Book Review: Machine Learning with Python Cookbook

Python

Book Review: Machine Learning with Python Cookbook Source – PbPython.com

Introduction

This article is a review of Chris Albon’s book, Machine Learning with Python Cookbook.
This book is in the tradition of other O’Reilly “cookbook” series in that it
contains short “recipes” for dealing with common machine learning scenarios in python.
It covers the full spectrum of tasks from simple data wrangling and pre-processing
to more complex machine learning model development and deep learning implementations.
Since this is such a fast moving and broad topic, it is nice to get a new book
that covers the latest topics and presents them in a compact but very useful format.
Bottom line, I enjoyed reading this book and think it will be a useful resource to have
on my python bookshelf. Read on for some more details about the book and who will benefit
most from reading it.

Where does this book fit?

As data science, machine learning and AI have become more and more popular, there
is a proliferation of books that try to cover these topics in differing manners.
Some books go very deep in the math and theory behind the various machine learning
algorithms. Others try to cover a lot of content but do not provide a quick reference
resource with code examples for solving real world problems. Machine Learning with Python Cookbook,
fills this code-heavy niche with lots of examples. There
are very few paragraphs with math equations or details behind the implementation
of machine learning algorithms. Instead, Chris Albon breaks the topics down into
bite size chunks that solve a very specific problem. Each of the nearly 200
recipes follows a similar format:

  • Problem definition
  • Solution
  • Discussion (optional)
  • Additional resources (optional)

In most cases, the problem definition is as simple as “You want to multiply two matrices” or
“You need to visualize a model created by a decision tree learning algorithm.” This organization
makes it convenient to look at the table of contents, and find the relevant section with ease.

Each solution is fully self-contained and can be copied and pasted into
a standalone script or jupyter notebook and executed. In addition, the code sample includes all
the necessary imports as well as sample data sets (e.g. Iris, Titanic, MNIST). They are all
around 12-20 lines of code with comments included so they are easy to dissect and understand.

In some cases, there is further discussion about the approach as well as hints and tips related
to the solutions. In many cases, topics like performance for larger and more complex
data sets are discussed and options are presented for managing those situations.

Finally, the author also includes links to more details that might be useful when
you need to dive into the problem in more depth.

Who should read it?

The author is very clear that this book is not an introduction to python or machine learning.
Since the recipes are short, the actual python code is fairly simple. There’s no need
to understand complex python data structures or programming constructs outside of
lists and dictionaries. You should know how to install python libraries such as
numpy, pandas and scikit-learn.

More importantly, you should have at least some experience using these libraries
to load and manipulate data. I also highly recommend that you have done some work
with building predictive models with scikit-learn. A lot of the value I gained from
this book was related to learning solutions to problems I encountered in my own work.

Finally, some basic understanding of supervised and unsupervised machine learning
algorithms is going to be really helpful. For example, if you do not know the types
of problems where you would use linear vs. logistic regression or
why you might need to use dimensionality reduction, then this book (especially
chapters 9 and higher) might not make sense.

How should you read it?

Because the book is a cookbook, it’s not necessary to read it from page 1 through
340. However, I do think it is best to skim through it in order to understand what
content is available. For instance, I felt very comfortable with the content in
chapter 2 (Loading Data) and Chapter 3 (Data Wrangling) so I skimmed the content.
For other chapters, I felt like I got a lot more out of reading the examples
in depth since I did not have as much experience with those topics.

Ultimately though, this is a resource that is meant to sit beside your computer and
provide a quick lookup for a specific problem. With that goal in mind, it achieves
its aim admirably.

Chapter Overview

The book only has 340 pages of content but it is broken down into 21 chapters. In my opinion,
this is a good structure because each chapter provides a concise introduction
of a topic and specific code examples that solve common problems.

The chapters start with basic numpy functions, then move to more complex pandas and sckit-learn
functions and close out with some keras examples. Here’s a list of each chapter
along with its primary focus:

  1. Vectors, Matrices and Arrays [numpy]
  2. Loading Data [scikit-learn, pandas]
  3. Data Wrangling [pandas]
  4. Handling Numerical Data [pandas, scikit-learn]
  5. Handling Categorical Data [pandas, scikit-learn]
  6. Handling Text [NLTK, scikit-learn]
  7. Handling Dates and Times [pandas]
  8. Handling Images [OpenCV, matplotlib]
  9. Dimensionality Reduction Using Feature Extraction [scikit-learn]
  10. Dimensionality Reduction Using Feature Selection [scikit-learn]
  11. Model Evaluation [scikit-learn]
  12. Model Selection [scikit-learn]
  13. Linear Regression [scikit-learn]
  14. Trees and Forests [scikit-learn]
  15. K-Nearest Neighbors [scikit-learn]
  16. Logistic Regression [scikit-learn]
  17. Support Vector Machines [scikit-learn]
  18. Naive Bayes [scikit-learn]
  19. Clustering [scikit-learn]
  20. Neural Networks [keras]
  21. Saving and Loading Trained Models [scikit-learn, keras]

To illustrate how the chapters work, let’s look at chapter 15 which cover K-Nearest Neighbors (KNN).
In this cases, the introduction recipe (15.0) gives a concise summary of KNN and why it is a
popular tool.

Now that we remember what KNN is used for, we’re likely going to want to apply it
to our data. First, we will want “to find an observation’s
k

nearest observations (neighbors).”
Recipe 15.1 contains specific code as well as some more detail around the various
algorithm parameters we can tweak such as the distance metrics (Euclidean, Manhattan or Minkowski).

Next, recipe 15.2 shows how to take some unknown data and predict its class based on neighbors.
This recipe uses the iris data set but also includes important caveats about scaling data when using KNN.

Recipe 15.3 then moves on to cover a common challenge with KNN, specifically how do you select the
best value for k? This recipe uses scikit-learn’s
Pipeline

function and
GridSearchCV

to conduct a cross-validation of KNN classifiers with different values of
k

. The code is simple
to comprehend and easy to extend to your own data sources.

The point is that each chapter can be consumed at the individual recipe level or
read more broadly to understand the concept in more detail. I really like this approach
because so many topics are covered at a quick pace. If I feel the need to dive into
the mathematical rationale for an approach, I can use these recipes as a jumping off
point for further review.

Additional Considerations

The only criticism I can place is that I wish there were more topics covered
in the content. Some specific areas I would have liked to learn about are
coverage of ensemble methods as well as a discussion about xgboost.

In some cases, it might be useful to understand some of the additional libraries
in the python eco-system. From a NLP perspective, I know that NLTK is the standard
but have heard good things about spaCy as well so would be curious where it fits
in this space. The neural network space is changing rapidly so I think keras was
a good choice but it might be interesting to learn about some of the other options like PyTorch.

I am sure there are a lot of other potential topics that were considered so I can imagine it was really
tough to decide what was in and out of scope. All of my suggestions are based on topics that
sprang to my mind and are meant only as potential ideas for another edition (if that is the plan).

Originally, I had some concerns about using the basic data sets (Titanic, Iris, etc) in most
examples. However, now that I have reflected on it, I like that the examples are so self-contained
and think it would be much more difficult to create such a great resource if there
needed to be more explanation of the data.

Also, it would be nice if the code examples were available online so you could do some quick
copying and pasting instead of typing it all in by hand. This may be available so
if I find it, I’ll be sure to update it.

The final comment I have is related to the price of the book. The current US list
price is $59.99 which may seem steep for a 340 page book. However, I think the book
is worth it and encourage those interested to purchase it. The content is
great and I see it being very useful to those using pandas + scikit-learn on a frequent
basis. It is clear that Chris knows what he is talking about and he explains
the details well. I predict that this book will become well broken in as I frequently refer to it.

The second reason it is important to purchase these books is so that
authors and publishers know that the python community values this type of content.
I can not imagine how long it took Chris to write this book. I can only guess that
the royalties will probably not afford him an early retirement any time soon! Still,
I do want to make sure he gets at least some compensation for this valuable resource
and want to provide encouragement to him for a job well done.

Conclusion

Overall, the Machine Learning with Python Cookbook is an extremely useful book which
is aptly described in the tag line as “Practical Solutions From Preprocessing to Deep Learning.”
Chris has done a fabulous job of collecting a lot of the most common machine learning
problems and summarizing solutions. I definitely encourage those of you using
any of the libraries mentioned here to pick up this book. I have added this
book to my recommended resources page so please check it out and see if
any of the other recommendations might be useful. Also, let me know if you
find this review useful.