Book Review – The Elements of Statistical Learning: Data Mining, Inference, and Prediction

The Elements of Statistical Learning Book Cover The Elements of Statistical Learning
Springer Series in Statistics
Trevor Hastier, Robert Tibshirani, and Jerome Friedman
Machine Learning Algorithms
Springer Science
2001
533

About the Book:
This book is a collection of topics which are loosely organized but the discussion of the topics is extremely clear.   The loose organization of topics has the advantage that one can flip around the book and read different sections without having to read earlier sections. A beginner to machine learning might start by reading Chapters 1, 2, 3, 4, 5, 11, 13, and 14 very carefully and then read the initial sections of the remaining chapters to get an idea about what types of topics they cover.

The choice of topics hit most of the major areas of machine learning and the pedagogical style and writing style is quite clear. There are lots of great exercises, great color illustrations, intuitive explanations, relevant but not excessive mathematical notation, and numerous comments which are extremely relevant for applying theses ideas in practice. Both this book and the text by Bishop (Pattern Recognition and Machine Learning) are handy references which I like to keep by my side at all times! Indeed,  both of these texts are perhaps the two most popular graduate level textbooks on Machine Learning.

I would say that if your training is in the field of statistics or mathematics you will probably like The Elements of Statistical Learning a little better than Pattern Recognition and Machine Learning but if your training is in the field of engineering you may like Pattern Recognition and Machine Learning a little better than the Elements of Statistical Learning.

Do not confuse this book with “An introduction to Statistical Learning: with Applications in R”  by James, Witten, Hastie, and Tibshirani. The book “An Introduction to Statistical Learning” is a great book and covers similar topics but it is less mathematical and is more focused on applications and software implementations than “The Elements of Statistical Learning”.

Chapter 2 provides an overview of Supervised learning. Chapters 3 and 4 discuss linear methods for regression and classification. Then, Chapter 5 introduces the key concepts of basis functions and regularization.  A “basis function” can be described as a type of feature detector. With the right types of feature detectors or “basis functions”, it may be possible to approximate a complicated nonlinear function as a weighted sum of basis functions. This type of approach is used in eigenvector analysis and fourier analysis and plays a key role in Deep Learning methods.  Regularization is also a crucial concept which is discussed here. A nice feature of Chapter 5 is that it includes brief but useful discussions of Reproducing Kernel Hilbert Spaces and Wavelet smoothing. Chapter 6 discusses kernel methods and Chapter 7 discusses Model Assessment and Selection. In particular, Chapter 7 covers the VC dimension, BIC model selection methods, MDL (Minimum Description Length) model selection methods, and Bootstrap methods.  Chapter 8 covers the topics of Model Averaging and has a very nice discussion of the relationship between Model Averaging and Bootstrap sampling methods. Monte Carlo Markov Chain and the Expectation Maximization algorithm are also discussed. Chapter 11 explains concepts associated with parameter estimation in feedforward multilayer perceptrons and provides helpful advice and warnings. Chapter 12 introduces the concept of Support Vector Machines as an alternative to feedforward multilayer neural networks. Chapter 13 discusses K-means clustering and nearest neighbor clustering methods. Chapter 14 discusses unsupervised learning and includes not only a discussion of standard cluster analysis methods but also a discussion of Self-Organizing Maps. The standard Principal Component Analysis is discussed as well as the relatively new but not established method of Independent Component Analysis.

Target Audience:
In order to read this textbook, a student should have taken the standard lower-division course in linear algebra, a lower-division course in calculus (although multivariate calculus is recommended), and a calculus-based probability theory course (typically an upper-division course). With this background, the book may be a little challenging to read but it is certainly accessible to students with this relatively minimal math background. If you have a PhD in Statistics, Computer Science, Engineering, or Physics you will find this book extremely useful because it will help you make contact with topics with which you are already familiar.

About the Authors:
Professor Hastie, Professor Rob Tibshirani are Professors of Statistics and Biomedical Data Science at Stanford University. Professor Jerome H. Friedman is a Professor of Statistics at Stanford University. All three of the authors have made important contributions to the fields of Statistics and Machine Learning.