Podcast: Play in new window | Download | Embed
Episode Summary:
In this episode we discuss the problem of how to evaluate the ability of a learning machine to make generalizations and construct abstractions given the learning machine is provided a finite limited collection of experiences.
Show Notes:
Hello everyone! Welcome to the twelfth podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.
In this episode we discuss the problem of how to evaluate the ability of a learning machine to make generalizations and construct abstractions given the learning machine is provided a finite limited collection of experiences.
The concept of a device which could preserve information without distortion has existed for several thousand years of human history. Such devices are referred to as “books” and have existed in the form of clay tablets, bone, shells, wood, silk, wax-coated wooden tablets, parchment, paper, and (more recently) in the form of “e-books” which exist on electronic devices called “hard drives”. To be sure, the concept of a book regardless of its implementation has been one of the most important forces for shaping human civilization. Thus, the ability to preserve information without distortion is fundamentally important. We will call this “memorization learning” and it is not a trivial accomplishment but rather one of the great accomplishments of the human race. We depend on memorization learning in so many ways. For example, your bank records, medical documents, legal documents can be stored for virtually indefinite periods of time and then retrieved without distortion. Human memory does not have this property. In a few special cases, humans can vividly remember a specific event without distortion but, typically, human memory tends to be based upon a type of learning which might be referred to as “generalization learning”.
Generalization learning is based upon the idea that information is not stored without distortion but rather incoming information is filtered and the important critical aspects of the incoming information stream are identified and extracted. During the retrieval process, the learning machine attempts to understand and interpret incoming information regardless of whether it was previously learned or is completely novel in terms of the statistical regularities the machine extracted during the learning process. Generalization learning involves a process which can be interpreted as “similarity-matching” where the “similarity” of an input to the learning machine’s memory is used as a retrieval cue. Computer scientists refer to such systems as “content-addressable” memory systems.
Generalization learning is a typical characteristic of not only human memory but also many machine learning algorithms. Suppose we provide a learning machine with some example situations and then have the learning machine answer questions and make inferences about situations that it has never seen before. We will refer to the example situations provided to the learning machine during the learning process as the training data. We will refer to the data containing the questions for the learning machine about situations that it has never seen before as the test data.
Suppose we train a learning machine to predict the stock market using stock market data collected over the ten year time period 2004 through 2014. We want to use the learning machine to make good predictions in the future so we train the learning machine with lots of data from this past decade. We evaluate the learning machine’s performance and we find it has learned the characteristics of the behavior of the stock market very accurately over the time period 2004-2014. The learning machine accurately predicts the behavior of the stock market over this entire ten year time period. We calculate that with an initial investment of $1,000 that one could easily make $10,000 within a few months using our machine learning algorithm.
We then decide to make some money on the stock market so we make some stock market investments based upon the learning machine that we have trained. Unfortunately, we find that the learning machine’s performance is horrible and that as a result of our stock market investments based upon the learning machine’s predictions we have lost thousands of dollars.
What went wrong?
Well, one possibility is that the behavior of the stock market over the time period 2004-2014 is not representative of the behavior of the stock market over the time period 2015.
The data characterizing the behavior of the stock market over the time period 2004 through 2014 is the training data, and the data characterizing the stock market over the time period 2015 is the test data. The learning machine is only trained on the training data. The learning machine is NOT trained on the test data.
If the training data is not similar in some ways to the testing data, then this can contribute to poor performance of the learning machine. Therefore, a prerequisite for the learning machine to make abstractions and generalize from its experiences is that both the training data set and the test data set share common important characteristics.
However, this is not enough. Even if the training data and test data share common important characteristics, the learning machine must be designed so that it learns the critical statistical regularities common to both the training data and the test data and it ignores distinctive features of the training data not present in the test data.
Failing to carefully distinguish between performance on a training data set and performance on a test data set can have serious consequences. In the field of healthcare, statistical analyses and machine learning algorithms are frequently used to determine the risk factors associated with a particular disease. Learning machines are frequently used to predict the likelihood a patient may have a serious medical condition given the patient has specific characteristics. Suppose we predict that someone is in danger of having a stroke given certain measurements on that individual. If the learning machine that was used to make this prediction had been trained using data who are older than 30 years old, then the predictions of that machine might not be valid for patients who are younger than 30 years old. The training data set and test data set must share certain important common characteristics.
The similarity of the “training data” and “test data” is a necessary but not a sufficient condition to guarantee legitimate inferences. Different learning machines will pay attention and ignore different characteristics of a data set. For example, suppose a learning machine in a health related area is trained upon a representative sample of the population of US citizens and is only used to make predictions about health conditions for US citizens. Although the training data and the test data share important common characteristics, this is still not sufficient to guarantee that all learning machines will extract the same types of statistical regularities.
One learning machine might extract the following rule from a set of training data:
IF someone has high-blood pressure and high cholesterol,
THEN that person is very likely to have a stroke.
While a second learning machine might extract this rule from the same set of training data:
IF someone has high-blood pressure and high cholesterol and rarely exercises,
THEN that person is very likely to have a stroke.
In future episodes we will explore the complex question regarding whether the inference from the first learning machine is better or worse than the inference from the second learning machine, but for now it is sufficient for us to recognize that inferences are often more dependent upon the structure of a learning machine than we might initially imagine. There may be more than one way to make correct generalizations from the same set of training data.
Another example arises in the field of forensic medicine, suppose that a suspect has left some of their DNA at the scene of a crime and a learning machine is used to identify that suspect. The accuracy of the learning machine’s generalization performance needs to be properly understood before those predictions can be used in a court of law. A generalization error in this case might have very serious legal consequences. An entirely innocent individual might be convicted as a murderer or a murderer might be released as an innocent person.
Similar issues are encountered in areas such as speech recognition, image processing, and the design of robot control mechanisms. In the real world, speech sounds, visual images, and motor behaviors have virtually infinite variation and are rarely repeated. The important statistical regularities required to implement voice recognition, visual image processing, and the acquisition of appropriate robotic motor behaviors are always extremely elusive. The generalization performance of these systems as well as their memorization performance must be carefully characterized.
To summarize, the ability to extract just the right statistical regularities from a data sample and ignore features of the data sample which are not characteristic of other data samples, is the essence of a learning machine’s ability to generalize from experience. It is important that we have methods in machine learning for evaluating generalization performance accuracy. A learning machine can be 99% accurate in memorizing the training data yet have very poor performance on a test data set.
Hopefully, at this point it should be clear that the problem of characterizing the generalization performance of a learning machine is both prevalent and important. So how can this be done?
One commonly used approach is called the “train-test” method. In the “train-test” method, a data set is divided in half to form two new data sets which are called the “training data” and the “test data”. The data set is divided in such a way so that both data sets have similar statistical characteristics. The learning machine is “trained” on the training data set and then “tested” on the test data set. Suppose we have a learning machine which is designed to predict the presence of prostate cancer. We train the learning machine on the training data set and examine how well it learned the training data. We find that the accuracy of learning on the training data set is 98%. Next, we evaluate the performance of the learning machine on the test data set and we find that the machine’s accuracy is 95%. The performance of the learning machine on the test data set will generally tend to be lower than it’s performance on the training data set. These results would support the hypothesis that the learning machine is actually picking up statistical regularities which are common to both the training data and the test data since the machine’s performance on the training data and test data sets is comparable.
On the other hand, suppose that we repeat the above “train-test” procedure with a different learning machine on the same prostate cancer data set. Our goal is the same. We are interested in predicting the presence of prostate cancer for an individual but we are now using a different learning machine for training the learning machine. The performance on the training data set is 98% which is the same as in our previous example. However, the performance of the learning machine on the test data set is not 95% as it was in the previous example but it is only 20%. This might seem like a rather unusual and atypical situation but, in fact, this type of situation can often occur.
These results suggest that although the second learning machine is capable of representing crucial statistical regularities in the training data it is not effectively extracting statistical regularities common to both the training data and test data. The generalization performance of this second learning machine is extremely poor. However, if you had only used training data to evaluate both learning machines, then the difference in performance between the two learning machines would not be observable at all.
The methodology we have described here seems very appropriate and, in fact, it is widely used. However, there are some problems with the approach we have just described. The first problem with this approach is that there might be some important statistical regularity which is in the test data but which is not in the training data. When we evaluate the performance of the learning machine, it may appear to have poor generalization performance when, in fact, the problem is that we are testing the learning machine for its ability to detect events in the environment which it never learned!
A second problem with this approach is that there might be some important statistical regularity which is in the training data but which is not in the test data. Thus, the learning machine might appear to have good generalization performance when, in fact, its performance on the test data would have been decreased if the test data set assessment had been more comprehensive.
Both of these problems, as well as other related problems, can be addressed more effectively by a method which is called 2-fold cross-validation. The basic idea of 2-fold cross-validation is described as follows. As in the “train-test” method, we divide the data set into two data sets which we will call “data set 1” and “data set 2”. We then train the learning machine on Data Set 1. Next, we evaluate the performance of the learning machine on both data sets. So this part is exactly the same as the “train-test” method. But now, what we do is we wipe out the learning machine’s memory and repeat the above procedure but we train the learning machine on Data Set 2 rather than Data Set 1. As before, we then evaluate the learning machine’s performance on both Data Set 1 and Data Set 2. Thus, this procedure gives us two different measures of training data set accuracy and two different measures of test data set accuracy.
To be a little more concrete, suppose we divide our data set into two data sets called Data Set 1 and Data Set 2. We train the learning machine to predict the incidence of prostate cancer using Data Set 1 and find its classification performance is 98%. Then we use the same trained learning machine to predict the incidence of prostate cancer using Data Set 2 and find its classification performance is 90%. Next, we repeat the above procedure but train and test the machine on Data Set 2 rather than on Data Set 1. In this case we find the classification performance when the machine is trained and tested on the same data set is 96% so the performance on the training data set has decreased a little. On the other hand, we find the classification performance when the machine is trained on Data Set 2 and tested on Data Set 1 is now 94% rather than 90% so the generalization performance has improved!
We can report the average performance on the training data as the average of 98% and 96% which is 97%. This is the “memorization performance”. We can report the average performance on the test data set as the average of 90% and 94% which is 92%. This is the “generalization performance”. This method has the advantage that it “averages out” statistical flukes associated with different data samples.
Also it has the advantage that it provides some insights into the accuracy of these estimates for memorization performance and generalization performance. That is, since the performance of the machine when trained and tested on Data Set 1 was 98% and the performance of the machine when trained and tested on Data Set 2 was 96%, this implies that average “memorization performance” is 97% plus or minus about 1%. Note that the 97% was obtained by averaging the numbers 96% and 98%.
Similarly, the performance of the machine when it was trained on Data Set 1 and tested on Data Set 2 was 90%. And, the performance of the machine when it was trained on Data Set 2 and tested on Data Set 1 was 94%. Thus, the average “generalization performance” is 92% plus or minus about 2%.
This is actually a very good way to characterize memorization and generalization performance. It has the advantage that it provides an estimate of not only memorization and generalization performance but also provides an indication of the accuracy of these estimates.
Nevertheless, the 2-fold cross-validation method has the problem that you are only training the system with half of your data. Presumably if you trained the system with a larger amount of data then the system would be smarter because it has had more experiences. Sometimes high quality data is readily available and it is inexpensive to collect. But typically, every data set contains valuable information and one wants to maximize learning machine performance by extracting as much information as possible from a given data set.
To address this problem, the method of K-fold cross-validation is widely used. The 2-fold cross validation method described previously may be interpreted as a special case of K-fold cross-validation where K=2. A 3-fold cross-validation procedure is defined as follows.
First, the data set is divided into three smaller data sets of equal size. Second, we train the learning machine on 2 of the smaller data sets and then test on the learning machine on the remaining data set. The remaining data set is called the “test data” in this case. Thus, if the original data set consisted of 300 examples, we would train the learning machine on 200 of the examples and then test on the learning machine on the remaining 100 examples. The training data in this example consists of 200 examples and the test data consists of 100 examples. The above procedure is repeated two more times so that each of the three smaller data sets gets the chance of being the “test data”. This yields 3 different measures of performance on a training data set consisting of 200 examples and 3 different measures of performance on a test data set consisting of 100 examples. As in 2-fold cross-validation, these numbers can be averaged and their deviations from the average can be documented so that both memorization and generalization performance can be estimated as well as the accuracy of these estimators.
Note that one common method for estimating the average deviation is to compute the standard deviation of the learning machine’s predictions as an estimator of the accuracy of the learning machine’s average prediction.
The 3-fold cross-validation method has the advantage over 2-fold cross-validation in that the learning machine is trained with 200 examples from the original data set of 300 examples in 3-fold cross-validation rather than 150 examples out of 300 examples for 2-fold cross-validation.
Taking this idea one step further leads us to “leave-one-out” cross-validation. In leave-one-out cross-validation, we train the learning machine on the entire data set except for one example. So if the original data set consisted of 300 examples, we would train the learning machine on 299 examples. Then we test the performance of the learning machine on the example that was “left out” of the training data. This procedure is then repeated 300 times so that every example in the original set of 300 examples has the opportunity to be “left out”. The predicted generalization and memorization performance is then estimated by averaging the machine’s predictive performance on the 300 left-out examples.
The method of “leave-one-out” cross-validation has the nice property that the learning machine is trained with essentially the entire data set except for one example so the generalization performance of the learning machine when it is trained with essentially all of the data is being characterized. On the other hand, this methodology is very computationally intensive since if our original data set has 10,000 examples, then this means that to implement “leave-one-out” cross-validation we would have to train the learning machine 10,000 times where each time the learning machine would learn 9,999 training examples and 1 test example.
Notice that this method needs to be carefully applied to cases where we have sequences of learning examples. So, for example, suppose we wanted to apply this technique to evaluate the performance of a machine which could predict performance on the stock market. Our learning machine might work by using the performance of the stock market over the past few weeks to predict a particular security’s value for tomorrow. Also suppose that we had data from the stock market for the past ten years. It probably wouldn’t make sense to use a 2-fold cross-validation method where the first data set consisted of stock prices associated with the time period 1970 through 1990 and the second data set consisted of stock prices associated with the time period 1990 through 2010. The types of statistical regularities guiding the stock market over the time period 1970 through 1990 are probably fundamentally different from the types of statistical regularities guiding the market’s behavior over the time period 1990 through 2010. The method of 2-fold cross-validation would not work in this instance. And, similar problems can be constructed for K-fold cross-validation as well.
On the other hand, it might be reasonable to take a collection of stock market data from the time period 2010 through 2013 and divide that data into smaller blocks of data where each small block of data corresponded to a history of security prices evolving over the time period of about 30 sequential days. Under the assumption that these 30 day blocks of sequences of security prices share common statistical regularities yet are independent, one could use (for example) a leave-one-out block cross-validation method where one trains the learning machine with all but one of the 30 day blocks of security prices and then tests the learning machine on that remaining 30 day block of prices. For a four year time period with 12 months per year and 30 days per month, this would yield about 48 blocks of 30 days so that you would train the learning machine with 47 blocks of data, and then test on the remaining block of data. This would then be repeated 48 times using a different block of data each time as the “test data set”. This method which could be called K-fold times-series block cross-validation is applicable when the blocks of data are not correlated yet share important statistical characteristics.
Another additional but important key point. In practice, the development of a learning machine is an iterative process. One examines the performance of the learning machine and then based upon its performance, one makes modifications to the machine so that it will perform better in the future. When this iterative learning machine development methodology is employed, ideally one should use two distinct data sets. The first data set is called the model development data set. The second data set is called the model evaluation data set. The model development data set should be used to support the development and evaluation of the learning machine during the development process. Once the engineer has reached a point where the learning machine has been completely specified, then a K-fold cross-validation method should be used in conjunction with the model evaluation data set to estimate memorization and generalization performance.
The above approaches to evaluating generalization performance are important tools used for evaluating the generalization performance of a learning machine. We also have mentioned a few situations where failure to check the generalization performance of a learning machine can have serious real-world consequences. Example applications of learning machines in the real world whose generalization performance is typically carefully examined using techniques such as those discussed in this podcast are widely prevalent. For example, machine learning algorithms are used to buy and sell stocks on the stock market, machine learning algorithms are often used as initial screening tests for determining if a patient requires a more invasive surgical examination, and machine learning algorithms are often used in forensic science for matching DNA which will ultimately sway an entire jury towards deciding whether a suspect is entirely guilty or entirely innocent. Without the use of methods for checking the accuracy of generalization performance as described in this podcast, these types of machine learning algorithms could have detrimental financial, medical, and legal effects on numerous individuals throughout the world.
In summary, we have talked about a fundamental problem in the field of machine learning which is concerned with evaluating not just the memorization performance of a learning machine but also its generalization performance. We have also shown how the K-fold cross-validation methodology can not only be used to provide estimates of both memorization and generalization performance but also can be used to estimate the accuracy of these estimators. These topics are of critical importance in the development and evaluation of machine learning algorithms which are used by all of us in our everyday lives.
Further Reading:
Cross Validation Wikipedia article (http://en.wikipedia.org/wiki/Cross-validation_(statistics))
Copyright © 2014 by Richard M. Golden. All rights reserved.
I just came to know about this podcast a few months a go and I really like the mild theoretical treatment of fundamental topics in AI and machine learning. Please keep up the good work. Regarding, the K-fold block time series cross-validation, I’m just wondering does this take into account the sequences in the blocks. Otherwise, I guess we would need to avoid a scenario
where we train on a more current block and predict on an older one. Is it just up to the engineer to ensure that this doesn’t happen or are there any detailed procedure for cross validating time series data.
Thanks for the great question. I’ll try to answer it. So in my podcast I assumed that the time-series has
the property that the blocks are independent and identically distributed. This can occur for example if the
time-series is stationary and the time-series is “tau-dependent” and each block is separated by
a distance of tau from another block where tau is an integer. A “tau-dependent” time series is a time series
of observations x(1), x(2), … such that x(t) and x(t+tau) are independent for all t. If tau=0 then the time series
consists of independent observations. This sounds great but unfortunately even a time linear dynamical system
driven by noise such as: x(t+1) = x(t) + n(t) where n(t) is noise typically won’t be tau-dependent. However the noise
may fall over exponentially. If it falls off exponentially and certain additional conditions
are satisfied, laws of large numbers and central limit theorems will still hold
in such cases for dependent stochastic sequences. In practice, I prefer to use nonparametric bootstrap methods.
Here is a useful (advanced) review article which might be helpful:http://www.math.ucsd.edu/~politis/StatSci03.pdf
However, I should emphasize that one must also take into account one’s knowledge of the statistical characteristics of the data set
as discussed in the podcast. Sooner or later the engineer needs to go beyond the existing scientific understanding of the algorithms!!