Podcast: Play in new window | Download | Embed
LM101-027: How to Learn About Rare and Unseen Events (Smoothing Probabilistic Laws)[RERUN]
Episode Summary:
In this podcast episode, we discuss the design of statistical learning machines which can make inferences about rare and unseen events using prior knowledge.
Show Notes:
Hello everyone! Welcome to a RERUN of the 11th podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.
Today we address a strange yet fundamentally important question. How do you predict the probability of something you have never seen? Or, in other words, how can we accurately estimate the probability of rare events?
This is very important because in real life we may be required to make inferences about rare events which possibly have never occurred in our experience. This is also important because in the real world we often deal with trying to reason about the likelihood of sequences of events.
In the real world we are constantly dealing with rare or events or even events we have never previously experienced. No event is ever experienced in exactly the same way. You may have many visual experiences of viewing the face of a friend. If your friend changes their expression in a novel way, however, you can still recognize your friend.
Furthermore, this is relevant to the control of behavior as well. Suppose we learn from experience how to perform an action in our favorite sport or some other task that requires physical skills. We practice that action multiple times. In a game situation, we can execute novel combinations of these component physical skills and they feel familiar to us even though we have never executed those sequences of skills before.
In a sequence of events, each event in the sequence might be quite common but specific sequences are relatively rare. So, for example, a learning machine that is learning to predict the weather might observe the sequence of events over a 3 month time period:
30 Sunny Days, followed by 30 Cloudy Days, followed by 30 Rainy Days.
As a result of the learning process over this three month time
period, the learning machine might conclude that:
- it is very likely that a sunny day follows a sunny day
- it is very likely that a cloudy day follows a cloudy day
- it is very likely that a rainy day follows a rainy day
And it will also learn that sometimes approximately once in three months that:
- a cloudy day will follow a sunny day
- a rainy day will follow a cloudy day
And finally, based upon the learning machine’s experiences of 30 sunny days, followed by 30 cloudy days, followed by 30 rainy days the learning machine will conclude:
- a rainy day will NEVER follow a sunny day
- a cloudy day will NEVER follow a rainy day
Ideally, we would like to increase the probability or chances that a rainy day will follow a sunny day by a small amount so that the expected frequency of that event is not considered by the learning machine to be absolutely impossible. Of course since probabilities must always add up to one that will also require that we decrease by a small amount the probability that a sunny day will follow a sunny day.
Notice that in our discussion we are making what is called a first-order Markov modeling assumption. A first-order Markov modeling assumption states that the probability that tomorrow will be rainy can be calculated if we know today’s weather.
This exact same problem occurs in language modeling. One might think that the statistics of language modeling might be challenging but that the problem of rare events is not an issue because one always has available hundreds of thousands of language documents one can use to calculate the probability that one word follows another. However, this assumption is not correct. Even if one has a large sample of language consisting of a million words, there will always be many sequences of words which do not occur in that large sample of language yet which are common in the English language.
For example, suppose we have a sentence from Shakespeare such as:
“To be or not to be that is the question”.
In this 10 word sentence, there are 8 unique words because the word “to” and “be” are both repeated twice. Therefore, there are 64 possible probabilities that we need to estimate. That is, the probability or chance that one word follows another. Again, we are using a first-order Markov chain model.
The probability or chance that the word “question” follows the word “the” is 100% which means that based upon this single sentence a learning machine would think it is IMPOSSIBLE for the word “dog” to follow the word “the”.
Of course, this is only one sentence. What if we used lots of sentences from Shakespeare?
So instead of using 1 ten word sentence, let’s use all of the words in every one of Shakespeare’s plays. That’s a lot more than ten words. In fact, it has been estimated to be 884,657 words. The number of unique words in this data sample of 884,657 words is 29,066 unique words. This means that since any one of the 29,066 unique words could, in principle, follow any of the other 29,066 unique words it follows that there are 29,066 multiplied by 29,066 possible or equivalently 844 million probabilities that we need to estimate. Thus, the number of probabilities that we need to figure out the chance that one word follows another in the entire works of William Shakespeare is about a 1000 times greater than the entire number of total words in the entire works of William Shakespeare. This means that for many of the expected frequencies we want to estimate that those estimated frequencies will be unreliable and additionally that many of those probabilities that are estimated to be zero might actually not correspond to impossible transitions.
Daniel Jurafsky and James A. Martin in their book Speech and Language Processing actually estimated these 844 million probabilities. Then, once these 844 million probabilities were estimated, they were able to tell the computer to construct a collection of 29,066 dice where each die in the set has 29,066 sides. One die in the set of 29,066 dice has 29,066 sides where each side corresponds to one of the 29,066 unique words in the Shakespeare corpus. This die is weighted so that the likelihood a side comes up corresponds to the probability that the first word of a sentence from Shakespeare would start with that word. This die is rolled and the word that comes up is recorded for example the word “To” might come up. Then the die corresponding to the word “To” is rolled and the next word which comes up might be the word “be”. Then one looks for the die in the large set of 29,066 die which corresponds to the word “be” and that die is rolled. The “be” die assigns a probability to the word “or” under the assumption that the word “or” is preceded by the word “be” in a sentence. By rolling and selecting dice in this fashion, one can then generate simulated sentences from the estimated probabilities of our Shakespeare dice probability model.
It is important to emphasize that this model is not only limited because it has so many free parameters relative to a small number of data points. It is also limited because it assumes that the likelihood of the next word in the sequence is only dependent upon the previous word and not the previous words or the intended meaning of the author.
Nevertheless, it is instructive to look at the output of our first-order Markov Shakespeare dice model. Here are some example sentences generated by this collection of 29,066 dice where each die has 29,066 sides. The sentences actually look pretty good!
- What means, sir. I confess she? Then all sorts he is trim captain.
- Why doest stand forth thy canopy forsooth; he is the palpable hit the King Henry. Live king. Follow.
- if it so many good direction found’st thou art a strong upon command of fear not a liberal largess given away Falstaff! Execunt
- The world shall my lord!
As previously noted, given the entire works of William Shakespeare which consists of 884,657 words, the number of probabilities which must be estimated using the assumption of a first-order Markov model is 1000 times greater than the total number of words! That is, many words in the works of William Shakespeare are best treated as “rare” or even “unseen” events!
In fact, this type of analysis reveals that over 99.96% of sequences of two unique words are never observed in the works of Shakespeare!
The most important point of the previous discussion is that it is not sufficient to have a large amount of data for learning. If you have a large amount of training data but the number of relevant statistical regularities in the training data is very small or non-existent, then learning will be difficult without the benefit of either collecting additional data or using prior knowledge. Having a lot of data does not imply that you will be able to reliably estimate probabilities of interest.
The ONLY way to do this is to inject prior knowledge into the learning process. One approach to injecting prior knowledge is to use MAP estimation which was discussed in the previous Episode 10.
The essential idea of MAP estimation is that we are given a large collection of probabilistic laws and for each probabilistic law we have an additional probability called a “prior”. The collection of probabilistic laws is called the “probability model”. MAP estimation uses the observed data, the probability model, and the priors to calculate the “most probable” probabilistic law given the observed data. In the initial stages of learning when the learning machine has a limited knowledge of its statistical environment, it can be shown that the “most probable” probabilistic law is influenced by the priors. While in the later stages of learning when the learning machine has had the opportunity to observe its environment for a long time period, it can be shown that the “most probable” probabilistic law is estimated using the method of maximum likelihood estimation. Maximum likelihood estimation is an approach which attempts to estimate probabilities that make the observed frequencies of events in the environment most likely. If the choice of probabilities is unconstrained and there is a sufficiently large amount of data, then the observed frequencies of a collection of events are interpreted as estimated probabilities of those events. These ideas were discussed in Episode 10.
In our Shakespeare example, we estimated probabilities using observed frequencies of events in the environment so this corresponds to estimating an unconstrained probability model in the final stages of MAP estimation. Thus, the effects of the “priors” are relevant in the initial stages of MAP estimation learning but become less relevant in the final stages of MAP estimation learning.
We now consider one of the earliest methods proposed for estimating the probability of things that are unseen! The approach is best illustrated with the following example. Suppose that we are trying to figure out whether tomorrow’s weather will be either: cloudy, rainy, sunny, or snowy given that today is a sunny day. We have prior knowledge that any of these four weather states is possible but our learning machine only has access to the following training data:
30 Sunny Days, followed by 30 Cloudy Days, followed by 30 Rainy Days.
Using the maximum likelihood estimation method, we estimate the probability that tomorrow will be a sunny day as: 29/30 and the probability that tomorrow will be a cloudy day as 1/30. The probability that tomorrow will be a rainy day is estimated to be zero since today is a sunny day and in the training data rainy days never follow sunny days. The probability that tomorrow will be a snowy day is also equal to zero since “snowy” days were never observed in the training data.
This doesn’t seem very satisfying to have the learning machine conclude that a “snowy day” or a “rainy day” is absolutely impossible. We would like to increase the probability of such events but how can we do this in a principled way?
One approach was developed by the mathematician Laplace in the early 19th century. We will call this the “add-one smoothing method”. In the “add-one smoothing method” we simply assume that we have four extra observations that essentially correspond to the events that
Either a cloudy, rainy, sunny, or snowy day can follow a sunny day. This is not data that we observe. This is FAKE DATA that we actually artificially manufacture and then combine with the original data set. Once we do this, we have that the probability that tomorrow is a sunny day is now equal to: (29 + 1)/34, the probability that tomorrow is a rainy day is: (1+1)/34, the probability that tomorrow is a rainy day is equal to: (1/34), and the probability that tomorrow is a snowy day is equal to (1/34).
At first this seems like a terrible thing to do. Making up data. Couldn’t we get arrested or ostracized from society for making up fake data? Before considering such nightmare scenarios, let’s evaluate this idea objectively.
Notice that any data is learned, the probability of a cloudy, rainy, sunny, and snowy day are all equally likely and equal to ¼ . This seems reasonable given that we have no experience with our environment. Now suppose we have (as before) 30 sunny days, followed by 30 cloudy days, followed by 30 rainy days. By computing frequencies using this fake data we still are assigning a high probability to the event tomorrow is a sunny day given today is a sunny day, and a low probability to the event that a cloudy day will follow a sunny day. However, because of just adding these 4 data points, it is not IMPOSSIBLE for a rainy or snowy day to follow a sunny day! Furthermore, since we only added 4 data points to the training data so in the later stages of learning after thousands of observations have been made, the influence of these 4 fake data points that we added will become negligible and the estimated probabilities will again correspond to the observed frequencies in the data. Thus, even if our fake data was poorly chosen, it will not matter after we make many observations of the environment.
This case corresponds to using the fake data to represent prior knowledge that we have no knowledge of whether tomorrow will be a cloudy, rainy, sunny, or snowy day. All of these possibilities are equally likely.
Now suppose that we have prior knowledge that we are predicting the weather in San Diego, California. For San Diego weather, we might want to assign a probability of 30% to sunny days, a probability of 30% to rainy days, and a probability of 30% to cloudy days, with a 10% probability to the event that tomorrow will be snowy given today is a sunny day. We could
Represent this type of prior knowledge by adding 10 additional fake data points. 3 sunny days following a sunny day, 3 rainy days following a sunny day, 3 cloudy days following a sunny day, and 1 snowy day following a sunny day.
On the other hand, suppose we have prior knowledge that we are predicting the weather in Alaska. For Alaska weather, we might want to assign a probability of 10% to sunny days, 10% to rainy days, 10% to cloudy days, and a probability of 70% that tomorrow will be snowy given today is sunny. We could represent this type of prior knowledge by adding 10 additional fake data points. 1 sunny days following a sunny day, 1 rainy days following a sunny day, 1 cloudy days following a sunny day, and 7 snowy days following a sunny day.
These probabilities correspond to the “priors” of MAP estimation. We won’t get into the technical details here but it is important to simply comment that the prior distribution is called a Dirichlet probability density.
The problem with these methods discussed so far, however, is that although such methods are successful at assigning positive probabilities to impossible events. The resulting positive probabilities might in some cases be too small. That is, they may be so small that the events are still impossible for all practical purposes. Alternatively, the positive probabilities assigned to impossible events might be too large. That is, the probability of a “near impossible event” in appropriately competes with the likelihood of an event which has been typically observed.
One way of dealing with this problem is to use some or all of the data that has been collected or some other data set which is comparable to the data which is presented to the learning machine. By analyzing the properties of a data set, one might be able to arrive at a “prior” which tries to assign a probability which is not too small and not too large to impossible events.
When the “prior” is based upon data that has been previously collected or even properties of the current data set, then it is called an empirical Bayes prior. One type of empirical Bayes prior yields what is a particular type of method for “smoothing” the probability law called the Good-Turing method.
The Good-Turing method was invented by Turing and his colleague Good during World War II. During Word War II, Turing worked for the British Government Code and Cypher School. His secret mission was to develop mathematical algorithms to break the German’s secret code system. As a direct result of his efforts, the British government was able to decode coded messages sent by the Germans. There are many variations of the Good-Turing method but the basic idea is simple. We illustrate the basic concept by the following example.
Suppose a learning machine is trying to pick the weather for the next day given that today is a sunny day. The learning machine observes 300 sunny days following a sunny day, 300 rainy days following a sunny day, 300 cloudy days following a sunny day, 200 snowy days following a sunny day, 10 thunderstorms with lighting following a sunny day, 1 tornado following a sunny day, and 1 hailstorm following a sunny day. Assume the learning machine never sees a hurricane following a sunny day, the learning machine never sees an extremely hot day following a sunny day, the learning machine never sees an extremely cold day following a sunny day, and the learning machine never sees it rain cats and dogs.
The maximum likelihood estimation method would therefore estimate the probability of a hurricane following a sunny day as 0% since this event was never observed.
Now we inject the key assumption which is common to the majority of Good-Turing smoothing methods which can be semantically interpreted as the following statement.
The expected number of times we observe rare events that occur only once should be used as an estimator of the expected number of times that we observe rare events that never occur.
Thus, in our previous example, the expected number of times that we observe rare events that occur only once is equal to 2 since we observed 1 tornado following a sunny day and 1 hailstorm following a sunny day. Therefore, we estimate the number of rare events that never occur as equal to the number of rare events that occur only once. Since the learning machine has never seen the 4 rare events: hurricane, extreme hot, extreme cold, rain cats and dogs and under the assumption that these 4 types of rare events are equally likely, we take the number 2 and divide it by 4. Thus, the expected number of times the rare event of a hurricane which was never observed by the learning machine is chosen to be ½. Thus, the probability of a hurricane in this example using this simplified version of the Good-Turing method would be:
(1/2) divided by the total number of observed days plus 2.
As previously mentioned, there are many variations of the Good-Turing method but all of these variations share the key idea that the expected number of times that an event occurs K times should be approximately equal to the expected number of times that an event occurs K-1 times. In language modeling, the Good-Turing method is widely used.
All of the methods described so far can be interpreted as “priors” which incorporated into a MAP estimation learning problem. As we know, a MAP estimation learning machine has the property that the effects of the prior become negligible as the amount of training data becomes very large. On the one hand, this is a great property because if one incorporates prior knowledge which is misleading to the learning machine the effects of this misleading prior knowledge will tend to become negligible as the learning machine acquires more experiences in its environment. On the other hand, this is an undesirable property if one has prior knowledge which one does not want to vanish as the amount of training data becomes large.
In other words, sometimes we have prior knowledge that we do not want to have less impact as we collect more data. We might want this prior knowledge to have the same impact throughout the learning process. This can be achieved by placing constraints on the probability model rather than the priors. For example, suppose the learning machine observes:
30 Sunny Days, followed by 30 Cloudy Days, followed by 30 Rainy Days
and that a particular day is classified as either: sunny, cloudy, rainy, or snowy. As discussed in Episode 10, the technical definition of maximum likelihood estimation is to find the probability laws that make the data most likely.
Assume that since the observed frequency tomorrow is a sunny day given today is a sunny day is equal to 29/30, that the probability tomorrow is a sunny day given today is a sunny day is a positive number K multiplied by 2 raised to the power of 29/30. The purpose of the positive number K is to “scale” the number 2 raised to the power of 29/30 so that it is a probability. Also note that the probability of an event is an increasing function of the observed frequency of that event. More frequent events are assigned larger probabilities. Less frequent events are assigned smaller probabilities. It can be shown that this method of computing probabilities does not change the priors. This method directly changes only the probability law. Also do not worry about what value one should choose for the constant K, it is uniquely determined by the observed frequencies.
Using this approach it also follows that the observed frequency that tomorrow is a cloudy day given today is a sunny day is equal to 1/30, suppose we assume that the probability tomorrow is a cloudy day given today is a sunny day is equal to K multiplied by 2 raised to the power of 1/30.
The advantage of this approach becomes clearer when we consider that for events where the observed frequency of an event is equal to 0 (that is, the so-called “Impossible Event”), that 2 raised to the power 0 is equal to 1 multiplied by K.
Thus, this approach assigns a strictly positive probability to all possible event outcomes even if the observed frequency of an event was equal to zero. Moreover, as the amount of observations become large, the effects of the prior do not converge to zero. This type of probability distribution is called a Gibbs probability density and it is one of the most widely used methods for representing probability laws.
The concept of injecting prior knowledge into probability laws is as a vast topic which easily deserves its own podcast series. We will revisit this topic many times in future podcast episodes!
Further Reading:
Allen, J. (1995) Natural Language Understanding.
Gale, W. A. and Sampson, G. (1995). Good-Turing Frequency Estimation without Tears. Journal of Quantitative Linguistics, 2, 217–37.
Jurafsky, D., and Martin, J. H. (2000). Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition.
Stone, J. V. (2013). Bayes’ Rule: A Tutorial Introduction to Bayesian Analysis.
Kruschke, J. K. (2010). Doing Bayesian Data Analysis: A Tutorial with R and BUGS.
Kruschke, J. K. (in press). Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan.
MAP (Maximum A Posteriori) estimation (Wikipedia Entry)
( http://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation)
Maximum Likelihood (ML) estimation (Wikipedia Entry)
(http://en.wikipedia.org/wiki/Maximum_likelihood)
Also check out the movie Codebreaker (www.turingfilm.com). This is a dramatized documentary of Alan Turing’s contributions to breaking German code ciphers during World War II.