Podcast: Play in new window | Download | Embed
LM101-042: What happened at the Monte Carlo Inference Methods Tutorial at the 2015 Neural Information Processing Systems Conference?
Episode Summary:
This is the second of a short subsequence of podcasts providing a summary of events associated with Dr. Golden’s recent visit to the 2015 Neural Information Processing Systems Conference. This is one of the top conferences in the field of Machine Learning. This episode reviews and discusses topics associated with the Monte Carlo Inference Methods Tutorial held on the first day of the conference.
Show Notes:
Hello everyone! Welcome to the forty-second podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner. This is the second of a short subsequence of podcasts in the Learning Machines 101 series designed to provide a brief overview of my experiences and perspectives on my visit to the 2015 Neural Information Processing Systems Conference in Montreal Canada. This episode reviews and discusses topics associated with the Monte Carlo Inference Tutorials which was held on the first day of the conference. A general introduction to the Neural Information Processing Systems Conference and also a review of the Morning Deep Learning Tutorial which took place on the first day of the conference can be found in Episode 41 of Learning Machines 101.
I will now provide a brief review of the conference based upon the notes that I took during the conference. This review will be flavored with my own opinions and comments randomly interjected. I will try to do my best to distinguish my thoughts from the ideas presented when I have opinions which are different or additional to what was presented and I will also try not to distort the presentations. However, please keep in mind that my notes were simply written into my kindle and I am not referencing a video or audio recording in writing this summary. I encourage you to visit the official Neural Information Processing Conference website whose URL is provided in the show notes at: www.learningmachines101.com because some of the presentations, video-recordings, posters, symposia, and tutorials for the Neural Information Processing Systems Conference have been posted on-line.
The first day of the Neural Information Processing Systems Conference consists of tutorials. The early afternoon tutorial session was from 1pm to 3pm and consisted of two tutorials in parallel. The slides for both of these tutorials are available on-line from the Neural Information Processing Systems website and, for convenience, I have provided explicit hyperlinks to these presentations at: www.learningmachines101.com. The two tutorials covered the topics of Monte Carlo Inference Methods tutorial taught by Ian Murray and the Probabilistic Programming tutorial taught by Frank Wood. I attended the Monte Carlo Inference Methods tutorial taught by Ian Murray. As I briefly review the contents of the Monte Carlo Inference Methods tutorial I will briefly comment upon what is Probabilistic Programming and why it is so important. Also note that Monte Carlo Inference Methods have been previously introduced and discussed in the previous Episodes 22, 24, and 39 of Learning Machines 101. Visit the website: www.learningmachines101.com to either listen to this episodes or download copies of the transcripts of these episodes.
The Monte Carlo Markov Chain tutorial at the 2015 Neural Information Processing Systems Conference began with some comments about the origins of Monte Carlo Markov Chain methods which evolved from work in the 1950s from scientists who were working on the top-secret Manhattan atomic bomb project. In fact, the classic paper published in 1953 titled “Equation of State Calculations by Fast Computing Machines” by Metropolis, Rosenbluth, Rosenbluth, and Teller introduces the key idea of the well-known Metropolis Monte Carlo Markov Chain Algorithm which is a special case of the more general Metropolis-Hastings algorithm introduced in 1970 by W.K. Hastings. Apparently, the first author Metropolis simply provided the computing systems, Rosenblatt wrote the code for the algorithm. The Wikipedia entry on the Metropolis-Hastings algorithm notes (and this is a quote from Wikipedia):
“There is controversy over the credit for discovery of the algorithm. Edward Teller states in his memoirs that the five authors of the 1953 paper worked together for “days (and nights)”.[4] M. Rosenbluth, in an oral history recorded shortly before his death[5] credits E. Teller with posing the original problem, himself with solving it, and A.W. Rosenbluth (his wife) with programming the computer. According to M. Rosenbluth, neither Metropolis nor A.H. Teller participated in any way. Rosenbluth’s account of events is supported by other contemporary recollections.[6] According to Roy Glauber and Emilio Segrè, the original algorithm was invented by Enrico Fermi and reinvented by Stan Ulam.”
In either case, the Monte Carlo Markov Chain tutorial at the 2015 Neural Information Processing Systems Conference began with a simple linear regression modeling example where the goal is to make a prediction which I will call Y from some predictor which I will call X. For example, X might be a student’s grade point average and Y might be the student’s score on the Graduate Record Exam also known as the GRE. In a linear regression model, the predicted GRE score is defined as a slope parameter estimate multiplied by the student’s GPA plus an intercept parameter estimate. Thus, given data from a collection of 20 students one estimates two parameter values: the slope parameter parameter and the intercept parameter. Suppose the slope parameter estimate was 100 and the intercept parameter estimate was 50 for a group of 20 students. Now we take another group of 20 students and re-estimate the slope parameter and intercept parameter and we obtain the slope parameter estimate to be 105 and the intercept parameter estimate to be 55. If we repeat this process many times for different groups consisting of 20 students, one obtains a probability distribution of the parameter estimate given a data sample of 20 students.
Notice that a pair of slope and intercept parameter estimates uniquely determines a specific linear regression model specified by those parameter estimates. If a linear regression model consists of specific slope and intercept parameter values which are known constants calculated from a data set, then that linear regression model is called a “fitted linear regression model”.
Thus, another way to think about this is that we are computing a probability distribution of fitted linear regression models given the data sample of 20 students. The likelihood of a particular fitted linear regression model given the data sample corresponds to the probability distribution of the parameter estimates given the data sample of 20 students.
This is a major reconceptualization of the goal of learning. Rather than viewing the goal of learning as an attempt to identify two parameter values, one’s goal is to identify a probability distribution of fitted linear regression models. This probability distribution provides more valuable information than two numbers. Unfortunately, for many real-world applications the normalization constant for this probability distribution is computationally intractable. This type of viewpoint is often referred to as an example of a Bayesian inference strategy.
Despite the fact that the probability distribution of the parameter values given the observed data is often computationally intractable, this distribution is useful for deriving improved methods of estimation and inference. For example, given the probability distribution of the parameter values given the observed data, one can construct a computationally intractable multidimensional integral which predicts a student’s GRE score given the student’s GPA by computing a weighted average across the predictions of an infinite number of linear regression models. Each weight in the weighted average specifies the likelihood of the parameter values for a particular regression model given the observed data. This problem is called the “posterior predictive density problem.” Another important computationally intractable multidimensional integral essentially computes a weighted average of the average prediction error for each fitted linear regression model where the weight for that prediction error is again the likelihood of that fitted linear regression model given the observed data. This problem is called “estimating the posterior predictive density”.
Monte Carlo Markov Chain methods provide mechanisms for addressing the above issues.
The basic idea is to come up with a computationally tractable way to generate random choices for parameter values such that the probability of generating a parameter value in some range is computed easily and implicitly according to the computationally intractable probability of the parameter values given the observed data. This is the essential guts of the Monte Carlo Markov Chain method. Briefly, one starts with random parameter values, randomly perturb them in a special way, then randomly perturb them again, and if you do this in just the right way…eventually the parameter values you generate will have the right properties! That is, the probability you observe a particular fitted linear regression model as specified by its parameter values as a result of this random perturbation process will be approximately equal to the mathematically correct expression for the probability of the fitted linear regression model given the observed data.
Then if you want to compute a weighted average of predictions errors, this works by simply generating parameter values using the Monte Carlo Markov chain method…estimating the prediction error…generating another set of parameter parameter values estimating the prediction error again…and then averaging the prediction errors. This implements implicitly just the right weighted average of prediction errors. Similarly, this technique could be used to compute a weighted average of predictions such as GRE given an input such as GPA.
The parameter vector, in general, will have many more components than just two components. This technique is applicable and has been applied in cases where there are hundreds or even thousands of parameter values. In fact, this technique is not limited to addressing problems such as “model averaging” and estimating the “posterior predictive density”. It is also useful for addressing problems such as “probabilistic constraint satisfaction”.
In the constraint satisfaction problem, we have a large collection of probabilistic constraints. For example, we might have knowledge that: “If tweety is a bird, then there is a 90% probability that Tweety can fly”,
“If tweety is a bird and tweety’s feet are stuck in concrete, then there is a 5% probability that probability that Tweety can fly”, “If tweety is a bird, then there is a 5% probability that tweety is not an ostrich.” Although it is often computationally intractable to explicitly compute the joint probability distribution associated with this probabilistic database, Monte Carlo Markov chain methods can be used to generate answers to probabilistic queries such as: “If Tweety can not fly, what is the probability that Tweety’s feet are stuck in concrete?” These methods work by generating different situations involving Tweety in such a manner that the probability of observing a situation where Tweety can not fly and Tweety’s feet are struck in concrete is a close approximation to the actual probability Tweety can not fly and Tweety’s feet are struck in concrete. Note that here the concept of “actual probability” refers to the probabilistic laws specified in the probabilistic database as opposed to actual frequencies of events in the world.
The percentage of times that Tweety can not fly and Tweety’s feet are stuck in concrete can thus be estimating by averaging across these Monte Carlo Markov Chain generated scenarios and then one can estimate the desired probabilities.
There are many ways to generate Monte Carlo Markov Chains algorithms which generate samples from desired probability distributions. And, as previously noted, such algorithms work by providing a computationally easy iterative algorithm which generates a sequence of state vectors which eventually can be interpreted as samples generated from a computationally intractable probability distribution. This sequence of state vectors is called the “Markov Chain”. Averaging across functions of these samples then implements a way of computing weighted averages where the weight of each sample is equal to the probability of observing that sample. Monte Carlo Markov Chain algorithms can also be used to find approximate maximizers of computationally intractable functions. This is achieved by constructing a probability distribution whose maximum values or “modes” correspond to the modes of the function which is being maximized. Sampling from this distribution implies that the most frequently observed states will be the most probable states.
Unfortunately there are always problems to be addressed (even in paradise). The first problem is that it may take some time for the Markov Chain to converge. The second problem is that it may be difficult to assess whether the Markov Chain has converged.
Theoretically, the first problem is partially addressed by the fact that the convergence rate in Monte Carlo Markov Chain algorithms is geometric which means that it is potentially very fast but the story is more complicated in empirical settings. Similarly, the second problem is often addressed by simply running a Monte Carlo Markov Chain algorithm to convergence and then recording the solution. Then one repeats the process again and records the solution. And possibly repeating this process several more times to obtain additional solutions. If all of the solutions are very different this would suggest that the convergence criteria needs to be re-examined.
Both problems can sometimes be effectively addressed by developing “custom-designed” Monte Carlo Markov Chain algorithms. There are many different techniques for implementing Monte Carlo Markov Chain algorithms which include: Gibbs Sampling, Block Gibbs, Collapsed Gibbs, Metropolis algorithm, Metropolis-Hastings algorithm, Slice Sampling, Hamiltonian Monte Carlo, and Hybrid methods which combine some or all of the above algorithms in various combinations. Here is a nice quote (from my memory) from Ian which I particularly liked:
“The problem is that there are lots of clever things to do but we don’t know which thing to do.” If you are an expert in this area, then you sometimes can hit upon a good solution relatively rapidly because you might have a good feeling for what works best in particular situations but there are not many theoretical results which can be used to guide even the expert in what algorithm should be used in what situation. Of course the problem is even more challenging for the novice who is not be aware of all of the variations of different Monte Carlo Markov Chain Algorithms and there relative strengths and weaknesses in different situations.
This leads to an exciting new area which was actually the topic of the other tutorial which I did not attend which is called “Probabilistic Programming”. This tutorial was taught by Frank Wood and the slides for both the Monte Carlo Tutorial and the Probabilistic Programming tutorial can be found on the Neural Information Processing Systems 2015 website. Links to these tutorials can be found in the show notes of this episode at: www.learningmachines101.com . To understand the concept of Probabilistic Programming, consider a modern computer programming language such as C++, MATLAB, JAVA, or Python. A software developer can program in these programming languages directly without having to learn to program in assembly language. A compiler typically translates the commands in a high level language into the low level assembly language so that the computer can execute the commands in the high level language.
By setting things up in this way, the algorithm designer can program a high-level algorithm in C++ or MATLAB without having to worry about the technical details of correct implementation in assembly language. This increases the efficiency and power of algorithm software development by using a high level language such as MATLAB. On the other hand, technical software experts whose specialization area is assembly language can devote their energies to developing improved optimized compilers without having to worry about how the end-products of the compilers will be used in algorithm applications.
Probabilistic programming is inspired by this approach. The basic idea is that in the probabilistic programming language the algorithm engineer specifies the probability model which generates the observed data without worrying about which Monte Carlo Markov Chain algorithm to use. This specification is provided by the algorithm engineer in a high-level “probabilistic programming language”. The compiler for the probabilistic programming language then makes decisions regarding what type of Monte Carlo Markov Chain algorithm to use and how to determine when the Markov chain converged to an acceptable solution. This allows experts in Machine Learning and Statistics specializing in Monte Carlo Markov chain algorithms to contribute directly to the development of the compiler without worrying about applications. On the other hand, the probabilistic programming language and compiler make both novices and experts more productive by providing an efficient software development environment. Examples of probabilistic programming languages discussed in the tutorial include: BUGS and STAN.
Finally, at the end of the tutorial, the speaker made some interesting comments about the wide range of applications which involve Monte Carlo Markov Chain methods. In particular, the speaker mentioned that the top three methods used to predict the existence of dark matter in the universe are based upon Monte Carlo Markov Chain methods. Physicists specify complex probabilistic models which represent their beliefs and experimental findings regarding the nature of the universe. The Monte Carlo Markov Chain method is then used to estimate the likelihood of dark matter in specific regions. I’ve included a few references involving Monte Carlo Markov Chain methods in astrophysics in the show notes.
I have provided in the show notes at: www.learningmachines101.com hyperlinks to all of the papers published at the Neural Information Processing Systems conference since 1987, the workshop and conference schedule for the Neural Information Processing Systems conference, and links to related episodes of Learning Machines 101!
I have been reviewing the user profiles and many of you still have not updated your user profiles. Regardless of whether or not you are currently a member of the Learning Machines 101 community, simply go to the website: www.learningmachines101.com and fill out the “subscription form”. Make sure to fill in the entry labeled “Briefly describe your interests”. If you are already a member, then you are done! If you are joining for the first time, then you should check your email for a “confirmation email” and follow the directions in the “confirmation email” to complete the subscription process.
Also note that if you are already a member, you also have the option of updating your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!
Also follow either me on Twitter! My twitter handle is:“lm101talk”.
You are also encouraged to link up with me on LINKEDIN and we are starting
a new group called “Statistical Machine Learning” you are welcome to join!
I also have been constructing some boards on PINTEREST as well.
From time to time, I will review the profiles of members of the Learning Machines 101 community and do my best to talk about topics of interest to the members of this group! So please make sure to visit the website: www.learningmachines101.com and update your profile!
So thanks for your participation in today’s show! I greatly appreciate your support and interest in the show!!
Further Reading:
Neural Information Processing System 2015 Conference Schedule and Proceedings
Proceedings of ALL Neural Information Processing System Conferences.
2015 Neural Information Processing Systems Conference Book.
2015 Neural Information Processing Systems Workshop Book.
Related Episodes of Learning Machines 101
https://www.learningmachines101.com/lm101-024-how-to-use-genetic-algorithms-to-breed-learning-machines-stochastic-model-search-and-selection/ https://www.learningmachines101.com/lm101-022-learn-to-solve-constraint-satisfaction-problems/
Probabilistic Programming
https://en.wikipedia.org/wiki/WinBUGS
https://en.wikipedia.org/wiki/Metropolis-Hastings
http://www.stat.columbia.edu/~gelman/research/unpublished/stan-resubmit-JSS1293.pdf
https://en.wikipedia.org/wiki/Probabilistic_programming_language
http://probabilistic-programming.org/wiki/Home
Finding Dark Matter using Monte Carlo Markov Chain Technology
The role of the ILC in the study of Cosmic Dark Matter.
Dark Matters (Journal of Physics, Conference Series, 2010).
Revisiting Generalized Chaplyin Gas as a Unified Dark Matter and Dark Energy Model.
Neural Information Processing Systems 2015 Slides from Tutorials
https://media.nips.cc/Conferences/2015/tutorialslides/nips-2015-monte-carlo-tutorial.pdf
Supplemental References for Monte Carlo Markov Chain
https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo
Charles Beyer (1992). Practical Monte Carlo Markov Chain. Statistical Science.
Equations of State Calculations by Fast Computing Machines by
Metropolis, Rosenbluth, Rosenbluth, and Teller (1953)