Podcast: Play in new window | Download | Embed
LM101-056: How to Build Generative Latent Probabilistic Topic Models for Search Engine and Recommender System Applications
Episode Summary:
In this episode we discuss Latent Semantic Indexing type machine learning algorithms which have a probabilistic interpretation. We explain why such a probabilistic interpretation is important and discuss how such algorithms can be used in the design of document retrieval systems, search engines, and recommender systems.
Show Notes:
Hello everyone! Welcome to the fifty-sixth podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner. In this episode we discuss a Latent Semantic Analysis machine learning algorithms which have a probabilistic interpretation. We explain why such a probabilistic interpretation is important and discuss how such algorithms can be used in the design of document retrieval systems, search engines, and recommender systems.
In a document retrieval system or search engine, a collection of terms is provided to the system and the goal of the system is to retrieve documents which are most similar to that collection of terms. So, when you are searching for a particular item of information, you often will type in a group of search terms into your Google or Bing Search Engine and then the search engine will return a collection of web pages which are rank-ordered according to their similarity to the original set of search terms. In this example, a “term” corresponds to a word or a word phrase while a document corresponds to a particular web-page or URL address.
A common way of thinking about these ideas involves introducing the concept of a “document space”. Suppose we have 60,000 documents comprised of 30,000 terms, then we can think of a particular document about “dogs” as a list of 30,000 numbers where each number indicates the degree to which a particular term occurs in that document. That is, the first number in the list specifies whether the first term in the 30,000 term vocabulary appears in a specific document. The second number in the list specifies whether the second term in the 30,000 term vocabulary appears in the document and so on. This 30,000 dimensional vector or equivalently list of 30,000 numbers representing the target document is plotted as a single point in a 30,000 dimensional hyperspace. In a similar manner, other documents can be plotted as points in this 30,000 dimensional hyperspace. If two documents are “close together” in this space according to some distance measure, then those documents are defined as “similar” with respect to the 30,000 dimensional space and the distance measure. A retrieval cue typed into a search engine window can be considered to be a point in “document space” and we basically then retrieve the documents that are close to that retrieval cue.
Some recommender systems are based upon similar principles. In particular, one might provide a recommender system with a collection of terms where each term corresponds to an item previously purchased by an individual (e.g., a term might correspond to a “book” or a “song”), and then the goal of the recommender system is to generate a set of “books” or “songs’ which are most similar to the items which have been previously purchased by the individual. Thus, the 60,000 documents in the document retrieval example correspond to 60,000 different shopping cards of consumer purchases.
Note that in the recommender system example, we need a method for computing similarity between “terms”, while in the search engine example we need a method for computing similarity between “documents”. This leads to the concept of a “term space” where a particular “term” such as a “book” or “song” is represented by a list of 60,000 numbers indicating which shopping carts included the purchase of that book or song and which shopping carts did not include that purchase.
The standard Latent Semantic Indexing or LSI machine learning algorithm works in the following manner. First, the algorithm is “trained” with a collection of documents if we have a document retrieval application. For a recommender application, each document is called a “market basket” which corresponds to a collection of items purchased together by a consumer but the idea is essentially the same.
Suppose that we have a vocabulary of 30,000 words for document retrieval and if we are building a recommender system then the vocabulary might be 30,000 possible items that could be purchased by the consumer. Also assume that we have 60,000 documents which are constructed from the 30,000 words. Each document is modeled as an unordered set of words in a document retrieval application. For a recommender system application, this type of assumption seems more plausible: the ordering of items within the market basket is irrelevant.
We now construct a large matrix array with 30,000 rows and 60,000 columns. Each column of the matrix array corresponds to a particular document or market basket. Each row of the matrix array corresponds to a particular term or purchased-item. A number such as 10 which appears in row J and column K of the term by document matrix indicates that the Jth term in the vocabulary occurred 10 times in the Kth document.
Using this setup, a search engine query consisting of a collection of key words can be interpreted as a document or equivalently as a column of the term by document matrix and the similarity of that column to the other columns in the matrix can be computed. The columns in the term by document matrix which are most similar to the document corresponding to the search engine query are retrieved. Unfortunately, in the real world,
One often encounters two documents which are very semantically related but do not have many common key words. And, in addition, one often encounters two documents which are not semantically related but might have many words in common.
Similarly, a recommender system might be designed to suggest an item for a consumer to purchase based upon an item previously purchased by the consumer. Again, each column of the term by document matrix corresponds to a document or “market basket”. Thus, the term by document matrix is constructed by using a large collection of “market baskets” which have been purchased by consumers. A number in the Jth row and Kth column of the term by document matrix now represents the number of purchases a consumer made of item J for the Kth market basket. But, just as in the search engine query case, it is possible that a consumer who purchases item X would also purchase item Y if they knew item Y was being sold but a recommender system based upon the above principles might not make this recommendation if the training data for this system did not include market baskets were items X and Y were purchased together.
One way these problems can be addressed is through the use of Latent Semantic Indexing. The basic idea of LSI is that the term by document matrix is “dissected” so that it is a weighted sum of many term by document matrices. Term by document matrices in this weighted sum which are weighted with large numbers are assumed to represent the critical structural relationships while the term by document matrices in this weighted sum where the weights which are very small numbers are assumed to represent random (non-structural) relationships between terms and documents. LSI works based upon the following two-step procedure. First, dissect the original term by document matrix using a matrix analysis technique called Singular Value Decomposition or SVD. Second, reconstruct
An approximation to the original term by document matrix using only a weighted sum of term by document matrices which have large weights. The resulting approximate term by document matrix tends to reveal the “latent structure” in the data because associations between pairs of items which never co-occurred in a document or market basket tend to be visible in the reconstructed term by document matrix. In other words, two documents which might have had no terms in common might be identified as similar when the approximation of the original term by document matrix was used rather than the original term by document matrix.
These ideas lead to the “topic space” which basically a collection of “latent topics” which are identified by the Latent Semantic Indexing method. To fix ideas, suppose that we decide to focus upon just 50 latent topics. In one simple variation, one specifies the number of latent topics and then the LSI method maps each document represented as a list of 30,000 numbers corresponding to a point in document space into a list of 50 numbers corresponding to a point in the “latent topic space”. That is, LSI provides a method for relocating the points in the 30,000 dimensional space corresponding to particular documents into a lower-dimensional 50-dimensional latent topic space in such a manner that the distance relationships between documents in the original 30,000 dimensional space are preserved as much as possible when those documents are represented as points in the new 50-dimensional latent topic space. This method for the projection of points in document space into the lower-dimensional topic space can also be used to project points in the 60,000 dimensional term space into a low-dimensional 50-dimensional topic space as well. The goal of the projection process it to attempt to exaggerate important distance relationships between documents or terms while minimizing unimportant distance relationships.
Still, LSI methods of this type are often criticized because they appear to lack a legitimate probabilistic interpretation. What do we mean by a probabilistic interpretation? And why are such probabilistic interpretations desirable?
In Episode 8 and Episode 10 of Learning Machines 101, we discuss issues of probability as a theory of belief and principles of Maximum Likelihood Estimation and MAP estimation in greater depth. Briefly, a machine learning algorithm with a probabilistic interpretation means that we can interpret that machine learning algorithm as having a collection of expectations regarding the likelihood of specific events in its environment. This collection of expectations is formally called the learning machine’s “probability model”. Whether the learning machine behaves in a deterministic manner or a probabilistic manner is irrelevant. The concept of a probabilistic interpretation or equivalently a “probability model for the learning machine” means that the inference and learning processes of the learning machine regardless of whether they are deterministic or non-deterministic are designed so that the learning machine can acquire some probabilistic representation of its environment during the learning process and then use that probabilistic representation of the environment when inferences are required.
When a machine learning algorithm does not have a probabilistic interpretation, it is difficult to characterize in a precise manner the conditions under which the machine learning algorithm will make correct inductive inferences. It is also difficult to empirically test to see if a given environment is compatible with the collection of expectations of the learning machine since the learning machine’s expectations are never formally defined. For example, suppose we have an unsupervised machine learning algorithm which exhibits reasonably good generalization performance on test data. On the other hand, suppose we found out that our unsupervised machine learning algorithm was derived to find the “most likely” parameter values under a particular set of probabilistic modeling assumptions. Furthermore, suppose that these assumptions are shown to be empirically wrong then we know that the unsupervised machine learning algorithm might be doing the best job that it can but it is going to optimally perform within its existing statistical environment.
In some cases, one can empirically check using goodness-of-fit types of statistical analyses to determine if a probability model is capable of adequately representing its statistical environment. However, in other cases, we can obtain insights into the adequacy of a machine learning algorithm’s probabilistic representation of its environment by carefully understanding its probabilistic modeling assumptions. For example, some lower-division statistics textbooks will discuss statistical analyses where it is assumed that a person’s height follows a bell-shaped curve or Gaussian distribution, but such an assumption about heights is clearly wrong because it implies that some people must have negative heights and some people are over 1000 feet tall!
The key point here is that even if we have a machine learning algorithm which exhibits great performance upon both training and test data, there is still a tremendous advantage if this algorithm has a probabilistic interpretation. Such an interpretation allows one to obtain insights into what classes of environments the machine learning algorithm will exhibit good performance before deploying the algorithm in such environments. In addition, even if a machine learning algorithm works well within a particular environment, if one can identify problematic modeling assumptions that can possibly suggest how to improve the machine learning algorithm.
Even though LSI is a deterministic inference and learning algorithm and is often not viewed as having probabilistic modeling assumptions, it is straightforward to construct various probabilistic interpretations of classical LSI. Here is a probabilistic generative model for LSI.
Step 1: For each document, generate a k-dimensional random vector whose k elements are independent and identically distributed with zero mean and common variance using a Laplace-density probabilistic law which makes values closer to zero more likely than values further than zero. This random vector has the semantic interpretation of a “topic” vector which is a point in a k-dimensional topic space.
Step 2: Project a point in the k-dimensional topic space into a d-dimensional document space where each point in the d-dimensional document space corresponds to a particular document with d terms. This projection operator is learned by the learning machine. Different projection operators will give rise to different distributions of documents.
Step 3: Next, add some Gaussian random noise to the point that was created in document space. That is, we take some random steps in some random directions away from the point generated in document space from Step 2. This new point in document space is presumed to be a typical document which was in the learning machine’s environment.
So the big point here is that we have a generative probabilistic model which generates documents represented as points in document space according to some probability model. The goal of the learning process is to select the parameters of the probability model which in this case are mainly determined by the projection operators so that frequently observed documents in the environment are also documents which are generated with high probability from our probability model. As discussed in Episode 55, this type of methodology for estimating the parameters of our probability model is called maximum likelihood estimation.
In fact, using this approach to generating documents, assumptions of this type can be used to provide a probabilistic interpretation of the classical Latent Semantic Indexing method discussed in Episode 54 as either a Maximum Likelihood or a MAP estimation algorithm as discussed in Episode 55. The paper by Wang et al. (2013) published in the ACM Transactions on Information Systems discusses this idea in a little more detail.
A popular alternative to classical LSI is based upon a different type of probability model of document generation which is called Latent Dirichlet Allocation (LDA) which was introduced by Blei, Ng, and Jordan (2003). Here is the generative probabilistic model for LDA!
Step 1: For each document, pick some probability distribution of topics at random. So this means that We assume that a particular document in the entire collection of documents will have Some distribution of topics and we assume that the distribution of topics will randomly
Vary from one document to another. So for example, one document might focus on the topic distribution dogs 40%, cats 40%, and politics 20%,while another document might focus on the topic distribution dog 80%, cats 20% and politics 0%. Each topic distribution is possible with a different probability and so we pick a topic distribution at random. Note that the learning machine thus learns the probabilities of different topic distributions.
Step 2: Then, given we have already chosen a topic distribution in Step 1, then for each location in the document corresponding to a word, we pick a topic from our topic distribution at random. So, for example, suppose we had picked the topic distribution Dog 80% and Cats 20% in Step 1, then we randomly choose from this topic distribution a specific topic so we would have an 80% chance of choosing the topic “Dog”. On the other hand, if we had instead picked the topic distribution “dogs 40%, cats 40%, and politics 20% “ in Step 1, then the chance of choosing the topic “Dog” would only be 40%.
Step 3: Now from Step 2 we had picked a topic at random, now we pick a particular word given that topic. So, for example, if the topic is “Dog” would might generate the word “lassie” with some probability or we might generate the word “leash” with some probability. We also might generate the word “Computer” with some probability given the topic is dog but the probability we would generate the word “Computer” given the topic “Dog” would be very small. The learning machine will learn this collection conditional probabilities as well. That is, the probability that a particular topic generates a particular word.
The Latent Dirichlet Allocation (LDA) model has been the most widely used for many years, intuitively it seems more attractive since rather than assuming that a topic vector representation has a bell-shaped Gaussian-time probability distribution, LDA assumes that there is a certain probability that a particular topic is represented in the document. The probabilistic modeling assumptions are more interpretable and more intuitively satisfying. Empirically, the LDA method generally out performs the classical LSI topic model method.
However, if we look closely at the probability distribution of topic probability distributions which is called a Dirichlet probability distribution, it turns out that inspection of those formulas shows that certain implicit assumptions make it very difficult to generate a probability distribution of topics such that the topics are correlated in specific ways. This has led to a more recent development which is called the Correlated Topic Model.
Briefly, a Correlated Topic Model is based upon the following generative probabilistic model.
Step 1: For each document, pick a random topic vector from a Bell-shaped or Gaussian curve in multidimensional space. Intuitively, we can think of an “egg-like shape” or “hyperellipsoid” located in some high-dimensional space and we are going to randomly choose points in the vicinity of the surface of the hyperellipsoid. So these topic vectors are numerical with values which do not sum to one and which can have positive and negative elements. The first element of the topic vector chosen indicates the amount of evidence supporting the first “latent” or “hidden” topic, the second element of the topic vector chosen indicates the amount of evidence supporting the second “latent” or “hidden” topic and so on.
Step 2: Turn the Gaussian random topic vector into a probability topic vector. This is done by exponentiating each element of the random topic vector which was chosen. So for example, if the first element was a 2 then then you would compute e squared where e is 2.718. Then you divide the first exponentiated element by the sum of all exponentiated elements in the vector. Then divide the second exponentiated element in the vector by the sum of all exponentiated elements in the vector. The exponentiation operation forces all elements in the transformed topic vector to be positive and the dividing by the sum of all exponentiated elements forces the sum of all elements in the transformed topic vector to add up to one. This type of transformation is closely related to softmax neural network transformations and multinomial logit regression modeling transformations.
Step 3: Then we proceed in a manner similar to LDA we sample from this transformed probability topic vector which was generated from sampling from a Gaussian distribution. Specifically, the generated probability topic vector specifies a probability for each topic. So we can essentially roll a die whose sides are weighted by the probability topic vector and that tells us the topic associated with a particular word in the current document. Then we roll another die to pick the word associated with that topic. The probability of a word given a topic is estimated in a manner similar to LDA.
So today we have discussed three different probabilistic models of topic vector generation: a probabilistic classical LSI model, an LDA model, and a correlated topics model. All three of these models have free parameters which can be estimated using the method of maximum likelihood estimation. Once the parameters of the model are estimated, then we can use these models to map documents into a latent topic space to facilitate tasks such as document retrieval and handling search engine queries as well as making recommendations in recommender systems.
The advantage of the probabilistic interpretation is that one can either choose a probability model based upon “face validity” which basically means that the probabilistic modeling assumptions seem intuitively reasonable. For example, the probabilistic interpretation of classical LSI provided here is ok but it makes the assumption that points in document space and term space have a bell-shaped or Gaussian distribution which may be questionable. The LDA model is based upon the intuitive idea of a topic vector which specifies the probability that a particular topic is mentioned which has been argued to be a more intuitively plausible model of documents. The Correlated Topic model is based upon a similar idea as LDA but makes a more plausible assumption regarding how to parameterize the probability distribution of topic probability distributions. This argument is discussed in more detail by Blei and Lafferty in their 2005 reference provided in the show notes of this episode.
In addition to INTUITIVELY assessing the appropriateness of a model using face validity, one can also EMPIRICALLY assess whether or not the probability model can adequately represent the data generating process by examining empirical distributions of document vectors and term vectors and comparing them to the probability model. Methods such as cross-validation can be used to take into account sampling error. I’ve also provided a reference to one of my recently published papers which describes a new advanced state-of-the-art method for detecting problems with one’s probabilistic representation of reality using Generalized Information Matrix Tests. So check out that reference which is a book chapter that was published in 2013! I also have provided a reference to an earlier paper concerned with testing hypotheses regarding which of two competing models best-fits a given data generating process in the possible presence of model misspecification. These references as well as references for all three of the probabilistic topic models discussed in today’s podcast can be found in the show notes at: www.learningmachines101.com
The assessment of the appropriateness of a probability model should be based upon BOTH INTUITIVE and EMPIRICAL assessments. There are no simple answers. But without a probabilistic representation of your data generating process, a principled method for understanding when your algorithm will exhibit appropriate generalization performance in novel statistical environments will be challenging to develop.
Thank you again for listening to this episode of Learning Machines 101! I would like to remind you also that if you are a member of the Learning Machines 101 community, please update your user profile and let me know what topics you would like me to cover in this podcast.
You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!
If you are not a member of the Learning Machines 101 community, you can join the community by visiting our website at: www.learningmachines101.com and you will have the opportunity to update your user profile at that time. You can also post requests for specific topics or comments about the show in the Statistical Machine Learning Forum on Linked In.
From time to time, I will review the profiles of members of the Learning Machines 101 community and comments posted in the Statistical Machine Learning Forum on Linked In and do my best to talk about topics of interest to the members of this group!
And don’t forget to follow us on TWITTER. The twitter handle for Learning Machines 101 is “lm101talk”!
And finally, I noticed I have been getting some nice reviews on ITUNES. Thank you so much. Your feedback and encouragement is greatly valued!
Keywords: Latent Semantic Analysis, LSA, Latent Semantic Indexing, LSI, Regularized LSI, Latent Dirichlet Allocation, LDA, Correlated Topic Models, Model Misspecification, Information Matrix Tests, Model Selection.
Further Reading:
Wang, Q., Xu, J., Li, H., and Craswell (2013). Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling. ACM Transactions on Information Systems, Vol., 31, No. 5. http://dl.acm.org/citation.cfm?doid=2414782.2414787
Blei, D. M., Ng., Y., A., and Jordan, M. I. (2003). Latent Dirichlet Allocation.
Journal of Machine Learning Research, 993-1022. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
Blei, D. M., and Lafferty, J. D. (2005). Correlated Topic Models. Advances in Neural Information Processing Systems.
https://papers.nips.cc/paper/2906-correlated-topic-models.pdf
Golden, Richard. M. (2003). Discrepancy Risk Model Selection Test Theory for Comparing Possibly Misspecified or Nonnested Models. Psychometrika.
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=D09B379CB0B1A03B24B27F0F899A02D2?doi=10.1.1.33.6943&rep=rep1&type=pdf
Golden, Richard, M., Henley, Steven, S., White, Halbert, Kashner, T. Michael (2013). New Directions in Information Matrix Testing: Eigenspectrum Tests. In Recent Advances and Future Directions in Causality, Prediction, and Specification Analysis
pp 145-177 http://link.springer.com/chapter/10.1007/978-1-4614-1653-1_6
Copyright © 2016 by Richard M. Golden. All rights reserved.
Dear Dr Golden, thanks a lot for your great podcasts. I’ve been a follower for some time and I really enjoy listening.
I have a question regarding episode 56. During this episode you mentioned that LDA tends to perform better than other topic modeling algorithms such as LSI. You also mentioned “emprical and intuitive assessment of the approriateness of a probability model” towards the end of the episode. I wonder whether you could give more elements / research paths towards objective evaluation of such models.
Best,
Julien
Thanks for your interest! Empirical objective approaches for assessment are complicated because of the large numbers of free parameters. If one could find a solution, however, which is locally identifiable (i.e., locally unique) then one might be able to apply model selection tests or model specification analyses for the purpose of assessing the locally unique solution using classical asymptotic finite-dimensional statistical theory. However, things that are working against us are that the number of parameters tends to grow with the number of documents (but this can be addressed by fixing the number of terms initially using a model development sample) and that many solutions of this type are not locally unique (but this might be addressed through regularization).
I’m thinking of the new information matrix test methods for model misspecification detection
http://www.amazon.com/Advances-Directions-Causality-Prediction-Specification/dp/1461416523
“New Directions in Information Matrix Testing: Eigenspectrum Tests” and more classical
Model Selection Tests such as the Generalized Akaike Information Criterion (also known
as the Takeuchi Information Criterion). Future podcast episodes (more distant future) will discuss these methods in depth and cover the relevant theoretical assumptions and their applicability and evaluation. The detailed comparison of LDA and LSI is something we might discuss further in a future episode of Learning Machines 101 as well. thanks for your question and it really deserves a much longer answer than these few quick comments…