Podcast: Play in new window | Download | Embed
LM101-041: What happened at the 2015 Neural Information Processing Systems Deep Learning Tutorial?
Episode Summary:
This is the first of a short subsequence of podcasts which provides a summary of events at the recent 2015 Neural Information Processing Systems Conference. This is one of the top conferences in the field of Machine Learning. This episode introduces the Neural Information Processing Systems Conference and reviews the content of the Morning Deep Learning Tutorial which took place on the first day of the conference.
Show Notes:
Hello everyone! Welcome to the forty-first podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner. This is the first of a short subsequence of podcasts in the Learning Machines 101 series which is designed to give a brief overview of my personal experience and perspectives on my visit to the 2015 Neural Information Processing Systems Conference in Montreal. For me, this was a very exciting and intellectually stimulating conference! This episode provides a general introduction to the Neural Information Processing Systems Conference and also reviews the Morning Deep Learning Tutorial which took place on the first day of the conference.
This year the conference was held during the time period of December 7 through December 12 at the Palais des Congrès de Montréal Convention and Exhibition Center in Montreal Canada. conference is divided into three sections. The first day consists of tutorials. The next 3 days is the main conference. And finally the main conference is followed by 1 day of symposia, and 2 days of workshops.
The President of the Neural Information Processing Systems Foundation, Professor Terry Sejnowski, provided the opening address to the conference. Professor Sejnowski is Francis Crick Professor and the Director of the Crick-Jacobs Center for Theoretical and Computational Biology at the Salk Institute. In addition, to his primary appointment as a Professor of Biological Sciences. In addition, according to his Wikipedia entry, he is only one of 12 living scientists who has been elected to the National Academy of Sciences, National Academy of Engineering, and National Institute of Medicine. Professor Sejnowski’s work has made fundamental contributions to many fields but has primarily focused upon theoretical neuroscience, computational neuroscience, cognitive-neuroscience, cognitive science, and machine learning algorithms.
In his opening address, Professor Sejnowski, noted that this year’s conference had a record number of participants since its beginnings in 1987. Actually, as an Andrew Mellon post-doctoral fellow, I attended the very first Neural Information Processing Systems conference in 1987 where I presented my paper titled “Probabilistic Characterization of Neural Model Computations” which argued for a probabilistic and optimization-oriented framework for interpreting the goals of classification and learning in artificial neural network models. Although the idea of interpreting machine learning algorithms within a statistical and optimization-oriented framework is widely accepted today, at the time the idea was more controversial. I can’t recall exactly, but I believe the attendance at the 1987 conference could not have been more than about 200 participants. Today, approximately 30 years later in 2015, approximately 2500 participants attended the tutorials, approximately 3262 participants attended the conference, and approximately 3000 participants attended the workshops.
Topics at the 2015 Neural Information Processing Systems conference included convex optimization machine learning algorithms, non-convex optimization machine learning algorithms, reinforcement learning algorithms, Monte Carlo inference and learning machine learning algorithms, hardware implementations of machine learning algorithms, and the neuroscience of machine learning algorithms. Despite great advances in theory, algorithms, technology, and applications over the past thirty years, many of the types of algorithms discussed at the 2015 conference are very similar to the types of algorithms discussed at the 1987 conference. In addition, like the 1987 conference, the 2015 conference is highly multidisciplinary attracting leading scientific and engineering researchers from a wide variety of fields such as computer science, engineering, mathematics, neuroscience, cognitive science, cognitive psychology, and mathematical psychology. However, there was a greater emphasis in 1987 on understanding the computational basis of inference and learning in biological systems. Hopefully, with the renewed interest in deep learning and the latest advances in neuroscience, there will be at least a partial return towards the original origins of the Neural Information Processing Systems conference whose original seeds were planted equally in both the fields of Machine Learning and Biology.
A number of industries are clearly interested in the topics covered by this conference. The 2015 Neural Information Processing Systems conference is supported by companies such as: Google, Microsoft, Amazon, Apple, Twitter, Facebook, IBM Research, the scientific journal Artificial Intelligence, Disney Research, Ebay, Adobe, Panasonic, Sony, Toyota, The Alan Turing Institute, Yahoo, Netflix as well as a number of investment management and trading companies such as Bloomberg, Cubist, DE Shaw, PDT Partners, Vatic, and Winton.
Furthermore, scientists and engineers in many fields consider acceptance of a paper at this conference to be a major accomplishment! Out of the 1826 papers submitted to this conference, only 403 of the papers were accepted. That is, an acceptance rate of only 22% for obtaining your paper at the main conference! The reviewing process is completely anonymous so even if you are a well known researcher in the field of machine learning your paper has a good chance of being rejected. This has the unfortunate effect of resulting in many good papers not being accepted but from the perspective of conference attendees it dramatically helps to ensure a high standard for all papers which are accepted!!
I will now provide a brief review of the conference based upon the notes that I took during the conference. This review will be flavored with my own opinions and comments randomly interjected. I will try to do my best to distinguish my thoughts from the ideas presented when I have opinions which are different or additional to what was presented and I will also try not to distort the presentations. However, please keep in mind that my notes were simply written into my kindle and I am not referencing a video or audio recording in writing this summary. I encourage you to visit the official Neural Information Processing Conference website which is listed in the show notes at: www.learningmachines101.com because some of the presentations, video-recordings, posters, symposia, and tutorials may be posted on-line.
In today’s podcast, we will focus on the 2 hour morning tutorial session. The morning session from 930am-1130am had two tutorials in parallel. The first morning session tutorial was called “Large Scale Distributed Systems for Training Neural Networks”. The Large Scale Distributed Systems for Training Neural Networks tutorial was by Jeff Dean and Oriol Vinyals from Google describe new software and hardware technology which has been specially developed to conduct research on large scale machine learning problems. However, I didn’t go to that morning session tutorial even though it sounded quite fascinating. Instead I decided to go to the second morning session tutorial which was titled “Deep Learning” which was led by “Yoshua Bengio” and “Yann Le Cun” who have been working in the area of Deep Learning for several decades and have played crucial roles in creating and shaping the field of Deep Learning.
The first speaker begin the tutorial session by reviewing the essential key idea of deep learning. This key idea is that the solution to many complex problems in artificial intelligence and machine learning often lies in the development of an appropriate representation of the problem. Classical artificial intelligence and machine learning approaches emphasize the importance of a domain knowledge expert or machine learning engineer who can participate actively in the design of important problem features and important problem relationships which can serve as inputs to the machine learning algorithm. The presence of the correct abstractions of reality as inputs to the learning machine can transform an impossible problem into an easy problem. The key idea of deep learning is to dramatically reduce the role of the domain knowledge expert or machine learning engineer by having the learning machine discover on its own an abstract representation of the problem which can transform an impossibly hard problem in artificial intelligence into an easy computationally simple problem. But that is not all, deep learning is based upon the ideas that it is not sufficient to learn a single abstract representation of a problem. Deep learning is characterized by learning multiple levels of representation and abstractions of a problem…thus bypassing the so-called “curse of dimensionality”. The “curse of dimensionality” was often discussed in introductory machine learning courses in 1980s. The basic idea was a paradox which was that as you added more and more features to a learning machine, at some point the machine’s performance dramatically decreased. Deep learning networks bypass the curse of dimensionality because even though the learning machine might consist of millions of free parameters, the deep learning machine is highly structured and specifically designed to acquire multiple abstract representations of a problem.
Each level of the network consists of a collection of “units” which can be intuitively visualized as corresponding to an extremely high-level abstraction of a neuron in the brain. Each unit operates by computing some function of its input and its parameters and then returning that value. Although the analogy with brain information processing systems is useful for pedagogical purposes, these systems do not typically make specific testable predictions that could be confirmed by a neuroscientist. On the other hand, they do provide a computational sufficiency analysis which might be relevant to neuroscientists, psychologists, and cognitive scientists. Specifically, the analysis might be stated as follows. Consider the large class of machines (both biological and non-biological) which have the characteristics of the machine learning algorithm I am proposing. Given this machine learning algorithm which I have explicitly defined and the statistical environment which I have explicitly defined, then my computer experiments establish that there exists situations where machines of these type are capable of learning statistical environments of this type. This type of computational sufficiency analysis is very valuable for establishing sufficient (but not necessary) conditions which could be used to understand the basis of inference and learning in both biological and non-biological systems.
For example, most popular deep learning network, which is called a feed-forward convolutional neural network is inspired by a particular abstract highly speculative and untestable model of neural information processing in the visual system. The convolutional neural network works by processing an image of pixels and identifying statistical regularities at the level of small groups of pixels in the image. Another level of the network then examines these constructed statistical regularities and identifies more abstract statistical regularities such as corners and edges. The next level of the network then examines these more abstract statistical regularities such as corners and edges and identifies parts of images such as a patch of carpet texture, a piece of furniture leg texture, or a patch of dog fur. The next level of the network examines these statistical regularities to form even higher level features and so on…until the highest levels of the network might be able to identify objects such as: chairs, faces, pillows, and people.
Such networks work as noted in Episodes 23, 29, and 30 of Learning Machines 101 by having a particular network structure. In particular, a feature detector which looks at a small region of a transformed image basically scans the image to detect the presence or absence of that feature in different parts of the transformed image…this scanning process is then used to generate a feature map whose elements indicate where the feature was detected. Multiple feature maps are learned at each level of processing. Next, a max-pooling layer is typically used which looks at small regions of the feature map to determine whether or not there is sufficient evidence to decide whether or not the feature is present at that point.
Another important success story is the development of recurrent networks as discussed in Episode 36 of Learning Machines 101 which are capable of detecting long-term statistical regularities over a long sequence of inputs as well as techniques for combining both feedforward and recurrent network architectures.
Most of these ideas were initially developed in the 1960s and 1970s, experienced a rebirth in the 1980s and a growth spurt in the 1990s but have only been successful at solving incredibly complex and challenging problems in the past few years. The recent surge in interest and success in this area has been driven primarily by two factors. First, the computing technology factor is undeniable. The growth spurt in the 1990s was perfectly correlated with the availability of high-speed computing resources which did not exist in the 1980s. Similarly, the current 21st century growth spurt is also strongly correlated with the availability of new computing technologies which simply did not exist in the 1990s. Second, there has been a fundamental change in the philosophy of how deep learning neural networks are developed. In the early 1990s there was a focus on using networks with fewer parameters and fewer hidden layers and “growing the networks” in a cautious manner. Today, the reverse philosophy has proven to be much more successful. Start with very large networks with excessive numbers of parameters and introduce large amounts of regularization, architectural constraints, and very large amounts of training data to constrain the network architecture.
The fundamental idea underlying all of this research is the concept of gradient descent discussed in Episode 31 of Learning Machines 101. This episode and others can be found on the learning machines 101 website: www.learningmachines101.com. The basic idea of gradient descent is that it is possible to apply principles of matrix calculus to figure out how to perturb the connections among a network after it experiences a training stimulus or a small group of training stimuli so that the network’s prediction error decreases on the average. Complex network architectures such as those employed in deep learning were also noted to have not only many suboptimal solutions which are called local minima but also have many intermediate solutions called saddlepoints. A local minima might be considered to be a local solution because any small perturbation to the connections associated with a local minima is guaranteed to make performance worse. Thus, a learning machine which works by perturbing its parameter values might think that a particular local minimum is a global rather than a local solution.
In addition, during the Deep Learning Tutorial, recent theoretical research was discussed which apparently shows the advantage of multiple layers of units in a network. Theoretical results in the 1990s showed that a network with one layer of hidden units was sufficiently powerful to represent any arbitrary function (that is…solve any arbitrary problem). Recent theoretical results are now showing that computing with multiple layers of hidden units might have important computational advantages. These recent theoretical results were new to me and were only briefly mentioned. We might discuss them in a future episode of Learning Machines 101.
Another interesting recent development is the availability of advanced software packages for constructing complex multi-layer networks using a module-based approach. This is an important development which has supported the rapid application of these networks to a variety of problems.
Another interesting discussion in the tutorial covered the topic of distributing training across multiple machines. This idea of asynchronous updates is a hot topic in the deep learning literature. The basic idea is that the perturbation to the connections of a learning machine might be calculated based upon the connections observed by computer processor A at time step 100 but it might take 1 or 2 time steps for that learning machine to complete the calculation for how the connections should be perturbed. Then, before computer processor A can update the connections, computer processor B might update the learning machine’s connections. This seems to work empirically quite well in practice and theoretical results explaining why this works are available.
The next part of the tutorial covered applications of deep learning and future directions. Apparently before 2011 very few modern image understanding systems used convolutional neural networks. Today, almost all modern image understanding systems use convolutational neural networks including many major players such as Google and Facebook as well as many major hardware companies which use convolutional neural networks to tune the parameters of their hardward computer systems.
Applications include scene parsing where each pixel is assigned a label such as: door, sky, or window. Getting a robot to drive a car autonomously. Segmenting and localizing objects in an image. Face recognition is also an interesting problem. It is easy to get lots of examples of faces but to enhance learning the learning machine needs some additional information. That is because even though there are a lot faces…we often don’t know the categories. So the basic strategy is to train the network with pairs of faces of the same person versus pairs of faces of different people. This helps the learning machine learn discriminative features which are useful for mapping different face representations to the same person and very similar face representations to different people.
Other applications include: image captioning where you generate a simple sentence describing the contents of an image, learning features of a video image by watching videos, and the development of networks which can actually generate pictures which they have never seen before. For example, a network can be trained to draw a face of someone from a particular perspective even though the network has never be trained on that perspective before.
Important advances have also been made in the field of speech recognition. Specifically, deep learning convolutional networks have made important advances in speech recognition. And, it was very interesting to learn that better performance in one language was obtained when the network was forced to learn many different languages in addition to the target language. Apparently there are common statistical regularities across languages that this type of network is capable of exploiting.
Other variations include the encoder-decoder framework which has been applied to a variety of problems including language translation. The basic idea is that you start with a word sequence and map it into a representation of the meaning of the sentence specified by the word sequence. This is called the “encoder” component. There is also a “decoder” component which maps the meaning representation into a word sequence. Networks such as these reached state of the art against conventional language translation methods in 2014 and surpassed them in 2015.
Various novel architectures are also being explored. One idea is an “attention mechanism” for deep learning which is another network which tells the original convolutional network how to move the feature scanner across the image by focusing on the important parts of the image.
Still feedforward networks have important obvious limitations since they are specifically designed for supervised learning problems. In a supervised learning problem, part of the training stimulus is designated as the “input pattern” while the other part is designated as the “desired response”. Unsupervised learning machines are more powerful because they learn the internal structure of a given training stimulus and are able to predict any subset of the elements of the training stimulus from any other group of subsets.
Approaches for dealing with the unsupervised learning problem include the “autoencoder” which basically uses the entire training stimulus to predict itself. The network architecture is designed to force the learning machine to extract crucial statistical regularities rather than simply memorizing the training stimulus. This is typically achieved by creating “informational bottlenecks” where the original training stimulus is processed through multiple layers until it reaches a layer of units which is small enough that crucial statistical regularities must be abstracted. At this point additional layers process the intermediate results to generate the entire training stimulus at the output. Denoising autoencoders operate in a similar manner but the input to the system is corrupted in various ways and then the denoising autoencoder is trained to construct the original input before it was corrupted with noise. This type of training helps the denoising autoencoder acquire an internal model of world.
Other systems which use Monte Carlo Markov Chain methods such as the Boltzmann machine and Helmholtz machine. Monte Carlo Markov chain methods will be discussed in my review of the afternoon tutorial and are also discussed in Episodes 21 and 22 of Learning Machines 101 are more specifically designed to solve the unsupervised learning problem but such methods are often very computationally demanding.
The tutorial ended with some general comments about the general goals of learning. The goal of human learning is to figure out how the world works rather than learn some narrow task. It was hypothesized that humans generalize more effectively by having deeper levels of representations. Deep learning provides a mechanisms for constructing such deeper levels of representations by having the learning machine discover them without human intervention.
I have provided in the show notes at: www.learningmachines101.com hyperlinks to all of the papers published at the Neural Information Processing Systems conference since 1987, the workshop and conference schedule for the Neural Information Processing Systems conference, and links to related episodes of Learning Machines 101!
If you are a member of the Learning Machines 101 community, please update your user profile. If you look carefully you can provide specific information about your interests on the user profile when you register for learning machines 101 or when you receive the bi-monthly Learning Machines 101 email update!
You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!
Or if you are not a member of the Learning Machines 101 community, when you join the community by visiting our website at: www.learningmachines101.com you will have the opportunity to update your user profile at that time.
Also check out the Statistical Machine Learning Forum on LinkedIn and Twitter at “lm101talk”.
Also check us out at PINTEREST as well!
From time to time, I will review the profiles of members of the Learning Machines 101 community and do my best to talk about topics of interest to the members of this group! So please make sure to visit the website: www.learningmachines101.com and update your profile!
So thanks for your participation in today’s show! I greatly appreciate your support and interest in the show!!
Further Reading:
Proceedings of ALL Neural Information Processing System Conferences.
2015 Neural Information Processing Systems Conference Book.
2015 Neural Information Processing Systems Workshop Book.
Wikipedia Entry for Professor Sejnowski
Related Episodes of Learning Machines 101:
Episode 11 (Markov Modeling),
Episode 16 (Gradient Descent),
Episode 21 (Monte Carlo Markov Chain)
Episode 22 (Learning in Monte Carlo Markov Chain Machines)Episode 23 (deep learning and feedforward networks),
Episode 29 (Convolutional Neural Networks and Rectilinear Units)
Episode 30 (Dropout in Deep Networks and Model Averaging)
Episode 36 (Recurrent Deep Networks)