Podcast: Play in new window | Download | Embed
Episode Summary:
In this episode we describe how to build a machine that can take any given pattern of inputs and generate any desired pattern of outputs!
Show Notes:
Hello everyone! Welcome to the fourteenth podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.
In this episode, we discuss the problem of how to build a machine that can do anything! Or more specifically, given a set of input patterns to the machine and a set of desired output patterns for those input patterns we would like to build a machine that can generate the specified output pattern for a given input pattern. This problem may be interpreted as an example of solving a supervised learning problem.
Our story today begins with the birth of a fellow named Walter Pitts who was born in Detroit, Michigan in 1923. Pitts was somewhat of a genius and by the age of 15 he was reading and critically reviewing mathematics texts written by world-renowned professors of mathematics. At about this time, Pitts ran away from his family to study mathematics at the University of Chicago where he continued his studies. At the age of 19, he began collaborating with Warren McCulloch. Warren McCulloch was a neuroscientist and psychologist pursuing research in the new science of biological cybernetics which focused on understanding machines and animals interacting with the world as control systems.
Pitts and McCulloch were interested in the problem of how principles of reasoning and logic could be implemented in the nervous system. Specifically they considered an abstract mathematical model of a brain cell (also known as a neuron). Their mathematical model was generally consistent with current theories of neural processing in the early 1940s. They assumed that a neuron consisted of several inputs and an output. Each neuron had both excitatory and inhibitory inputs. If the sum of the excitatory inputs exceeded the neuron’s firing threshold, then the neuron would fire unless one or more inhibitory inputs were activated.
In 1943, McCulloch and Pitts published a paper titled “A logical calculus of ideas immanent in nervous activity” in the journal titled “Bulletin of Mathematical Biophysics”. This famous paper showed that these abstract model neurons could be wired up in such a way so that they could realize any desired logical function. That is, these abstract mathematical models of neurons could be connected together so that they implemented any desired collection of IF-THEN logical rules. An example of an IF-THEN logical rule might be: IF I am hungry AND berries are available, THEN I should eat the berries. Check out Episode 3 of this podcast series titled “How to represent knowledge using logical rules” for further discussion of the concept of IF-THEN logical rules.
The essential idea underlying the McCulloch-Pitts analysis was to semantically interpret the firing of a neuron as the event that some assertion is true while a neuron which failed to fire could be semantically interpreted as corresponding to the event that an assertion is false. A neuron whose threshold was set to be relatively high could be used to implement the logic operation of “AND”. That is, only fire if ALL of the excitatory inputs to the neuron are active and no inhibitory inputs are active. A neuron whose threshold was set relatively low could be used to implement the logic operation of “OR”. That is, if ANY of the excitatory input to the neuron are active and no inhibitory are active. A neuron whose threshold was set very low so that the neuron would normally be active but could be inhibited by an inhibitory input could be used to implement the logic operation “NOT”. These three logic operations: AND, OR, and NOT are the building blocks of logic. By combining neurons that can implement the logic operations of AND, OR, and NOT it can be shown that any logical expression in the world can be represented by a network of these special McCulloch-Pitts formal neurons.
To see this, simply note that suppose we have a logic machine which has a collection of inputs and we want to build a network of McCulloch-Pitts neurons that will generate a desired pattern of Outputs for each pattern of Inputs. The number of possible input patterns is always going to be some finite rather than infinite number. For example, if the logic machine has 3 inputs and since each input can be either true or false there are eight possible input patterns. The total number of input patterns for a logic machine with 4 inputs would be 24 or 16 possible patterns. The first step is that we use some additional McCulloch-Pitts formal neurons which implement the NOT function. Specifically, one NOT neuron for each input so now when an input to the logic machine is OFF or FALSE we have a McCulloch-Pitts neuron that turns on. The second step is that we will require one McCulloch-Pitts neuron for each possible pattern of inputs and that McCulloch-Pitts neuron is chosen to implement the AND operation. Thus, we now have a unique neuron firing for each unique input pattern. The output of that neuron can then be used to generate the required action of the logic machine for that input pattern.
This is the essential idea of the famous paper by McCulloch and Pitts: Any arbitrary logical formula can be represented by a network of McCulloch-Pitts neurons. Note that this paper was published several years before the appearance of the first digital computer and, in fact, the paper by McCulloch and Pitts must have had an important impact on the development of the first digital computer because this paper was one of the few papers cited by Von Neumann in his technical report describing the first digital computer. Today, in fact, we refer to these abstract neurons proposed by McCulloch and Pitts as “digital logic gates”!
The ability to build machines that can implement any arbitrary logical function is an important and valuable accomplishment which can be achieved using McCulloch-Pitts formal neurons. However, in many cases we must have the ability to build other types of machines with numerical inputs rather than logical inputs. Our analysis of McCulloch-Pitts neurons is based upon the idea that the logic machine only observes a finite rather than an infinite collection of input patterns. This is a limitation of the McCulloch-Pitts analysis because it restricts itself to logical machines rather than machines that can process an infinite array of inputs.
There are many examples of situations in the world which are naturally represented using patterns of numerical values rather than patterns whose elements are restricted to the values of zero (false) or one (true). There are also many examples of machines that can process this information directly. Here is a short list.
Auditory information is perceived over time and is transmitted via acoustic pressure waves that strike eardrums or microphones to generate neural or electrical activity. Thus, a sequence of numbers is used to represent auditory information.
Visual information is perceived over space and time and transmitted via electromagnetic radiation which can be detected by eyeballs and photodetectors. In this case, we have sequences of large arrays of numbers which are used to represent incoming visual information.
Proprioceptive feedback is also naturally represented numerically rather than by a simple logical true or logical false. Numbers specifying the internal state of the body such as information about muscle tension is available for providing information about performance.
The magnitude of heat or cold which is detected is naturally represented on a numerical scale.
Smell and taste are impossible to represent as logical assertions. They are most naturally represented as patterns of numbers.
Indeed, virtually any perceived input pattern seems as if it is more fundamentally represented on a numerical scale rather than as a collection of assertions which can individually take on the values of true or false.
Furthermore, it is not always possible to represent the output of an intelligent system as a pattern of conclusions which can be classified as either true or false. The output of the intelligent system might involve the control of movement which is not naturally represented as turning a switch on or off. Even the process of thinking and planning and reaching conclusions can not always be represented using the formalism of logic. Intelligent systems (as discussed in Episodes 7 and 8) may generate conclusions which are represented in terms of the numerical likelihood that one action is more or less preferable to another action. That is, an intelligent system whose inputs are a collection of numerical preferences and whose outputs are a collection of numerical preferences is not solving a logic problem. Thus, in the real world, there may be literally infinite variation in the world and the outputs of the machine may have infinite variation as well. We typically can not always represent the world as a finite number of possible input patterns and the output of a machine as always a finite collection of possible output patterns.
These thoughts lead us to a concept in mathematics which generalizes the concept of an IF-THEN logical rule. This new concept is called a “FUNCTION”. You can think of a function as a type of machine whose inputs are a collection of numerical measurements and when then generates a set of outputs which are also represented as a pattern of numbers.
As previously noted, such situations are quite common in areas such as speech signal processing, image processing, or problems involving the control of physical devices and systems. The problem of designing (or even learning) a function to achieve this goal seems daunting. Indeed, it can be shown that in this more general case the problem of designing such a machine is equivalent to selecting a function from an infinite rather than a finite collection of functions.
Fortunately, however, there is some good news! Most of the important functions in this large infinite collection of functions can be approximately represented as “curvy” or “smooth” functions which are technically known as continuous functions. From the perspective of designing a machine which takes some input pattern of numerical inputs and generates some output pattern of numerical outputs, this assumption that the machine can be represented as a continuous function simply means that if you give the machine a pattern of numerical inputs and you get a response then you should get a similar response if you give the machine a very similar pattern of numerical inputs. This is actually a very desirable property for a machine which must generalize from experience and make inferences about numerical patterns of inputs which it has never seen before.
Given the concept of a continuous function, we now introduce the idea of a function space. Suppose we start with a finite number of functions which are called basis functions. We now consider the set of all possible functions which can be formed by computing weighted sums of basis functions. This gigantic set of functions which are generated from the basis functions is called a linear function space. Now if we are careful about how we select our basis functions, we can show that any continuous function can be represented as a weighted sum of basis functions. Not any basis function will do however. I will go over a few types which are commonly used in machine learning and which have been mathematically proven to be able to represent any arbitrary continuous function.
If the basis functions are chosen in an appropriate manner, one can prove using mathematical theorems that a weighted sum of basis functions may be used to approximately represent any given continuous function to any desired degree of accuracy. This type of approach is an example of function approximation which is widely used in the field of machine learning.
Although many different types of basis functions will have the appropriate properties for specifying a function space which can approximately represent any arbitrary continuous function, today we only focus on two specific types of basis functions. The two most widely used types of basis functions are Radial Basis Functions and Sigmoidal Basis Functions. Both of these types of basis functions are capable of specifying function spaces which can represent any arbitrary continuous function. We will now discuss each of these two important types of basis functions.
The first type of basis function is called an RBF or Radial Basis Function. The RBF has the property that this basis function takes on its maximum value for a particular pattern of numerical inputs. For example, suppose an RBF is designed to respond to the input pattern of three numerical inputs: [112, 42, -10]. Thus, when the pattern [112, 42 -10] is presented, the RBF might generate the number 10 but if a different pattern such as [0, -100, 0] is presented the RBF might generate a number which is very close to zero. The fun part begins when we present a pattern of numerical inputs which is similar but not identical to the pattern that the RBF knows. So, if we present the pattern [112, 42, -9] instead of [112, 42, -10] the RBF might respond with the number 9 instead of the number 10. Or if we present the pattern [100, 40, -8] instead of [112, 42, -10] the RBF might respond with the number 7 instead of the number 10.
Geometrically, we can imagine the RBF as a “hill” where the location of the center of the hill corresponds to the identity of the input pattern the RBF “knows” and the height of the hill corresponds to the response of the RBF to a given input pattern. Locations far away from the center of the hill will correspond to “low heights” or equivalently “weak responses”. Finally, there are other parameters of the RBF which can be used to adjust the height, width, and shape of the RBF hill but regardless of the adjustment of these various parameters the geometric shape of an RBF always looks like a “bump” or “hill”.
So far we have only talked about a single RBF but as mentioned previously in order to approximately represent different functions we will want to compute a weighted sum of RBFs. Here is an easy way to visualize what this means. Let’s go back to the idea of the RBF as a hill where the location of the height corresponds to the input pattern that the RBF “knows” and the height of the hill corresponds the response of the RBF. Now suppose we want to approximate an arbitrary continuous function using a weighted sum of RBFs. Geometrically, this corresponds to the goal of building a shape in the form of a mountain range of RBF hills. By choosing the location, shape, and height of each RBF hill in an appropriate manner, it should be possible to design a mountain range of any desired shape. The shape of the mountain range is interpretable in this context as the specification of a desired continuous function which we wish to approximate using the RBF.
We can also interpret the RBF as alternative abstract mathematical model of a neuron which is different from a McCulloch-Pitts formal neuron. The RBF neuron is a feature detector which responds strongly to a particular pattern of numerical inputs. And, just like the McCulloch-Pitts neuron, we can “wire together” RBF neurons to generate novel machines to support intelligent computations. Specifically, we can imagine having a large collection of RBF formal neurons which receive the same input pattern of numerical inputs. The outputs of these RBF formal neurons generate a new pattern of numerical outputs. This new pattern of numerical outputs can then be fed as input to another abstract neuron model which computes a weighted sum of its inputs. Note that we use the terminology “neuron model” as a pedagogical device but if we are not explicitly making testable hypotheses regarding actual neural processing, it is more appropriate to substitute the terminology “unit” for “formal neuron”. Thus, we might refer to an RBF neuron as an “RBF unit” since our goal is not to posit particular theories of neural processing but rather to describe a particular computing strategy which may or may not have a plausible neural interpretation.
Anyway, we have thus shown that any arbitrary continuous function can be approximated by using a large number of RBF units whose outputs feed into another unit. More specifically, the outputs of the RBF units are then fed as input to another unit which computes a weighted sum of these outputs. This other unit may be called a “linear formal neuron” or “linear unit”.
RBF units are very useful for representing arbitrary analog signals which don’t have the semantic interpretation of true or false. McCulloch-Pitts formal neurons or “MP units” are very useful for representing input signals in which the inputs have numerical values which are equal to the number one indicating the presence of a true assertion or are equal to the number zero indicating the presence of a false assertion. Thus, RBF units are not better or worse than MP units but rather it is important to select the appropriate unit for the appropriate modeling task.
However, suppose we have a situation as discussed in Episode 7 where we have “fuzzy logic”. In other words, the inputs to our inference machine can be equal to the number “1” indicating “true” or equal to the number “0” indicating “false” but they can also be equal to a number between “0” and “1” indicating “maybe”. That is, an input of 9/10 means the assertion is very close to true and an input of 1/10 means the assertion is very close to false. In order to represent arbitrary continuous functions with these types of input patterns, one can not use MP units since such units are only designed to correctly function with inputs which take on the value of “0” or “1”. Also RBF units are not really the best choice because they are not designed to model IF-THEN logical rules like MP units.
Thus, this leads us to a new important type of basis function called the sigmoidal unit. The sigmoidal unit is a smooth version of an MP unit. To review, a McCulloch-Pitts or MP unit fires if enough of its inputs are active. A sigmoidal unit works by computing a weighted sum of its inputs, if the weighted sum of inputs is very large then the sigmoidal unit generates its maximum response. If the weighted sum of inputs is very small then the sigmoidal unit generates its minimum response. If the weighted sum of inputs is not small and it is not large, then the sigmoidal unit’s response is approximately equal to the weighted sum of its inputs.
Geometrically, the response characteristics of a sigmoidal unit can be visualized as a road or ramp that gradually increases in altitude from a low altitude to a high altitude. Assume the road is level at its lowest altitude, then its altitude increases, and then the road is level again at its highest altitude. The location of all points on the highest altitude portion of the road correspond to input patterns which are assigned the value of 1 or logical true.
The location of all points on the lowest altitude portion of the road correspond to input patterns which are assigned the value of 0 or logical false. Points on the ramp section of the road which traverses from the lowest to highest altitude correspond to input patterns which are considered to be partially true and partially false with input patterns associated with greater altitudes being identified as patterns that are more true than false.
Although a weighted sum of sigmoidal units, like the RBF unit, is capable of approximately representing any continuous function, networks of sigmoidal units are useful for representing functions that approximate logical computations. Networks of RBF units, on the other hand, are useful for representing functions that process analog data such as that encountered in speech signal processing or image processing.
To summarize, we have shown that MP units may be used to construct machines that can represent any arbitrary logical function. RBF units and sigmoidal units may be used to construct machines that can approximately represent any arbitrary continuous function. As a general rule, however, sigmoidal units are preferred when the machine is intended to represent a continuous approximation to a logical function such as a fuzzy logic machine as discussed in Episode 7. In other cases, where the machine is not modeling a logic computation or a fuzzy logic computation, then RBF units should be considered.
It is important to emphasize, however, that these theoretical results that we have discussed today are really theorems about representation. These results do not tell us whether a machine with RBF or sigmoidal units will exhibit better generalization performance such as discussed in the cross-validation Episode 12. In other words, the theoretical results described today do not say anything about how the machine will behave when it is presented with a novel input unless that input is very very similar to one of those trained input patterns. These results provide NECESSARY conditions for effective memorization but not SUFFICIENT conditions for generalization performance. These results do not tell us how to DESIGN a learning rule.
But don’t worry! We will cover the topics of how to design learning procedures for learning machines and evaluate generalization performance in considerable detail in future episodes of Learning Machines 101!
Further Reading:
Radial Basis Function. Wikipedia article (http://en.wikipedia.org/wiki/Radial_basis_function_network )
McCulloch Pitts Formal Neuron. Wikipedia article. (http://en.wikipedia.org/wiki/Artificial_neuron [Good introduction.]
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115-113.
White, H. and Stinchcombe, M. (1992). Approximating and learning unknown mappings using multilayer networks with bounded weights. In H. White (Ed.) Artificial Neural Networks: Approximation and Learning Theory. Blackwell: Mass. (See Theorems 2.1 and 2.6)
[Advanced reading.]
“Stone-Weierstrass Theorem”. Wikipedia article (https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem)
[Reviews theorem that any continuous function can be approximated by a polynomial.]
Basis (linear algebra). Wikipedia article. (Note that “basis” in linear algebra may refer to a space of functions rather than vectors).
( https://en.wikipedia.org/wiki/Basis_(linear_algebra) )
Keywords: Function Approximation, Supervised Learning, McCulloch-Pitts, Sigmoid, Radial Basis Function