Podcast: Play in new window | Download | Embed
LM101-068: How to Design Automatic Learning Rate Selection for Gradient Descent Type Machine Learning Algorithms
Episode Summary:
This 68th episode of Learning Machines 101 discusses a broad class of unsupervised, supervised, and reinforcement machine learning algorithms which iteratively update their parameter vector by adding a perturbation based upon all of the training data. This process is repeated, making a perturbation of the parameter vector based upon all of the training data until a parameter vector is generated which exhibits improved predictive performance. The magnitude of the perturbation at each learning iteration is called the “stepsize” or “learning rate” and the identity of the perturbation vector is called the “search direction”. Simple mathematical formulas are presented based upon research from the late 1960s by Philip Wolfe and G. Zoutendijk that ensure convergence of the generated sequence of parameter vectors. These formulas may be used as the basis for the design of artificially intelligent smart automatic learning rate selection algorithms.
Show Notes:
Hello everyone! Welcome to the 68th podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner. Episode 65 explained the concept of gradient descent algorithm. If you have not yet listened to Episode 65, please listen to Episode 65 because this Episode is a continuation of the concepts discussed in Episode 65.
Most machine learning algorithms (including unsupervised, supervised, and reinforcement learning algorithms) work by perturbing the parameters of the learning machine based upon information in the set of training data. The size of this perturbation governs the rate of learning. If the rate of learning is too slow….that is we just change the parameter values of the learning machine just by a very small amount at each learning trial, it takes forever to learn. However, if we try to increase the speed of learning by making large changes to the parameter values then a large change to the parameter values might damage the learning machine’s existing knowledge of its environment. Ideally, the learning machine should learn at a slow enough rate that relevant key statistical regularities are distinguished from statistical flukes in the learning machine’s statistical environment. If the rate of learning is too fast, then freak statistical events in the environment might be learned by the learning machine and disrupt the learning machine’s previous memories of critically important statistical regularities.
This 68th episode of Learning Machines 101 discusses a broad class of unsupervised, supervised, and reinforcement machine learning algorithms which iteratively update their parameter vector by adding a perturbation based upon the training data until the resulting parameter vector generates improved predictive performance. The magnitude of this perturbation is called the “stepsize” or “learning rate” and the identity of the perturbation vector is called the “search direction”. Simple mathematical formulas are presented based upon research from the late 1960s by Philip Wolfe and G. Zoutendijk that ensure convergence of the generated sequence of parameter vectors. A relevant recent review of this work can be found in Nocedal and Wright (1999). These formulas may be used as the basis for the design of artificially intelligent smart automatic learning rate selection algorithms.This reference is provided in the show notes of this episode of Learning Machines 101 at: www.learningmachines101.com .
Many machine learning algorithms can be interpreted as gradient descent algorithms. Gradient descent type methods are also used in unsupervised and reinforcement learning algorithms as well as supervised learning algorithms. Indeed, the recent advances in the development of learning methods for deep learning neural networks are based upon the concept of gradient descent, yet standard methods for estimating parameters in classical logistic and multinomial logistic regression models are based upon the concept of gradient descent as well. In this episode, we will define a generalization of a classical gradient descent algorithm which includes a large class of machine learning algorithms.
In this episode we discuss a large class of descent algorithms which correspond to variations of the classical gradient descent algorithm which are commonly used in many machine learning algorithms including deep learning. We provide some simple mathematical formulas and conditions which can be easily computed for the purposes of checking if a given descent algorithm will converge to an appropriate set of solutions. These conditions can also be used, as I will explain, as the basis for designing methods for automatically adjusting the learning rate or stepsize of a descent algorithm.
Consider a machine learning algorithm which generates a prediction for a given input pattern based upon the current parameter values of the learning machine. One can then compute the prediction error for each input pattern if we know the desired response of the learning machine for each input pattern. For training data set, the average prediction error across all training stimuli in the training data set is also assumed to be a differentiable function of the learning machine’s parameter values. The generic descent algorithm works by first making an initial guess for the learning machine’s parameter values and then perturbing these parameter values in such a way that the new parameter values will have a smaller average prediction error than the previous parameter values. This perturbation is often represented as a positive number called the “step size” multiplied by a list of numbers which is called the “search direction”.
If the descent algorithm is a gradient descent algorithm, then the search direction is chosen to be negative one multiplied by the derivative of the prediction error with respect to the learning machine’s parameter values. It can be shown that this choice of search direction decreases the prediction error most rapidly if the step size is sufficiently small. In fact, it can also be shown that if we choose a search direction whose angular separation with respect to the gradient descent search direction is strictly less than 90 degrees that the prediction error is guaranteed to decrease provided the step size is sufficiently small. This provides quite a bit of flexibility in choosing the pattern of perturbations to the parameter values of the learning machine during the learning process.
In order to understand the concept of angular separation, consider the following example. Suppose if we draw an arrow or “vector” from Earth such that the arrow point touches a nearby star then this is an example of a vector in a three-dimensional space. The distance to the star as measured by the length of the arrow is called the magnitude of the vector. We can also draw another vector using the same measurement units which points in the same direction to the star but which has a length of one. This latter vector is called a “normalized vector’’. Now draw another vector from Earth to a different star. A special formula for combining these normalized versions of these two vectors can be shown to specify the angular separation between the two vectors. When the angular separation is zero degrees, the two vectors are pointing in the same direction. If the two vectors are pointing in opposite directions, then the angular separation is 180 degrees.
So what is this special operation which has the capability of computing the angular separation between two normalized vectors? It is called a dot product operation and it is very easy to compute. I’ll illustrate it with an example. Suppose we have a coordinate system in three-dimensions and the location of one star is at the point (1,2,3) and the location of the other star is at the (0,4,5). These two points implicitly specify two unnormalized vectors in a three-dimensional space. The dot product of these two unnormalized vectors is simply equal to the sum of the products of the corresponding elements of the two vectors. That is, for the vectors (1,2,3) and (0,4,2) we have 1 times 0 plus 2 times 4 plus 3 times 2 which equals 8 plus 6 or 14. Finally, the good news is that everything I just described about computing the angular separation between a pair of vectors in three-dimensional space also works for computing the angular separation between a pair of vectors in a space with millions of dimensions!!!
Thus, we can easily check if a particular search direction satisfies the condition that its angle with the gradient descent direction is less than 90 degrees. A common choice for such a search direction is to choose the search direction equal to negative one multiplied by the gradient, the dot product in this case will be negative one multiplied by the sum of the squares of the elements of the gradient vector.
So the connection with machine learning is that each star in the sky actually corresponds to a different choice of parameter values for your learning machine. We can think of a gradient descent algorithm as specifying the trajectory of a spaceship from Earth. Every so often, the space ship computes the angular separation between the planned direction or equivalently the search direction which is an arrow pointing from the space ship and another arrow pointing from the space ship which is pointing directly towards the star. This latter arrow corresponds to the negative gradient descent direction when the performance function is convex or bowl-shaped. The pilot of the spaceship checks that the angular separation between the direction of flight (or search direction) with the arrow pointing directly to the star (that is the negative gradient) is always less than 90 degrees. Then the spaceship travels along the search direction a short distance called the stepsize and then a new flight direction and stepsize is proposed. In order for the spaceship to not “overshoot” its target star, it is clear that initially it should probably try to take larger steps and then when it gets closer to the target star the stepsize should decrease resulting in more “inflight corrections” to the search direction.
So to summarize the discussion thus far, we note that many machine learning algorithms can be interpreted as descent algorithms which work in the following manner. First, guess parameter values for the learning machines. Then, using all of the training data, compute the derivative of the prediction error or the gradient with respect to the parameter values. Then change the current parameters values by adding a positive number called the stepsize multiplied by the search direction. Choose the search direction so that the dot product of the normalized gradient and the normalized search direction is strictly less than some small negative number.
Note that just showing the average prediction error decreases at each parameter update is not sufficient to show that the resulting sequence of parameter estimates converges. It is quite possible to generate a sequence of parameter estimates whose corresponding prediction errors are decreasing yet the sequence of parameter estimates does not converge. Analogous to the spaceship example, suppose we consider the case where a parameter vector corresponds to a particular location of a hiker on a mountain side and the height of the hiker above sea level is defined as the average prediction error. If the hiker’s goal is to reach sea level, then one strategy is to choose each step so that the hiker’s height above sea level decreases at each step. This is analogous to a machine learning algorithm which decreases its prediction error at each step by updating its parameter values in an appropriate manner.
Note that, as in the spaceship example, if the step size is chosen to be some constant number, then convergence is not possible since the learning algorithm (or hiker) will “overstep” a solution. On the other hand, it is possible that the step size is decreased too rapidly. One can imagine a situation where the hiker takes shorter and shorter steps which eventually leads to the hiker stopping not at sea level but at some point on the mountain side.
For this reason, the problem of choosing the step size must be carefully addressed. One approach is to select an “optimal stepsize” which means that once the search direction is identified then one attempts to choose the stepsize which generates the largest decrease in prediction error over some pre-specified stepsize range. However, in a high-dimensional parameter space, it is computationally expensive to exactly compute the optimal stepsize. Instead, one would like to compute a computationally cheap “sloppy stepsize’’ which includes the optimal stepsize as a special case which ensures that the descent algorithm will converge.
What types of technical conditions are required for a “sloppy stepsize’’ to have the property that it ensures a descent algorithm will converge. A sufficient set of conditions is obtained if the “sloppy stepsize’’ satisfies the strong Wolfe conditions which were discussed by Philip Wolfe and others in the late 1960s. We refer to such sloppy stepsizes as Wolfe stepsizes. A Wolfe stepsize must satisfy two conditions called the Wolfe conditions.
The first Wolfe condition states that the prediction error evaluated after the proposed parameter update for a candidate stepsize minus the prediction error evaluated at the current parameter values must be less than or equal to some positive number α multiplied by the stepsize multiplied by the dot product of the gradient vector at the current parameter values and the proposed search direction. This condition basically ensures that not only the prediction error decreases at an update but it decreases by at least the amount α δ multiplied by the stepsize where δ is the dot product of the gradient vector at the current parameter values and the search direction and the constant α is a positive number less than one.
The second Wolfe condition is a requirement that the magnitude of the minimum prediction error decrease α δ implied by the first Wolfe condition must be greater than or equal to some lower bound. This lower bound is equal to (1/β) multiplied by the magnitude of the dot product of the search direction and the gradient after taking the stepsize. The constant number β is chosen to be larger than α but less than one.
These conditions might seem complicated when stated in words but they are essentially easily computable requirements that a stepsize must satisfy which only requires evaluating the prediction error and the prediction error derivative which are typically functions which have already been implemented.
Also recall that a critical point is defined as a parameter vector which makes the first derivative of the prediction error function vanish. This is a necessary but not sufficient condition for a local or global minimizer of the prediction error function. In practice, it would be great if we could find the smallest possible prediction error that is a global minimizer of the prediction error but for complex problems it is not unusual that simply finding a relatively deep minimizer of the prediction error function yields acceptable performance. Humans and animals are highly optimized to survive in their environments but there is not technically an optimal way to pick up an apple and eat the apple. There is not an optimal way to run away from a predator such as a tiger. What is important is that the biological learning machine eats applies and avoids tigers. This is usually good enough for the purpose of survival!
At this point we are ready to discuss the Zoutendijk-Wolfe convergence theorem. Assume the second derivative of the prediction error function is continuous and has a lower bound. We also define a subset of the parameter space called OMEGA as the set of all parameter vectors whose prediction error is less than or equal to the prediction error of our initial guess. Then it follows that the parameter updating process will eventually result in a sequence of parameter values which gets closer and closer to the set of critical points of the prediction error function in OMEGA or the sequence of parameter vectors will grow without bound. From an engineering perspective, this result is useful because it means that if the algorithm does converge to a parameter vector based upon empirical observation, then that parameter vector must be critical point of the prediction error objective function.
These results can also be used to design automatic stepsize selection algorithms for descent algorithms. The basic idea of such algorithms is that one first starts with the current parameter estimates. Then one picks a search direction which points in a direction which has an angular separation with the negative gradient descent direction which is strictly less than some 90 degrees minus some small constant. Then we need to pick a stepsize associated with the search direction. This is done by first starting with the maximum stepsize which is usually chosen to be one. Next, one checks the Wolfe conditions. If the stepsize is ok, then that stepsize is chosen because it is a large stepsize and the Wolfe conditions are satisfied. If not, then the current stepsize choice is reduced in magnitude and the Wolfe conditions are checked again. This latter step is repeated until the Wolfe conditions are satisfied or until it is determined that it is too computationally expensive to consider additional stepsizes. Typically, for high dimensional parameter estimation problems, one only wants to spend a few iterations looking for a good stepsize. Next, the parameters are updated by adding the stepsize multiplied by the search direction to the current parameter estimates. This process is then repeated until convergence to a critical point is reached or the system is empirically observed to exhibit convergence failure. Automatic stepsize selection methods of this type are called “backtracking stepsize” search methods and are discussed in the references at the end of blog for this podcast which is located at: www.learningmachines101.com .
I’m very interested in your comments or specific questions about this episode and promise to address them promptly. To provide comments or specific questions about this episode please use the following special email address: (Please listen to the audio podcast to obtain the secret email address!). This email address is not listed in the show notes.
Thank you again for listening to this episode of Learning Machines 101! I would like to remind you also that if you are a member of the Learning Machines 101 community, please update your user profile and let me know what topics you would like me to cover in this podcast. You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!
If you are not a member of the Learning Machines 101 community, you can join the community by visiting our website at: www.learningmachines101.com and you will have the opportunity to update your user profile at that time. You can also post requests for specific topics or comments about the show in the Statistical Machine Learning Forum on Linked In.
From time to time, I will review the profiles of members of the Learning Machines 101 community and comments posted in the Statistical Machine Learning Forum on Linked In and do my best to talk about topics of interest to the members of this group!
And don’t forget to follow us on TWITTER. The twitter handle for Learning Machines 101 is “lm101talk”!
Also please visit us on ITUNES and leave a review. You can do this by going to the website: www.learningmachines101.com and then clicking on the ITUNES icon. This will be very helpful to this podcast! Thank you so much. Your feedback and encouragement are greatly valued!
Keywords: Gradient Descent Learning, Batch Learning, Wolfe Conditions, Backtracking Linesearch, Zoutendijk-Wolfe Convergence Theorem, Convergence Theorem.
Further Reading and Videos:
“Wolfe Conditions” http://en.wikipedia.org/wiki/Wolfe_conditions Provides explicit formulas for selecting the stepsize for batch learning
Steepest Descent http://en.wikipedia.org/wiki/Gradient_descent
Episode 65 of Learning Machines 101 (https://www.learningmachines101.com/lm101-065-design-gradient-descent-learning-machines-rerun/)
Wolfe, P. (1969). “Convergence Conditions for Ascent Methods”. SIAM Review. 11 (2): 226–000. JSTOR 2028111. doi:10.1137/1011036.
Wolfe, P. (1971). “Convergence Conditions for Ascent Methods. II: Some Corrections”. SIAM Review. 13 (2): 185–000. doi:10.1137/1013035.
Nocedal, Jorge; Wright, Stephen (1999). Numerical Optimization.
Zoutendijk (1960, 1970). Methods of Feasible Directions. Elsevier, Amsterdam.
Copyright © 2017 by Richard M. Golden. All rights reserved.
Hi,
You may find theoretical proven results on convergence of Backtracking Gradient Descent, under the most general assumptions so far (better than if you use Wolfe’s method or Standard GD or other methods), in the papers below:
Tuyen Trung Truong, Coordinate-wise Armijo’s condition, arXiv: 1911.07820.
Tuyen Trung Truong, Convergence to minima for the continuous version of Backtracking Gradient Descent, arXiv: 1911.04221.
Tuyen Trung Truong and Tuan Hang Nguyen, Backtracking gradient descent method for general C^1 functions with applications in Deep Learning, 37 pages. Preprint: arXiv: 1808.05160v2. Accompanying source codes are available on GitHub.
Best,
Tuyen
Thank you for the references! I will check them out!!!