Following our previous post on Quasi-Newton methods, having established the Wolfe line search algorithm and how it guarantees convergence under directional Hessian approximation, we move on to QN methods that create approximations that meet this criterion and thus enjoy global and local convergence properties.
- BFGS isn't the first QN algorithm (that'd be DFP), nor is it the simplest (that'd be SR1), but it is the most widely used due to its stability and has strong guarantees.
- BFGS convergence proofs capture the spirit of QN methods:
- From Wolfe, we need to build Hessian approximations along the descent direction: .
- We structure the method around using discrete differences in the gradient (the secant equation) to approximate continuous differences in the gradient, namely the Hessian.
- While the true Hessian need not be positive definite (PD) for arbitrary non-convex functions, enforcing this restriction on its approximation keeps updates stable, unique, and eventually leads to getting a good directional approximation.
- The method is the basis for the more modern variants (L-BFGS, OWL-QN) which we use in our stack.
- Theorem 6.6 N&W Convergence eventually (slow GD rate)
- A non-negative potential generalizes the AM-GM inequality for PD matrices, which PD-enforcing BFGS updates respect. By the BFGS construction, smoothness and strong convexity, updates relate the cosine of the angle between the most recent step taken and the gradient, , to changes in the eigenspectrum of our Hessian approximation, namely, if we take orthogonal steps to the gradient the AM/GM eigengap shrinks and eventually flips signs, a contradiction. Thus so N&W Theorem 3.3 applies.
- I'd guess a cleaner constructive proof is possible if we could express some objective value decrease condition in terms of the current AM/GM eigengap.
- Lemmas 3, 4, 5 Powell 1976 Convergence linearly for bounded QN methods (Powell lemmas)
- Adding smoothness and strong convexity to the weaker Wolfe conditions of N&W Theorem 3.3, the same proof can be bootstrapped into linear convergence. This is unsurprising since in this setting GD gets linear convergence.
- Theorem 6.7 N&W Convergence superlinearly (reduction to Wolfe)
- Introducing Hessian-Lipschitness, a stronger-than- condition, and relying on linear iterate convergence lets us ensure second derivatives don't change too much, which lets us create a stronger version of the AM-GM inequality based on above in N&W Theorem 6.6 in terms of the distance between what a BFGS step would be and a Newton step in the same place; the same non-negativity argument shows this distance vanishes quickly, so Thm 3.7 applies.
QN Part 2: BFGS.pdf
If you like applying these kinds of methods practical ML problems, join our team.