Optimization

Quasi-Newton part two: BFGS | October 27, 2020

By Vlad Feinberg - October 27, 2020

Following our previous post on Quasi-Newton methods, having established the Wolfe line search algorithm and how it guarantees convergence under directional Hessian approximation, we move on to QN methods that create approximations that meet this criterion and thus enjoy global and local convergence properties.

Materials

Why BFGS?

  • BFGS isn't the first QN algorithm (that'd be DFP), nor is it the simplest (that'd be SR1), but it is the most widely used due to its stability and has strong guarantees.
  • BFGS convergence proofs capture the spirit of QN methods:
    • From Wolfe, we need to build Hessian approximations along the descent direction: (Bkk2)pk=o(pk)\|(B_k-\nabla_k^2)p_k\|=o(\|p_k\|).
    • We structure the method around using discrete differences in the gradient (the secant equation) to approximate continuous differences in the gradient, namely the Hessian.
    • While the true Hessian need not be positive definite (PD) for arbitrary non-convex functions, enforcing this restriction on its approximation keeps updates stable, unique, and eventually leads to getting a good directional approximation.
  • The method is the basis for the more modern variants (L-BFGS, OWL-QN) which we use in our stack.

Nuggets

  • Theorem 6.6 N&W Convergence eventually (slow GD rate)
    • A non-negative potential Ψ\Psi generalizes the AM-GM inequality for PD matrices, which PD-enforcing BFGS updates respect. By the BFGS construction, smoothness and strong convexity, updates relate the cosine of the angle between the most recent step taken and the gradient, cos2θk\cos^2\theta_k, to changes in the eigenspectrum of our Hessian approximation, namely, if we take orthogonal steps to the gradient the AM/GM eigengap shrinks and eventually flips signs, a contradiction. Thus cos2θkδ>0\cos^2\theta_k\ge \delta> 0 so N&W Theorem 3.3 applies.
    • I'd guess a cleaner constructive proof is possible if we could express some objective value decrease condition in terms of the current AM/GM eigengap.
  • Lemmas 3, 4, 5 Powell 1976 Convergence linearly for bounded QN methods (Powell lemmas)
    • Adding smoothness and strong convexity to the weaker C1\mathcal{C}^1 Wolfe conditions of N&W Theorem 3.3, the same proof can be bootstrapped into linear convergence. This is unsurprising since in this setting GD gets linear convergence.
  • Theorem 6.7 N&W Convergence superlinearly (reduction to Wolfe)
    • Introducing Hessian-Lipschitness, a stronger-than-C3\mathcal{C}^3 condition, and relying on linear iterate convergence lets us ensure second derivatives don't change too much, which lets us create a stronger version of the AM-GM inequality based on Ψ\Psi above in N&W Theorem 6.6 in terms of the distance between what a BFGS step would be and a Newton step in the same place; the same non-negativity argument shows this distance vanishes quickly, so Thm 3.7 applies.

Raw Notes

QN Part 2: BFGS.pdf

If you like applying these kinds of methods practical ML problems, join our team.


Read more

Quasi-Newton part one: Wolfe conditions | October 13, 2020

The first part in a series of papers we're reading on Quasi-Newton methods, which Sisu relies on for optimizing key subproblems when running workflows.

Read more

Benjamini-Hochberg | August 17, 2020

A discussion on Benjamini-Hochberg, avoiding the multi-comparison problem, and False Discovery Rate (FDR).

Read more