This is the first part in a series of papers we're reading about Quasi-Newton methods, which Sisu relies on for optimizing key subproblems when running workflows. In this meeting, we discuss:
- What's the point of Quasi-Newton methods?
- How are the Wolfe conditions guide algorithm design?
Why Quasi-Newton (QN)?
- Middle ground between first-order methods and second-order methods.
- If you look at the rates for strongly convex and smooth functions, this doesn't make much sense: to get an -critical point, gradient descent (GD) has convergence and Newton provides but for most optimization problems you get in ML and statistics, the size of your dataset is usually at least for generalization error, so even evaluating your final training loss takes magnitudes longer than the number of iterations to optimize it to a point where optimizing more won't significantly improve the real metric you care about!
- That said, in practice QN methods partially retain a key property of second-order methods, which is reduced sensitivity to smoothness and convexity constants, which can get really bad!
- QN methods do this while allowing to iteration cost for dimension . Compare this to for Newton steps.
- It's amazing that we can get practical improvements so cheaply. BFGS updates its Hessian approximation given just numbers, but can achieve superlinear convergence rates, beating GD. This is crazy, because the Hessian has numbers that define it at each iterate, but BFGS can use limited information (and limited compute) to still optimize quickly.
- Wolfe conditions are the key to understanding how this is possible (turns out you only need Hessian information along the descent direction).
- Lemma 3.1 N&W Wolfe conditions admit a solution for descent directions
- Armijo condition upper bounds step size locally by the Taylor expansion and by Mean Value Theorem, the curvature condition lower bounds step size below the upper bound.
- Thm 3.2 N&W Any optimization method producing descent directions which have lower-bounded cosine with the gradient will convergence eventually with Wolfe line search (slow GD rate, global convergence)
- Armijo condition bounds the per-step function value reduction with changes in slope, since gradient-Lipschitz assumptions imply slope changes slowly, the per-step function value drop is significant when the curvature condition is enforced.
- Thm 3.7 N&W Superlinear convergence
- Assuming the Wolfe conditions eventually hold for step size 1, then iteration for any QN method using Wolfe conditions is eventually an approximation for a Newton-Raphson process under step size 1 assumption, up to some error which drops us from quadratic to merely superlinear convergence.
- Fun fact: conditions for this to hold can be either that the function is or strong convexity and smoothness, but we don't actually need both.
QN Part 1: Wolfe.pdf
Huber is a bit of a trivial example for low strong convexity—it's not strongly convex outside of the quadratic region at all. Cross entropy, , is a less trivial one where the strong convexity constant tends to zero as domain increases.
Secant equation is (instead of with in one place)
Wolfe conditions require , not , you need this for IVT application in Lemma 3.1 so that where
The N&W book is missing a key theorem for why Theorem 3.7's assumption (that step size is eventually 1) holds for most QN methods (its Theorem 3.6 proves this for regular Newton iteration). These ad-hoc notes explore the additional regularity assumptions where this can happen. As long as the QN method approximates bounded Hessians and the underlying function is smooth, the assumption holds.
QN Wolfe Step Size.pdf
(sign is flipped but proof structure holds because another sign was dropped when introducing the Taylor expansion into the Armijo condition)
should be another unspecified positive constant (does not affect conclusion)
If you like applying these kinds of methods practical ML problems, join our team.