Optimization

By Vlad Feinberg - October 13, 2020

This is the first part in a series of papers we're reading about Quasi-Newton methods, which Sisu relies on for optimizing key subproblems when running workflows. In this meeting, we discuss:

- What's the point of Quasi-Newton methods?
- How are the Wolfe conditions guide algorithm design?

**Materials**

- Quasi-Newton Youtube Lecture
- Chapter 3 Nocedal & Wright (N&W) Numerical Optimization

**Why Quasi-Newton (QN)?**

- Middle ground between first-order methods and second-order methods.
- If you look at the rates for strongly convex and smooth functions, this doesn't make much sense: to get an $\epsilon$-critical point, gradient descent (GD) has $\log\epsilon^{-1}$ convergence and Newton provides $\log\log\epsilon^{-1}$ but for most optimization problems you get in ML and statistics, the size of your dataset is usually at least $\epsilon^{-1}$ for $\epsilon$ generalization error, so even evaluating your final training loss takes magnitudes longer than the number of iterations to optimize it to a point where optimizing more won't significantly improve the real metric you care about!
- That said, in practice QN methods partially retain a key property of second-order methods, which is reduced sensitivity to smoothness and convexity constants, which can get really bad!
- QN methods do this while allowing $O(d)$ to $O(d^2)$ iteration cost for dimension $d$. Compare this to $O(d^3)$ for Newton steps.
- It's amazing that we can get practical improvements so cheaply. BFGS updates its Hessian approximation given just $O(d)$ numbers, but can achieve superlinear convergence rates, beating GD. This is crazy, because the Hessian has $O(d^2)$ numbers that define it at each iterate, but BFGS can use limited information (and limited compute) to still optimize quickly.
- Wolfe conditions are the key to understanding how this is possible (turns out you only need Hessian information along the descent direction).

**Nuggets**

- Lemma 3.1 N&W Wolfe conditions admit a solution for descent directions
- Armijo condition upper bounds step size locally by the Taylor expansion and by Mean Value Theorem, the curvature condition lower bounds step size below the upper bound.

- Thm 3.2 N&W Any optimization method producing descent directions which have lower-bounded cosine with the gradient will convergence eventually with Wolfe line search (slow GD rate, global convergence)
- Armijo condition bounds the per-step function value reduction with changes in slope, since gradient-Lipschitz assumptions imply slope changes slowly, the per-step function value drop is significant when the curvature condition is enforced.

- Thm 3.7 N&W Superlinear convergence
- Assuming the Wolfe conditions eventually hold for step size 1, then iteration for any QN method using Wolfe conditions is eventually an approximation for a Newton-Raphson process under step size 1 assumption, up to some error which drops us from quadratic to merely superlinear convergence.
- Fun fact: conditions for this to hold can be either that the function is $\mathcal{C}^3$ or strong convexity and smoothness, but we don't actually need both.

**Raw Notes**

*Corrections*

Huber is a bit of a trivial example for low strong convexity—it's not strongly convex outside of the quadratic region at all. Cross entropy, $x\log x$, is a less trivial one where the strong convexity constant tends to zero as domain increases.

Secant equation is $B_{k+1}s_k=y_k$ (instead of with $B_k$ in one place)

Wolfe conditions require $0 < c_1<c_2 < 1$, not $c_2<c_1$, you need this for IVT application in Lemma 3.1 so that $\varphi'(0) < c_2\varphi'(0) < c_1\varphi'(0)$ where $\varphi'(0)<0$

The N&W book is missing a key theorem for why Theorem 3.7's assumption (that step size is eventually 1) holds for most QN methods (its Theorem 3.6 proves this for regular Newton iteration). These ad-hoc notes explore the additional regularity assumptions where this can happen. As long as the QN method approximates bounded Hessians and the underlying function is $\mathcal{C}^3$ smooth, the assumption holds.

*Corrections*

$H_k\nabla_k=-p_k$ (sign is flipped but proof structure holds because another sign was dropped when introducing the Taylor expansion into the Armijo condition) $1+c_3$ should be another unspecified positive constant $c_4$ (does not affect conclusion)

*If you like applying these kinds of methods practical ML problems, join our team.*