Optimization

Quasi-Newton part one: Wolfe conditions | October 13, 2020

By Vlad Feinberg - October 13, 2020

This is the first part in a series of papers we're reading about Quasi-Newton methods, which Sisu relies on for optimizing key subproblems when running workflows. In this meeting, we discuss:

  • What's the point of Quasi-Newton methods?
  • How are the Wolfe conditions guide algorithm design?

Materials

Why Quasi-Newton (QN)?

  • Middle ground between first-order methods and second-order methods.
  • If you look at the rates for strongly convex and smooth functions, this doesn't make much sense: to get an ϵ\epsilon-critical point, gradient descent (GD) has logϵ1\log\epsilon^{-1} convergence and Newton provides loglogϵ1\log\log\epsilon^{-1} but for most optimization problems you get in ML and statistics, the size of your dataset is usually at least ϵ1\epsilon^{-1} for ϵ\epsilon generalization error, so even evaluating your final training loss takes magnitudes longer than the number of iterations to optimize it to a point where optimizing more won't significantly improve the real metric you care about!
  • That said, in practice QN methods partially retain a key property of second-order methods, which is reduced sensitivity to smoothness and convexity constants, which can get really bad!
  • QN methods do this while allowing O(d)O(d) to O(d2)O(d^2) iteration cost for dimension dd. Compare this to O(d3)O(d^3) for Newton steps.
  • It's amazing that we can get practical improvements so cheaply. BFGS updates its Hessian approximation given just O(d)O(d) numbers, but can achieve superlinear convergence rates, beating GD. This is crazy, because the Hessian has O(d2)O(d^2) numbers that define it at each iterate, but BFGS can use limited information (and limited compute) to still optimize quickly.
  • Wolfe conditions are the key to understanding how this is possible (turns out you only need Hessian information along the descent direction).

Nuggets

  • Lemma 3.1 N&W Wolfe conditions admit a solution for descent directions
    • Armijo condition upper bounds step size locally by the Taylor expansion and by Mean Value Theorem, the curvature condition lower bounds step size below the upper bound.
  • Thm 3.2 N&W Any optimization method producing descent directions which have lower-bounded cosine with the gradient will convergence eventually with Wolfe line search (slow GD rate, global convergence)
    • Armijo condition bounds the per-step function value reduction with changes in slope, since gradient-Lipschitz assumptions imply slope changes slowly, the per-step function value drop is significant when the curvature condition is enforced.
  • Thm 3.7 N&W Superlinear convergence
    • Assuming the Wolfe conditions eventually hold for step size 1, then iteration for any QN method using Wolfe conditions is eventually an approximation for a Newton-Raphson process under step size 1 assumption, up to some error which drops us from quadratic to merely superlinear convergence.
    • Fun fact: conditions for this to hold can be either that the function is C3\mathcal{C}^3 or strong convexity and smoothness, but we don't actually need both.

Raw Notes

QN Part 1: Wolfe.pdf

Corrections

Huber is a bit of a trivial example for low strong convexity—it's not strongly convex outside of the quadratic region at all. Cross entropy, xlogxx\log x, is a less trivial one where the strong convexity constant tends to zero as domain increases.

Secant equation is Bk+1sk=ykB_{k+1}s_k=y_k (instead of with BkB_k in one place)

Wolfe conditions require 0<c1<c2<10 < c_1<c_2 < 1, not c2<c1c_2<c_1, you need this for IVT application in Lemma 3.1 so that φ(0)<c2φ(0)<c1φ(0)\varphi'(0) < c_2\varphi'(0) < c_1\varphi'(0) where φ(0)<0\varphi'(0)<0

The N&W book is missing a key theorem for why Theorem 3.7's assumption (that step size is eventually 1) holds for most QN methods (its Theorem 3.6 proves this for regular Newton iteration). These ad-hoc notes explore the additional regularity assumptions where this can happen. As long as the QN method approximates bounded Hessians and the underlying function is C3\mathcal{C}^3 smooth, the assumption holds.

QN Wolfe Step Size.pdf

Corrections

Hkk=pkH_k\nabla_k=-p_k (sign is flipped but proof structure holds because another sign was dropped when introducing the Taylor expansion into the Armijo condition) 1+c31+c_3 should be another unspecified positive constant c4c_4 (does not affect conclusion)

If you like applying these kinds of methods practical ML problems, join our team.


Read more

Benjamini-Hochberg | August 17, 2020

A discussion on Benjamini-Hochberg, avoiding the multi-comparison problem, and False Discovery Rate (FDR).

Read more

R-Learner | December 7, 2020

A discussion on R-learner, a 2-step causal inference algorithm to estimate heterogeneous treatment effects from observational data.

Read more