glmnet | February 23, 2021

By Vlad Feinberg - February 23, 2021

In our discussion this week, we review glment, the widely-renowned, frequently-used, and efficient Lasso optimization package that kicked off a whole class of working set optimization approaches and rekindled excitement in coordinate descent.


Why glmnet?

  • At Sisu, we perform selection by relying on Stability Selection, which uses Lasso as a subroutine. Calling Lasso multiple times in an inner loop means at the end of the day lasso optimization is the bottleneck for what we do.
  • glmnet provides very fast lasso solutions for the path-lasso problem, i.e., for various different regularization parameters λ\lambda. Its two most universal speedup workhorses are:
    • Strong rules, which tentatively discard variables from the optimization problem.
    • Usage of coordinate descent, a relatively obscure algorithm at the time, which happens to solve lasso really really well.


  • Lasso can be solved with generic ("oblivious") proximal convex solvers. Why specialize so much to this specific problem minβ12nXβy22+λβ1\min_\beta \frac{1}{2n}\|X\beta-y\|_2^2+\lambda\|\beta\|_1? The optimum β^\hat\beta to this problem has some really amazing properties.

    • Bet on sparsity principle (16.2.2 ESL):

      Use a procedure that does well in sparse problems, since no procedure does well in dense problems.

    • Under modest assumptions (Foster and George 1994, Bartlett et al 2011), in the "supervised ML setting," mean squared error (MSE) for prediction drops as O(n1/2)O(n^{-1/2}) ("slow rate") in the number of examples nn.

    • With less modest assumptions (Candès and Tao 2005, van de Geer and Bühlmann 2009) you can have MSE drop at rate O(n1)O(n^{-1}) and even recover the "true" unknown β\beta^* assuming it's sparse, λ\lambda is set appropriately, given data (x,y)(x,y) which satisfies E[y]=x,β\mathbb{E}[y]=\langle x, \beta^*\rangle. Note this is really fast, as in it's the kind of MSE you get from just estimating means, which are, like, the easiest thing to estimate.

    • Perhaps more importantly than the rate hacking above, those O()O(\cdots) only hide logp\log p factors, where pp is the dimension of β\beta. This tells us that the Lasso really is a tool for statistical search: it can uncover the "true" signals from a list of noisy ones without requiring the number of examples to grow with the number of signals we use.

  • Zoomed out, sparsity-aware ("non-oblivious") solvers have the following form. Let L(β)=12nXβy22L(\beta)=\frac{1}{2n}\|X\beta-y\|_2^2, and LAL_A the restriction to coordinates AA of its argument (i.e., with the other coordinates fixed at 0).

    Coarsifying excessively, the whole lasso algorithm looks like a loop over these two steps:

    1. Make a guess which subset of candidates A[p]A\subset[p] have nonzero coefficients. A[p]|A|\ll [p] for most λ\lambda.
    2. Optimize the small lasso problem minβALA(βA)+λβA1\min_{\beta_A}L_A(\beta_A)+\lambda\|\beta_A\|_1 over just these A\left|A\right| coefficients, not all pp. This gives some corresponding βRp\beta'\in\mathbb{R}^p with nonzero coefficients only in AA coordinates.
    3. For every one of the [p][p] original candidates, we need to check that the KKT conditions hold (given β\beta', are the rest really 0, since we just guessed in step (1)?).

    So fast solvers (A) make good guesses for Asupp βA\approx \mathrm{supp}\ \beta^*, such that we require few loops, learning based on the history of the optimization and (B) solve the inner problem in step 2 as fast as they can.

  • glmnet uses the "strong rules" heuristic for (A). Define ci(λ)=1nxi(Xβ(λ)y)c_i(\lambda)=\frac{1}{n}x_i^\top (X\beta^*(\lambda) - y) where β(λ)\beta^*(\lambda) is the solution for a given λ\lambda (notably, strong rules are useful in the situation where you want to solve lasso for many λ\lambda, in decreasing order; this is called the lasso path problem). Note cic_i is the gradient of the RSS term LL at the solution for λ\lambda. Strong rules assume cic_i changes slowly to use information from old solutions λ>λ\lambda'>\lambda to estimate AA.

    • Assume cic_i is 11-Lipschitz, i.e., ci(λ)ci(λ)λλ|c_i(\lambda')-c_i(\lambda)|\le |\lambda-\lambda'|
    • The strong rules suggests adding a coordinate ii to AA if it fails the the "sequential strong rule" condition, that ci(λ)<2λλ|c_i(\lambda)|<2\lambda-\lambda'
    • When the above two conditions hold, notice ci(λ)ci(λ)ci(λ)+ci(λ)|c_i(\lambda)|\le |c_i(\lambda)-c_i(\lambda')|+|c_i(\lambda')| by triangle inequality, which implies ci(λ)<(λλ)+(2λλ)|c_i(\lambda)|< (\lambda' - \lambda)+(2\lambda-\lambda') by using our two conditions. This simplifies to ci(λ)<λ|c_i(\lambda)|<\lambda, which by the KKT conditions of the Lasso problem implies βi(λ)=0\beta^*_i(\lambda)=0, i.e., we don't need to optimize coordinate ii.
  • glmnet leverages coordinate descent (CD), which uses more than just first order information about the objective to outperform most oblivious solvers for (B). In practice, we noticed several gotchas with CD:

    • You're actually solving Lasso-with-unregularized-intercept minβ,β012nXβyβ022+λβ1\min_{\beta,\beta_0} \frac{1}{2n}\|X\beta-y-\beta_0\|_2^2+\lambda\|\beta\|_1, so you need do a hairy two dimensional optimization step when doing line search if XX is uncentered.
    • You also need to keep track of the objective loss value — computing it is O(np)O(np), so you need to update it. Requires some additional statistics to be gathered during the coordinate step.
    • Sparse XX are tricky: you need to do additional bookkeeping to avoid paying O(n)O(n) cost per coordinate iteration (i.e., make it O(nnz)O(\mathrm{nnz})). In addition, watch out for floating point error from cancellation when doing this.

    Raw Notes

    If you like applying these kinds of methods practical ML problems, join our team.

Read more

Quasi-Newton part one: Wolfe conditions | October 13, 2020

The first part in a series of papers we're reading on Quasi-Newton methods, which Sisu relies on for optimizing key subproblems when running workflows.

Read more

Benjamini-Hochberg | August 17, 2020

A discussion on Benjamini-Hochberg, avoiding the multi-comparison problem, and False Discovery Rate (FDR).

Read more