Optimization

By Vlad Feinberg - February 23, 2021

In our discussion this week, we review glment, the widely-renowned, frequently-used, and efficient Lasso optimization package that kicked off a whole class of working set optimization approaches and rekindled excitement in coordinate descent.

- glmnet JSS paper
- R package website
- Strong rules paper
- Video presentation to R user session by Prof. Hastie

- At Sisu, we perform selection by relying on Stability Selection, which uses Lasso as a subroutine. Calling Lasso multiple times in an inner loop means at the end of the day lasso optimization is the bottleneck for what we do.
- glmnet provides very fast lasso solutions for the path-lasso problem, i.e., for various different regularization parameters $\lambda$. Its two most universal speedup workhorses are:
- Strong rules, which tentatively discard variables from the optimization problem.
- Usage of coordinate descent, a relatively obscure algorithm at the time, which happens to solve lasso really really well.

Lasso can be solved with generic ("oblivious") proximal convex solvers. Why specialize so much to this specific problem $\min_\beta \frac{1}{2n}\|X\beta-y\|_2^2+\lambda\|\beta\|_1$? The optimum $\hat\beta$ to this problem has some really amazing properties.

Bet on sparsity principle (16.2.2 ESL):

Use a procedure that does well in sparse problems, since no procedure does well in dense problems.

Under modest assumptions (Foster and George 1994, Bartlett et al 2011), in the "supervised ML setting," mean squared error (MSE) for prediction drops as $O(n^{-1/2})$ ("slow rate") in the number of examples $n$.

With less modest assumptions (Candès and Tao 2005, van de Geer and Bühlmann 2009) you can have MSE drop at rate $O(n^{-1})$ and even recover the "true" unknown $\beta^*$ assuming it's sparse, $\lambda$ is set appropriately, given data $(x,y)$ which satisfies $\mathbb{E}[y]=\langle x, \beta^*\rangle$. Note this is really fast, as in it's the kind of MSE you get from just estimating means, which are, like, the easiest thing to estimate.

Perhaps more importantly than the rate hacking above, those $O(\cdots)$ only hide $\log p$ factors, where $p$ is the dimension of $\beta$. This tells us that the Lasso really is

*a tool for statistical search*: it can uncover the "true" signals from a list of noisy ones*without*requiring the number of examples to grow with the number of signals we use.

Zoomed out, sparsity-aware ("non-oblivious") solvers have the following form. Let $L(\beta)=\frac{1}{2n}\|X\beta-y\|_2^2$, and $L_A$ the restriction to coordinates $A$ of its argument (i.e., with the other coordinates fixed at 0).

Coarsifying excessively, the

*whole lasso algorithm*looks like a loop over these two steps:- Make a guess which subset of candidates $A\subset[p]$ have nonzero coefficients. $|A|\ll [p]$ for most $\lambda$.
- Optimize the small lasso problem $\min_{\beta_A}L_A(\beta_A)+\lambda\|\beta_A\|_1$ over just these $\left|A\right|$ coefficients, not all $p$. This gives some corresponding $\beta'\in\mathbb{R}^p$ with nonzero coefficients only in $A$ coordinates.
- For every one of the $[p]$ original candidates, we need to check that the KKT conditions hold (given $\beta'$, are the rest really 0, since we just guessed in step (1)?).

So fast solvers (A) make good guesses for $A\approx \mathrm{supp}\ \beta^*$, such that we require few loops, learning based on the history of the optimization and (B) solve the inner problem in step 2 as fast as they can.

glmnet uses the "strong rules" heuristic for (A). Define $c_i(\lambda)=\frac{1}{n}x_i^\top (X\beta^*(\lambda) - y)$ where $\beta^*(\lambda)$ is the solution for a given $\lambda$ (notably, strong rules are useful in the situation where you want to solve lasso for many $\lambda$, in decreasing order; this is called the lasso path problem). Note $c_i$ is the gradient of the RSS term $L$ at the solution for $\lambda$. Strong rules assume $c_i$ changes slowly to use information from old solutions $\lambda'>\lambda$ to estimate $A$.

- Assume $c_i$ is $1$-Lipschitz, i.e., $|c_i(\lambda')-c_i(\lambda)|\le |\lambda-\lambda'|$
- The strong rules suggests adding a coordinate $i$ to $A$ if it fails the the "sequential strong rule" condition, that $|c_i(\lambda)|<2\lambda-\lambda'$
- When the above two conditions hold, notice $|c_i(\lambda)|\le |c_i(\lambda)-c_i(\lambda')|+|c_i(\lambda')|$ by triangle inequality, which implies $|c_i(\lambda)|< (\lambda' - \lambda)+(2\lambda-\lambda')$ by using our two conditions. This simplifies to $|c_i(\lambda)|<\lambda$, which by the KKT conditions of the Lasso problem implies $\beta^*_i(\lambda)=0$, i.e., we don't need to optimize coordinate $i$.

glmnet leverages coordinate descent (CD), which uses more than just first order information about the objective to outperform most oblivious solvers for (B). In practice, we noticed several gotchas with CD:

- You're actually solving Lasso-with-unregularized-intercept $\min_{\beta,\beta_0} \frac{1}{2n}\|X\beta-y-\beta_0\|_2^2+\lambda\|\beta\|_1$, so you need do a hairy two dimensional optimization step when doing line search if $X$ is uncentered.
- You also need to keep track of the objective loss value — computing it is $O(np)$, so you need to update it. Requires some additional statistics to be gathered during the coordinate step.
- Sparse $X$ are tricky: you need to do additional bookkeeping to avoid paying $O(n)$ cost per coordinate iteration (i.e., make it $O(\mathrm{nnz})$). In addition, watch out for floating point error from cancellation when doing this.

### Raw Notes

*If you like applying these kinds of methods practical ML problems, join our team.*