By Vlad Feinberg - February 23, 2021
In our discussion this week, we review glment, the widely-renowned, frequently-used, and efficient Lasso optimization package that kicked off a whole class of working set optimization approaches and rekindled excitement in coordinate descent.
Lasso can be solved with generic ("oblivious") proximal convex solvers. Why specialize so much to this specific problem ? The optimum to this problem has some really amazing properties.
Bet on sparsity principle (16.2.2 ESL):
Use a procedure that does well in sparse problems, since no procedure does well in dense problems.
With less modest assumptions (Candès and Tao 2005, van de Geer and Bühlmann 2009) you can have MSE drop at rate and even recover the "true" unknown assuming it's sparse, is set appropriately, given data which satisfies . Note this is really fast, as in it's the kind of MSE you get from just estimating means, which are, like, the easiest thing to estimate.
Perhaps more importantly than the rate hacking above, those only hide factors, where is the dimension of . This tells us that the Lasso really is a tool for statistical search: it can uncover the "true" signals from a list of noisy ones without requiring the number of examples to grow with the number of signals we use.
Zoomed out, sparsity-aware ("non-oblivious") solvers have the following form. Let , and the restriction to coordinates of its argument (i.e., with the other coordinates fixed at 0).
Coarsifying excessively, the whole lasso algorithm looks like a loop over these two steps:
So fast solvers (A) make good guesses for , such that we require few loops, learning based on the history of the optimization and (B) solve the inner problem in step 2 as fast as they can.
glmnet uses the "strong rules" heuristic for (A). Define where is the solution for a given (notably, strong rules are useful in the situation where you want to solve lasso for many , in decreasing order; this is called the lasso path problem). Note is the gradient of the RSS term at the solution for . Strong rules assume changes slowly to use information from old solutions to estimate .
glmnet leverages coordinate descent (CD), which uses more than just first order information about the objective to outperform most oblivious solvers for (B). In practice, we noticed several gotchas with CD:
If you like applying these kinds of methods practical ML problems, join our team.