Statistical Inference

Cross-Validation | May 25, 2021

By Vlad Feinberg - May 25, 2021

This week, we reviewed a paper by Stephan Bates, Trevor Hastie, and Robert Tibshirani on cross-validation. This paper provides an illuminating look at what cross validation (CV) actually estimates, proving that KK-fold CV better estimates the average error for a supervised learning procedure among datasets of size K1Kn\frac{K-1}{K}n for datasets of size nn than the actual average generalization error of the model we just fitted!


Why Nested Cross Validation?

  • As machine learning practitioners, if we want to have a confidence interval for our model's generalization error, we should be cognizant of the naivety of merely looking at the CV estimate for our model's error with a normal interval built around its standard errors, which often times fail to include the true error at the nominal rate.
    • This means that you might expect that if you take a lower bound of the 10% CI from your CV, that only ~5 out of a 100 times you train an ML model on a new dataset you'll actually have worse performance than expected, in practice it could be much higher.


  • (Theorem 1) Put evocatively, in a standard ordinary least squares OLS setting with predictors XX and outcomes YY, one could resample outcomes YY' from the same model, and should be just as happy with a CV estimate for the error from the dataset (X,Y)(X, Y') for the model trained on (X,Y)(X, Y) as the CV estimate on (X,Y)(X, Y) directly.
    • To an extent, we can view this as sensible: our error estimate is independent of the particular noise instantiation for our dataset.
    • However, this is also not intended, as practitioners frequently look to CV performance for an understanding of their model's expected generalization error.
    • The proof follows from a classical argument: since CV is linearly invariant its error estimate is independent of the the component of YY living in XX's column space, which determines the OLS fit.
  • (Theorem 2) Under stronger assumptions about a generative distribution of XN(0,Σp)X\sim N(0,\Sigma_p) and the asymptotic setting nn\rightarrow \infty, n/pλn/p\rightarrow\lambda with number of features growin as nn does, the CV estimate for the mean squared error (MSE) of your model itself has an MSE of Θ(n1)\Theta(n^{-1}) from the actual generalization error, which is within the magnitude of the variance of the CV estimator itself. As a result, one cannot hope to have asymptotic coverage of the true error for one's model using unadjusted confidence intervals.
    • Inspecting the generalization error of the OLS setup directly, the question essentially reduces to computing second and fourth moments of a normal quadratic form.
    • Existing results about inverse Wishart matrices bound the error growth as described.
  • Raw notes consider various extensions and how authors might pursue future work: the results are impressive for the given setting, but this CV estimation issue does not exist for fixed pp, asymptotic nn settings.
    • If Lasso regularization is used, is the high feature dimensionality problematic? Unfortunately, yes, we cannot simply rely on screening to whittle feature count down. There might be ways around this, but they'd require at a deeper look at stability of the lasso solution across CV folds.
    • Can results be extended to other covariate models? Possibly; their fourth and second moments need to be well behaved for error growth rates to remain similar.

Raw Notes


  • The proof of the abridged Theorem 1.6 from Seber and Lee starts by writing the quadratic form XAX=ijklaijaklXiXjXkXlX^\top AX=\sum_i\sum_j\sum_k\sum_la_{ij}a_{kl}X_iX_jX_kX_l; the left-hand side should be squared.
  • The proof of Corollary 3 is incorrect and longer than it should be: three lines in, the E[var(ErrXYX)]\mathbb{E}[\mathrm{var}(\mathrm{Err}_{XY}|X)] term is shown equal to var ErrXY2cov (ErrXY,ErrX)\mathrm{var}\ \mathrm{Err}_{XY}-2\cdot\mathrm{cov}\ (\mathrm{Err}_{XY},\mathrm{Err}_{X}), and the left term is immediately Ω(n1)\Omega(n^{-1}) while the right term can be written as corr(ErrXY,ErrX)var(ErrXY)var(ErrX)\mathrm{corr}(\mathrm{Err}_{XY},\mathrm{Err}{X})\cdot\sqrt{\mathrm{var}(\mathrm{Err}{XY})\mathrm{var}(\mathrm{Err}{X})}, which is o(n1)o(n^{-1}).

If you like applying these kinds of methods practical ML problems, join our team.

Read more

Model-X knockoffs | April 6, 2021

A discussion on Model-X knockoffs, part of a family of approaches for converting any variable selection into on that controls false discovery rate (FDR).

Read more

Simple Bayesian Algorithms for Best Arm Identification | March 9, 2021

This week we reviewed techniques using simple Bayesian algorithms for sequential decision-making problems.

Read more