Statistical Inference

By Vlad Feinberg - May 25, 2021

This week, we reviewed a paper by Stephan Bates, Trevor Hastie, and Robert Tibshirani on cross-validation. This paper provides an illuminating look at what cross validation (CV) actually estimates, proving that $K$-fold CV better estimates the average error for a supervised learning procedure among datasets of size $\frac{K-1}{K}n$ for datasets of size $n$ than the actual average generalization error of the model we just fitted!

**Materials**

- Cross-validation: what does it estimate and how well does it do it? by Stephen Bates, Trevor Hastie, Robert Tibshirani

**Why Nested Cross Validation?**

- As machine learning practitioners, if we want to have a confidence interval for our model's generalization error, we should be cognizant of the naivety of merely looking at the CV estimate for our model's error with a normal interval built around its standard errors, which often times fail to include the true error at the nominal rate.
- This means that you might expect that if you take a lower bound of the 10% CI from your CV, that only ~5 out of a 100 times you train an ML model on a new dataset you'll actually have worse performance than expected, in practice it could be much higher.

**Nuggets**

- (Theorem 1) Put evocatively, in a standard ordinary least squares OLS setting with predictors $X$ and outcomes $Y$, one could
*resample*outcomes $Y'$ from the same model, and should be just as happy with a CV estimate for the error from the dataset $(X, Y')$ for the model trained on $(X, Y)$ as the CV estimate on $(X, Y)$ directly.- To an extent, we can view this as sensible: our error estimate is independent of the particular noise instantiation for our dataset.
- However, this is also not intended, as practitioners frequently look to CV performance for an understanding of their model's expected generalization error.
- The proof follows from a classical argument: since CV is linearly invariant its error estimate is independent of the the component of $Y$ living in $X$'s column space, which determines the OLS fit.

- (Theorem 2) Under stronger assumptions about a generative distribution of $X\sim N(0,\Sigma_p)$ and the asymptotic setting $n\rightarrow \infty$, $n/p\rightarrow\lambda$ with number of features growin as $n$ does, the CV estimate for the mean squared error (MSE) of your model
*itself*has an MSE of $\Theta(n^{-1})$ from the actual generalization error, which is within the magnitude of the variance of the CV estimator itself. As a result, one cannot hope to have asymptotic coverage of the true error for one's model using unadjusted confidence intervals.- Inspecting the generalization error of the OLS setup directly, the question essentially reduces to computing second and fourth moments of a normal quadratic form.
- Existing results about inverse Wishart matrices bound the error growth as described.

- Raw notes consider various extensions and how authors might pursue future work: the results are impressive for the given setting, but this CV estimation issue does not exist for fixed $p$, asymptotic $n$ settings.
- If Lasso regularization is used, is the high feature dimensionality problematic? Unfortunately, yes, we cannot simply rely on screening to whittle feature count down. There might be ways around this, but they'd require at a deeper look at stability of the lasso solution across CV folds.
- Can results be extended to other covariate models? Possibly; their fourth and second moments need to be well behaved for error growth rates to remain similar.

**Raw Notes**

*Corrections:*

- The proof of the abridged Theorem 1.6 from Seber and Lee starts by writing the quadratic form $X^\top AX=\sum_i\sum_j\sum_k\sum_la_{ij}a_{kl}X_iX_jX_kX_l$; the left-hand side should be squared.
- The proof of Corollary 3 is incorrect and longer than it should be: three lines in, the $\mathbb{E}[\mathrm{var}(\mathrm{Err}_{XY}|X)]$ term is shown equal to $\mathrm{var}\ \mathrm{Err}_{XY}-2\cdot\mathrm{cov}\ (\mathrm{Err}_{XY},\mathrm{Err}_{X})$, and the left term is immediately $\Omega(n^{-1})$ while the right term can be written as $\mathrm{corr}(\mathrm{Err}_{XY},\mathrm{Err}{X})\cdot\sqrt{\mathrm{var}(\mathrm{Err}{XY})\mathrm{var}(\mathrm{Err}{X})}$, which is $o(n^{-1})$.

*If you like applying these kinds of methods practical ML problems, join our team.*