Statistical Inference

Complementary Pairs Stability Selection | August 25, 2021

By Nivedita Bhaskhar - September 3, 2021

Materials

Variable selection with error control: another look at stability selection by R.D. Shah and R.J. Samworth.

Why CPSS ?

The original stability selection procedure (SS), broadly speaking, improves the performance of a base variable selection procedure ($M$) by aggregating the results of applying $M$ to subsamples of the dataset. However the guarantee of an upper bound on the expected number of noise variables which get selected by SS holds only under the following assumptions on the base procedure $M$ - a) that $M$ is better than random guessing and b) a strong exchangeability condition on the selection of noise variables holds. Further, the guaranteed upper bound tends to be weak and often, a precise calculation of the estimator used in SS is impossible because it involves running over all subsets of size $n/2$ from a dataset of size $n$.

Complementary Pairs Stability Selection (CPSS) introduced in this paper is a closely related variant of stability selection. Instead of segregating variables into noise and signal, they are split into two disjoint groups - variables that have low (resp. high) probability of getting selected by base procedure $M$. The authors derive upper bounds without any further assumptions on $M$ for a) the average number of low selection (by M) probabilities variables which are selected by CPSS and b) the average number of high selection (by M) probabilities variables which are NOT selected by CPSS where guarantee a) can be further strengthened under reasonable shape restrictions. Finally, the estimator used in the CPSS procedure involves the number of splits $B$ and the guarantees derived above hold for any $B$.

Nuggets

• Assumptions and notations:

• Assume we have $n$ vector valued data points $z_1, z_2, \ldots, z_n$ which are values taken on by i.i.d random variables $Z_i$.
• Further, assume that each $Z_i = (X_i, Y_i)$ where feature vector $X_i$ takes values in $\mathbb{R}^p$ and target variable $Y_i$ takes values in $\mathbb{R}$. So we have $p$ features (let's call them $1,2,\ldots, p$) and real valued targets.
• For convenience, let us denote $\{1,2,\ldots, m\}$ by $[m]$ for any natural number $m$.
• We are given a base variable selection procedure $M$ which selects a subset of features $[p]$ after training on the dataset. So we can consider $M$ to be a statistic $M(Z_1,Z_2,\ldots, Z_n)$ which takes values in the power set of $[p]$.
• For any subset $A = \{a_1,a_2,\ldots, a_{|A|}\}$ of $[n]$, let $M_A$ denote the base procedure $M$ which trains on the sub-sample $\{Z_{a_1}, Z_{a_2}, \ldots, Z_{a_{|A|}}\}$ of the dataset.
• Let $M_m$ denote the base procedure $M$which trains on a random sub-sample of size $m$ of the dataset for any natural number $m\leq n$.
• CPSS procedure:

• Fix a $\pi_{thresh}$ in $[0,1]$
• Fix $B$ , the number of complementary splits.
• Pick $B$ random, independent complementary splits of the dataset (i.e. pick random, independent pairs of disjoint splits $(A_{2j-1}, A_{2j})$ of the dataset for $j=1,2,\ldots, B$ where $|A_{i}| = \lfloor{n/2\rfloor}$ for all $i$).
• For each $1\leq k\leq p$, calculate the average number of times feature $k$ gets picked if you run base procedure $M$ on each sub-sample $A_i$ for $i=1,2.\ldots, 2B$. More formally, calculate

$\hat{\Pi}_B(k) := \frac{1}{2B} \sum_{i=1}^{2B} \mathbb{1}_{k\in M_{A_i}}$

• CPSS procedure then picks every feature $k$ whose $\hat{\Pi}_B(k)\geq \pi_{thresh}$. Let's call the set of features selected by the CPSS procedure to be $CPSS_{\pi_{thresh}, B}$.
• CPSS guarantees:

• Fix a $\theta\in [0,1]$ and fix a $\pi_{thresh}\in [0,1]$.

• We call feature $k$ to be a low selection probability feature if the probability that it gets selected by procedure $M_{\lfloor{n/2\rfloor}}$ is at most $\theta$. If not, we call it a *high selection probability feature.*Thus the features $[p]$ are segregated into two disjoint subsets of low selection probability features ($L_{\theta}$) and high selection probability features ($H_{\theta}$)

• Broadly speaking, if $\pi_{thresh}>0.5$, we have a guarantee that the average number of low selection probability features picked by the CPPS procedure is upper bounded by a quantity proportional to the average number of low selection probability features picked by $M$. More formally,

$E[|CPSS_{\pi_{thresh}, B}\cap L_{\theta}|] \leq \frac{\theta}{2\pi_{thresh}-1}E[|M_{\lfloor{n/2\rfloor}}\cap L_{\theta}|] ----(a)$

• Similarly, if $\pi_{thresh}<0.5$, we have a guarantee that the average number of high selection probability features NOT picked by the CPPS procedure is upper bounded by a quantity proportional to the average number of high selection probability features NOT picked by $M$. More formally,

$E[|CPSS_{\pi_{thresh}, B}^{complement}\cap H_{\theta}|] \leq \frac{1-\theta}{1-2\pi_{thresh}}E[|M_{\lfloor{n/2\rfloor}}^{complement}\cap H_{\theta}|]--(b)$

• In fact, if the assumptions needed for the original stability selection (SS) guarantee actually hold, then the CPSS guarantee (a) recovers the original SS guarantee.

• Stronger CPSS guarantees for restricted distributions:

• One can define a related estimator

$\tilde{\Pi}_B(k) := \frac{1}{B} \sum_{j=1}^{B} \mathbb{1}_{k\in M_{A_{2j-1}}} \mathbb{1}_{k\in M_{A_{2j}}}$

• In fact, the proof for the CPSS guarantee (a) proceeds by relating $\hat{\Pi}_B(k)$