Statistical Inference

Complementary Pairs Stability Selection | August 25, 2021

By Nivedita Bhaskhar - September 3, 2021

Materials

Variable selection with error control: another look at stability selection by R.D. Shah and R.J. Samworth.

Why CPSS ?

The original stability selection procedure (SS), broadly speaking, improves the performance of a base variable selection procedure (MM) by aggregating the results of applying MM to subsamples of the dataset. However the guarantee of an upper bound on the expected number of noise variables which get selected by SS holds only under the following assumptions on the base procedure MM - a) that MM is better than random guessing and b) a strong exchangeability condition on the selection of noise variables holds. Further, the guaranteed upper bound tends to be weak and often, a precise calculation of the estimator used in SS is impossible because it involves running over all subsets of size n/2n/2 from a dataset of size nn.

Complementary Pairs Stability Selection (CPSS) introduced in this paper is a closely related variant of stability selection. Instead of segregating variables into noise and signal, they are split into two disjoint groups - variables that have low (resp. high) probability of getting selected by base procedure MM. The authors derive upper bounds without any further assumptions on MM for a) the average number of low selection (by M) probabilities variables which are selected by CPSS and b) the average number of high selection (by M) probabilities variables which are NOT selected by CPSS where guarantee a) can be further strengthened under reasonable shape restrictions. Finally, the estimator used in the CPSS procedure involves the number of splits BB and the guarantees derived above hold for any BB.

Nuggets

  • Assumptions and notations:

    • Assume we have nn vector valued data points z1,z2,,znz_1, z_2, \ldots, z_n which are values taken on by i.i.d random variables ZiZ_i.
    • Further, assume that each Zi=(Xi,Yi)Z_i = (X_i, Y_i) where feature vector XiX_i takes values in Rp\mathbb{R}^p and target variable YiY_i takes values in R\mathbb{R}. So we have pp features (let's call them 1,2,,p1,2,\ldots, p) and real valued targets.
    • For convenience, let us denote {1,2,,m}\{1,2,\ldots, m\} by [m][m] for any natural number mm.
    • We are given a base variable selection procedure MM which selects a subset of features [p][p] after training on the dataset. So we can consider MM to be a statistic M(Z1,Z2,,Zn)M(Z_1,Z_2,\ldots, Z_n) which takes values in the power set of [p][p].
    • For any subset A={a1,a2,,aA}A = \{a_1,a_2,\ldots, a_{|A|}\} of [n][n], let MAM_A denote the base procedure MM which trains on the sub-sample {Za1,Za2,,ZaA}\{Z_{a_1}, Z_{a_2}, \ldots, Z_{a_{|A|}}\} of the dataset.
    • Let MmM_m denote the base procedure MMwhich trains on a random sub-sample of size mm of the dataset for any natural number mnm\leq n.
  • CPSS procedure:

    • Fix a πthresh\pi_{thresh} in [0,1][0,1]
    • Fix BB , the number of complementary splits.
    • Pick BB random, independent complementary splits of the dataset (i.e. pick random, independent pairs of disjoint splits (A2j1,A2j)(A_{2j-1}, A_{2j}) of the dataset for j=1,2,,Bj=1,2,\ldots, B where Ai=n/2|A_{i}| = \lfloor{n/2\rfloor} for all ii).
    • For each 1kp1\leq k\leq p, calculate the average number of times feature kk gets picked if you run base procedure MM on each sub-sample AiA_i for i=1,2.,2Bi=1,2.\ldots, 2B. More formally, calculate

    Π^B(k):=12Bi=12B1kMAi\hat{\Pi}_B(k) := \frac{1}{2B} \sum_{i=1}^{2B} \mathbb{1}_{k\in M_{A_i}}

    • CPSS procedure then picks every feature kk whose Π^B(k)πthresh\hat{\Pi}_B(k)\geq \pi_{thresh}. Let's call the set of features selected by the CPSS procedure to be CPSSπthresh,BCPSS_{\pi_{thresh}, B}.
  • CPSS guarantees:

    • Fix a θ[0,1]\theta\in [0,1] and fix a πthresh[0,1]\pi_{thresh}\in [0,1].

    • We call feature kk to be a low selection probability feature if the probability that it gets selected by procedure Mn/2M_{\lfloor{n/2\rfloor}} is at most θ\theta. If not, we call it a *high selection probability feature.*Thus the features [p][p] are segregated into two disjoint subsets of low selection probability features (LθL_{\theta}) and high selection probability features (HθH_{\theta})

    • Broadly speaking, if πthresh>0.5\pi_{thresh}>0.5, we have a guarantee that the average number of low selection probability features picked by the CPPS procedure is upper bounded by a quantity proportional to the average number of low selection probability features picked by MM. More formally,

      E[CPSSπthresh,BLθ]θ2πthresh1E[Mn/2Lθ](a)E[|CPSS_{\pi_{thresh}, B}\cap L_{\theta}|] \leq \frac{\theta}{2\pi_{thresh}-1}E[|M_{\lfloor{n/2\rfloor}}\cap L_{\theta}|] ----(a)

    • Similarly, if πthresh<0.5\pi_{thresh}<0.5, we have a guarantee that the average number of high selection probability features NOT picked by the CPPS procedure is upper bounded by a quantity proportional to the average number of high selection probability features NOT picked by MM. More formally,

      E[CPSSπthresh,BcomplementHθ]1θ12πthreshE[Mn/2complementHθ](b)E[|CPSS_{\pi_{thresh}, B}^{complement}\cap H_{\theta}|] \leq \frac{1-\theta}{1-2\pi_{thresh}}E[|M_{\lfloor{n/2\rfloor}}^{complement}\cap H_{\theta}|]--(b)

    • In fact, if the assumptions needed for the original stability selection (SS) guarantee actually hold, then the CPSS guarantee (a) recovers the original SS guarantee.

  • Stronger CPSS guarantees for restricted distributions:

    • One can define a related estimator

Π~B(k):=1Bj=1B1kMA2j11kMA2j\tilde{\Pi}_B(k) := \frac{1}{B} \sum_{j=1}^{B} \mathbb{1}_{k\in M_{A_{2j-1}}} \mathbb{1}_{k\in M_{A_{2j}}}

  • In fact, the proof for the CPSS guarantee (a) proceeds by relating Π^B(k)\hat{\Pi}_B(k)