Causal Inference

R-Learner | December 7, 2020

By Arnaud Autef - December 7, 2020

This week we review the R-Learner, a 2-step causal inference algorithm to estimate heterogeneous treatment effects from observational data.


Why the R-learner

  • Setup

    Observations YiY_i, features XiX_i, binary treatment decision WiW_i for each unit ii, with assumptions:

    • Overlap Xi,ηP(Wi=1Xi)1η\forall X_i, \eta \le \mathbb{P}(W_i= 1 | X_i) \le 1 - \eta for some 0<η<10 < \eta < 1

    • Un-confoundedness {Yi(0),Yi(1)}WiXi\{Y_i(0), Y_i(1)\} \bot W_i | X_i

      \rightarrow treatment decision and potential outcomes are independent given the features

  • Goal

    Efficiently approximate τ(x):=E(Yi(1)Yi(0)Xi=x)\tau^*(x) := \mathbb{E}(Y_i(1) - Y_i(0)|X_i = x)

  • Practical motivation

    • Quite general assumptions, much less constrained than strict experimental setups like randomized control trials (RCTs).
    • Answers the very important question: "What is the expected effect of the treatment on unit i,i,given its features XiX_i "
  • Theoretical motivation

    τ^\hat{\tau} obtained via the R-learner achieves asymptotic error rates of the same scale as τ~\tilde{\tau}, with τ~\tilde{\tau} an "oracle" learner knowing perfectly the following functions:

    m(Xi):=E(YiXi)m^*(X_i) := \mathbb{E}(Y_i | X_i) e(Xi):=E(WiXi)e^*(X_i) := \mathbb{E}(W_i | X_i)

    • In the paper, error rates are obtained when τ\tau functions are approximated via penalized kernel regression

    • In the raw notes below, we sketch the proof when τ\tau functions are approximated via Lasso-penalized linear regression. In this case, we get:

      R(τ^), R(τ~)=O~P(1n)R(\hat{\tau}),~R(\tilde{\tau}) = \tilde{\mathcal{O}}_P(\dfrac{1}{\sqrt n})

  • At Sisu?

    • Today, Sisu customers can use our software to identify subpopulation of their dataset that impact changes in their metric of interest.
    • Tomorrow with causal inference, if a customer takes action on those subpopulations, they could come back to Sisu to estimate the "treatment effect" that their action had on each subpopulation!


  • Robison Decomposition

    E(Yim(Xi)Xi,Wi)=τ(Xi){Wie(Xi)}\mathbb{E}(Y_i - m^*(X_i)| X_i, W_i) = \tau^*(X_i) \{W_i - e^*(X_i)\}

  • R-loss \leftrightarrow Squared Error of the τ\tau^* approximation correspondence


    L(τ)=E[({Yim(Xi)}τ(Xi){Wie(Xi)})2]L(\tau) = \mathbb{E} [\left( \{Y_i - m^*(X_i)\} - \tau(X_i) \{ W_i - e(X_i)\} \right)^2]

    Correspondence (leverages Overlap for tightness)

    1η2E[(τ(Xi)τ(Xi))2]<L(τ)L(τ)<(1η)2E[(τ(Xi)τ(Xi))2]\dfrac{1}{\eta^2} \mathbb{E}[(\tau(X_i) - \tau^*(X_i))^2] < L(\tau) - L(\tau^*) < (1 - \eta)^2 \mathbb{E}[(\tau(X_i) - \tau^*(X_i))^2]

  • Feasible R-loss

    L^(τ)=1n1in({Yim^i(Xi)}τ(Xi){Wie^i(Xi)})2\hat{L}(\tau) = \dfrac{1}{n}\sum_{1 \le i \le n} \left(\{Y_i - \hat{m}^{-i}(X_i)\} - \tau(X_i) \{ W_i -\hat{ e}^{-i}(X_i)\}\right)^2

Raw Notes

"proof of concept proof" of the isomorphic projection bound in the Lasso linear regression case

Rlearner 2.pdf

If you like applying these kinds of methods practical ML problems, join our team.

Read more

The Synthetic Controls Method | July 6, 2020

In this weeks discussion, we review the Synthetic Controls method, which extends potential outcomes form Causal Inference literature to time-dependent observational data.

Read more

SCANN | August 3, 2020

This week, we take a look at ScaNN (Scalable Nearest Neighbors), a method for efficient vector similarity search at scale.

Read more