Causal Inference

# R-Learner | December 7, 2020

By Arnaud Autef - December 7, 2020

This week we review the R-Learner, a 2-step causal inference algorithm to estimate heterogeneous treatment effects from observational data.

Materials

Why the R-learner

• Setup

Observations $Y_i$, features $X_i$, binary treatment decision $W_i$ for each unit $i$, with assumptions:

• Overlap $\forall X_i, \eta \le \mathbb{P}(W_i= 1 | X_i) \le 1 - \eta$ for some $0 < \eta < 1$

• Un-confoundedness $\{Y_i(0), Y_i(1)\} \bot W_i | X_i$

$\rightarrow$ treatment decision and potential outcomes are independent given the features

• Goal

Efficiently approximate $\tau^*(x) := \mathbb{E}(Y_i(1) - Y_i(0)|X_i = x)$

• Practical motivation

• Quite general assumptions, much less constrained than strict experimental setups like randomized control trials (RCTs).
• Answers the very important question: "What is the expected effect of the treatment on unit $i,$given its features $X_i$ "
• Theoretical motivation

$\hat{\tau}$ obtained via the R-learner achieves asymptotic error rates of the same scale as $\tilde{\tau}$, with $\tilde{\tau}$ an "oracle" learner knowing perfectly the following functions:

$m^*(X_i) := \mathbb{E}(Y_i | X_i)$ $e^*(X_i) := \mathbb{E}(W_i | X_i)$

• In the paper, error rates are obtained when $\tau$ functions are approximated via penalized kernel regression

• In the raw notes below, we sketch the proof when $\tau$ functions are approximated via Lasso-penalized linear regression. In this case, we get:

$R(\hat{\tau}),~R(\tilde{\tau}) = \tilde{\mathcal{O}}_P(\dfrac{1}{\sqrt n})$

• At Sisu?

• Today, Sisu customers can use our software to identify subpopulation of their dataset that impact changes in their metric of interest.
• Tomorrow with causal inference, if a customer takes action on those subpopulations, they could come back to Sisu to estimate the "treatment effect" that their action had on each subpopulation!

Nuggets

• Robison Decomposition

$\mathbb{E}(Y_i - m^*(X_i)| X_i, W_i) = \tau^*(X_i) \{W_i - e^*(X_i)\}$

• R-loss $\leftrightarrow$ Squared Error of the $\tau^*$ approximation correspondence

R-loss

$L(\tau) = \mathbb{E} [\left( \{Y_i - m^*(X_i)\} - \tau(X_i) \{ W_i - e(X_i)\} \right)^2]$

Correspondence (leverages Overlap for tightness)

$\dfrac{1}{\eta^2} \mathbb{E}[(\tau(X_i) - \tau^*(X_i))^2] < L(\tau) - L(\tau^*) < (1 - \eta)^2 \mathbb{E}[(\tau(X_i) - \tau^*(X_i))^2]$

• Feasible R-loss

$\hat{L}(\tau) = \dfrac{1}{n}\sum_{1 \le i \le n} \left(\{Y_i - \hat{m}^{-i}(X_i)\} - \tau(X_i) \{ W_i -\hat{ e}^{-i}(X_i)\}\right)^2$

Raw Notes

"proof of concept proof" of the isomorphic projection bound in the Lasso linear regression case

Rlearner 2.pdf

If you like applying these kinds of methods practical ML problems, join our team.