Causal Inference

By Arnaud Autef - December 7, 2020

This week we review the R-Learner, a 2-step causal inference algorithm to estimate heterogeneous treatment effects from observational data.

**Materials**

- Paper by Xinkun Nie and Stefan Wager
- Presentation slides (overview of the paper and experimental results, no proof)

**Why the R-learner**

Setup

Observations $Y_i$, features $X_i$, binary treatment decision $W_i$ for each unit $i$, with assumptions:

**Overlap**$\forall X_i, \eta \le \mathbb{P}(W_i= 1 | X_i) \le 1 - \eta$ for some $0 < \eta < 1$**Un-confoundedness**$\{Y_i(0), Y_i(1)\} \bot W_i | X_i$*$\rightarrow$ treatment decision and potential outcomes are independent given the features*

Goal

Efficiently approximate $\tau^*(x) := \mathbb{E}(Y_i(1) - Y_i(0)|X_i = x)$

Practical motivation

- Quite general assumptions, much less constrained than strict experimental setups like randomized control trials (RCTs).
- Answers the very important question: "
*What is the expected effect of the treatment on unit $i,$given its features $X_i$ "*

Theoretical motivation

$\hat{\tau}$ obtained via the R-learner achieves asymptotic error rates of the same scale as $\tilde{\tau}$, with $\tilde{\tau}$ an "oracle" learner knowing perfectly the following functions:

$m^*(X_i) := \mathbb{E}(Y_i | X_i)$ $e^*(X_i) := \mathbb{E}(W_i | X_i)$

In the paper, error rates are obtained when $\tau$ functions are approximated via penalized kernel regression

In the raw notes below, we sketch the proof when $\tau$ functions are approximated via Lasso-penalized linear regression. In this case, we get:

$R(\hat{\tau}),~R(\tilde{\tau}) = \tilde{\mathcal{O}}_P(\dfrac{1}{\sqrt n})$

At Sisu?

- Today, Sisu customers can use our software to identify subpopulation of their dataset that impact changes in their metric of interest.
- Tomorrow with causal inference, if a customer takes action on those subpopulations, they could come back to Sisu to estimate the "
*treatment effect*" that their action had on each subpopulation!

**Nuggets**

Robison Decomposition

$\mathbb{E}(Y_i - m^*(X_i)| X_i, W_i) = \tau^*(X_i) \{W_i - e^*(X_i)\}$

R-loss $\leftrightarrow$ Squared Error of the $\tau^*$ approximation correspondence

R-loss

$L(\tau) = \mathbb{E} [\left( \{Y_i - m^*(X_i)\} - \tau(X_i) \{ W_i - e(X_i)\} \right)^2]$

Correspondence (leverages

**Overlap**for tightness)$\dfrac{1}{\eta^2} \mathbb{E}[(\tau(X_i) - \tau^*(X_i))^2] < L(\tau) - L(\tau^*) < (1 - \eta)^2 \mathbb{E}[(\tau(X_i) - \tau^*(X_i))^2]$

Feasible R-loss

$\hat{L}(\tau) = \dfrac{1}{n}\sum_{1 \le i \le n} \left(\{Y_i - \hat{m}^{-i}(X_i)\} - \tau(X_i) \{ W_i -\hat{ e}^{-i}(X_i)\}\right)^2$

**Raw Notes**

"proof of concept proof" of the isomorphic projection bound in the Lasso linear regression case

*If you like applying these kinds of methods practical ML problems, join our team.*