Causal Inference

By Arnaud Autef - July 6, 2020

In this weeks discussion, we review the Synthetic Controls method, which extends potential outcomes form Causal Inference literature to time-dependent observational data.

**Materials**

- Synthetic Controls paper (2007) by Abadie, Diamond and Hainmueller

**Why Synthetic Controls?**

The Synthetic Control Method extends Potential Outcomes from Causal Inference literature to time-dependent observational data.

This matters, because most real-world business and operational data falls in that category!

E.g:

Company A assigns promotional campaing P (aka the "treatment) to subpopulation X (aka the "treated unit") from time $t \ge T_0$.

→ With synthetic controls, we can answer the question:

What is the effect that this campaign P had on subpopulation X in terms of Y = average CLTV (customer long term value).

**Motivation for Sisu?**- The above example situation is a very likely use case that Sisu could encounter as we grow. For now, Sisu identifies such interesting subpopulation X to target with a promotional campaign P. In the future, we might want to be able to close the loop and soundly estimate the counter factual effect of this promotional campaign.

**Nuggets**

**Time-dependent potential outcomes model**Time-steps $t \in 1,~...,~T$, distinct units $j \in 1,~...,~ J$, time-dependent metric of interest $Y_{j, t}$ for each unit. Potential outcomes for unit $j$ at time $t$ are written:

$\{Y_{j, t}^N,Y_{j,t}^I \}$

- Where:
- $Y_{j, t}^I$ = observation for a treated unit $j$ at time $t$.
- $Y_{j, t}^N$ = observation for an untreated unit $j$ at time $t$.

- Where:
Before time-step $T_0$, no unit has been treated, and observations follow the factorized time-series model:

$Y_{j, t} = Y_{j,t}^N = \delta_t + \theta_t^TZ_j + \lambda_t^T\mu_j + \epsilon_{j,t}$

From time $T_0$, unit $j=1$ is treated and other units are kept untreated:

KaTeX parse error: $ within math mode

**Treatment effect definition and estimation**Definition: Treatment effect $\tau_t$ for treated unit $j = 1$ from time $t = T_0 + 1$:

$\tau_t = Y_{1,t}^I - Y_{1, t}^N$

- But, by definition, only $Y_{1, t}^I$ is observed for $t \ge T_0 + 1$!

Estimation strategy - Synthetic Controls Method:

High-level idea:

For $t \le T_0$ we observe $Y_{1, t}^N$

- Fit $Y_{1, t}^N \sim f_\theta(Y_{j>1,t}^N)$ on $t \le T_0$, get $\hat{\theta}$.

For $t > T_0$ we still observe $Y_{j,t}^N$ for untreated units $j > 1$:

- Use the estimate $\hat{Y}_{1, t}^N \approx f_{\hat{\theta}}(Y_{j>1, t}^N)$

So that the treatment effect estimate is:

$\hat{\tau}_t := Y_{1, t} - f_{\hat{\theta}}(Y_{j>1,t})$

In practice:

- Restrict the class of fitting functions $\{f_{\theta},~\theta \in \Theta\}$ to convex combinations of untreated units $1 <j \le J$
- $\Theta = \Delta^{J -1}$
- $\forall X,~f_{\theta}(X) = \sum_{j > 1}^{J}\theta_jX_j$

- Restrict the class of fitting functions $\{f_{\theta},~\theta \in \Theta\}$ to convex combinations of untreated units $1 <j \le J$

**Main theoretical result:**Under assumptions:

(~SUTVA) The treatment of units $1$ has no indirect effect on units $j > 1$.

(Controls approximate well the treated unit) There exists $\boldsymbol w^* \in \Delta^{J -1}$ such that:

$\tag{1} \forall t \le T_0,~Y_{1, t} = Y_{1, t}^N = \sum_{j > 1}^{J}w^*_j Y_{j, t}$

$\tag{2} Z_1 = \sum_{j > 1}^{J}w^*_j Z_{j}$

(~No confounding) Noise terms $\epsilon_{j,t}$ are $iid$ with mean $0$ and $\mathbb{E}(\epsilon_{j,t}|Z_j,\mu_j) = 0$

The Synthetic Controls estimator is asymptotically unbiased (for large $J$, $T_0$):

$\mathbb{E}(|\tau_t - \hat{\tau}_t|) \rightarrow 0$

**Practical considerations**BIGGEST CAVEAT: convex combinations are a very restrictive class of approximating functions. If the controls do not fit well the treated unit via a convex combination, estimated treatment effects can be heavily polluted by the bias of the fit → requires a good control group.

Model estimation

To fit proper weights to the convex combination $w^*$, the authors advise the use of regularization and validation (if enough data).

Inference

How do we estimate the significance of the treatment effects estimated? → In the original paper, the author proposes "Placebo tests": are the treatment effects estimated for treated unit $1$ much larger than the treatment effect estimates we would have got by applying Synthetic Controls on unit $j > 1$?

**Raw Notes**

*If you like applying these kinds of methods practical ML problems, join our team.*