Statistical Inference

By Arnaud Autef - March 9, 2021

In our discussion this week, we review Daniel Russo's paper on Simple Bayesian Algorithms for Best Arm Identification. This paper presents very simple algorithms for sequential decision-making problems where the goal is to identify, as quickly as possible, the best variant out of $k$ possible designs.

- Paper link, Daniel Russo, 29th Annual Conference on Learning Theory

- The problem setup is very general and can model many practical problems
- A few examples
- A/B testing: $2$ designs A or B, which one is best?
- Clinical trials for $k$ treatment variants
- Website optimization: $k$ different webpage layouts

- A few examples
- As the paper name suggests, algorithms proposed are very simple
- They have a single $\beta$ parameter which can be tuned automatically based on the data gathered.
- Their implementation mainly requires access to samples from the posterior distribution.

Setup

- $k$ possible
*designs*. - Each design $i \in \{1,~...,~k\}$ has
*unknown*parameter $\theta_i^*$ - At each time-step $n \in \mathbb{N}$
- experimenter chooses design $I_n \in \{1,~...,~k\}$
- experimenter observes a random outcome $Y_{I_n, n} \sim p(.|\theta_{I_n}^*)$

Assumptions

- The prior for the unknown parameters $\theta_i^*$ lies in a bounded hyperrectangle of $\mathbb{R}^k$, with uniformly bounded density.
- Observed outcomes $Y_i$ are sampled from a 1D exponential family where the log-partition function is twice differentiable with uniformly bounded first derivative.

Algorithms

- "Top-two" Bayesian algorithms with parameter $\beta$ (a probability).
- At step $t$, given posterior at time $t$ for parameters $\theta$
- Design selected for measurement is selected randomly
- using $\beta$
- Between the top two designs $i,~j$ most likely to be "optimal" at time $t$

- Design selected for measurement is selected randomly

- At step $t$, given posterior at time $t$ for parameters $\theta$
- The three algorithms presented (top-two sampling, top-two value sampling and top-two thompson sampling) each have a different notion of being "optimal" at time $t$.

Results

No algorithm can do better than

*exponential convergence*- (of the posterior to the true best design on this problem)

$\limsup_{n \rightarrow +\infty}-\frac{1}{n}\log \Pi_n(\Theta_{I^*}^c) \le \Gamma^*$

- Convergence rate $\Gamma^*$ captures
*how difficult*it is to*identify the best design*on a particular problem

Top-two algorithms

*reach this exponential rate*with a proper parameter $\beta^*$Top-two algorithms with an adaptive $\beta_t$

*also attain this exponential rate*- $\beta_t$ need be properly set to converge to $\beta^*$,
- It is only based on observations (no knowledge of $\beta^*$ required)

- $\beta_t$ need be properly set to converge to $\beta^*$,
Top-two algorithms with a fixed $\beta$ converge exponentially, with a rate $\Gamma_{\beta=\frac{1}{2}}$ "not too far" from the optimal rate $\Gamma^*$

- In particular, $\Gamma_{\beta = \frac{1}{2}} \ge \Gamma^* / 2$

**Raw Notes**

*If you like applying these kinds of methods practical ML problems, join our team.*