Statistical Inference

# Simple Bayesian Algorithms for Best Arm Identification | March 9, 2021

By Arnaud Autef - March 9, 2021

In our discussion this week, we review Daniel Russo's paper on Simple Bayesian Algorithms for Best Arm Identification. This paper presents very simple algorithms for sequential decision-making problems where the goal is to identify, as quickly as possible, the best variant out of $k$ possible designs.

### Materials

• Paper link, Daniel Russo, 29th Annual Conference on Learning Theory

### Why this method?

• The problem setup is very general and can model many practical problems
• A few examples
• A/B testing: $2$ designs A or B, which one is best?
• Clinical trials for $k$ treatment variants
• Website optimization: $k$ different webpage layouts
• As the paper name suggests, algorithms proposed are very simple
• They have a single $\beta$ parameter which can be tuned automatically based on the data gathered.
• Their implementation mainly requires access to samples from the posterior distribution.

### Nuggets

Setup

• $k$ possible designs.
• Each design $i \in \{1,~...,~k\}$ has unknown parameter $\theta_i^*$
• At each time-step $n \in \mathbb{N}$
1. experimenter chooses design $I_n \in \{1,~...,~k\}$
2. experimenter observes a random outcome $Y_{I_n, n} \sim p(.|\theta_{I_n}^*)$

Assumptions

• The prior for the unknown parameters $\theta_i^*$ lies in a bounded hyperrectangle of $\mathbb{R}^k$, with uniformly bounded density.
• Observed outcomes $Y_i$ are sampled from a 1D exponential family where the log-partition function is twice differentiable with uniformly bounded first derivative.

Algorithms

• "Top-two" Bayesian algorithms with parameter $\beta$ (a probability).
• At step $t$, given posterior at time $t$ for parameters $\theta$
• Design selected for measurement is selected randomly
• using $\beta$
• Between the top two designs $i,~j$ most likely to be "optimal" at time $t$
• The three algorithms presented (top-two sampling, top-two value sampling and top-two thompson sampling) each have a different notion of being "optimal" at time $t$.

Results

1. No algorithm can do better than exponential convergence

• (of the posterior to the true best design on this problem)

$\limsup_{n \rightarrow +\infty}-\frac{1}{n}\log \Pi_n(\Theta_{I^*}^c) \le \Gamma^*$

• Convergence rate $\Gamma^*$ captures how difficult it is to identify the best design on a particular problem
2. Top-two algorithms reach this exponential rate with a proper parameter $\beta^*$

3. Top-two algorithms with an adaptive $\beta_t$ also attain this exponential rate

• $\beta_t$ need be properly set to converge to $\beta^*$,
• It is only based on observations (no knowledge of $\beta^*$ required)
4. Top-two algorithms with a fixed $\beta$ converge exponentially, with a rate $\Gamma_{\beta=\frac{1}{2}}$ "not too far" from the optimal rate $\Gamma^*$

• In particular, $\Gamma_{\beta = \frac{1}{2}} \ge \Gamma^* / 2$

Raw Notes

If you like applying these kinds of methods practical ML problems, join our team.