Statistical Inference

Simple Bayesian Algorithms for Best Arm Identification | March 9, 2021

By Arnaud Autef - March 9, 2021

In our discussion this week, we review Daniel Russo's paper on Simple Bayesian Algorithms for Best Arm Identification. This paper presents very simple algorithms for sequential decision-making problems where the goal is to identify, as quickly as possible, the best variant out of kk possible designs.


  • Paper link, Daniel Russo, 29th Annual Conference on Learning Theory

Why this method?

  • The problem setup is very general and can model many practical problems
    • A few examples
      • A/B testing: 22 designs A or B, which one is best?
      • Clinical trials for kk treatment variants
      • Website optimization: kk different webpage layouts
  • As the paper name suggests, algorithms proposed are very simple
    • They have a single β\beta parameter which can be tuned automatically based on the data gathered.
    • Their implementation mainly requires access to samples from the posterior distribution.



  • kk possible designs.
  • Each design i{1, ..., k}i \in \{1,~...,~k\} has unknown parameter θi\theta_i^*
  • At each time-step nNn \in \mathbb{N}
    1. experimenter chooses design In{1, ..., k}I_n \in \{1,~...,~k\}
    2. experimenter observes a random outcome YIn,np(.θIn)Y_{I_n, n} \sim p(.|\theta_{I_n}^*)


  • The prior for the unknown parameters θi\theta_i^* lies in a bounded hyperrectangle of Rk\mathbb{R}^k, with uniformly bounded density.
  • Observed outcomes YiY_i are sampled from a 1D exponential family where the log-partition function is twice differentiable with uniformly bounded first derivative.


  • "Top-two" Bayesian algorithms with parameter β\beta (a probability).
    • At step tt, given posterior at time tt for parameters θ\theta
      • Design selected for measurement is selected randomly
        • using β\beta
        • Between the top two designs i, ji,~j most likely to be "optimal" at time tt
  • The three algorithms presented (top-two sampling, top-two value sampling and top-two thompson sampling) each have a different notion of being "optimal" at time tt.


  1. No algorithm can do better than exponential convergence

    • (of the posterior to the true best design on this problem)

    lim supn+1nlogΠn(ΘIc)Γ\limsup_{n \rightarrow +\infty}-\frac{1}{n}\log \Pi_n(\Theta_{I^*}^c) \le \Gamma^*

    • Convergence rate Γ\Gamma^* captures how difficult it is to identify the best design on a particular problem
  2. Top-two algorithms reach this exponential rate with a proper parameter β\beta^*

  3. Top-two algorithms with an adaptive βt\beta_t also attain this exponential rate

    • βt\beta_t need be properly set to converge to β\beta^*,
      • It is only based on observations (no knowledge of β\beta^* required)
  4. Top-two algorithms with a fixed β\beta converge exponentially, with a rate Γβ=12\Gamma_{\beta=\frac{1}{2}} "not too far" from the optimal rate Γ\Gamma^*

    • In particular, Γβ=12Γ/2\Gamma_{\beta = \frac{1}{2}} \ge \Gamma^* / 2

Raw Notes

If you like applying these kinds of methods practical ML problems, join our team.

Read more

Fixed-X knockoffs | November 9, 2020

A look at a novel method for False Discovery Rate control in variable selection — the Fixed-X Knockoff filter by Rina Barber and Emmanuel Candès.

Read more

Benjamini-Hochberg | August 17, 2020

A discussion on Benjamini-Hochberg, avoiding the multi-comparison problem, and False Discovery Rate (FDR).

Read more