Skip to content

Posts Bandit Algorithms for Website Optimization #
Find similar titles

Structured data

Date Published

An introductory book on Multi-armed bandit algorithm. Examples are written in Python.

Preface #

Chapter 1. Two Characters: Exploration and Exploitation #

Exploration and Exploitation:

  • Exploitation: An algorithm for solving the Multiarmed Bandit Problem exploits if it play the arm with the highest estimated value based on previous plays.
  • Exploration: An algorithm for solving the Multiarmed Bandit Problem explores if it plays any arm that does not have the highest estimated value based on previous plays. In other words, exploration occurs whenever exploitation does not.

The Explore-Exploit Dilemma: need to (A) learn about new ideas (which we'll always call exploring from now on), while you also need to (B) take advantage of the best of your old ideas (which we'll always call exploiting from now on).

Chapter 2. Why Use Multiarmed Bandit Algorithms? #

Shortcomings of A-B testing:

Standard A/B testing consists of:

  • A short period of pure exploration, in which you assign equal numbers of users to Groups A and B.
  • A long period of pure exploitation, in which you send all of your users to the more successful version of your site and never come back to the option that seemd to be inferior.

Why might this be a bad strategy?

  • It jumps discretely from exploration into exploitation, when you might be able to smoothly transition between the two.
  • During the purely exploratory phase, it wastes resources exploring inferior options in order to gather as much data as possible. But you shouldn't want to gather data about strikingly inferior options.

Strengths of Multi-armed bandit algorithm:

Bandit algorithms provide solutions to both of these problems: (1) they smoothly decrease the amount of exploring they do over time instead of requiring you to make a sudden jump and (2) they focus your resources during exploration on the better options instead of wasting time on the inferior options that are over-explored during typical A/B testing.

Chapter 3. The epsilon-Greedy Algorithm #

What the algorithm does:

...the term epsilon in the algorithm's name refers to the odds that the algorithm explores instead of exploiting. ... the big idea behind the epsilon-Greedy algorithm really is that simple: if you flip a coin and it comes up heads, you should explore for a moment. But if the coin comes up tails, you should exploit.

Chapter 4. Debugging Bandit Algorithms #

Introduces Monte Carlo method and describes how to use the method to test the epsilon-Greedy algorithm.

Chapter 5. The Softmax Algorithm #

Problems of epsilon-Greedy:

For example, in one scenario (call it Scenario A), you might have two arms, one of which rewards you 10% of the time and the other rewards you 13% of the time. In both of these scenarios, the probability that the epsilon-Greedy algorithm explores the worse arm is exactly the same (it's epsilon / 2), despite the inferior arm in Scenario B being, in relative terms, much worse than the inferior arm in Scenario A.


We need structured exploration rather than the haphazard exploration that the epsilon-Greedy algorithm provides.

The Softmax Algorithm:

Suppose that you have two arms, A and B. Now imagine that, based on your past experiences, these two arms have had two different rates of success: rA and rB. With those assumptions, the most naive possible implementation of a Softmax-like algorithm would have you choose Arm A with probability \( \frac{ rA }{ rA + rB } \) and Arm B with probability \( \frac{ rB }{ rA + rB } \).

... To reconstruct the algorithm people actually use, we need to make two changes to it.

First, we will calculate a different scale for reward rates by exponentiating our estimates of rA and rB. Using this new scale, we will choose Arm A with probability

$$ \frac{ e^{rA} }{ e^{rA} + e^{rB} } $$

and Arm B with probability

$$ \frac{ e^{rB} }{ e^{rA} + e^{rB} } $$

This naive exponential rescaling has the virtue of not behaving strangely if someone used negative numbers as rates of success. ...

More importantly, this exponentiation trick brings us very close to the full Softmax algorithm. In fact, plain exponential rescaling gives us the Softmax algorithm if you hardcoded one of the configurable parameters that the standard Softmax algorithm possesses. This additional parameter is a different sort of scaling factor than the expoenetiation we just introduced.

This new type of scaling factor is typically called a temperature parameter based on analogy with physics in which systems at high temperatures tend to behave randomly, while they take on more structure at low temperatures. ... We'll call this new temperature parameter tau. ...

  • At time T, select one of two arms with probability computed as follows:

    $$ \frac{ e^{rA / tau} }{ e^{rA / tau} + e^{rB / tau} } \\ \frac{ e^{rB / tau} }{ e^{rA / tau} + e^{rB / tau} } \\ $$

  • For whichever arm you picked, update your estimate of the mean using the same update rule we used for the epsilon-Greedy algorithm.

What the tau does:

It's easiest to think of tau as letting us shift the behavior of the Softmax algorithm along a continuum defined by two extreme ways to select arms. At one extreme, we set tau = 0.0. This will give us a fully deterministic choice of the arm that has the highest estimated value. At the other extreme, we set tau = Inf, which gives us purely random exploration like we got out of the epsilon-Greedy algorithm.

Annealing:'s often a good idea to encourage an algorithm to explore less over time. In the Softmax algorithm, we can achieve that by slowly decreasing the temperature, which we call annealing.

Chapter 6. UCB - The Upper Confidence Bound Algorithm #

A weakness of epsilon-Greedy and Softmax:

...they don't keep track of how much they know about any of the arms available to them. They pay attention only to how much reward they've gotten from the arms. This means that they'll underexplore options whose initial experiences were not rewarding, even though they don't have enough data to be confident about those arms. We can do better by using an algorithm that pays attention to not only what it knows, but also how much it knows.

Upper Condifence Bound Algorithm:

The algorithm, UCB, that we'll present in this chapter does exactly this.


Why is it important to keep track of our confidence in the values of the arms? The reason has to do with the nature of the rewards we receive from arms: they're noisy. If we use our past experiences with an arm, then the estimated value of any arm is always a noisy estimate of the true Return On Investment we can expect from it. Because of this noise, it might just be a coincidence that Arm A seems etter than Arm B; if we had more experience with both arms, we'd eventually realize that Arm B is actually better.


Instead, UCB avoids being gullible by requiring us to keep track of our confidence in our assessments of the estimated values of all of the arms. To do that, we need to have some metric of how much we know about each arm. Thankfully, we already have information on hand that will give us that metric: we've been explicitly keeping track of the number of times we've pulled each arm for both of the algorithms we've used so far.

Strengths of UCB:

  • UCB doesn't use randomness at all.
  • UCB doesn't have any free parameters that you need to configure before you can deploy it.

A bonus quantity:

...the select_arm method for UCB1 uses a special type of value that we've called a ucb_value in this code. The ucb_value combines the simple estimated value of each arm with a special bonus quantity which is math.sqrt((2 * math.log(total_counts)) / float(self.counts[arm])).

The most basic statement that can be made about it is that it augments the estimated value of any arm with a measure of how much less we know about that arm than we know about the other arms. ... The effect of that is that UCB is a explicitly curious algorithm that tries to seek out the unknown.

Chapter 7. Bandits in the Real World: Complexity and Complications #

  • A/A Testing to validate the algorithm
  • Continuous Experimentation vs. Periodic Testing:

    ...bandit algorithms look much better than A-B testing when you are willing to let them run for a very long time. If you're willing to have your site perpetually be in a state of experimentation, bandit algorithms will be many times better than A-B testing.

  • Moving Worlds:

    In the real world, the value of different arms in a bandit problem can easily change over time. ... Arms with changing values can be a very serious problem if you're not careful when you deploy a bandit algorithm. The algorithms we've presented will not handle most sorts of change in the underlying values of arms well. ...when you're deailing with millions or billions of plays, this means that recent rewards will have almost zero effect on your estimates of the value of different arms. ...

    There is a simple trick for working around this that can be used if you're careful: instead of estimating the values of the arms using strict averages, you can overweight recent events by using a slightly different update rule based on a different snippet of code:

      new_value = (1 - alpha) * value + (alpha) * reward
  • Correlated bandits

  • Contextual bandits can read about them in two academic papers called "A Contextual-Bandit Approach to Personalized News Article Recommendation" and "Parametric Bandits: The Generalized Linear Case".

Chapter 8. Concolusion #

Lessons #


In the real world, you always have to trade off between gathering data and acting on that data.

God does play dice:

Randomization is the key to the good life. ... While the UCB algorithms we've used in this book aren't trully randomized, they behave at least partially like randomized algorithms from the perspective of your users. Ultimately what matter most is that you make sure that end-users can't self-select into the arms you want to experiment with.

Defaults matter a lot.

Take a chance:

You should try everhthing at the start of your explorations to insure that you know a little bit about the potential value of every option.

Everybody's gotta grow up sometime:

You should make sure that you explore less over time.

Leave your mistakes behind:

You should direct your exploration to focus on the second-best option, and third-best option and a few other options that are just a little bit further away from the best. Don't waste much or any of your time on options that are clearly losing bets.

Don't be cocky;

You should keep track of how confident you are about your evaluations of each of the options available to you.

Context matters.

A Taxonomy of Bandit Algorithms #

  1. Curiosity
  2. Incresed Exploitation over Time
  3. Strategic Exploration
  4. Number of Tunable Parameters
  5. Initialization Strategy
  6. Context-Aware

Incoming Links #

Related Articles #

Suggested Pages #

Other Posts #