Different algorithms for the multi-armed bandit problem
The multi-armed bandit problem is a classic problem in reinforcement learning, where there is only
a single state (for stationary bandits). It has a variety of applications including
used as an alternative for A/B testing in digital marketing (for resource allocations). The goal of this
project is to understand some of the best known algorithms for the bandit problem
under different settings (e.g. stationary vs non-stationary, continuous reward vs discrete reward).
Algorithms in this project are written from FIRST PRINCIPLE using Python. For each scenario, the following
algorithms are being used:
- Greedy action selection
- Eps-Greedy action selection (different eps)
- Upper-Confidence-Bound action selection
- Gradient bandits (different learning rates)
For each scenario, 100 independent bandits are generated where each bandit has the same underlying
probability distribution of action values. For each bandit, 1000 time steps are performed. To assess
the performance of each algorithm under each setting, by averaging the results over the bandits, the average
reward is plotted to see how performance is improving with experience.
Scenario 1. Stationary Bandits - Noisy Continuous Reward
In this scenario, each bandit involves 10 different actions (10 arms), and the true action values are sampled
according to some probability distribution (where this probability distribution is stationary - does not
change over time). In addition, the reward for a bandit is the true action value
plus some noisy term (which is sampled from another probability distribution). The average reward plot
is shown as follows:
Some of the observations are:
- Ranking of performance:
Eps-Greedy (small eps) > UCB = Gradients > Eps-Greedy (big eps) > Greedy
- Greedy algorithm has a much worse performance compared to other algorithms. This makes sense
as the reward has big variance (due to noise), and EXPLORATION beats EXPLOITATION in this case.
- Hyper-parameters is important for the algorithms. In this case,
for the eps-greedy algorithm, using big eps causes a much worse performance (and slow convergence rate)
compared to using a small eps.
Scenario 2. Stationary Bernoulli Bandits
In this scenario, the agent chooses among 4 different actions (4 arms). When arm k is pulled, the reward
will be either 1 (success) or 0 (failure) with some probability pk, i.e. the reward for each arm follows
a Bernoulli distribution with parameter pk. Here, all bandits are stationary - the
success probability for each arm p1,p2,p3,p4 remains the same over time. The average reward plot
is shown as follows:
Some of the observations are:
- The success probability for each arm is 0.5,0.7,0.3,0.4. It can be seen that the top performers
are learning well toward the best arm.
- Ranking of performance:
Eps-Greedy (small eps) > Eps-Greedy (adaptive eps) > Eps-Greedy (big eps) > UCB >
Gradients > Greedy
- Again the greedy algorithm has a much worse performance compared to other algorithms.
EXPLORATION beats EXPLOITATION in this case.
- In this case, the gradient algorithms do not perform well compared to other exploration algorithms,
possibly due to the fact that the algorithm got stuck in some suboptimal action (in this case,
the first arm which has success probability of 0.5). Using better
learning rates might be useful in this case.
Scenario 3. Non-Stationary Bandits - Noisy Continuous Reward
Compared to scenario 1, now the bandits are non-stationary
- where the probability distribution that the true action values are sampled from changes over time.
Note that in this case, a CONSTANT STEP SIZE
is being used to get an exponentially decayed average for the estimated action values.
Again, the reward for a bandit is the true action value
plus some noisy term (which is sampled from another probability distribution). The average reward plot
is shown as follows:
Some of the observations are:
- Ranking of performance:
Gradients > Eps-Greedy > Greedy > UCB
- Once again the greedy algorithm does not perform well. This makes sense
as the reward is now having even bigger variance (due to noise and non-stationary true action values),
and EXPLORATION beats EXPLOITATION in this case.
- Note that the UCB algorithm does not perform well in this case. This is true as the UCB in general
is not generalizable to deal with non-stationary problems.
Scenario 4. Non-Stationary Bernoulli Bandits
Compared to scenario 2, now the success probability for each arm is no longer constant (or stationary) by
the following way. At each time t, the agent has an equal probability of choosing 1 of 3 regimes of the
slot machine, and each regime has a different constant probability distribution for success events. This
will make it more difficult to learn the true action values.
The average reward plot
is shown as follows:
Some of the observations are:
- The success probability for each arm is (0.5,0.7,0.3,0.4) for regime 1, (0.7,0.3,0.4,0.5) for regime 2,
and (0.5,0.3,0.4,0.7) for regime 3. Again the best arm has a success probability of 0.7.
- Ranking of performance:
Gradients = Eps-greedy (small & adaptive) > Greedy = Eps-greedy (big eps) > UCB
- First note that in general, all the algorithms have a worse performance compared to the
stationary scenario - this makes sense as the non-stationary scenario is more complex to learn.
- Once again the greedy algorithm and eps-greedy algorithm with big eps
have worse performance compared to other algorithms (except UCB).
- Again the UCB algorithm does not perform well in this case. This is true as the UCB in general
is not generalizable to deal with non-stationary problems.
Last updated on Nov 9, 2019