Different algorithms for the multi-armed bandit problem

A beautiful picture

The multi-armed bandit problem is a classic problem in reinforcement learning, where there is only a single state (for stationary bandits). It has a variety of applications including used as an alternative for A/B testing in digital marketing (for resource allocations). The goal of this project is to understand some of the best known algorithms for the bandit problem under different settings (e.g. stationary vs non-stationary, continuous reward vs discrete reward).

Algorithms in this project are written from FIRST PRINCIPLE using Python. For each scenario, the following algorithms are being used:

For each scenario, 100 independent bandits are generated where each bandit has the same underlying probability distribution of action values. For each bandit, 1000 time steps are performed. To assess the performance of each algorithm under each setting, by averaging the results over the bandits, the average reward is plotted to see how performance is improving with experience.


Scenario 1. Stationary Bandits - Noisy Continuous Reward

In this scenario, each bandit involves 10 different actions (10 arms), and the true action values are sampled according to some probability distribution (where this probability distribution is stationary - does not change over time). In addition, the reward for a bandit is the true action value plus some noisy term (which is sampled from another probability distribution). The average reward plot is shown as follows:

A beautiful picture

Some of the observations are:


Scenario 2. Stationary Bernoulli Bandits

In this scenario, the agent chooses among 4 different actions (4 arms). When arm k is pulled, the reward will be either 1 (success) or 0 (failure) with some probability pk, i.e. the reward for each arm follows a Bernoulli distribution with parameter pk. Here, all bandits are stationary - the success probability for each arm p1,p2,p3,p4 remains the same over time. The average reward plot is shown as follows:

A beautiful picture

Some of the observations are:


Scenario 3. Non-Stationary Bandits - Noisy Continuous Reward

Compared to scenario 1, now the bandits are non-stationary - where the probability distribution that the true action values are sampled from changes over time. Note that in this case, a CONSTANT STEP SIZE is being used to get an exponentially decayed average for the estimated action values. Again, the reward for a bandit is the true action value plus some noisy term (which is sampled from another probability distribution). The average reward plot is shown as follows:

A beautiful picture

Some of the observations are:


Scenario 4. Non-Stationary Bernoulli Bandits

Compared to scenario 2, now the success probability for each arm is no longer constant (or stationary) by the following way. At each time t, the agent has an equal probability of choosing 1 of 3 regimes of the slot machine, and each regime has a different constant probability distribution for success events. This will make it more difficult to learn the true action values. The average reward plot is shown as follows:

A beautiful picture

Some of the observations are:


Last updated on Nov 9, 2019