The Multi-Armed Bandit Problem and Epsilon-Greedy Algorithm

Today we were given some time to do independent research on topics related to AI that interest us. I elected to do some more research on reinforcement learning, which if you remember from previous posts is a type of machine learning where an agent cyclically takes actions in its environment and is given a reward or punishment based on the success of those actions in relation to the goal. One classic example of a problem that can be solved with RL is the multi-armed bandit problem. Suppose you are in a casino and playing various slot machines. Certain slot machines have a higher chance to give you a reward than others, and you have to figure out how to get the most amount of money from the machines as possible without knowing the probability of a reward from any given machine. One way of solving this is to simply pull the lever of every slot machine in hopes of hitting jackpot. This is known as pure-exploration, as you continue to only explore more possibilities while ignoring the ones that previously paid out. Another way is to pull the lever of the same slot machine every time and hope you picked a good one. This is known as pure-exploitation, AKA greedy, as you do no exploration whatsoever. Both of these methods are rather naive and have low odds of paying out. A better way is a balance of the two methods, known as epsilon-greedy. This approach is that with probability ϵ, you will select a random machine to pull the lever of. The rest of the time (1-ϵ) you will pull the best lever based on your current knowledge of the machines. ϵ is typically rather low (around 0.1 in this case) so you mostly play greedy, but with some exploration to find the best machine. I won’t go into the implementation of this algorithm, however when I tested it based on a table of pseudo-random numbers that represented the probabilities of the various slot machines paying out and graphed the simulated earnings over time, the results were undeniable.

As you can see the rewards started out low, but as the algorithm explored the various machines, it learned the best levers to pull and increased its rewards significantly. This example is rather simple but nevertheless demonstrates the awesome power and potential of reinforcement learning for a wide variety of real-life applications.

The Multi-Armed Bandit Problem and Epsilon-Greedy Algorithm

Related

Leave a Reply Cancel reply