Improving Epsilon-Greedy: Q-Learning

Today was mostly a wrap-up session, but we were given some more time to explore topics of interest to us. I elected to spend some more time with RL, namely improving the epsilon-greedy algorithm explained in my last post. I decided to use the egreedy philosophy and apply it to a method of RL known as Q-Learning. Q-Learning is an algorithm where you take all the possible states of your agent, and all the possible actions the agent can take, and arrange them into a table of values (the Q-Table). These values represent the reward given to the agent if it takes that particular action while in that particular state. While they start at zero, the point of the algorithm is to accurately map and fill-in the Q-Table so that the most efficient (or close to it) series of actions is taken. The algorithm itself is pictured below

Most of these steps have been explained or are self-explanatory, however measuring reward is less intuitive. R is measured using what is known as the Bellman equation

As you can see, the Q value at a certain point in the table (state, action) is updated based on the reward given and a few other parameters specified by the user such as the learning rate (alpha) and the discount rate (gamma, how much we weight the short and long-term rewards). The action choice part uses the epsilon-greedy strategy I described in my previous post, however unlike the last post, epsilon starts at 1 and slowly decays throughout the training process. This makes it so that the algorithm can explore its choices to the fullest extent early on but as it does so, it becomes more sure of the best choices. I tested this algorithm on the taxi cab problem using software called OpenAI Gym, and as expected the rewards increased over time while epsilon decreased. I have had so much fun at this internship, and I hope to continue to explore AI and reinforcement learning in the future. Thanks for reading!

The Multi-Armed Bandit Problem and Epsilon-Greedy Algorithm

Today we were given some time to do independent research on topics related to AI that interest us. I elected to do some more research on reinforcement learning, which if you remember from previous posts is a type of machine learning where an agent cyclically takes actions in its environment and is given a reward or punishment based on the success of those actions in relation to the goal. One classic example of a problem that can be solved with RL is the multi-armed bandit problem. Suppose you are in a casino and playing various slot machines. Certain slot machines have a higher chance to give you a reward than others, and you have to figure out how to get the most amount of money from the machines as possible without knowing the probability of a reward from any given machine. One way of solving this is to simply pull the lever of every slot machine in hopes of hitting jackpot. This is known as pure-exploration, as you continue to only explore more possibilities while ignoring the ones that previously paid out. Another way is to pull the lever of the same slot machine every time and hope you picked a good one. This is known as pure-exploitation, AKA greedy, as you do no exploration whatsoever. Both of these methods are rather naive and have low odds of paying out. A better way is a balance of the two methods, known as epsilon-greedy. This approach is that with probability ϵ, you will select a random machine to pull the lever of. The rest of the time (1-ϵ) you will pull the best lever based on your current knowledge of the machines. ϵ is typically rather low (around 0.1 in this case) so you mostly play greedy, but with some exploration to find the best machine. I won’t go into the implementation of this algorithm, however when I tested it based on a table of pseudo-random numbers that represented the probabilities of the various slot machines paying out and graphed the simulated earnings over time, the results were undeniable.

As you can see the rewards started out low, but as the algorithm explored the various machines, it learned the best levers to pull and increased its rewards significantly. This example is rather simple but nevertheless demonstrates the awesome power and potential of reinforcement learning for a wide variety of real-life applications.

Dr. Çetinkaya-Rundel and RStudio

Today we got to hear from Dr. Çetinkaya-Rundel, a statistician who has made significant contributions to RStudio, an open-source statistics software that aids data scientists. RStudio is based on the R language, a programming language that was designed with statistics in mind and has a large variety of tools to analyze data and visualize results easily. RStudio is not only a development environment for R, but also has several other non-programming utilities to analyze data. RStudio, as previously mentioned, is also open-source. Open-source software, or OSS, is a software model where any verified contributor can contribute to the development of the software, and the software is not always maintained by a specific company (although it can be). OSS is also free to use and anyone can get a copy simply by downloading one. It’s a bit like crowdsourced development, and it has been extremely effective at enabling a large variety of people to have access to tools they otherwise would not. RStudio has aided the advancement of data science greatly, and the community and its features are only growing.

RStudio - Logos Download

Mount Boredoom

Today we met with several members of Dr. Laber’s organization who have been involved in several outreach activities to educate people about topics in AI, statistics, and computer science. For the past two years they have been working on a game called Mount Boredoom, focusing on the applications of dynamic programming. Mount Boredoom is a game where you control a goat on a mountain who has been trying to gather flowers, while trees that can fire pinecones try to stop you. The trees rotate every time you move, and you must avoid them on your way to the flowers. Dynamic programming is a computer programming method where the programmer breaks a problem into smaller subsets recursively until the problems can no longer be divided. It has many applications ranging from aerospace engineering to economics. It can also be used to solve the problem posed by Mount Boredoom, in where you are able to break apart your overall path into each individual movement and recursively optimize it to get the best score. The overlay in-game that allowed us to do this gave us a greater understanding of dynamic programming and enlightened us to its usefulness and applications.

Dynamic Programming Part-1 (Stage Coach Problem or ...

The Paperclip Maximizer

Today during Jesse Clifton’s talk on the dangers of AI, we were presented with multiple scenarios in which AI could pose a danger to society if, when asked to achieve a certain end, their means does not align with what we as humans consider to be acceptable. As an example of how this could happen, our attention was drawn to a parable called “the paperclip maximizer.” The story goes like this: some time in the not-too-distant future, a company that manufactures paperclips directs its AI assistant to maximize the production of paperclips. Crucially, it neglects to restrict said AIs methods for doing this. Additionally, it allocates the AI a great amount of resources and power which it can use to achieve this goal. Soon, the paperclip factories begin to spread. It builds more and more, and they begin to take over industry. The worlds mineral resources are stripped as they are turned into paperclips, and the pollution from the factories poisons the air and the oceans. Nobody can stop this AI, it has taken countermeasures to prevent this. And the world falls to ruin, all because of paperclips. While a rather silly and extreme example, Clifton argued that a less extreme version of this scenario may well occur if we are not careful in how we instruct AIs to carry out tasks.

There are multiple proposed solutions to preventing this problem, however few have shown much promise at this point. The default assumption is to test the AIs actions inside a simulation before releasing it into the real world, however the possibility remains that the AI would see its most efficient path to achieve its goal as deceiving us and pretending to act in acceptable ways until we release it and it enacts its true plan. We have no way of knowing what could happen, but we must proceed carefully.

Alex Cloud from Riot Games and RL

During the second half of our day we got to speak to Alex Cloud, a statistics Ph.D student who works for Riot Games. His research involves several topics including what’s called reinforcement learning. Reinforcement learning, or RL, is a type of machine learning that deals in autonomous agents interacting with the environment around them. They are trained to seek a specific goal and get “rewarded” for completing it, and “punished” for doing it incorrectly. Eventually the algorithm learns how to complete its goal in an effective way. The big question Alex has been trying to answer is “If I am able to simulate how the world works but don’t know the true state of the world, how do I plan ahead?” Several methods including RL, game theory, decision theory, and pure statistics have been used in his approach, and he recently submitted his findings to a computer science conference. By training a machine learning model to analyze previous scenarios in this “world” and checking for specific elements of the known world that predict the unknown parts, the entire world can be accurately predicted even if not all is known.

What Is Reinforcement Learning? - MATLAB & Simulink

RL can go a long way towards solving many current challenges faced by the world including modeling conflict, autonomous vehicles, precision medicine, and business analytics. The example we were given was a machine that can reliably win at poker by accurately predicting its opponents’ hands. The possibilities are endless, and we’re excited to start experimenting with these technologies.

Skip to toolbar