Today was more of an independent work day so I decided to mess around with the Cart-Pole problem and Q Networks. I already knew a bit about Machine Learning so I thought, “How much harder could it be to get a network to work with reinforcement learning?”. The answer is much harder. It is much harder to get it to work with reinforcement learning. So, lets start with the problem – CartPole.
Basically, there’s a cart with a pole on top and the agent (what the computer is called) is supposed to balance it. It’s given the cart’s velocity, cart’s position, angle of the pole, and angular velocity of the tip of the pole and it decides if it wants to apply a force (of constant magnitude 1) to the left or right. If the pole angle is more than 15 degrees or the cart is more than 2.4 units away from center (I believe that is the correct distance, it might vary based on the simulation though), the round ends. The base idea behind reinforcement learning is maximizing reward, which can be generated in different ways. In CartPole, it’s simple; +1 for each survived timestep. In other games, like Snake for example, it might be a bit more complicated because you’d want to balance the reward from the apples with a reward from staying alive that motivates the snake to stay alive and go for apples instead of just going in circles and cheating rewards by surviving. The way that Q Networks work is by basically predicting the Quality of each action, which is basically the expected reward from acting in that manner. So in CartPole, my implementation of the agent used something called actor-critic, which is where there are two networks in play. One acts and the other collects the actions and learns from them. Eventually, the actor network syncs with the critic network so that they have the same weights/action/knowledge. Think of it like playing a game with your friend. One of you is playing and the other is watching and learning and after a certain period of time, you and your friend stop playing for a bit and you tell your friend what you observed about the game and how to win/maximize reward. Then, you watch your friend play again with their newfound knowledge, observe, and learn again and keep repeating this until you and your friend are able to win at the game or achieve your goal. There’s also a small twist on that called “epsilon-greedy” where there’s an epsilon percent chance that the agent will choose a random action in order to explore different paths/actions and see if it can get more reward from a different approach. The epsilon slowly decays so that as the agent plays more and more rounds/episodes, the amount of moves done by the agent itself and not randomly chosen also increases. Or that’s what should happen, in theory. For me, the agent is minimizing reward, or doing the exact opposite of what it is supposed to. Initially, it has a reward of around 20-30 but it manages to get it down to a consistent 8-10 points. My main issue right now is somewhere in my code for the agent’s actions and how it chooses which action to use in the environment. I probably accidentally switched a min and a max somewhere which is causing reward to be minimized instead of maximized but yeah, my goal is to fix that and see it actually learn to get smarter and get more reward. Hopefully, I won’t have to rewrite my code again and it’s just a small typo somewhere.