Today, we wrapped up the WEP and asked any lingering questions we had and worked on an independent project for the rest of the day. Naturally, I decided to try to figure out the issue with CartPole. The most likely issue is a min and max switched up but I feel like I’ve checked them all so I decided to try something else – seeding the environment. Basically, every time the environment is reset at the beginning of each episode that the agent plays, the initial condition is randomized. However, by seeding it, the environment is reset to the same condition every time. Minecraft’s world generation operates on a similar base concept. If no seed is given, then the world is generated randomly with a random seed each game but if a seed is given, then the world generated will be constant regardless of the number of times the seed is used. By seeding the environment, I’ll be able to tell if the issue is in the agent being too dumb to get a general ‘understanding’ of how to win in randomly generated environments but smart enough to figure out a solution after playing a single environment enough or if the issue is actually just something critically wrong with my code. It’s been running for a bit and I’m feeling a bit optimistic because the score seems to be consistently around 20 points and slowly rising, although there’s still the chance it’ll slowly get down to 8-10 points again.
Update: After a bit more training, the reward is approaching like 10 points again so I think the issue is in my code rather than the seeding/generation of the environment. Debug time!
Today was more of an independent work day so I decided to mess around with the Cart-Pole problem and Q Networks. I already knew a bit about Machine Learning so I thought, “How much harder could it be to get a network to work with reinforcement learning?”. The answer is much harder. It is much harder to get it to work with reinforcement learning. So, lets start with the problem – CartPole.
Basically, there’s a cart with a pole on top and the agent (what the computer is called) is supposed to balance it. It’s given the cart’s velocity, cart’s position, angle of the pole, and angular velocity of the tip of the pole and it decides if it wants to apply a force (of constant magnitude 1) to the left or right. If the pole angle is more than 15 degrees or the cart is more than 2.4 units away from center (I believe that is the correct distance, it might vary based on the simulation though), the round ends. The base idea behind reinforcement learning is maximizing reward, which can be generated in different ways. In CartPole, it’s simple; +1 for each survived timestep. In other games, like Snake for example, it might be a bit more complicated because you’d want to balance the reward from the apples with a reward from staying alive that motivates the snake to stay alive and go for apples instead of just going in circles and cheating rewards by surviving. The way that Q Networks work is by basically predicting the Quality of each action, which is basically the expected reward from acting in that manner. So in CartPole, my implementation of the agent used something called actor-critic, which is where there are two networks in play. One acts and the other collects the actions and learns from them. Eventually, the actor network syncs with the critic network so that they have the same weights/action/knowledge. Think of it like playing a game with your friend. One of you is playing and the other is watching and learning and after a certain period of time, you and your friend stop playing for a bit and you tell your friend what you observed about the game and how to win/maximize reward. Then, you watch your friend play again with their newfound knowledge, observe, and learn again and keep repeating this until you and your friend are able to win at the game or achieve your goal. There’s also a small twist on that called “epsilon-greedy” where there’s an epsilon percent chance that the agent will choose a random action in order to explore different paths/actions and see if it can get more reward from a different approach. The epsilon slowly decays so that as the agent plays more and more rounds/episodes, the amount of moves done by the agent itself and not randomly chosen also increases. Or that’s what should happen, in theory. For me, the agent is minimizing reward, or doing the exact opposite of what it is supposed to. Initially, it has a reward of around 20-30 but it manages to get it down to a consistent 8-10 points. My main issue right now is somewhere in my code for the agent’s actions and how it chooses which action to use in the environment. I probably accidentally switched a min and a max somewhere which is causing reward to be minimized instead of maximized but yeah, my goal is to fix that and see it actually learn to get smarter and get more reward. Hopefully, I won’t have to rewrite my code again and it’s just a small typo somewhere.