More CartPole (seeds!)

Today, we wrapped up the WEP and asked any lingering questions we had and worked on an independent project for the rest of the day. Naturally, I decided to try to figure out the issue with CartPole. The most likely issue is a min and max switched up but I feel like I’ve checked them all so I decided to try something else – seeding the environment. Basically, every time the environment is reset at the beginning of each episode that the agent plays, the initial condition is randomized. However, by seeding it, the environment is reset to the same condition every time. Minecraft’s world generation operates on a similar base concept. If no seed is given, then the world is generated randomly with a random seed each game but if a seed is given, then the world generated will be constant regardless of the number of times the seed is used. By seeding the environment, I’ll be able to tell if the issue is in the agent being too dumb to get a general ‘understanding’ of how to win in randomly generated environments but smart enough to figure out a solution after playing a single environment enough or if the issue is actually just something critically wrong with my code. It’s been running for a bit and I’m feeling a bit optimistic because the score seems to be consistently around 20 points and slowly rising, although there’s still the chance it’ll slowly get down to 8-10 points again.

An image of my code and the agent’s reward over many episodes.

 

Update: After a bit more training, the reward is approaching like 10 points again so I think the issue is in my code rather than the seeding/generation of the environment. Debug time!

CartPole – Balancing Stuff is Hard

Today was more of an independent work day so I decided to mess around with the Cart-Pole problem and Q Networks. I already knew a bit about Machine Learning so I thought, “How much harder could it be to get a network to work with reinforcement learning?”. The answer is much harder. It is much harder to get it to work with reinforcement learning. So, lets start with the problem – CartPole.

An image of the CartPole Simulation (OpenAI Gym CartPole-v1)

Basically, there’s a cart with a pole on top and the agent (what the computer is called) is supposed to balance it. It’s given the cart’s velocity, cart’s position, angle of the pole, and angular velocity of the tip of the pole and it decides if it wants to apply a force (of constant magnitude 1) to the left or right. If the pole angle is more than 15 degrees or the cart is more than 2.4 units away from center (I believe that is the correct distance, it might vary based on the simulation though), the round ends. The base idea behind reinforcement learning is maximizing reward, which can be generated in different ways. In CartPole, it’s simple; +1 for each survived timestep. In other games, like Snake for example, it might be a bit more complicated because you’d want to balance the reward from the apples with a reward from staying alive that motivates the snake to stay alive and go for apples instead of just going in circles and cheating rewards by surviving. The way that Q Networks work is by basically predicting the Quality of each action, which is basically the expected reward from acting in that manner. So in CartPole, my implementation of the agent used something called actor-critic, which is where there are two networks in play. One acts and the other collects the actions and learns from them. Eventually, the actor network syncs with the critic network so that they have the same weights/action/knowledge. Think of it like playing a game with your friend. One of you is playing and the other is watching and learning and after a certain period of time, you and your friend stop playing for a bit and you tell your friend what you observed about the game and how to win/maximize reward. Then, you watch your friend play again with their newfound knowledge, observe, and learn again and keep repeating this until you and your friend are able to win at the game or achieve your goal. There’s also a small twist on that called “epsilon-greedy” where there’s an epsilon percent chance that the agent will choose a random action in order to explore different paths/actions and see if it can get more reward from a different approach. The epsilon slowly decays so that as the agent plays more and more rounds/episodes, the amount of moves done by the agent itself and not randomly chosen also increases. Or that’s what should happen, in theory. For me, the agent is minimizing reward, or doing the exact opposite of what it is supposed to. Initially, it has a reward of around 20-30 but it manages to get it down to a consistent 8-10 points. My main issue right now is somewhere in my code for the agent’s actions and how it chooses which action to use in the environment. I probably accidentally switched a min and a max somewhere which is causing reward to be minimized instead of maximized but yeah, my goal is to fix that and see it actually learn to get smarter and get more reward. Hopefully, I won’t have to rewrite my code again and it’s just a small typo somewhere.

Sneakercon

Today, we met with Dr. Mark Hansen, a professor at Colombia Journalism School and learned about how he modern journalism is beginning to use more and more data science and statistics as well as some of the many things that Dr. Hansen and his students have done. One of their coolest projects (in my opinion) is Sneakercon. Sneakercon is about things called “Sneaker nets” in Venezuela. Because the government monitors the internet, people can’t really browse the web freely. So they have a really cool workaround – Sneakernets. People would set up different mesh nets, give out thumb drives disconnected from the network, and even use Raspberry Pis (small computer boards, similar to Arduinos) as bridging devices to access the internet unrestricted. What Dr. Hansen and his students did was organize Sneakercon, a conference to teach more people about sneakernets and other types of offline/independent internet connections.

Me, Teddy, Alex, and Colin at Dr. Hansen’s Presentation

Justin Weltz – Duke Grad Student

This morning, we met with Justin Weltz, a grad student at Duke about his research in Reinforcement Learning and Response Driven Sampling. However, something interesting is that he actually didn’t start out planning to go into computer science. He went into undergrad initially focusing on English/Political Science and decided to pursue more Statistics/Computer Science related ideas in his junior year. After making that decision, Justin applied to Duke grad school on a whim and wasn’t sure if he’d actually get in. Overall, Justin found that grad school was a good experience and much more self directed than high school or undergrad. While his first year had a lot of classes and work, his second year, when he began researching RDS with Dr. Laber, was much more research oriented and independent, without a set schedule.

LaberLabs Day 1 – 05/24/2021

Today, we were introduced to Dr. Laber, a professor at Duke and the head of Laber Labs, a research lab that researches different questions, mainly related to medicine, and uses methods like statistics and reinforcement learning to solve them.

Pictured below is an outreach project where Reinforcement Learning was used to “teach” a computer to play the Laser Cat game.

A Demonstration of Q Learning in the Laser Cat Game

For the rest of the day, we’ll be working on simulating something called the Monty Hall Problem. Basically, there are 3 doors. Two have something you don’t want and the third has something you do want. You choose one door and another one is eliminated, leaving you to choose to either open your current door or switch to the other remaining one. What we’re doing is building a dataset to see how often you win when you switch doors and if there is a correlation between switching doors and winning.

Skip to toolbar