In this post, AYLIEN NLP Research Intern, Mahdi, talks us through a quick experiment he performed on the back of reading an interesting paper on evolution strategies, by Tim Salimans, Jonathan Ho, Xi Chen and Ilya Sutskever.
Having recently read Evolution Strategies as a Scalable Alternative to Reinforcement Learning, Mahdi wanted to run an experiment of his own using Evolution Strategies. Flappy Bird has always been among Mahdi’s favorites when it comes to game experiments. A simple yet challenging game, he decided to put theory into practice.
The model is trained using Evolution Strategies, which in simple terms works like this:
- Create a random, initial brain for the bird (this is the neural network, with 300 neurons in our case)
- At every epoch, create a batch of modifications to the bird’s brain (also called “mutations”)
- Play the game using each modified brain and calculate the final reward
- Update the brain by pushing it towards the mutated brains, proportionate to their relative success in the batch (the more reward a brain has been able to collect during a game, the more it contributes to the update)
- Repeat steps 2-4 until a local maximum for rewards is reached.
At the beginning of training, the bird usually either drops too low or jumps too high and hits one of the boundary walls, therefore losing immediately with a score of zero. In order to avoid scores of zero in training, which would means there won’t be a measure of success among brains, Mahdi set a small 0.1 score for every frame the bird stays alive. This way the bird learns to avoid dying at the first attempt. He then set a score of 10 for passing each wall, so the bird tries not only to stay alive, but to pass as many walls as possible.
The training process is quite fast as there is no need for backpropagation, and it is also not very costly in terms of memory as there is no need to record actions, as it is in policy gradients.
The model learns to play pretty well after 3000 epochs, however it is not completely flawless and it rarely loses in difficult cases, such as when there is a high difference between two wall entrances.
Here is a demonstration of the model after 3000 epochs
(~5 minutes on an Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz):
For ease of access, Mahdi has created a web version of the experiment which can be accessed here.
Try it yourself
Note: You need python3 and pip for installing and running the code.
git clone https://github.com/mdibaiee/flappy-es.git
Next, install dependencies (you may want to create a virtualenv):
pip install -r requirements
The pretrained parameters are in a file named
load.npy and will be loaded when you run
train.py will train the model, saving the parameters to
demo.py shows the game in a GTK window so you can see how the AI actually plays.
play.py if you feel like playing the game yourself, space: jump, once lost, press enter to play again.
It seems that training past a maximum point leads to a reduction in performance. Learning rate decay might help with this. Mahdi’s interpretation is that after finding a local maximum for accumulated reward and being able to receive high rewards, the updates become pretty large and will pull the model too much to different sides, thus the model will enter a state of oscillation.
To try it yourself, there is a
long.npy file, rename it to
load.npy before doing so) and run
demo.py, you will see the bird failing more often than not.
long.py was trained for only 100 more epochs than