Monday, March 2, 2015

DeepMind's Atari paper replicated

This is a followup on the previous post. Most of the results have been replicated. Turning off Nesterov momentum, pumping RMSprop decay to 0.99 and setting the discount rate to 0.95 did the trick. The spikes in the Q-score plot that were observed with the parameters used previously that I mentioned in the last post seemed to have been caused by the momentum making the optimisation diverge, not by Q-learning itself diverging as I thought previously.

Just to be completely clear, none of this is by me: all the learning code, to which I will refer to as deep_q_rl, is by Nathan Sprague, and the algorithm is from the paper "Playing Atari with Deep Reinforcement Learning" by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra and Martin Riedmiller that you can get from You can get the code from I'm mainly acting as a tester.

(As I was writing this post (it has taken me almost two weeks to write), DeepMind published their updated paper on Nature (, and this time they've released their source code: In this post I am mainly comparing to the original NIPS workshop paper. I will only mention the Nature paper where it seemed to clarify something about the NIPS paper)

When I write that the results are replicated, I mean that I think that the results produced by deep_q_rl are probably as good as the results that DeepMind obtained when they wrote the paper, and that all the original results seem believable considering these results. The Q-value curves still look very different, most likely because they were using a higher discount rate (*in their Nature paper they mention a discount rate of 0.99 while this code uses 0.95). Also, depending on the interpretation of the table where they list their results, DeepMind's results might have been significantly better. It is hard to tell exactly what they mean by "average total reward" in table 1. Do they mean across the whole 200 epochs? Or testing their best model? Or testing only the final network? Or across a large swathe of the networks that were trained?  I've taken a liberal interpretation of them to mean roughly how well does the agent score across many epochs once it looks like it has learnt. (From their Nature paper, it seems that they mix their use of episode and epoch a bit, so that the "best" seems to mean average across the testing epoch where it averaged highest, while "average" could be the average over all testing episodes)

Here are DeepMind's average and best scores as listed in table 1:

GameBeam riderBreakoutEnduroPongQ*BertSeaquestSpace invaders
DeepMind average40921684702019521705581
DeepMind best518422566121450017401075

I think deep_q_rl does clearly worse than DeepMind's code in Space Invaders, a little bit worse in Breakout, similar in Pong and Enduro, better on Q*Bert, and worse, similar or much better on Seaquest depending on your angle. I did not test on Beam rider.

I'll just post the average and max-score curves obtained by deep_q_rl, some qualitative commentary on how the agent plays on each game, and videos showing good (biased) examples of the agent playing or montages of it learning. The graphs show average test score on the left and highest test score on the right. The x-axis is the epoch. They were obtained using a test epsilon of 0.05 (ie on each frame there is a five percent chance of performing a random action) just to be able to compare them with DeepMind's results. I prefer watching the results of epsilon 0.01 (ie 1 percent chance of random action) since it makes it easier to tell policy problems from random errors, so I'll be describing the behaviour of runs using those settings. Also, let me be very clear that even though I might have some negative comments on the behaviour of the agent, this is all to be taken as relative to the absolute level of blown-awayedness that I feel regarding that this works at all.


Pong is very simple and the agent learns very quickly. It plays extremely aggressively always trying to win in one hit. It mostly wins 21-0 or 21-1.


The code learns to play breakout relatively quickly, taking about 40 epochs (1 epoch = 50000 frames) to get reasonably good at it. It plays without any seeming purpose, hitting the ball when it comes down but not aiming anywhere. 

At some point it gets good enough to make a tunnel in the wall, just by repeated random hitting, and put the ball through the tunnel, scoring lots of points. It never learns to hit the ball after it comes back down through the tunnel. I think that the image looks different enough that the network does not know what to do. It mostly just sits catatonic with the paddle against the left or right wall. I have never seen it clear a level.

The difference between its high scores and its average scores once it has learnt are obtained by it being lucky, first of all by getting the ball through the tunnel quickly once it's formed, since once the tunnel is formed its play deteriorates, and secondly by either having the ball hang around above the wall for longer or by having the paddle in the right place by accident once the ball does come back down.

Highest scored observed: 391. Obtained just as described above: random hitting leading to a hole in the middle of the wall, ball goes up and hangs around for a while, hole is in the right place for a new ball to go through it if hit by a paddle laying against the left wall.

The paper lists the average human score as 31, which I can only explain if they were using a keyboard to play the game. This game is meant to be played with a paddle, and 31 would be a pathetic score. I don't think the scores attained by the network as better than human level.

Space invaders

I expected the code to learn space invaders quite well, since it is a very simple game where the rewards come very shortly after the actions and where random thrashing at the beginning can quickly tell it what actions lead to good rewards, but it does not do very well at it, at least compared to a human. 

It does learn very quickly that shooting is a good thing and within 5 epochs it gets reasonably good at shooting down invaders. It just never gets very good at evading invaders' bullets. It does slowly improve at both tasks, and by epoch 180 or so, its play looks very similar to a novice human player, but it doesn't keep on improving.

The huge difference between the average scores and the maximum scores for each epoch are completely due to luck. It will randomly shoot down a passing mothership which give over 300 points each. Due to the way that the image is cropped while fed to the agent, it cannot actually see the mothership though, so it being able to shoot it down is most likely just due to luck (unless it managed to find the pattern of when they appear and from which direction, something I can't rule out).

DeepMind's version did not do that well either, but clearly better than this one. 

Highest observed score: 925 (shot two motherships)


Enduro is my favourite game to watch it play in. Its style is very non-human, combining super-human reflexes in some parts and parts where it just doesn't know what to do. 

It takes ages for it just to get any points at all in this game (almost 30 epochs). This is because points here are gotten by passing cars, and you need to keep the accelerator pressed for quite a few frames in a row to gather enough speed to pass a car. This is something that isn't likely to happen with random joystick movements. Once it learns to do that though, it quickly gets to very high scores, learning to dodge cars at very high speed in quite different-looking environments.

It never learns what to do in the bright/glare/snow/whatever-it-is stage though, where it just waits it out or goes full tilt till it slams into the car in front of it. It is the only stage where the background is brighter than the objects to avoid, so it is understandable for it to have trouble transferring the learning from the other sections. It also learnt that you cannot gain or lose places while the flags are waving in between levels, so it doesn't even try while they are being shown.

The high scores here track the average. Enduro takes ages to play so most of the time only one game fits in a testing epoch (10,000 frames).

When running without the 10,000 frames limit, I observed a run scoring 1080. It is part of the video below. It goes for over 7 minutes at 2x normal speed.


I didn't think the code would learn to play Seaquest because, unlike in all the other games in this page, all 18 actions are available (ie up-left, up, up-right, left, neutral, right, bottom-left, bottom, bottom-right, then each again but with the button pressed), which means that picking the right one is harder. It does learn to play it, though, in a very shallow sense.

It progressively gets better at hunting down sharks and at avoiding crashing into them, all the time ignoring the divers. It gets very good at that, but it never goes up for air so it keeps dying due to lack of oxygen. Eventually, it discovers that going up for air means getting points for rescued divers. This seems to have a negative effect on its agility though and it starts crashing into everything, and its scores get hammered.

Its best runs are in that short period were it is losing its basic skills but it has learnt about surfacing. The highest score I observed was 4000. It is included in the video below.


It learns to cover the first level very efficiently. The second level is a slightly different colour though, so that trips it up. It manages to learn to get some points on the second level but that seems to have a negative effect on its proficiency on the first level. I left it training for a while to see if it'd make the jump but no cigar.

Highest score I observed was 5600.


  1. HI, does anyone knows how many neurons and how many layers are in this network ? I couldn't find this information anywhere.

    Also, what is the structure of the network ? where can I see the connectivity of the neurons inside ?

    I'm very interesting about that,


    1. The description of the network is in the paper linked to at the top (, click on PDF). Top of page 6 there.

      "The input to the neural network consists is an 84 × 84 × 4 image produced by φ. The first hidden layer convolves 16 8 × 8 filters with stride 4 with the input image and applies a rectifier nonlinearity [10, 18]. The second hidden layer convolves 32 4 × 4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is fully-connected and consists of 256 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action."