Friday, June 10, 2016

Telling Python to prefer your user's private directory over the global library directory

Just a quick note since this caused quite a bit of frustration for me over the last couple of days:

I usually install Python modules using 'pip install --user <module>' since this is nice and safe, and I don't need to sudo to do it. These modules are installed, by default, to ${HOME}/.local/lib/python<version>/site-packages. This has served me well over the years, and it's something that worked before pip going back to easy_install and even before when running 'python install --user' manually.

Recently, trying to debug problems I am having installing TensorFlow, I noticed that the global packages were taking priority over my user-local modules in the cases where the module was installed in both. eg if I
    >>> import six
    >>> six.__file__
even though I have six installed locally. Note, this is the debian-custom global location for Python modules. I don't think the fact that I am using a Debian-derivative modifies this story, but I am not completely sure.

After a lot of Googling and looking through, this ended up being due to:

  1. Having ENABLE_USER_SITE = None in /usr/lib/python2.7/ although I don't remember ever having changed this before and this used to work ok. Anyway, set this to True
  2. More importantly, there was a row in ${HOME}/.local/lib/python2.7/easy-install.pth saying '/usr/lib/python2.7/dist-packages'. All these directories get pre-pended to sys.path, so this sneaks the global path above my user-local. Removing this line fixes the problem.
/usr/lib/python2.7/dist-packages gets added to sys.path anyway later in, but it gets appended instead of pre-prended, so all works out well.

Friday, March 4, 2016

Deep Q network trained to play Galaxian

I think DeepMind was a bit unlucky in not choosing Galaxian as one of their original set of games to train on. It's a very popular game, much more interesting than its Space Invaders ancestor, and, more importantly, their algorithm learns great on it.

Here are the usual graphs for it:

Note that, unlike in the Space Invaders case, these scores are very much human-level. Those scores in the high 20,000 range are really hard to get unless you play the game a lot. Even the 10,000 average is very respectable. It also very quickly reaches perfectly-reasonable and quite human-styled levels of play (less than 10 epochs).

I made a learning-progression video for it:

I used from Nathan Sprague's code ( You'll need a newish version of ALE ( to train this since code to support Galaxian was only incorporated recently (commit c62378aece70cfb119d0d4111d15b4b830a23d6c or later).

Sunday, December 20, 2015

Truly superhuman results with DeepMind's DeepQ algorithm

While DeepMind's papers show their algorithm doing better than some random person on a game they don't play much while having to use a keyboard when playing for five minutes ("expert human player" scoring 31 on breakout, c'mon), they don't try to show their programs getting better scores than human experts at their own games. I am not really sure why they chose their 5-minute limit for reporting, but the rest of us don't have to abide by it, and in here we can show that if you remove the 5-minute limit, the algorithm can beat the best human scores in at least one game.

One of the games that stands near the top of the DeepQ / human ratio in their papers is Atlantis. It's a game where planes fly overhead and you shoot them down from one of three guns. Most planes that fly past are harmless and are only there to score points from. Every now and then a plane firing a laser down comes across and if you fail to shoot it down it will destroy one of your guns or one of your other "buildings". When all of your buildings are destroyed, the game is over. One unusual feature of the game is that your buildings and guns reappear if you score enough points.

It would be very easy to write a program to play this game almost perfectly if you could hand-code it. The sprites are pretty much always the same, the planes always appear at the same height and the game never changes. You could write an if-detect-plane-here-then-shoot program and it would do amazingly well. It's a lot harder if you are not hand-coding almost anything like DeepQ does.

DeepQ learns to play this game relatively quickly. There is constant feedback from shooting down planes, and because of the loss-of-life marker, the program gets instant feedback when one of its buildings or guns is destroyed. Here's the training graph for it.

And here's the first 60 epochs or so since the latter epochs are hiding the early landscape:

As you can see, the learning is quite gradual until the huge spikes. There is even quite a bit of improvement before epoch 30 that is being hidden in the zoomed in image, although given the constant feedback I'd have expected it to have improved even faster. The scores are very high from epoch 40 onwards. For reference the score quoted on DeepMind's Nature paper ( ) are around 30000 for humans and 85000 for DQN.

The spikes on the first graph are huge, but they turn out to be artificially LOW because of the limit of playing time imposed on the testing period during training (125,000 actions, which is roughly two hours 20 minutes of play). I grabbed the model that produced the highest one, on epoch 78, and set it to play with unlimited time. After over 20 hours of gameplay, with the emulator running very slowly (bug yet to be investigated), and me doubting it was ever going to finish the game, I killed the run.

I've put the first 9 and a quarter hours of the game on Youtube:

It scores around 15 million points during the video, and I think over 30 million during the run (it isn't easy to check since the points clock over every million so you have to find all the clocking overs during the run). For comparison, the highest score recorded in TwinGalaxies is around 6.6 million for someone using a real Atari, and around 10 million for someone using an emulator.

If you watch the game even for a little bit you can tell that the computer doesn't play very well at all. It constantly misses on most of its shots. What it does super-humanly though, is kill the laser-shooting planes, and since that's the only way you can die, it can play for a very long time.

For those of you who want to try this out, I used Nathan Sprague's brilliant reimplementation of DeepMind's DeepQ algorithm which you can get at I ran it with the "Nature"-sized model. You can also download the pre-trained model from

Monday, March 2, 2015

DeepMind's Atari paper replicated

This is a followup on the previous post. Most of the results have been replicated. Turning off Nesterov momentum, pumping RMSprop decay to 0.99 and setting the discount rate to 0.95 did the trick. The spikes in the Q-score plot that were observed with the parameters used previously that I mentioned in the last post seemed to have been caused by the momentum making the optimisation diverge, not by Q-learning itself diverging as I thought previously.

Just to be completely clear, none of this is by me: all the learning code, to which I will refer to as deep_q_rl, is by Nathan Sprague, and the algorithm is from the paper "Playing Atari with Deep Reinforcement Learning" by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra and Martin Riedmiller that you can get from You can get the code from I'm mainly acting as a tester.

(As I was writing this post (it has taken me almost two weeks to write), DeepMind published their updated paper on Nature (, and this time they've released their source code: In this post I am mainly comparing to the original NIPS workshop paper. I will only mention the Nature paper where it seemed to clarify something about the NIPS paper)

When I write that the results are replicated, I mean that I think that the results produced by deep_q_rl are probably as good as the results that DeepMind obtained when they wrote the paper, and that all the original results seem believable considering these results. The Q-value curves still look very different, most likely because they were using a higher discount rate (*in their Nature paper they mention a discount rate of 0.99 while this code uses 0.95). Also, depending on the interpretation of the table where they list their results, DeepMind's results might have been significantly better. It is hard to tell exactly what they mean by "average total reward" in table 1. Do they mean across the whole 200 epochs? Or testing their best model? Or testing only the final network? Or across a large swathe of the networks that were trained?  I've taken a liberal interpretation of them to mean roughly how well does the agent score across many epochs once it looks like it has learnt. (From their Nature paper, it seems that they mix their use of episode and epoch a bit, so that the "best" seems to mean average across the testing epoch where it averaged highest, while "average" could be the average over all testing episodes)

Here are DeepMind's average and best scores as listed in table 1:

GameBeam riderBreakoutEnduroPongQ*BertSeaquestSpace invaders
DeepMind average40921684702019521705581
DeepMind best518422566121450017401075

I think deep_q_rl does clearly worse than DeepMind's code in Space Invaders, a little bit worse in Breakout, similar in Pong and Enduro, better on Q*Bert, and worse, similar or much better on Seaquest depending on your angle. I did not test on Beam rider.

I'll just post the average and max-score curves obtained by deep_q_rl, some qualitative commentary on how the agent plays on each game, and videos showing good (biased) examples of the agent playing or montages of it learning. The graphs show average test score on the left and highest test score on the right. The x-axis is the epoch. They were obtained using a test epsilon of 0.05 (ie on each frame there is a five percent chance of performing a random action) just to be able to compare them with DeepMind's results. I prefer watching the results of epsilon 0.01 (ie 1 percent chance of random action) since it makes it easier to tell policy problems from random errors, so I'll be describing the behaviour of runs using those settings. Also, let me be very clear that even though I might have some negative comments on the behaviour of the agent, this is all to be taken as relative to the absolute level of blown-awayedness that I feel regarding that this works at all.


Pong is very simple and the agent learns very quickly. It plays extremely aggressively always trying to win in one hit. It mostly wins 21-0 or 21-1.


The code learns to play breakout relatively quickly, taking about 40 epochs (1 epoch = 50000 frames) to get reasonably good at it. It plays without any seeming purpose, hitting the ball when it comes down but not aiming anywhere. 

At some point it gets good enough to make a tunnel in the wall, just by repeated random hitting, and put the ball through the tunnel, scoring lots of points. It never learns to hit the ball after it comes back down through the tunnel. I think that the image looks different enough that the network does not know what to do. It mostly just sits catatonic with the paddle against the left or right wall. I have never seen it clear a level.

The difference between its high scores and its average scores once it has learnt are obtained by it being lucky, first of all by getting the ball through the tunnel quickly once it's formed, since once the tunnel is formed its play deteriorates, and secondly by either having the ball hang around above the wall for longer or by having the paddle in the right place by accident once the ball does come back down.

Highest scored observed: 391. Obtained just as described above: random hitting leading to a hole in the middle of the wall, ball goes up and hangs around for a while, hole is in the right place for a new ball to go through it if hit by a paddle laying against the left wall.

The paper lists the average human score as 31, which I can only explain if they were using a keyboard to play the game. This game is meant to be played with a paddle, and 31 would be a pathetic score. I don't think the scores attained by the network as better than human level.

Space invaders

I expected the code to learn space invaders quite well, since it is a very simple game where the rewards come very shortly after the actions and where random thrashing at the beginning can quickly tell it what actions lead to good rewards, but it does not do very well at it, at least compared to a human. 

It does learn very quickly that shooting is a good thing and within 5 epochs it gets reasonably good at shooting down invaders. It just never gets very good at evading invaders' bullets. It does slowly improve at both tasks, and by epoch 180 or so, its play looks very similar to a novice human player, but it doesn't keep on improving.

The huge difference between the average scores and the maximum scores for each epoch are completely due to luck. It will randomly shoot down a passing mothership which give over 300 points each. Due to the way that the image is cropped while fed to the agent, it cannot actually see the mothership though, so it being able to shoot it down is most likely just due to luck (unless it managed to find the pattern of when they appear and from which direction, something I can't rule out).

DeepMind's version did not do that well either, but clearly better than this one. 

Highest observed score: 925 (shot two motherships)


Enduro is my favourite game to watch it play in. Its style is very non-human, combining super-human reflexes in some parts and parts where it just doesn't know what to do. 

It takes ages for it just to get any points at all in this game (almost 30 epochs). This is because points here are gotten by passing cars, and you need to keep the accelerator pressed for quite a few frames in a row to gather enough speed to pass a car. This is something that isn't likely to happen with random joystick movements. Once it learns to do that though, it quickly gets to very high scores, learning to dodge cars at very high speed in quite different-looking environments.

It never learns what to do in the bright/glare/snow/whatever-it-is stage though, where it just waits it out or goes full tilt till it slams into the car in front of it. It is the only stage where the background is brighter than the objects to avoid, so it is understandable for it to have trouble transferring the learning from the other sections. It also learnt that you cannot gain or lose places while the flags are waving in between levels, so it doesn't even try while they are being shown.

The high scores here track the average. Enduro takes ages to play so most of the time only one game fits in a testing epoch (10,000 frames).

When running without the 10,000 frames limit, I observed a run scoring 1080. It is part of the video below. It goes for over 7 minutes at 2x normal speed.


I didn't think the code would learn to play Seaquest because, unlike in all the other games in this page, all 18 actions are available (ie up-left, up, up-right, left, neutral, right, bottom-left, bottom, bottom-right, then each again but with the button pressed), which means that picking the right one is harder. It does learn to play it, though, in a very shallow sense.

It progressively gets better at hunting down sharks and at avoiding crashing into them, all the time ignoring the divers. It gets very good at that, but it never goes up for air so it keeps dying due to lack of oxygen. Eventually, it discovers that going up for air means getting points for rescued divers. This seems to have a negative effect on its agility though and it starts crashing into everything, and its scores get hammered.

Its best runs are in that short period were it is losing its basic skills but it has learnt about surfacing. The highest score I observed was 4000. It is included in the video below.


It learns to cover the first level very efficiently. The second level is a slightly different colour though, so that trips it up. It manages to learn to get some points on the second level but that seems to have a negative effect on its proficiency on the first level. I left it training for a while to see if it'd make the jump but no cigar.

Highest score I observed was 5600.

Wednesday, January 14, 2015

Replicating the Atari-playing paper from Deep Mind

Deep Mind is a company that got bought by Google at the beginning of 2014 for the usual gazillion dollars that everything seems to get bought by nowadays. They published a paper not long before that showing how they combined a deepish convolutional neural network with Q-learning to get computers to learn to play quite a few Atari games solely from the screen pixels and the score. This is something I've tried to do before so I thought it was very exciting news and that I should try to replicate it.

By the way, if you haven't seen any of the demo videos showing the results, or haven't read the paper,  go have a look. The paper is "Playing Atari with Deep Reinforcement Learning" by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra and Martin Riedmiller . Get it from The rest of the post will make much more sense if you have read through the paper.

A year passed and I still hadn't attempted any replication. This was quite lucky since Nathan Sprague did, and some of the Theano code he needed to write looks beyond me. His code is here: (which I'll call the Sprague code).

There are probably a few differences in the details between the implementations by Prof Sprague and Deep Mind. The ones that I noticed are the following:

  • The Sprague code downsamples the 160 x 210 screen to 80 x 105, and then squeezes this into an 80 x 80 input layer, while the Deep Mind version downsamples to 84 x 110 and then crops an 84 x 84 section of it for the input layer. The 80 was most likely chosen by Prof Sprague so that the downsampling would be neat (two to one), but I suspect that the 84 was chosen for the same reason ie so that the resultant image would smear a bit, making everything a bit larger.
  • While the Deep Mind implementation mentions that they used Rprop for the weight updating, the Sprague code implements RMSProp with Nesterov momentum. 
  • The discount rate (gamma in Q-learning) is not mentioned in the paper. The Sprague code sets it to 0.9

I got some time to download the code and test it a bit over the new year's break and it does brilliantly,  even though not quite as brilliantly as what's described in the paper. Deep Mind's paper claim they got an average score of 168 on breakout and 1705 on seaquest while the Sprague code topped out at about 50 for breakout and didn't seem to learn on seaquest. I haven't tried it on any other games yet.

Each 100-episode run takes a bit over 24 hours on my beastly new GTX 970, so there haven't been too many experiments yet. I did make quite a few changes to the code, but mostly to make it easier to use and other non-learning-related changes.

The few changes that I've made that could affect the learning are the following:

  • Changed it to match the Deep Mind's paper more closely by using the crop-to-84x84 method, choosing the bottom part of the image.
  • Made the conversion to grayscale follow whatever it is that cv2 and pillow do that tries to match the way that humans perceive luminance which means that it pays more attention to the green component than to the blue. The original Sprague code simply averages the red, green and blue components.
  • The number of actions is now restricted to the number of actions provided by the hardware used for that game. For most games, that is the joystick so all 18 actions are used as by default (that is neutral, up, down, left, right, up-left, up-right, down-left, down-right, and also each one of those with the button pressed). Breakout and pong and some others were played with a paddle though, so only six actions are allowed: neutral, left, right, and each one of those with the button pressed. I am not sure if the original paper did this or not.
  • I've played around with the discount rate currently settling on 0.95

As an aside, the change to the input selection showed the sensitivity of the code to the input. I initially used pillow to do the downsampling, instead of the cv2 library as used in the original Sprague code, as pillow is easier to install. This change was enough for the program to stop learning altogether on breakout. Comparing the downsampling produced by pillow vs the downsampling produced by cv2, it seems like cv2 smears the image a bit more, making the ball quite a bit bigger than what happens with the pillow downsampling (note: this is using pillow 2.6.1. The 2.7.0 release seems to change the downsampling algorithm to be a bit more like cv2).

While I still haven't seen any learning on seaquest, the program got to an average score of slightly over 100 in some of the testing epochs, while regularly averaging over 70. Here is its current masterpiece, a 331 produced after about 18 hours of training:

Here is the more accurate picture of its results, plotting epoch vs score, and the Q-value on a test set, similar to what's shown on page 5 of the paper:

The score plotted for an epoch is the result of a testing round that is done after that epoch. The testing runs for 10000 frames. That's about eight or nine games when it's doing well. Note the collapse of the score around epoch 103 or so, with the Q-value going whacked shortly after that. That, I think, is the Q-learning diverging.  If you look in the paper, their Q-value goes smoothly up. This one never does, not even before the divergence.

My fork is at . I will probably keep tweaking with it. I expect the code to learn nicely on pong, but I'm not sure on other games. Here are some ideas for it:

  • Try converting the cnn_q_learner to use the default Theano convolutional code which uses the NVIDIA cuDNN code. It should be a bit faster than the cuda_convnet code.
  • Try Rprop instead of RMSprop
  • Test adapting discount rates
  • Change the network architecture (bigger, deeper)
  • Change the algorithm to use more time-separated frames as inputs, rather than the last four frames.

Friday, April 5, 2013

Speeding up libsvm

I mentioned in my first post that a run of libsvm's tool to optimise the hyperparameters for MNIST took 36 hours on my computer. This was using a manually compiled version of libsvm using the plain source code from the site. There are two things that can massively speed up your libsvm training runs. They are both mentioned on the libsvm site, but they are probably not given enough prominence.

The first one is to parallelise the inner loop of the code by using OpenMP. This takes four lines of code. If you use Gentoo, the source code is already patched to use this. The speedup is almost linear with the number of processors in your computer. I have 4 hyperthreaded processors in mine, and I've got around a 7.5x speedup. You can read about how to do it in the libsvm's FAQ or just download the patch from Gentoo

The second speedup is to use CUDA. The CUDA implementation of libsvm was written by (at least one of) Andreas Athanasopoulos, Anastasios Dimou, Vasileios Mezaris and Ioannis Kompatsiaris and you can find it at It speeds up things even more than the OpenMP version, but only under certain cases.

For example, training a subset of MNIST using the OpenMP version:

$ time svm-train -q -v 5 -m 1000 -c 64 -g 0.03125 docs/bigdata/mnist/mnist6000.scale mnist.model

Cross Validation Accuracy = 96.3833%

real	0m21.397s
user	2m45.332s
sys	0m0.254s

Same thing using the CUDA version:

$ time programming/cuda/libsvm-cuda/svm-train-gpu -q -v 5 -c 64 -g 0.03125 docs/bigdata/mnist/mnist6000.scale mnist.model

Cross Validation = 96.3833%

real	0m10.649s
user	0m9.972s
sys	0m0.654s

That's a two-times speedup over the 8-processor version. (UPDATE: I realised these numbers don't mean anything if I don't tell you at least some specs of my machine.  i7-870 with a Geforce GTS 450 with 512 MB)

There are a few caveats and things to keep in mind regarding the CUDA version:
  • While the runtime for the CPU-version of the code scales linearly with the number of cross-folds, the CUDA version's runtime will scale sublinearly. eg changing that -v 5 to -v 10 takes 16.5 seconds instead of 10.6.
  • The CUDA code only runs for cross-validation-enabled runs. If I hadn't used -v 5 in that run, the single-threaded CPU version of the code would have run
  • Most importantly: the CUDA version doesn't implement SMO when solving the SVM, so its space requirements scale quadratically with the number of samples in your dataset. Since my graphics card has 512 MB of RAM, it can only handle about 7000 samples before it crashes (7000 * 7000 * 8 bytes/double ~= 400MB). My pick of 6000 for subset size was a lucky coincidence.
  • The development of the CUDA code seems to have stopped at libsvm 3.0.  I've emailed the authors and they replied that they don't have anyone working on it at the moment but that they are planning to move the code somewhere more accessible so it can be kept up to date by the rest of us.
I have patches for the code to align it with version 3.14 (although I just noticed that libsvm is up to version 3.17), and a Makefile to make it compile on Gentoo.  I pasted the Makefile below since it's an easy way to get started with the code if you want to try it.

# Change the CUDA_INSTALL_PATH to wherever you have CUDA installed
CUDA_INSTALL_PATH ?= /opt/cuda
NVCC       := $(CUDA_INSTALL_PATH)/bin/nvcc
EXECUTABLE  := svm-train-gpu
CUDACCFLAGS := -po maxrregcount=16
LIBS = -lcublas

CXXFLAGS += -fPIC -W -Wall -Wswitch -Wformat -Wchar-subscripts -Wparentheses -Wmultichar -Wtrigraphs -Wpointer-arith -Wcast-align -Wreturn-type -Wno-unused-function -m32 -DUNIX

# Debug/release configuration

ifeq ($(dbg),1)
    CXXFLAGS += -O2 -fno-strict-aliasing


$(EXECUTABLE): svm.o svm-train.o
	$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS) $(LD_PATH)

svm.o: svm.cpp svm.h

svm-train.o: svm.h svm-train.c kernel_matrix_calculation.c cross_validation_with_matrix_precomputation.c
	$(CXX) $(CXXFLAGS) $(INCLUDES) -c -o $@ svm-train.c

	rm svm.o svm-train.o svm-train-gpu

Wednesday, March 6, 2013

Let's start with MNIST

So, lots of Coursera subjects done and then crap all produced. Geoffrey Hinton's Neural Networks for Machine Learning was very interesting. I'll do something with that.

Interesting paper number one: "Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion" by Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio and Pierre-Antoine Manzagol. Stacked = stacked. Denoising = get rid of noise. Autoencoders = things (single-hidden layer neural networks in this case) that encode themselves (not true here, they encode the input, and they try to make their output look like their input). They trained these stacked denoising autoencoders (SDAs) on a few standard datasets and use the hidden layer representations as inputs to support vector machines (SVMs) for classification. They get very good results.

I want to replicate these results to see how much fidgeting would need to be done to get such good results and to try to gauge how easy it would be to use the same technique for other problems. I can also document the parameters that weren't specified in the paper.

The easiest dataset they used for me is MNIST, which is an (over)used collection of 70000 images of hand-written single-digit numbers (from 0 to 9) created by Corinna Cortes and Yann LeCun.  You can get it from or preprocessed so that all pixel intensity values are scaled to the range 0 to 1 and in a more semi-standard format from the libsvm dataset page (which is what I did).

The images are all grayscale 28x28 images (processed from 20x20 black and white images) with the number centered in the image and spanning most of it. The numbers were collected from real people (government employees and high-school students), and they look like normal hand-written digits in all the usual fucked up ways that people write numbers.  eg here's a sample my computer picked at random:

The problem is known to be easy. The paper uses SVMs with Gaussian radial basis function (RBF) kernels as the baseline to compare against, and they get 98.6% right.  SVMs = black magic classifiers. How I understand them: they work by mapping all the data to higher dimensions and then drawing a hyperplane (ie a line/plane in n-1 dimensions) between the samples to divide them as well as possible (which is pretty much how I think of most classifiers). The kernels they use are functions of the dot product of the samples. I think of them as the way to measure the distances between samples and I thought they could also be interpreted as how the mapping to higher dimensions is done but I read Someone Smart saying this wasn't the case, so it most likely isn't (I can't remember who or where that was, but it seemed authoritative enough to stick).

The most popular SVM package seems to be libsvm by Chih-Chung Chang and Chih-Jen Lin. It has nice command-line tools, python bindings and nice utils (also written in python) so I was sold.

There are two important hyperparameters to tweak with SVMs that use gaussian RBFs. First one is the cost (C) of misclassification (ie where to set the trade off between making the dividing hyperplane more complicated versus how much we want the plane to divide the training dataset exactly). Second one is gamma (g), which is how tight we want the Gaussian RBFs to be (g is inversely proportional to the standard deviation of those gaussians).

SVMs only classify between two classes, and there are a lot of different ways to extend that to multiclass classification (like picking from 0 to 9). One way is to build one classifier per target classification (10 in this case). When you train each classifier, you put the target class samples on one side, and all other samples on the other side (eg when training the classifier for the digit 3, you call all the samples of 3s true, and all the samples of non-3s (0s, 1s, 2s, 4s, 5s, 6s, 7s, 8s and 9s) false) . When you are testing, you see which of the classifiers says it is most likely that that target class is true.  This is called one vs all classification.

Another "simple" way to do multiclass classification, and the way that libsvm does it by default, is to build a classifier for each pair of target classes (eg one for the 1s vs the 2s, one for the 7s vs the 8s, one for the 1s vs the 7s, etc). n * (n-1) / 2 classifiers in total (45 in this case).  Then when you test a sample you run the sample on all n * (n-1) / 2 classifiers and keep a tally of which class wins most of the contests for that sample, and claim the winner as the class of the sample. This is one of the one vs one classification systems. Intuitively to me the one vs all system seems better, but libsvm and others seem to like the one vs one and claim it works better. There are many other ways of mapping binary classification to multiclass classification, and I'll probably try some in the future.

libsvm comes with a little util called that looks for the best hyperparameters by sampling the two hyperparameters on a log/log grid, training and doing cross-validation on the data you give it.  The MNIST data is split into 60000 training samples and 10000 test samples. I picked 6000 out of the 60000 training samples to do the grid search to speed the process up.  Running with default parameters samples 110 points on the grid and took about 36 hours on my i7-870 on those 6000 samples (this was before I discovered libsvm-cuda, but I'll write about libsvm-cuda later).

While was running I noticed that the gamma parameter was very important and the cost was completely uninteresting as long as it was sufficiently high (anything bigger than 2). came back with log2 gamma = -5 (0.03125) and cost = some number but I picked 64. I probably should have done some more fine-grained searching on the gamma parameter but I thought 36 hours was enough runtime for my computer

Training with those parameters on the whole learning set, and testing on the testing set gives me 98.57% right. Which is ridiculous. Practically no preprocessing, only scaling between 0 and 1, no coding and automated tweaking, and it gets almost all numbers right. SVMs are something I'm going to have to keep trying on other datasets.

No coding is required for the above.  Assuming you have libsvm compiled and you have, say, the libsvm source directory under "libsvm", the full list of commands to get the above results would be:

# subselect 6000 training samples to run the grid faster
libsvm/tools/ mnist.scale 6000 mnist6000.scale 
# grid to find the best hyperparameters
libsvm/tools/ mnist6000.scale
# wait for a long time and read output from giving you the best hyperparameters
# use them to train a model
svm-train -c 64 -g 0.03125 mnist.scale mnist.model
# see how well that did
svm-predict mnist.scale.t mnist.model mnist.outputguesses

By the way, here are the 143 cases that it got wrong:

That's the straight SVM result replicated. Next to replicate the interesting bit of the autoencoder prepocessing.