Friday, April 5, 2013

Speeding up libsvm

I mentioned in my first post that a run of libsvm's grid.py tool to optimise the hyperparameters for MNIST took 36 hours on my computer. This was using a manually compiled version of libsvm using the plain source code from the site. There are two things that can massively speed up your libsvm training runs. They are both mentioned on the libsvm site, but they are probably not given enough prominence.

The first one is to parallelise the inner loop of the code by using OpenMP. This takes four lines of code. If you use Gentoo, the source code is already patched to use this. The speedup is almost linear with the number of processors in your computer. I have 4 hyperthreaded processors in mine, and I've got around a 7.5x speedup. You can read about how to do it in the libsvm's FAQ or just download the patch from Gentoo

The second speedup is to use CUDA. The CUDA implementation of libsvm was written by (at least one of) Andreas Athanasopoulos, Anastasios Dimou, Vasileios Mezaris and Ioannis Kompatsiaris and you can find it at http://mklab.iti.gr/project/GPU-LIBSVM. It speeds up things even more than the OpenMP version, but only under certain cases.

For example, training a subset of MNIST using the OpenMP version:


$ time svm-train -q -v 5 -m 1000 -c 64 -g 0.03125 docs/bigdata/mnist/mnist6000.scale mnist.model

Cross Validation Accuracy = 96.3833%

real	0m21.397s
user	2m45.332s
sys	0m0.254s


Same thing using the CUDA version:


$ time programming/cuda/libsvm-cuda/svm-train-gpu -q -v 5 -c 64 -g 0.03125 docs/bigdata/mnist/mnist6000.scale mnist.model

Cross Validation = 96.3833%

real	0m10.649s
user	0m9.972s
sys	0m0.654s


That's a two-times speedup over the 8-processor version. (UPDATE: I realised these numbers don't mean anything if I don't tell you at least some specs of my machine.  i7-870 with a Geforce GTS 450 with 512 MB)

There are a few caveats and things to keep in mind regarding the CUDA version:
  • While the runtime for the CPU-version of the code scales linearly with the number of cross-folds, the CUDA version's runtime will scale sublinearly. eg changing that -v 5 to -v 10 takes 16.5 seconds instead of 10.6.
  • The CUDA code only runs for cross-validation-enabled runs. If I hadn't used -v 5 in that run, the single-threaded CPU version of the code would have run
  • Most importantly: the CUDA version doesn't implement SMO when solving the SVM, so its space requirements scale quadratically with the number of samples in your dataset. Since my graphics card has 512 MB of RAM, it can only handle about 7000 samples before it crashes (7000 * 7000 * 8 bytes/double ~= 400MB). My pick of 6000 for subset size was a lucky coincidence.
  • The development of the CUDA code seems to have stopped at libsvm 3.0.  I've emailed the authors and they replied that they don't have anyone working on it at the moment but that they are planning to move the code somewhere more accessible so it can be kept up to date by the rest of us.
I have patches for the code to align it with version 3.14 (although I just noticed that libsvm is up to version 3.17), and a Makefile to make it compile on Gentoo.  I pasted the Makefile below since it's an easy way to get started with the code if you want to try it.



# Change the CUDA_INSTALL_PATH to wherever you have CUDA installed
CUDA_INSTALL_PATH ?= /opt/cuda
NVCC       := $(CUDA_INSTALL_PATH)/bin/nvcc
EXECUTABLE  := svm-train-gpu
CUDACCFLAGS := -po maxrregcount=16
INCLUDES += -I. -I$(CUDA_INSTALL_PATH)/include
LIBS = -lcublas
LD_PATH = -L$(CUDA_INSTALL_PATH)/lib

CXXFLAGS ?= $(CFLAGS)
CXXFLAGS += -fPIC -W -Wall -Wswitch -Wformat -Wchar-subscripts -Wparentheses -Wmultichar -Wtrigraphs -Wpointer-arith -Wcast-align -Wreturn-type -Wno-unused-function -m32 -DUNIX


# Debug/release configuration

ifeq ($(dbg),1)
    CXXFLAGS += -g -D_DEBUG
else
    CXXFLAGS += -O2 -fno-strict-aliasing
endif

all: $(EXECUTABLE)

$(EXECUTABLE): svm.o svm-train.o
	$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS) $(LD_PATH)

svm.o: svm.cpp svm.h

svm-train.o: svm.h svm-train.c kernel_matrix_calculation.c cross_validation_with_matrix_precomputation.c
	$(CXX) $(CXXFLAGS) $(INCLUDES) -c -o $@ svm-train.c

clean:
	rm svm.o svm-train.o svm-train-gpu