Friday, April 5, 2013

Speeding up libsvm

I mentioned in my first post that a run of libsvm's grid.py tool to optimise the hyperparameters for MNIST took 36 hours on my computer. This was using a manually compiled version of libsvm using the plain source code from the site. There are two things that can massively speed up your libsvm training runs. They are both mentioned on the libsvm site, but they are probably not given enough prominence.

The first one is to parallelise the inner loop of the code by using OpenMP. This takes four lines of code. If you use Gentoo, the source code is already patched to use this. The speedup is almost linear with the number of processors in your computer. I have 4 hyperthreaded processors in mine, and I've got around a 7.5x speedup. You can read about how to do it in the libsvm's FAQ or just download the patch from Gentoo

The second speedup is to use CUDA. The CUDA implementation of libsvm was written by (at least one of) Andreas Athanasopoulos, Anastasios Dimou, Vasileios Mezaris and Ioannis Kompatsiaris and you can find it at http://mklab.iti.gr/project/GPU-LIBSVM. It speeds up things even more than the OpenMP version, but only under certain cases.

For example, training a subset of MNIST using the OpenMP version:


$ time svm-train -q -v 5 -m 1000 -c 64 -g 0.03125 docs/bigdata/mnist/mnist6000.scale mnist.model

Cross Validation Accuracy = 96.3833%

real	0m21.397s
user	2m45.332s
sys	0m0.254s


Same thing using the CUDA version:


$ time programming/cuda/libsvm-cuda/svm-train-gpu -q -v 5 -c 64 -g 0.03125 docs/bigdata/mnist/mnist6000.scale mnist.model

Cross Validation = 96.3833%

real	0m10.649s
user	0m9.972s
sys	0m0.654s


That's a two-times speedup over the 8-processor version. (UPDATE: I realised these numbers don't mean anything if I don't tell you at least some specs of my machine.  i7-870 with a Geforce GTS 450 with 512 MB)

There are a few caveats and things to keep in mind regarding the CUDA version:
  • While the runtime for the CPU-version of the code scales linearly with the number of cross-folds, the CUDA version's runtime will scale sublinearly. eg changing that -v 5 to -v 10 takes 16.5 seconds instead of 10.6.
  • The CUDA code only runs for cross-validation-enabled runs. If I hadn't used -v 5 in that run, the single-threaded CPU version of the code would have run
  • Most importantly: the CUDA version doesn't implement SMO when solving the SVM, so its space requirements scale quadratically with the number of samples in your dataset. Since my graphics card has 512 MB of RAM, it can only handle about 7000 samples before it crashes (7000 * 7000 * 8 bytes/double ~= 400MB). My pick of 6000 for subset size was a lucky coincidence.
  • The development of the CUDA code seems to have stopped at libsvm 3.0.  I've emailed the authors and they replied that they don't have anyone working on it at the moment but that they are planning to move the code somewhere more accessible so it can be kept up to date by the rest of us.
I have patches for the code to align it with version 3.14 (although I just noticed that libsvm is up to version 3.17), and a Makefile to make it compile on Gentoo.  I pasted the Makefile below since it's an easy way to get started with the code if you want to try it.



# Change the CUDA_INSTALL_PATH to wherever you have CUDA installed
CUDA_INSTALL_PATH ?= /opt/cuda
NVCC       := $(CUDA_INSTALL_PATH)/bin/nvcc
EXECUTABLE  := svm-train-gpu
CUDACCFLAGS := -po maxrregcount=16
INCLUDES += -I. -I$(CUDA_INSTALL_PATH)/include
LIBS = -lcublas
LD_PATH = -L$(CUDA_INSTALL_PATH)/lib

CXXFLAGS ?= $(CFLAGS)
CXXFLAGS += -fPIC -W -Wall -Wswitch -Wformat -Wchar-subscripts -Wparentheses -Wmultichar -Wtrigraphs -Wpointer-arith -Wcast-align -Wreturn-type -Wno-unused-function -m32 -DUNIX


# Debug/release configuration

ifeq ($(dbg),1)
    CXXFLAGS += -g -D_DEBUG
else
    CXXFLAGS += -O2 -fno-strict-aliasing
endif

all: $(EXECUTABLE)

$(EXECUTABLE): svm.o svm-train.o
	$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS) $(LD_PATH)

svm.o: svm.cpp svm.h

svm-train.o: svm.h svm-train.c kernel_matrix_calculation.c cross_validation_with_matrix_precomputation.c
	$(CXX) $(CXXFLAGS) $(INCLUDES) -c -o $@ svm-train.c

clean:
	rm svm.o svm-train.o svm-train-gpu


11 comments:

  1. Hi, I would just like to ask how you were able to compile the libsvmgpu project under linux. I tried using the make file included in the sample programs and also the one that you posted above but it wasn't able to successfully compile the project. can you help me out? thanks

    ReplyDelete
    Replies
    1. Hi Summer
      What error message do you get when you run make on the one I posted?

      Delete
    2. Hey ale, whenever I run the make file used here it gives me the following error:

      g++ -fPIC -W -Wall -Wswitch -Wformat -Wchar-subscripts -Wparentheses -Wmultichar -Wtrigraphs -Wpointer-arith -Wcast-align -Wreturn-type -Wno-unused-function -m32 -DUNIX -O2 -fno-strict-aliasing -c -o svm.o svm.cpp
      In file included from /usr/include/features.h:385,
      from /usr/include/math.h:28,
      from svm.cpp:1:
      /usr/include/gnu/stubs.h:7:27: error: gnu/stubs-32.h: No such file or directory
      make: *** [svm.o] Error 1

      I'm not really familiar with developing projects yet with linux, so I hope you can help me out. Thanks a lot!


      Delete
    3. @summer I am not sure what exactly the issue is, but my guess is that the issue is that you are running a 64-bit system and I hardcoded the compile to 32-bit. I would try getting rid of the '-m32' in the Makefile and trying again, although I have very low confidence in that working.

      Another option, one more likely to work, would be to install the 32-bit branch of glibc. Have a look at the answer in http://stackoverflow.com/questions/7412548/gnu-stubs-32-h-no-such-file-or-directory . It lists options for many distributions.

      Delete
    4. Hi ale, taking out the -m32 flag worked :) thanks! yes, I am working on a 64 bit machine and i think that was the problem. I just have one question regarding your work. Is the speed up or the use of the GPU only in the svm-train file? or is there a way to speed up the grid.py tool as well? Because in my non GPU implementation, it is the grid.py which accounts for most of the time. Thanks for any insight that you can provide regarding this!

      Delete
    5. Good to hear that taking out -m32 worked.

      grid.py calls svm-train, so it speeds up that process too, and it is really in the grid.py case that this helps, since it calls svm-train to perform 5-fold cross-validation, and cross-validation is the main part that the GPU version speeds up.

      Make sure to set nr_local_worker = 1 in grid.py or you'll risk blowing up the memory of your GPU. Also, remember to set '-svmtrain [path-to-your-GPU-compiled-svm-train-binary]' in the command-line options to grid.py, or it will try to find an svm-train by itself, most likely finding a system one not GPU-compiled.

      Delete
    6. Hi ale, I've recently just started to poke around the source code and see how everything is working out. And also im trying to compare running the gpu version svm-train to the non gpu one. I read that I have to set the cross validation in order for it to run as a GPU? Is that true? Also, once I tried to run the gpu with the 5 fold validation, it eats up my memory and it won't continue to run. Do you have any idea why this is? thanks!

      Delete
    7. Sorry about the lateness of the reply.

      You will only see a speedup during cross-validation since the speedup comes from doing all the cross-validation sets in parallel.

      Regarding memory, you should be ok as long as the number of elements in your training set stay below the square root of your GPU's RAM divided by 8 or so. This is because the CUDA versions uses a naive algorithm instead of SMO.

      Also note that they've released a new version of the code and moved it to github: https://github.com/MKLab-ITI/CUDA/ .

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. please explain to me the whole working of grid.py i dont know phython so it is difficult to understand what i really happening

    ReplyDelete
  4. I have tried this library but it doesn't accelerate the libsvm

    ReplyDelete