DPSVM

Progress review

We finished the additional literature review for the project in the first week after submitting the proposal, which also included reviewing the Legion boot camp in the first few days. After realizing that Legion would be an overkill for the task, owing to it’s large boilerplate code and the limited use of heterogeneity in our project, we revised our platform choice and chose OpenMPI and CUDA for the project. We also finalized the algorithm we’d be using for SVM training.

In the second week, we worked on the sequential implementation of the SVM training phase, and fixed the errors in our earlier algorithm while comparing it to LibSVM for the no. of support vectors. The sequential implementation is currently slower than the LibSVM implementation, but achieves the same accuracy. In the last few days, we worked on the Parallel version using CUDA with the Thrust library and CuBLAS. We have finished writing the code for training, but have to debug the implementation to complete the training phase.

Goals and Deliverables

We are on track to deliver a parallel implementation of the SVM training phase based on CUDA and OpenMPI. We have already written the CUDA code (as stated in the progress review), and need to incorporate OpenMPI calls to parallelize over multiple GPUs across nodes. We need to figure out the procedure to access different nodes through OpenMPI on the latedays cluster, without authenticating multiple times for every iteration.

A “nice to have” goal would be a multi-class classification (as an extension to the 2 class classification problem) SVM before the Parallelism competition. We’d also like to extend our benchmarking to include new systems such as Spark LibLinear, if we manage to complete the previous stretch goal.

Exhibits for the Parallelism Competition

We plan to include a graph in our presentation which compares the runtimes of LibSVM, our sequential implementation and our parallel implementation for popular data-sets such as MNIST, Adult dataset and Covertype.

We could possibly have a small demo on classifying one of the smaller datasets in the presentation (dependent on the time constraint), as the runtimes for these datasets would be under 30 seconds.

Preliminary Results

We’ve been using 8 increasingly large versions of the Adult dataset (available here) to verify our sequential implementation against LibSVM. Our accuracy matches that of LibSVM for all cases, with the time going from approximately the same for smaller training data sets (1605 examples) to around 2x that of LibSVM for bigger sets (32,561 examples). We expect to beat the LibSVM time with our parallel implementation.

Issues

We could not get OpenMPI working on the GHC machines across multiple machines, without authenticating multiple times. The GHC machines could transfer data using the MPI calls for a certain subsection of machines without multiple authentications (after adding the public keys of machines in the authorized_keys file of each machine we intended to use), but this does not satisfy our requirements as we seek to use certain GPUs which are spread across multiple such subsets of machines.

We are yet to work our way around using the latedays cluster for this purpose, and may have to use Amazon EC2 instances if we aren’t able to access nodes without multiple authentication.

Schedule

Dates	Planned Goals
4/3 - 4/10	Completed (Environment setup, boilerplate code and literature review)
4/11 - 4/16	Completed (Sequential implementation and CUDA code), debug pending
4/17 - 4/24	Implemented the sequential classifier, have to debug CUDA code and optimize it (4/21 - 4/24).
4/25 - 5/4	We have 3 final exams this week, so won’t be able to make much progress (Siddharth will work on adding OpenMPI and optimizing the parallel implementation from 5/1 - 5/4)
5/5 - 5/8	Benchmark against LibSVM and our sequential implementation, stretch goals
5/9 - 5/10	Write up and presentation