Over 90 percent of the participating teams and three of the four winners in the prestigious 2014 ImageNet Large Scale Visual Recognition Challenge used GPUs to enable their deep learning work. Deep learning is a fast-growing segment of machine learning that involves the creation of sophisticated, multi-level or “deep” neural networks. These networks enable powerful computer systems to learn to recognize patterns, objects and other items by analyzing massive amounts of training data.
To accelerate research development in deep-learning, NVIDIA has released the cuDNN machine-learning library to CUDA registered developers. The cuDNN library can be downloaded from the cuDNN website,
The cuDNN library is supported as part of UC Berkeley’s Caffe with integrations into other popular machine learning frameworks on the way. Please see Caffe’s documentation for enabling cuDNN with Berkeley’s framework.
The NVIDIA website provides the following plots show the speedup on a few example problems:
Other projects such as RaPyDLI by Jack Dongarra (University of Tennessee) and Geoffrey Fox (Indiana University) along with Andrew Ng (Stanford, Baidu and Coursera) also provide a convenient interface for deep-learning. RaPyDLI provides a Python interface to run Deep-learning problems on both CPUs, GPUs, and Intel Xeon Phi.
The freely available farbopt deep-learning teaching code that runs on CPUs, GPU, and Intel Xeon Phi using CUDA, OpenACC, OpenMP, Intel Native, and Intel Xeon Phi offload versions provides comparative performance. The farbopt code also exhibits near-linear scaling beyond tens of thousands of devices (such as 16,384 GPUs on the ORNL Titian and 128k processors on two CM-200 connection machines.)
While not an apples-to-apples comparison, the Farber teaching code does deliver over a TF/s per device on both linear and nonlinear deep-learning problems as can be seen in the following per-device performance plots from the article “Deep-learning Teaching Code Achieves 13 PF/s on the ORNL Titan Supercomputer“. (I developed this high-performance mapping in the early 1980s while in the Theoretical Division at Los Alamos Laboratory and a member of the external faculty at the Santa Fe Institute. It was the first program I ran on NVIDIA GPUs and was the performance motivation for NVIDIA GPUs in my 2008 Dr. Dobbs article, “CUDA, Supercomputing for the Masses: Part 1“.)
The renewed popularity of deep-learning coupled with the modern TF/s devices is powerful. In addition, deep-learning can be structured to learn tasks and potentially neural subsystems.
The comparison indicates that the cuDNN benchmarks are not showing the full performance of the K4o hardware for some reason.
Leave a Reply