Baidu Research utilized a small 36-node NVIDIA-powered cluster to attain the best computer vision ImageNet classification result to date with a 5.98% error vs. GoogleNet’s 6.66%. These results are very close to the human error rate of 5.1%. Key to the Baidu performance is their mix of model- and data-parallelism as well as the use of higher-resolution images (512×512 vs 256×256) plus the incorporation of additional synthetic data derived from the ImageNet images. The paper, “Deep Image: Scaling up Image Recognition“, describes the Baidu approach. It is available on arxiv.org.
In particular, Baidu augmented the ImageNet images with various effects such as color-casting, vignetting and lens distortion. The goal was to let the system take in more features of smaller objects and to learn what objects look like without being thrown off by editing choices, lighting situations or other extraneous factors.
The small NVIDIA-powered server is well within the reach of most universities and small companies. It is comprised of 36 server nodes, each with 2 six-core Intel Xeon E5-2620 processors. Each sever contains 4 Nvidia Tesla K40m GPUs and one FDR InfiniBand (56Gb/s) which is a high-performance low-latency interconnection and supports RDMA. The peak single precision floating point performance of each GPU is 4.29TFlops and each GPU has 12GB of memory.
Andrew Ng of Baidu taught the following deep-learning course in 2012
TechEnablement also makes our exascale-capable deep-learning mapping available on github. You can read more about our approach in this article, “Deep-learning Teaching Code Achieves 13 PF/s on the ORNL Titan Supercomputer“.