SURFsara posted the best accuracy and an under 40 minute training time on some popular deep learning architectures and data sets to establish new single-model state-of-the-art results using only general-purpose CPU-based hardware, as opposed to special accelerators. Specifically SURFsara reports under 40 minutes to train the ResNet50 model on the ImageNet-1k dataset (1,000 classes), and state-of-the-art accuracy when training on the ImageNet-22k (22,000 classes) dataset. Further, the results were duplicated on very different processor machines and processor architectures – namely Intel® Xeon Phi™ and the recently announced Intel® Xeon® Scalable processors (generally known in the HPC community by the codename Skylake).
Specifically, SURFsara makes two unique claims compared to results published by IBM:
- Better accuracy with fewer number of Intel Xeon Phi processor-based servers (ResNet-50: 200 KNL vs 256 GPUs, 36.91% vs 33.8%)
- Better time-to-train on Intel Xeon Scalable processors at slightly lower accuracy (ResNet-50: 44 mins @ 74.3% vs 50 mins @ 75.01%)
SURFsara, an Intel Parallel Computing Center (IPCC) collaborator, has published results using the distributed training of deep Convolutional Neural Networks (CNNs) across many hundreds of compute nodes. Joe Curley, Director of HPC Platform and Ecosystem Enabling at Intel, explained the importance of this accomplishment: “Deep learning neural networks are now becoming practical to deploy at scale”.
“Deep learning neural networks are now becoming practical to deploy at scale” – Joe Curley (Senior Director of HPC Platform and Ecosystem Enabling, Intel Corporation)
The fast training times – some under 40 minutes – were achieved when running on up to 768 Intel Xeon Phi processor 7250 nodes and separately on up to 512 dual-socket Intel® Xeon® Scalable processor nodes connected over Intel® Omni-Path Architecture (OPA) fabric. As significant as this CPU-only scaling is, it’s not a record for performance or number of nodes utilized while training a deep learning neural network[i]. A multiyear IPCC collaboration that included Stanford and NERSC (National Energy Research Scientific Computing Center) recently published a paper demonstrating 15 PF/s deep learning training performance when utilizing approximately 9,600 of the NERSC Cori Intel Xeon Phi processor compute nodes .
Prior to these two IPCC results (SURFsara and scaling to 9600 nodes), the reported scaling behavior of deep-learning applications using TensorFlow, Caffe, and other popular packages has been limited to a few tens of nodes. Pradeep Dubey, Intel Fellow and Director of Intel’s Parallel Computing Lab, observed “scaling deep-learning training from single-digit nodes just a couple of years back to almost 10,000 nodes now, adding up to more than ten petaflop/s is big news” as training deep learning is now a member of the petascale club.
“Scaling deep-learning training from single-digit nodes just a couple of years back to almost 10,000 nodes now, adding up to more than ten petaflop/s is big news” – Pradeep Dubey, Intel Fellow and Director of Intel Parallel Computing Lab)
The IPCC collaboration with SURFsara widens the impact of distributed training of deep learning neural networks by correlating and comparing both training time and accuracy of the trained model to existing published results.
Accuracy and time-to-model
Succinctly, accuracy and time-to-model are all that matter for neural network training performance because the goal is to quickly develop a model that represents the training data with high accuracy. Since scaling has been limited for deep learning training applications, people have focused on hardware metrics like floating-point, cache, and memory subsystem performance to distinguish which hardware platforms are likely to perform better than others and deliver that desirable fast time-to-model. Similarly, most people use the same packages (e.g. Caffe, TensorFlow, the Torch middle-ware, and others) so the convergence rate to a solution and accuracy of the resulting solution have been neglected. The assumption is that the same software and algorithms shouldn’t differ too much in convergence behavior across platforms*, hence the accuracy of the models should be the same or very close.
Both the 15 PF/s and SURFsara collaborations have focused on using existing software tools like Intel® Caffe and Intel® Machine Learning Scaling Library inside a larger distributed environment. Using different optimization algorithms (e.g. Stocastic Gradient Descent), distributed asynchronous frameworks and more raises the specter of convergence rate and accuracy. Industry leading results on the Imagenet-1K, ImageNet-22K, and Places-365 data sets address these concerns while also maintaining above 90% scaling efficiency. Further, convergence and accuracy results cannot be presented in isolation – meaning they have to be compared to existing methods to allow apples-to-apples comparisons. Hence the importance of the comparisons to both IBM and Facebook GPU-based results.
Intel Xeon Phi Processor results
On the Imagenet-1K dataset, SURFsara reports that their trained models converge to a top-1 accuracy of more than 74.3% (for the 1000 category classification problem the network provides the right answer in more than 74.3% of cases). These results were obtained with the Resnet-50 network architecture, and the parallel efficiency achieved was approximately 97% for these scale-out experiments.
SURFsara replicated experiments of IBM and Facebook in training the Resnet-50 model on the Imagenet-1K dataset, “and achieved 73.2% top-1 accuracy in 41 minutes, and 74% top-1 in about 50 minutes”. The authors, Dr. Valeriu Codreanu and Damian Podareanu from SURFsara, plus Dr. Vikram Saletore from Intel, note they achieved these results using Intel Xeon Phi processors that achieves half the peak floating-point performance of the NVIDIA P100 GPUs used in the Facebook and IBM runs. “Moreover”, they also note, “we achieve above 97% parallel efficiency when going from 1 to 256 nodes.” All models are evaluated against the ILSVRC-2012 validation data, containing the blacklisted images that were removed in 2014.
These results are summarized in the following Stampede2 scale-out convergence results:
Figure 1: TACC Stampede2 scale-out convergence results on Intel® Xeon® Phi (Results courtesy SURFsara)
These results were obtained on the TACC (Texas Advanced Computing Center) Stampede2 supplied by Dell and incorporating Intel Xeon Phi processor 7250 nodes connected with Intel® Omni-Path Architecture (Intel® OPA) fabric. Comparative scaling behavior according to number of workers is shown in the following figure.
Figure 2: Scaling efficiency on Stampede2 (speedup vs number of workers). This plot starts from scaling on 4 workers, which has a scaling factor of 1.
When utilizing more workers the authors reported, “We achieve an accuracy level of 74.05%/92.07% top-1/top-5 for the 512-node case within 46 minutes, and 74.20%/92.20% for the 768-node case in only 39 minutes.”
SURFsara expects to speed up these results further to achieve less than 30 minute training time on the Imagenet-1K dataset.
Marenostrum 4 Intel Xeon Processors results
“We wanted to specifically evaluate deep learning training on traditional Xeon processors, as these form the backbone of most computing centers around the world”, the authors posted. In addition, there has been much interest in the new Intel Xeon Scalable processors. To address this, SURFsara reported results for similar ImageNet-1K experiments using the new processor’s nodes on the MareNostrum 4 supercomputer supplied by Lenovo at the Barcelona Supercomputing Center (BSC).[ii] MareNostrum 4 is composed of around 3500 dual-socket Intel Xeon Scalable processor 8160 nodes, each containing 96GB RAM and around 200GB of local storage.
SURFsara reported that the results were quite successful. As in the Stampede2 case the authors posted, “all our trained models achieve a top-1 validation accuracy greater than 74% on ILSVRC-2012 validation set, and 74.3% on ILSVRC-2014 respectively.”
Figure 3: Marenostrum 4 scale-out convergence results using Intel® OPA and Intel® Xeon® Scalable processors (Results courtesy SURFsara)
The Intel Xeon Scalable processors and Intel OPA network demonstrated “around 90% scaling efficiency when going from 1 to 256 SKX nodes” as reported by SURFsara and as seen in the figure below.
Figure 4: Scaling efficiency on MareNostrum 4 (speedup vs number of workers). This plot starts from scaling on 4 workers, which has a scaling factor of 1.
The authors show that the Intel Xeon Scalable processor nodes delivered higher hardware efficiency than the hardware used by Facebook and IBM.
Summary Intel Xeon Scalable processor highlights include:
- convergence in 70 minutes using 256 nodes
- convergence in 56 minutes using 400 nodes
- convergence in 44 minutes using 512 nodes
Increased accuracy using wide-rather-than-deep neural networks
Originally ‘deep learning’ was used to describe the many hidden layers that scientists used to mimic the many neuronal layers in the brain. While deep ANNs (DNNs) are useful, many in the data analytics world will not use many hidden layers due to the vanishing gradient problem. Thus they will train using wider and shallower neural networks.
Reflecting this, SURFsara reports that wider neural networks can deliver better results compared to their deep learning results. “The main advantage is that widening a network produces better results on the datasets we’ve targeted so far with a lower number of total parameters as the deeper counterpart” Valeriu Codreanu (SURFsara) observes. Using the large memory capacity of the Intel Xeon and Intel Xeon Phi processing nodes allowed SURFsara to experiment with even wider neural networks. Codreanu reported, “We realize that we can go even wider than we did now, as we did not find an upper limit for accuracy yet”.
“We realize that we can go even wider than we did now, as we did not find an upper limit for accuracy yet” – Valeriu Codreanu (SURFsara)
The larger memory capacity is important so the machine can keep the parameters and enough training examples in memory to achieve high performance. CPUs have big memory capacities, especially compared to the limited memory capacity of accelerators. Succinctly, accelerators deliver performance through massive scaling using thousands of threads. For performance reasons, this means that the device memory has to be large enough to hold enough examples to keep several thousands of threads busy. If not, performance suffers. Big training examples can be problematic as they limit parallelism on a limited memory device.
Wider neural networks can impose even more severe memory constraints as the data scientist may decide to use more parameters than a comparable deep network. “So, in fact” SURFsara explained, ”such networks (wide ones) have the potential to perform better with more parameters, which can make such a network prohibitive on a GPU in a data parallel fashion.” The extra parameters can increase accuracy but will also reduce the amount of memory available for training examples. “This would hinder GPUs with limited (max 16GB memory) even more”.
Further, convergence (which directly relates to time-to-model) can be affected. Citing an example from their work, SURFsara reported, “When training some of the larger networks, we use around 32-35GB of system memory for a local minibatch size of 16. This effectively allows the GPUs to only use batches of 4-8 images per node when training such networks. We have experimented with these small batches (4-8) per node, and our experience is that convergence is negatively affected”.
The beauty of deep learning, and machine learning in general, is that once the data and model have been specified, performance only depends on the software and computer hardware. Thus the fastest “student” is the computer or compute cluster that can deliver the fastest time to solution while finding acceptable accuracy solutions. Thus a 15 PF/s machine like Cori can be considered a roughly three orders of magnitude “better student” who can model real-world data sets in climate and high energy physics. The SURFsara results show that distributed training can also make big clusters and cloud instances “faster and equally accurate students” compared to other “kids in the class”. The results are available for all to see so the deep learning community can confirm that distributed training is viable and accurate. Further, hardened software is either available or currently in development for users and to incorporate into productivity languages.
*Note: Numerical differences will accumulate during the training run which will cause the runtime of different machines to diverge. The assumption is that the same algorithms and code will provide approximately the same convergence rate and accuracy. However, there can be exceptions.
Rob Farber is a global technology consultant and author with an extensive background in HPC and in developing machine learning technology that he applies at national labs and commercial organizations. Rob can be reached at firstname.lastname@example.org.
[i] Valeriu Codreanu notes that it is a record CPU-only scaling for traditional CNNs that achieve state of the art results on the popular Imagenet-1K.
[ii] The authors wish to acknowledge PRACE for awarding access to MareNostrum 4 based in Spain at BSC.