NVIDIA announced the P4 and P40 GPUs that are optimized for machine learning inference performance and efficiency. The products are positioned to gain market share for NVIDIA in the data center. (Inference, sometimes called scoring or prediction, utilizes a trained machine learning algorithm to perform some difficult or valuable task like recognizing a picture, predicting an outcome, processing a signal, categorizing data, or working inside a data analytic framework.) Inference tends to be the money making machine learning workloads in the data center.
According to both Intel and NVIDIA, the next 3 – 4 years will see a significant increase in the use of inference in the data center. This will likely cause a retooling as search providers, data centers, and cloud providers acquire hardware to keep pace with rapidly escalating machine learning inference workloads. Current estimates indicate that only 7% of existing servers are dedicated to machine learning (Source: Intel). Ian Buck (NVIDIA, VP Accelerated Computing) states, “NVIDIA believes that in the near future every piece of data in the data center will be interacted with by AI”. Diane Bryant (Intel, EVP and GM of the Data Center Group) remarks, ‘by 2020 servers will run data analytics more than any other workload’.
“In the near future every piece of data in the data center will be interacted with by AI” – Ian Buck (VP Accelerated Computing, NVIDIA)
“By 2020 servers will run data analytics more than any other workload” – Diane Bryant
The problem, Ian Buck points out, “is that AI is out innovating the CPU roadmap by up to 10x a year in some cases.

Ian Buck (VP Accelerated Computing, NVIDIA)
“AI is out innovating the CPU roadmap by up to 10x a year in some cases” – Ian Buck
This has created a market opportunity for NVIDIA, which explains why they have so aggressively worked to improve their performance on machine learning workloads. The P40 and P4 GPU announcements continue the process to establish NVIDIA ever more firmly in the data center and place them in a position of leadership for the expected retooling in the data center that will occur over the next 3-4 years.

Speedups on NVIDIA hardware. Training: comparing to Kepler GPU in 2013 using Caffe, Inference: comparing img/sec/watt to CPU: Intel E5-2697v4 using AlexNet
Arithmetic precision
Precision is currently a hot topic as Artificial Neural Networks (ANNs) – the basis for many machine, data analytic, and deep learning algorithms – are roughly based on biological models of the brain. From a computational perspective, biological brains operate in a very low-precision and noisy environment. For this reason, computer scientists are investigating the efficacy of using low-precision 16-bit floating-point (FP16) arithmetic and even 8-bit (single byte) integer arithmetic (INT8) for inference.
The speedups can be quite significant. Very simply, utilizing a single byte (INT8) arithmetic means that appropriately designed hardware can perform 4x the number of arithmetic operations per unit time compared to single-precision (32-bit) arithmetic. Similarly, the memory subsystem can become 4x more efficient as 4x more sequential datum can be transferred per cache line memory transaction.
To understand the importance of low-precision arithmetic, consider the explosive growth of speech recognition in the data center.

Source 2016 Internet Trends Report [2] from Kleiner Perkins Caufield & Byers [3]

NVIDIA benchmarks results on NVIDIA P40 speech recognition benchmark. Deep Speech 2 inference performance on 16 user server | CPU: 170 ms of estimated compute time required for each 100 ms of speech sample | Pascal GPU: 51 ms of compute required for each 100 ms of speech sample. (Source: NVIDIA)
NVIDIA announces TensorRT
Along with the new GPUs, NVIDIA announced the TensorRT library to exploit this 4x performance opportunity just from utilizing INT8 arithmetic while also providing a transparent migration path to the P40 and P4 inference optimized GPUs. To allay concerns that the automatic conversion of a neural network trained using 64-bit. 32-bit, or 16-bit precision arithmetic, the TensorRT library has the ability to run the INT8 migrated network on a data set (say the training set) so the user can evaluate any changes in accuracy or behavior.

NVIDIA benchmarks showing speedup using TensorRT on Pascal GPUs
The New GPUs
The NVIDIA P4 is a half-height card designed to fit into high-density “scale out” server systems. The following shows the size of a P4 card relative to a pencil. Each card delivers 5.5 TeraFLOPS of peak single precision and 22 TOPS (Trillion OP/s) Peak INT8 performance as well as a Tesla P4 inside a high-density tray.

NVIDIA P4 size relative to a pencil (Source: NVIDIA)

An NVIDIA P4 in a high-density enclosure
The NVIDIA Tesla P40 is designed for the highest throughput for “scale up” servers. The card delivers 12 TeraFLOPs of peak single-precision performance and 48 TOPS of peak INT8 performance.

Eight NVIDIA P40 GPUs in a node
Overall, NVIDIA claims the P4 and P40 GPUs are extremely performance and more efficient than both CPUs and Field Programmable Gate Arrays (FPGAs).

AlexNet, batch size = 128, CPU: Intel E5-2690v4 using Intel MKL 2017, FPGA is Arria10-115
The NVIDIA stock price is currently trading at a historical maximum (9/15/16), so investors like the moves NVIDIA is making into the data center. The next question is the uptake of the new inference optimized GPUs into the data center.
[1] http://intelstudios.edgesuite.net/idf/2016/sf/keynote/160817_db/160817_db.html
[2] http://www.slideshare.net/kleinerperkins/2016-internet-trends-report
[3] http://www.slideshare.net/kleinerperkins/2016-internet-trends-report
Leave a Reply