• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / Analysis / NVIDIA Moves Deeper into the Data Center with the P4 and P40 Inference GPUs

NVIDIA Moves Deeper into the Data Center with the P4 and P40 Inference GPUs

September 15, 2016 by Rob Farber Leave a Comment

NVIDIA announced the P4 and P40 GPUs that are optimized for machine learning inference performance and efficiency. The products are positioned to gain market share for NVIDIA in the data center. (Inference, sometimes called scoring or prediction, utilizes a trained machine learning algorithm to perform some difficult or valuable task like recognizing a picture, predicting an outcome, processing a signal, categorizing data, or working inside a data analytic framework.) Inference tends to be the money making machine learning workloads in the data center.

According to both Intel and NVIDIA, the next 3 – 4 years will see a significant increase in the use of inference in the data center. This will likely cause a retooling as search providers, data centers, and cloud providers acquire hardware to keep pace with rapidly escalating machine learning inference workloads. Current estimates indicate that only 7% of existing servers are dedicated to machine learning (Source: Intel). Ian Buck (NVIDIA, VP Accelerated Computing) states, “NVIDIA believes that in the near future every piece of data in the data center will be interacted with by AI”. Diane Bryant (Intel, EVP and GM of the Data Center Group) remarks, ‘by 2020 servers will run data analytics more than any other workload’.

“In the near future every piece of data in the data center will be interacted with by AI” – Ian Buck (VP Accelerated Computing, NVIDIA)

“By 2020 servers will run data analytics more than any other workload” – Diane Bryant

The problem, Ian Buck points out, “is that AI is out innovating the CPU roadmap by up to 10x a year in some cases.

Image result for Ian Buck NVIDIA

Ian Buck (VP Accelerated Computing, NVIDIA)

“AI is out innovating the CPU roadmap by up to 10x a year in some cases” – Ian Buck

This has created a market opportunity for NVIDIA, which explains why they have so aggressively worked to improve their performance on machine learning workloads. The P40 and P4 GPU announcements continue the process to establish NVIDIA ever more firmly in the data center and place them in a position of leadership for the expected retooling in the data center that will occur over the next 3-4 years.

09-15-16_fig02

Speedups on NVIDIA hardware. Training: comparing to Kepler GPU in 2013 using Caffe, Inference: comparing img/sec/watt to CPU: Intel E5-2697v4 using AlexNet

Arithmetic precision

Precision is currently a hot topic as Artificial Neural Networks (ANNs) – the basis for many machine, data analytic, and deep learning algorithms – are roughly based on biological models of the brain. From a computational perspective, biological brains operate in a very low-precision and noisy environment. For this reason, computer scientists are investigating the efficacy of using low-precision 16-bit floating-point (FP16) arithmetic and even 8-bit (single byte) integer arithmetic (INT8) for inference.

The speedups can be quite significant. Very simply, utilizing a single byte (INT8) arithmetic means that appropriately designed hardware can perform 4x the number of arithmetic operations per unit time compared to single-precision (32-bit) arithmetic. Similarly, the memory subsystem can become 4x more efficient as 4x more sequential datum can be transferred per cache line memory transaction.

To understand the importance of low-precision arithmetic, consider the explosive growth of speech recognition in the data center.

09-15-16_fig03-png

Source 2016 Internet Trends Report [2] from Kleiner Perkins Caufield & Byers [3]

The following shows the performance of one of the new NVIDIA P40 GPUs when running a speech recognition benchmark. A key feature of this benchmark is that the response time is much, much faster.

09-15-16_fig04

NVIDIA benchmarks results on NVIDIA P40 speech recognition benchmark. Deep Speech 2 inference performance on 16 user server | CPU: 170 ms of estimated compute time required for each 100 ms of speech sample | Pascal GPU: 51 ms of compute required for each 100 ms of speech sample. (Source: NVIDIA)

NVIDIA announces TensorRT

Along with the new GPUs, NVIDIA announced the TensorRT library to exploit this 4x performance opportunity just from utilizing INT8 arithmetic while also providing a transparent migration path to the P40 and P4 inference optimized GPUs. To allay concerns that the automatic conversion of a neural network trained using 64-bit. 32-bit, or 16-bit precision arithmetic, the TensorRT library has the ability to run the INT8 migrated network on a data set (say the training set) so the user can evaluate any changes in accuracy or behavior.

09-15-16_fig05

NVIDIA benchmarks showing speedup using TensorRT on Pascal GPUs

The New GPUs

The NVIDIA P4 is a half-height card designed to fit into high-density “scale out” server systems. The following shows the size of a P4 card relative to a pencil. Each card delivers 5.5 TeraFLOPS of peak single precision and 22 TOPS (Trillion OP/s) Peak INT8 performance as well as a Tesla P4 inside a high-density tray.

09-15-16_fig06

NVIDIA P4 size relative to a pencil (Source: NVIDIA)

09-15-16_fig07

An NVIDIA P4 in a high-density enclosure

The NVIDIA Tesla P40 is designed for the highest throughput for “scale up” servers. The card delivers 12 TeraFLOPs of peak single-precision performance and 48 TOPS of peak INT8 performance.

09-15-16_fig08

Eight NVIDIA P40 GPUs in a node

Overall, NVIDIA claims the P4 and P40 GPUs are extremely performance and more efficient than both CPUs and Field Programmable Gate Arrays (FPGAs).

09-15-16_fig09

AlexNet, batch size = 128, CPU: Intel E5-2690v4 using Intel MKL 2017, FPGA is Arria10-115

The NVIDIA stock price is currently trading at a historical maximum (9/15/16), so investors like the moves NVIDIA is making into the data center. The next question is the uptake of the new inference optimized GPUs into the data center.

[1] http://intelstudios.edgesuite.net/idf/2016/sf/keynote/160817_db/160817_db.html

[2] http://www.slideshare.net/kleinerperkins/2016-internet-trends-report

[3] http://www.slideshare.net/kleinerperkins/2016-internet-trends-report

 

 

Share this:

  • Twitter

Filed Under: Analysis, Featured article, Featured news, News Tagged With: machine-learning

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Recovering Speech from a Potato-chip Bag Viewed Through Soundproof Glass - Even With Commodity Cameras!
  • DARPA Goals, Requirements, and History of the SyNAPSE Project
  • Lustre Delivers 10x the Bandwidth of NFS on Intel Xeon Phi
  • South Africa Team Wins Their Second Student Supercomputing Competition At ISC14
  • Micron Automata Processor SDK Now Available - Includes Online Demo!

Archives

© 2025 · techenablement.com