• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / CUDA / Deep-learning Teaching Code Achieves 13 PF/s on the ORNL Titan Supercomputer

Deep-learning Teaching Code Achieves 13 PF/s on the ORNL Titan Supercomputer

April 18, 2014 by Rob Farber Leave a Comment

The deep-learning teaching code described in my book, “CUDA Application Design and Development” [Chapters 2, 3, and 9] plus online tutorials achieved 13 PF/s average sustained performance using 16,384 GPUs on the Oakridge Titan supercomputer. Full source code for my teaching code can be found on github in the farbopt directory.

Nicole Hemsoth at HPCwire noted these CUDA based results and potential to achieve 20 PF/s using OpenACC:

On the code front, OpenACC was a hot topic among the HPC set. Rob Farber did an excellent job of highlighting some of the key trends in programming and optimizing for GPUs at large scale. He presented on new results that extend machine learning and big data analysis to 13 petaflops average sustained performance across 16,384 GPUs on Titan—a very popular topic.

As you can see in the slide below from my GTC 2014 presentation “S4178 Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and more” the near-linear scaling utilizing MPI looks linear – meaning that we achieve close to a 16,000-times speedup when using 16k GPUs.

Slide3_GTC2014_S4178

My online tutorials showed it is possible to get close to a TF/s average sustained performance on a single device for  problems (labeled PCA below) and excellent 600 – 800 GF/s performance a nonlinear machine-learning problem (labeled NLPCA below).

PCA GTC 2013 talk 1 NLPCA GTC 2013 talk 1

Update: The CUDA-6.5 release boosted this performance by 100s of GF/s. We look forward to re-running on the ORNL Titan.

CUDA-6.5 significantly boost performance on a non-linear NLPCA, machine-learning, and deep-learning codes

CUDA-6.5 significantly boost performance on a non-linear NLPCA, machine-learning, and deep-learning codes

CUDA-6.5 significantly boost performance on PCA, machine-learning, and deep-learning codes

CUDA-6.5 significantly boost performance on PCA, machine-learning, and deep-learning codes

As the insert in the previous slide shows, performance variation (shown by the red lines) across the 16k nodes is minimal so the average sustained performance (an important measure) remains high.

Titan minimal performance variation

Basically, the linear scaling performance  means that leadership class supercomputers like Titan can deliver substantial, near-peak performance. My tutorial, “Numerical and Computational Optimization on the Intel Phi” shows how to achieve 2.2 PF/s average sustained performance on the TACC Stampede supercomputer using MPI and 3,000 Intel Xeon Phi coprocessors. Full source code is  provided in the tutorial. The ORNL run utilized a slightly modified version of the code in my Intel Xeon Phi tutorial.

The near-linear runtime is achieved through the SIMD mapping I created in the 1980s.

Farber High-Performance and Scalable Deep-learning mapping

Farber High-Performance and Scalable Deep-learning mapping

The farbopt teaching code utilizes the freely available nlopt optimization package nlopt numerical optimization package as the “Optimization Method” indicated in the previous picture.

800px-Nlopt-logo

This teaching code can also be used to program deep-learning systems to perform a task.

The use of MPI is mandatory as each Titan  K20x GPU is programmed using one GPU per MPI process as shown below

 

Titan XK7 compute node

Additional system information can be found in the Cray XK7 brochure.

More information about programming the ORNL titan supercomputer can be found in the accelerator user guide.

 

Click here for more TechEnablement machine-learning articles and tutorials! 

Share this:

  • Twitter

Filed Under: CUDA, Featured article, News, News Tagged With: CUDA, deep-learning, HPC, machine-learning, NLPCA, PCA, tutorial

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Recovering Speech from a Potato-chip Bag Viewed Through Soundproof Glass - Even With Commodity Cameras!
  • DARPA Goals, Requirements, and History of the SyNAPSE Project
  • Sparse matrix-vector multiplication: parallelization and vectorization
  • Gender Diversity Study "In Science, It Matters That Women Come Last"
  • Paper Compares AMD, NVIDIA, Intel Xeon Phi CFD Turbulent Flow Mesh Performance Using OpenMP and OpenCL

Archives

© 2025 · techenablement.com