• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / CUDA / Part 1: Load-Balanced, Strong-Scaling Task-Based Parallelism on GPUs

Part 1: Load-Balanced, Strong-Scaling Task-Based Parallelism on GPUs

July 9, 2014 by Rob Farber Leave a Comment

Achieve a 7.4x speedup with 8 GPUs over the performance of a single GPU through the use of task-based parallelism and concurrent kernels! Traditional GPU programming  typically views the GPU as a monolithic device that runs a single parallel kernel across the entire device. This approach is fantastic when one kernel can provide enough work to keep the GPU busy. The conundrum is that newer GPUs require ever larger computational problems to keep them fully occupied. The solution is to treat each Streaming Multiprocessor (SMX) inside the GPU as a separate processing core (similar to a single core of a multicore x86 or ARM processor) that runs a separate task list comprised of smaller concurrent kernels. Host-based dynamically scheduled OpenMP loops can then very simply and effectively queue these concurrent task lists as GPU SMX processors become available. This mini-tutorial is the first in a series of techEnablement.com tutorials about GPU task-based parallelism and how to achieve strong-scaling across multiples GPUs as well as the many SMX processors that are internal to each GPU.

The following timeline from nvvp, the NVIDIA Visual Profiler, demonstrates how a single concurrent kernel task-based parallel application can keep multiple GPUs busy in a workstation (in this case an NVIDIA K40 and a K20 GPU) even when the GPUs have widely different performance capabilities. The various colored lines represent different parallel kernels. Note that the length of the lines reprents the runtime of each of the kernels, which can vary in this problem from microseconds to tens of seconds depending on the data input. Overall utilization of both GPUs stays in the 95% – 100% range throughout the application lifespan as reported by  the nvidia-smi  (NVIDIA Systems Management Interface) tool.

Cold Spring Harbor K40C timeline

The logic for the load-balancing loop is very simple as shown in the following code snippet. The highlighted pragma performs the load-balancing via schedule(dynamic,1), which does not get blocked by one or more slow threads. The key is to have the task list signal when the task queue starts so more kernels can be asynchronously queued on a CUDA stream by the thread. As a result, all the CUDA stream queues (generally one per SMX) are kept fully loaded with kernels waiting to run.

Load-balancing loop for task-based parallelism
Default
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
double startTime=omp_get_wtime();
 
#pragma omp parallel for schedule(dynamic,1)
 
for(int i=0; i < bootstrap.get_n_proteins(); i++) {
 
int tid=omp_get_thread_num();
 
bootstrap.calcBootstrap(worker[tid], tid, i );
 
}
 
 
 
for(int i=0; i < worker.size(); i++) {
 
cudaStream_t s = worker[i].setStream();
 
cudaStreamSynchronize(s);
 
}

The initial code was developed and tested on an HP Z800 workstation containing three GPUs: A Tesla K40C, A Tesla K20C, and a GeForce GT 640.

The 8-way GPU results showing a 7.4x speedup with 8 GPUs were observed on a Cirrascale BladeRack 2 XL Blade Server configured as if there were 8 GPUs in a single node (e.g. no MPI were involved, only CUDA calls).

gI_65203_Cirrascale_GB5400

Click to learn more

Cirrascale BladeRack 2 XL

Click on image for more information

 

 

Application

The scaling results and timeline are from a statistical bootstrapping code written by TechEnablement.com and BlackDog Endeavors, LLC in collaboration with Dr. John Wilson for the Cold Spring Harbor Scientific Laboratory. This production code will be incorporated into an online interactive booststrapping workflow that will be generally accessible via the Internet in the near future.

A quick synopsis of this work was presented in my GTC 2014 talk “S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and more ” [pdf][video]. Following are the two pertinent slides

Slide27 Slide28

For more information please see the ASMS 2014 poster “A GPU-powered, massively parallel nonparametric statistics server for analysis and exploration of large-scale quantitative data between and across quantitative experiments” by John P. Wilson, Robert M. Farber, and Darryl J.C. Pappin

Wilson et al ASMS2014-statistics-poster final

Following is the poster introduction:

We often employ proteomics in the quest to understand two or more states: what makes this healthy and that diseased? Quantitative techniques such as iTRAQ and SILAC are powerful tools in this pursuit. However, their results immediately beg the question “What’s different?”, making determination of “different” between states essential to distilling meaning from quantitative MS data. Arguably, the most helpful output is some measure of certainty that the variable of study actually caused the difference observed. We have developed a novel experimental and analytical methodology which employs exclusively nonparametric statistics to determine p-values and fold-changes (including their uncertainties with confidence intervals) for observed differences as a function of biological variation, number of samples and observations, and measurement error (most crucially accounting for increased variance at low or high MS/MS spectral intensities). Our workflow uses no parametric modeling nor parametric simplifications, accepts data from all MS quantification techniques, can combine across experiments with ≥1 common control channel and is accessible through an interactive, real-time web interface.

Share this:

  • Twitter

Filed Under: CUDA, Featured article, Featured tutorial, News, Tutorials, Tutorials Tagged With: Cirrascale, CUDA, GPU NVIDIA, HPC, K40C

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Guide to Get Ubuntu 14.10 Running Natively on Nvidia Shield Tablet
  • Learn how to program IBM's 'Deep-Learning' SyNAPSE chip
  • Face It: AI Gets Personal to Make You Look Better!
  • NASA Charts Path For CFD To 2030 - Projects Future Computer Technology!
  • Seven10 Storage Software Intelligently Manages Seamless Data Migration to the Cloud

Archives

© 2023 · techenablement.com