The OLCF at Oakridge National Laboratory (ORNL) is working to educate users about how to best use their computing resources. As part of that process, the OLCF has published two very introductory tutorials to teach how to utilize concurrent kernels on their systems. Part 1 (concurrent kernels) and Part 2 (batched library calls) teach how to launch concurrent kernels using CUDA and OpenACC with C and Fortran.
TechEnablement readers will also find our tutorials that discuss key aspects of task-based parallelism including:
- Why task-parallelism can be a more efficient approach that pure loop parallelism for a multitude of reasons including memory consumption.
- How to load-balance parallel tasks on a multitude of devices with a single OpenMP schedule dynamic look.
- Demonstrations that task-parallelism can achieve strong scaling both within a GPU and across a number of devices (a 7.4x speedup was achieved in a single computational node containing eight GPUs).
Please see our articles:
- Part 1: Load-Balanced, Strong-Scaling Task-Based Parallelism on GPUs
- Part 2: No Idle Time CUDA Task Parallelism Across Eight GPUs
Oakridge has other introductory articles at https://www.olcf.ornl.gov/support/tutorials/.