• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / CUDA / The CUDA Thrust API Now Supports Streams and Concurrent Tasks

The CUDA Thrust API Now Supports Streams and Concurrent Tasks

May 7, 2014 by Rob Farber Leave a Comment

The CUDA Thrust API now supports streams and concurrent kernels through the use of a new API called Bulk created by Jared Hoberock at NVIDIA. The design of Bulk is intended to extend the parallel execution policies described in the evolving Technical Specification for Parallel Extensions for C++ N3960. Note that bulk is not part of the CUDA 6.0 distribution and must be downloaded from https://github.com/jaredhoberock/bulk.

Thrust Bulk

Bulk leverages Hyper-Q and CUDA streams to run concurrent tasks on the GPU. It lets the programmer describe a parallel task (e.g. sort, for_each, reduction, etcetera) as a hierarchical grouping of execution agents.

The big news is that concurrent kernel execution occurs with bulk without having to:

  • Specify a launch configuration
  • Decompose the problem into sub-tasks
  • Marshal parameters

CUDA programmers will appreciate that bulk::par() can take an existing CUDA stream as a parameter to indicate that any CUDA kernels it launches be enqueued on that stream. The   async_reduce.cu utilizes this capability. ( Note the use of par(s1;1) in the async call below.)

bulk async_reduce.cu example
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
struct reduce_kernel
{
  template<typename Iterator, typename Pointer>
  __device__ void operator()(volatile bool *wait_for_me, Iterator first, Iterator last, Pointer result)
  {
    while(!*wait_for_me)
    {
      printf("waiting...\n");
    }
 
    *result = thrust::reduce(thrust::device, first, last);
  }
};
 
 
struct greenlight
{
  __device__ void operator()(bool *set_me)
  {
    *set_me = true;
  }
};
 
 
int main()
{
  cudaStream_t s1,s2;
  cudaStreamCreate(&s1);
  cudaStreamCreate(&s2);
 
  using bulk::par;
  using bulk::async;
 
  thrust::device_vector<int> vec(1 << 20);
  thrust::sequence(vec.begin(), vec.end());
 
  thrust::device_vector<int> result(1);
  thrust::device_vector<bool> flag(1);
 
  // note we launch the reduction before the greenlight
  async(par(s1,1), reduce_kernel(), thrust::raw_pointer_cast(flag.data()), vec.begin(), vec.end(), result.begin());
 
  async(par(s2,1), greenlight(), thrust::raw_pointer_cast(flag.data()));
 
  cudaStreamDestroy(s1);
  cudaStreamDestroy(s2);
 
  std::cout << "result: " << thrust::reduce(vec.begin(), vec.end()) << std::endl;
  std::cout << "asynchronous result: " << result[0] << std::endl;
 
  assert(thrust::reduce(vec.begin(), vec.end()) == result[0]);
 
  return 0;
}

Individually, Bulk agents are like tiny, lightweight task-parallel threads, but when grouped together they have the potential to fully utilize the GPU. Parallel groups of agents are launched with bulk::async.

One can think of the bulk execution agents as logical lightweight threads, but in reality they may not run as separate physical threads. Think instead in terms of a grainsize parameter, which is a quantum of work where the grainsize parameter controls the size of sequential work that an individual execution agent performs. So it is orthogonal to something like warpsize or vector width. Grainsize is the “vertical” dimension.

Bulk uses a launch configuration heuristic bulk::par(n) to automagically configure the kernel launch. Occupancy is basically maximized by finding the largest block size that will maximize occupancy.

Jared explains this mapping to the hardware in the following way:

You can describe most kinds of processors via some sort of hierarchy. CPUs are decomposed into cores, each of which have some vector lanes. You could describe the hierarchy of NVIDIA GPUs as a set of cores, each of which have some warps, each of which have some vector lanes. Putting multiple GPUs into the machine might add an additional level. So there’s some actual hardware that corresponds to the physical machine, and sometimes it’s helpful to describe it as a hierarchy.

 The physical hierarchy varies from machine to machine. For example, GPU A might have P cores, and GPU B might have Q cores. As a programmer, you might want to ignore this minor difference between GPU A & GPU B because it it can be burdensome. You might instead wish to think about *all* GPUs as having N cores. Maybe you want N itself to vary depending on the task the program is executing that instant. This is what is meat by a virtual hierarchy – it’s just a decomposition of the processor into logical components in the same way that CUDA thread blocks can be thought of as logical multiprocessors.

 There are still other cases where the programmer doesn’t want to think about configuring any of this at all, and that’s where Bulk’s automatic launch configuration comes in.

Bulk also provides a performance enhancement, shared dynamic memory allocation for groups are performed as a single bulk allocation via a call to bulk::malloc().

In my own tests of cudaMalloc, as noted in my GTC 2014 presentation “Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and more” [video][pdf] plus Dr. Dobbs tutorial “A Massively Parallel Stack“, allocating a single block of a million objects takes 0.00001766 seconds compared to individually allocating a million objects which takes cudaMalloc 7.86 seconds. As you can see, utilizing bulk::malloc() can have an appreciable performance benefit for your concurrent Thrust applications.

For more information about Bulk, check out Jared’s GTC 2014 presentation, “Building Parallel Algorithms with Bulk“.

 

 

Share this:

  • Twitter

Filed Under: CUDA, Featured article, Featured tutorial, News Tagged With: CUDA, NVIDIA

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Face It: AI Gets Personal to Make You Look Better!
  • CUDA Study Guide
  • Apache Spark Claims 10x to 100x Faster than Hadoop MapReduce
  • PyFR: A GPU-Accelerated Next-Generation Computational Fluid Dynamics Python Framework
  • Paper Compares AMD, NVIDIA, Intel Xeon Phi CFD Turbulent Flow Mesh Performance Using OpenMP and OpenCL

Archives

© 2023 · techenablement.com