Today at the 2016 GPU Technology Conference in San Jose, NVIDIA CEO Jen-Hsun Huang announced the first of a new line of Pascal architecture GPUs- the P100. Jen-Hsun noted that the P100 GPUs are in volume production now, but the initial production runs have already been purchased. Consumers should look to first run on the P100 GPUs in the cloud. While Jen-Hsun’s keynote focused on improvements in machine learning performance, other NVIDIA speakers fleshed out the implications of the Pascal improvements for HPC in later GTC presentations. Succinctly Marc Hamilton (VP, Solutions Architecture and Engineering at NVIDIA) noted the P100 GPUs, “will perform well on other HPC applications”.
Mark Harris (Chief Technologist, GPU Computing Software, NVIDIA ) wrote a very detailed blog about the P100 specifications, “Inside Pascal: NVIDIA’s Newest Computing Platform”. His presentation with Lars Nyland (Senior Architectect, NVIDIA) titled, “S6176 – Deep Dive into NVIDIA’s Latest Architectures” provided much more detail about why Pascal is such a good accelerator for commercial and scientific HPC. (The slides can be found here.)
A key aspect of the Pascal architecture is that it is preemptable and has a fully functional MMU (Memory Management Unit). The engineering feat to create a preemptable virtual memory computing device with 3584 CUDA cores using 56 SM (Streaming Multiprocessors) is remarkable, but this is not simply an engineering coup as it spells the end of “offload mode is required for accelerator programming”.
Pascal GPUs support a 49-bit virtual memory address space which is sufficient to map 48-bits (2.8 x 10^14 bytes) of CPU memory plus all of GPU memory using up to 2MB pages. NVlink support is also provided so multiple Pascal GPUs can share this large address space. Once operating system support is provided, programmers will be able to simply allocate space as normal (say with malloc() and free() operations) and the demand-paged virtual memory will dynamically migrate memory to and from the GPUs and host memory – no offload mode required. In the mean time, programmers can use CUDA unified memory or explicit data copies.
A new cudaMemAdvise() API will be provided to give hints to the GPU virtual memory systems so the GPU virtual memory systems can prefetch data much like users of the mmap()/madvise() Linux APIs. Harris did note that to guarantee maximum performance, programmers should still perform explicit data copies.
Improvements to the Pascal architecture include 64-bit atomic operations and performance increases on existing atomic operations. Further, atomic operations will work across multiple GPUs connected via NVlink. This means that parallel codes can operate in the shared memory space and use atomic counters and perform other essential parallel operations!
In short, Pascal unified memory now supports demand-paged virtual memory across multiple GPUs. A collaboration with RedHat will (hopefully soon) provide the ability to create a single CPU and GPU dynamic memory allocator to end the “accelerators require offload” requirement.
Those running pre-Pascal GPUs (e.g. all of us) can use unified memory without demand-paged hardware support. The TechEnablement article “Inside NVIDIA’s Unified Memory: Multi-GPU Limitations and the Need for a cudaMadvise API Call” discusses that in detail.