The CUDA 6.0 Unified Memory offers a “single-pointer-to-data” model that is similar to CUDA’s zero-copy mapped memory. Both make it trivially easy for the programmer to access memory on the CPU or GPU, but applications that use mapped memory have to perform a PCI bus transfer occur every time a memory access steps outside of a cache line while a kernel running in a Unified Memory environment effectively “owns” the memory region so it only needs to transfer units of memory (called a page) to the device as needed and possibly back to the host on kernel completion if the “page” was modified on the GPU. [Click here to get started.]
Click here for information about shared memory on Intel Xeon Phi,
As can be seen in the following timeline schematic, every mapped pinned memory access causes a PCI bus transfer (indicated by the arrows). In contrast the CUDA 6.0 unified memory timeline shows a bulk transfer at kernel startup and one at kernel completion. From a performance point of view, CUDA Unified Memory can be a good way to go because:
- Once the data is on the device, no additional transfers are required (unless the data needs to be written back to the host because it was modified on the GPU).
- The transfer of larger blocks of information across the PCIe bus is more efficient that many little transfers.
[Ed. Note] NVIDIA corrected our post on 7/24/14 – Thank you! – to note that CUDA managed memory does implement a full lazy copy for memory touched on both the host and device. More specifically:
In the most basic case, managed memory is allocated on the device (by cudaMallocManaged). On CPU access, data is copied from device to host. So only those pages that are accessed are copied. On kernel launch, any pages previously copied to host are copied back to the device. (Note, it’s illegal to access managed data from the CPU while the GPU is accessing it. Such CPU accesses will generate a segmentation fault. Appropriate synchronization is required before CPU accesses can be made. The CUDA programming guide covers this in detail.)
SO, when cudaMallocManaged() is used to allocate 1GB of managed data, only pages touched by the CPU will be copied to the host from the device. These copies occur at the point when the CPU accesses the pages, not at cudaDeviceSynchronize(). Then when a kernel is launched, the pages copied from device to host are copied back to the device before the kernel runs. So if the CPU touches all 1GB, then yes, 1GB of data will be copied. But if it only touches a few pages, only a few pages will be copied. (In reality the runtime may copy a bit more than is used, for the purpose of speculative prefetching to improve performance.)
CUDA Managed Memory and GMAC
[Ed. Note] NVIDIA commented (7/24/14) on the use of GMAC in CUDA Managed Memory.
The GMAC (Global Memory for ACcelerators) library was a source of inspiration and valuable insight for CUDA Managed Memory (GMAC is being developed by the Operating System Group at the Universitat Politecnica de Catalunya and the IMPACT Research Group at the University of Illinois). However, NVIDIA clarified that, “GMAC is not part of the Unified Memory implementation, and there are some significant differences in programming model.“
It is understandable why NVIDIA has not delved deeper into the differences between GMAC and CUDA Managed Memory, but we do know that full lazy copy works (hopefully through the use of MMU dirty bits!). Presumably other differences leverage NVIDIA hardware for greater performance.
For a more detailed description of GMAC, check out the paper:
“An asymmetric distributed shared memory model for heterogeneous parallel systems“. I. Gelado, J. Cabezas, N. Navarro, J. E. Stone, S. Patel and Wen-mei W. Hwu. In proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2010) . Pittsburgh, USA. March 2010.
Reduce the complexity of your code. GMAC transparently takes care of the consistency and coherency of the data both used in the CPU and in the GPU, so your code gets simpler and cleaner. This means that you only have to use a pointer per allocation to access data both in the CPU and the GPU.
The multi-GPU challenge
A challenge with the current implementation of the CUDA 6.0 Unified Memory is that the running kernel “owns” the region of memory. Some form of ownership is necessary to enforce a single-writer model and avoid the multiple writer concurrency problem. Anyone who has worked with a revision control system knows how complicated (and confusing) it can be to merge changes from multiple sources (e.g. the merge problem). With a single writer, there is no ambiguity about how to update a data block.
Exclusive ownership causes problems in a multi-GPU environment (and even in a single GPU + CPU configuration) because it serializes access in a multi-device environment. Say we wish to share a common read-only region of memory among all the GPUs in a system. Exclusive ownership means that only one kernel on one device can run at any given time, and that nice multi-GPU system will only deliver single GPU performance.
What is needed in my opinion, is a cudaMadvise(void *ptr, size_t length, int advice) CUDA API call that is similar to the madvise() API call that is available in most virtual memory operating systems. Two possible advice flags (out of many possibilities) are:
- CUDA_MADV_READONLY: The read-only region of memory can be concurrently owned by multiple devices. Exclusive ownership is not required because no device will write to this region of memory.
- CUDA_MADV_READONLY_PAGEABLE: Similar to mmap() of a PROT_READ region of memory, this advice flag tells the runtime that the GPUs can share read-only regions of memory that are far larger than the actual amount of global memory on the device. This flag requires that the GPU have an MMU (Memory Mangement Unit) that can handle page faults. I understand that is not the case with the current NVIDIA GPUs.
At GTC 2014, I had many informative and productive discussions with the NVIDIA engineers about a possible cudaMadvise() call. My hope is that some form of advisory call is indicated by the “Optimizations” arrow in the Unified Memory roadmap slide:
CUDA 6 Unified Memory Roadmap
For more information about Unified Memory