Part B - CUDA Programming Model

Streams

Introduce pinned memory on the host
Describe concurrent capabilities that depend on device properties
Introduce streams and their asynchronous execution

Capabilities | Pinned Memory | Streams | Exercises


Managing concurrency across the host and the device involves separating related host requests into tasks that execute independently of one another and managing the separate data transfers associated with those tasks.  The kernels and the data transfers execute asynchronously.  In order to manage the data transfers we introduce the concept of pinning host memory.

This chapter describes some CUDA calls that execute asynchronously, describes pinned memory, and introduces the use of streams to manage concurrent execution. 


Concurrent Capabilities

We implement concurrent execution at host level through asynchronous function calls to device related operations.  Asynchronous calls return control to the host thread before the device may have completed its operation.  Such calls may involve overlaps of both data transfers and kernel executions. 

CUDA asynchronous function calls include:

  • kernel launches
  • memory copies between distinct addresses on the same device memory
  • memory copies from host to device of 64KB or less
  • memory copies performed by functions with the Async suffix in their identifier
  • memory set function calls

Concurrency Types

To determine the concurrent-execution capabilities available on the installed hardware, we interrogate two device properties:

 int iDevice;
 cudaDeviceProp prop;
 cudaGetDevice(&iDevice);
 cudaGetDeviceProperties(&prop, iDevice);
 if (prop.asyncEngineCount ... )
 if (prop.concurrentKernels ... )

Data-Kernel Concurrency

Some devices of compute capability 1.1 and higher can perform copies between host memory and device memory alongside kernel execution.  This capability is available if the device property asyncEngineCount is greater than 0.

 if (prop.asyncEngineCount > 0)
     // data-kernel concurreny is available

Kernel-Kernel Concurrency

Some devices of compute capability 2.x and higher can execute multiple kernels alongside one another.  This capability is available if the device property concurrentKernels is 1.

 if (prop.concurrentKernel == 1)
     // kernel-kernel concurrency is available

Data-Data Concurrency

Some devices of compute capability 2.x and higher can perform several copies between host memory to device memory concurrently.  This capability is available if the device property asyncEngineCount is equal to 2.

 if (prop.asyncEngineCount == 2)
     // data-data concurreny is available

Pinned Memory

Pinning is an operation on system memory that keeps a region of virtual memory attached to physical memory.  Operating systems manage system memory through paging between primary and secondary memory.  Paging allows for memory segmentation and a physical address space that is noncontiguous.  Dynamic memory allocated by malloc() and deallocated by free() is paged memory on the host. 

Pinned memory is host memory that has been page-locked; that is, excluded from this normal paging technique.  Once locked this memory is unavailable for other applications or operating system memory management.  Pinned memory is a scarce resource and consuming too much physical memory reduces the overall memory available for paging.  This reduction can severely affect system performance.

CUDA Calls

Allocation and Deallocation

To allocate pinned memory on the host, we call

 cudaError_t
 cudaMallocHost(void** p, size_t nb);

The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to function such as cudaMemcpy*().

To deallocate pinned memory on the host, we call

 cudaError_t
 cudaFreeHost(void* p);

By default, pinned memory is cacheable.

Write-Combining Memory

Write-Combining memory is pinned memory that minimizes snooping and frees up L1 and L2 cache resources, making more cache available to the rest of an application.  Snooping is a technique used to maintain coherence across caches on a set of processors that may hold copies of the same location in system memory. 

Write-combining memory is not snooped during transfers across the PCIe bus.  This can improve transfer rates in writing by up to 40%.  However, reading from write-combining memory is extremely slow. 

Write-combining memory is a good option for buffers that will be written by the CPU and read by the device using mapped pinned memory or host to device transfers.  It should only be used for memory to which the host writes.  To allocate write-combined memory, we call:

 cudaError_t
 cudaHostAlloc(void** p, size_t nb, unsigned flags);

with cudaHostAllocWriteCombined as the flag argument.  A zero-valued flag defaults the call to this function to a call to cudaMallocHost().

Memory allocated by cudaHostAlloc() must be freed by:

 cudaError_t
 cudaFreeHost(void* p);

Streams

We manage concurrent execution across the host and the device using streams.  A stream is a sequence of commands that execute in a well-defined order.  Different streams can execute out of order or concurrently with respect to one another. 

Create and Release

To identify a set of NSTREAM streams, we declare an array of cudaStream_t objects:

 cudaStream_t stream[NSTREAMS];
 for (int i = 0; i < NSTREAMS; i++)
     cudaStreamCreate(&stream[i]);
          

To release the cudaStream_t objects:

 for (int i = 0; i < NSTREAMS; i++)
     cudaStreamDestroy(stream[i]);
          

cudaStreamDestroy() waits until all commands in the stream have completed before releasing its stream.

Default Stream

Kernel launches and memory copies that do not specify a stream or set the stream parameter to 0, are issued to the default stream.  Commands in the default stream execute in order, just like commands in any other stream. 

Simple Streaming

In the following example, each stream executes a sequence of memory copying, kernel launch, and memory copying separately and possibly concurrently.  size holds the number of bytes of data processed by each stream:

 float* a;
 cudaMallocHost((void**)&a, NSTREAMS * size);
 cudaStream_t stream[NSTREAMS];
 for (int i = 0; i < NSTREAMS; i++)
     cudaStreamCreate(&stream[i]);

 // define each stream as a copy, a launch and a copy
 for (int i = 0; i < NSTREAMS; i++) {
     cudaMemcpyAsync(b + i * size, a + i * size, size,
      cudaMemcpyHostToDevice, stream[i]);
     kernel<<<dimGrid, dimBlock>>>(c + i * size, b + i * size, size);
     cudaMemcpyAsync(a + i * size, c + i * size, size,
      cudaMemcpyDeviceToHost, stream[i]);
 }

 for (int i = 0; i < NSTREAMS; i++)
     cudaStreamDestroy(stream[i]);

Analysis

On devices that do not support concurrent data transfers, the streams do not overlap because the memory copy to device from host for stream[i + 1] is issued after the memory copy from device to host for stream[i]

If the device supports concurrent data transfers, the streams overlap: the memory copy from host to device overlaps the memory copy from device to host.  If the device supports concurrent data transfer and kernel execution as well, the kernel execution and the memory copy from host to device overlap the memory copy from device to host of the preceding stream.

Implicit Synchronization

Two commands from different streams cannot run concurrently if the host thread issues one of the following operations between them:

  • allocation of pinned memory
  • allocation of device memory
  • setting of device memory
  • copying between two addresses in device memory
  • a CUDA command to the default stream
  • a switch between L1 and shared memory configurations

On devices that support concurrent kernel execution, operations that check to see if a kernel launch is complete:

  • start executing only when all thread blocks of all prior kernel launches from any stream have started executing
  • block all later kernel launches from any stream until the kernel launch is complete

Managing Overlap

In the following example, each stream executes all of the memory copying to device first, the kernel launches next, and finally the memory copying to host last:

 float* a;
 cudaMallocHost((void**)&a, NSTREAMS * size);
 cudaStream_t stream[NSTREAMS];
 for (int i = 0; i < NSTREAMS; i++)
     cudaStreamCreate(&stream[i]);

 // define each stream as a copy, a launch and a copy
 for (int i = 0; i < NSTREAMS; i++)
     cudaMemcpyAsync(b + i * size, a + i * size, size,
      cudaMemcpyHostToDevice, stream[i]);
 for (int i = 0; i < NSTREAMS; i++)
     kernel<<<dimGrid, dimBlock>>>(c + i * size, b + i * size, size);
 for (int i = 0; i < NSTREAMS; i++)
     cudaMemcpyAsync(a + i * size, c + i * size, size,
      cudaMemcpyDeviceToHost, stream[i]);

 for (int i = 0; i < NSTREAMS; i++)
     cudaStreamDestroy(stream[i]);

Analysis

If the device supports overlap of data and kernel execution, then the memory copy from host to device issued to stream NSTREAMS-1 overlaps with kernel launch issued to stream 0 and the kernel launch to stream NSTREAMS-1 overlaps with the memory copy from the device to the host issued to stream 0. 

If the device supports concurrent kernel execution, the kernel executions overlap since kernel launches issue to successive streams before the memory copy from device to host issues for stream 0.  However, this overlap pertains to the last thread blocks of the kernel launch, which can represent a small portion of the kernel's total execution time. 


Exercises




Previous Reading  Previous: Optimization Next: Floating-Point Considerations   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License