Part B - CUDA Programming Model

Kernels and Grids

Describe how to write code that executes on the device
Describe the organization and launching of threads for execution on the device
Show how to identify a thread during its execution on the device

Execution on the Device | Grid Design | Thread Identification | Exercises


The CUDA programming model provides facilities to incorporate custom source code that executes on the GPU.  The CUDA model is an SPMD model in which threads execute the same program on multiple data streams concurrently.  The CUDA runtime maps the threads defined in the source code to the CUDA cores of the device. 

This chapter describes how to code intructions that execute on the GPU and the structure for organizing the threads that will execute on the device.  It shows how to configure the thread design for execution on the multiprocessors and how to launch the configuration.


Execution on the Device

We store the program that executes on the multiprocessors threads in the form of a kernel.  We organize the software threads that execute these program in the form of a grid.  The grid's dimensionsdepend on the nature of our algorithm.  We define the kernel and the grid in the source code, which we store in a file with extension .cu

The nvcc compiler driver distinguishes the code destined for the device from the rest of the source code, which executes on the host.  The driver sends the host code to the default C/C++ compiler and the device code to the PTX (Parallel Thread Execution) virtual machine for subsequent execution on the GPU. 

Kernel Definition

Kernel code is similar to C/C++ function code.  CUDA Kernels do not return values: their return type is void.  Their parameter list follows their identifier as shown below.  Typically, this list includes one or more pointers, which receive addresses within device memory that we allocated outside the kernel definition.  The __global__ keyword identifies a kernel definition as code that executes on the device as a direct result of launching a grid of threads: 

 __global__ void kernel_identifier(float* a, ...) {

     // add device logic here

 }

Each CUDA core executes the kernel's instructions on a thread within the grid of threads independently of any other thread. 

Thread Identifier

The built-in identifier threadIdx is a struct consisting of three members x, y, z, which taken together uniquely identify a software thread.  threadIdx.x holds the x-index of the thread that is executing the kernel's instructions. 

The following kernel code initializes the element of a associated with thread threadIdx.x to a received value (v):

 __global__ void initialize(float* a, float v) {

     a[threadIdx.x] = v;

 }

Launching a Grid of Threads

We launch a grid of threads from the host code.  A launch is similar to a function call in that it transfers control to the kernel identified in the call.  A launch differs from a function call in that it initiates the concurrent execution of all of the threads in the specified grid rather than initiating a single flow of control. 

Launch Statement

The launch statement consists of the kernel's identifier, the execution configuration and an argument list.  The identifier is the name of the kernel code.  The execution configuration specifies the grid design, and the argument list defines the values passed to the kernel's parameters. 

The execution configuration consists of several arguments enclosed within a template-like triple chevron pair:

 kernel_identifier<<<dGrid,dBlock>>>(a, n);

The arguments to the execution configuration - dGrid and dBlock in this case - specify the layout of blocks of threads within the grid and threads within a block respectively. 

Example

The following program launches a grid of n threads not exceeding 1024.  Each thread initializes one element in the device array a to the value received in the second kernel parameter:

 // Initialize Memory using a Kernel
 // initialize.cu

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>
 const int MAX_THREADS = 1024;

 __global__ void initialize(float* a, float v) {
     a[threadIdx.x] = v;
 }

 int main(int argc, char* argv[]) {
     if (argc != 3) {
         std::cerr << argv[0]
                   << ": invalid no of arguments\n"
                   << "Usage: " << argv[0]
                   << "  no_of_elements initial_value\n"; 
         return 1;
     }
     int n = std::atoi(argv[1]);
     if (n > MAX_THREADS) n = MAX_THREADS;
     float v = std::atof(argv[2]);

     // allocate host memory
     float* h_a = new float[n];
     // allocate device memory
     float* d_a;
     cudaMalloc((void**)&d_a, n * sizeof(float));

     // launch a grid of 1 block of n threads
     initialize<<<1, n>>>(d_a, v);

     // copy from device memory to host memory
     cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost); 

     // display contents of host memory
     for (int i = 0; i < n; i++)
         std::cout << h_a[i] << (i % 5 == 4 ? '\n' : ' ');
     std::cout << std::endl;

     // deallocate device memory
     cudaFree(d_a);
     // deallocate host memory
     delete [] h_a;
 }

The result of compiling this program using nvcc at the command line should look something like:

 >nvcc initialize.cu
 initialize.cu
    Creating library a.lib and object a.exp initialize.cu  

Executing the compiled code should produce results like:

 >a 50 2.5
 2.5 2.5 2.5 2.5 2.5 
 2.5 2.5 2.5 2.5 2.5
 2.5 2.5 2.5 2.5 2.5
 2.5 2.5 2.5 2.5 2.5
 2.5 2.5 2.5 2.5 2.5
 2.5 2.5 2.5 2.5 2.5
 2.5 2.5 2.5 2.5 2.5
 2.5 2.5 2.5 2.5 2.5
 2.5 2.5 2.5 2.5 2.5
 2.5 2.5 2.5 2.5 2.5

Coordination

A launch statement differs from a function call in that it returns control to the caller before completing kernel execution.  That is, after a kernel launch, the host code can continue executing concurrently with the device code. 

Concurrent execution can lead to the host code accessing certain data before it has been updated by the device.  This is important for copy operations.  Before copying from the device to the host, we need to ensure that the data on the device has been updated and is ready to be copied.  We do so by coordinating the host operations with the device operations. 

The CUDA runtime coordinates some operations automatically and treats sone as asynchronous.  Copying operations between host and device memories synchronize automatically and don't require intervention.  For instance, a call to cudaMemcpy() does not return until all copying is complete.  Similarily, a launch starts executing kernel code only after the device has completed all previous device calls. 

The need for intervention may arise after a kernel launch.  The device returns control immediately and does not block continued execution of subsequent host code while the device executes.  To block execution on the host until all device calls have completed, we call the CUDA API function: 

 cudaDeviceSynchronize();

Error Handling

Identifying errors during concurrent execution may require coordination.

Initialization

To determine if an error has occured during the execution of a kernel we initialize the error state just before the launch by calling cudaGetLastError().  This call resets the error code to cudaSuccess.  If we omit this call, a call to cudaGetLastError() after the launch might return an error code that was generated before the launch. 

Sychronization

Because a launch is asynchronous, we need to coordinate execution on the host and device before checking for kernel errors.  We do so by synchronizing the device with the host immediately following the launch. 

Example

 // Synchronize the Host and Device for Error Handling
 // sync.cu

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>
 const int MAX_THREADS = 1024;

 __global__ void initialize(float* a, float v) {
     a[threadIdx.x] = v;
 }

 bool check(cudaError_t error) {
     bool rc;
     if (rc = (error != cudaSuccess)) {
         std::cout << cudaGetErrorString(error) << std::endl;
     }
     return rc;
 }

 int main(int argc, char* argv[]) {
     if (argc != 3) {
         std::cerr << argv[0]
                   << ": invalid no of arguments\n"
                   << "Usage: " << argv[0]
                   << "  no_of_elements initial_value\n"; 
         return 1;
     }
     int   n = std::atoi(argv[1]);
     if (n > MAX_THREADS) n = MAX_THREADS;
     float v = std::atof(argv[2]);
     float* d_a = nullptr;
     float* h_a = new float[n];
     cudaMalloc((void**)&d_a, n * sizeof(float));

     // initialize error code
     cudaError_t error = cudaGetLastError(); 

     // launch a grid of n threads
     initialize<<<1, n>>>(d_a, v);

     // synchronize the device and the host
     cudaDeviceSynchronize();

     // extract error code for the kernel's execution
     if(cudaGetLastError()) {
         cudaFree(d_a);
         delete [] h_a;
         return 3;
     }

     if(cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost)) {
         cudaFree(d_a);
         delete [] h_a;
         return 4;
     }

     // output for correctness check
     for (int i = 0; i < n; i++)
         std::cout << h_a[i] << (i % 5 == 4 ? '\n' : ' ');
     std::cout << std::endl;

     cudaFree(d_a);
     delete [] h_a;
 }

Grid Design

The grid design for an application can take a variety of forms.  We select a layout that suits the solution of our problem.  For instance, a two-dimensional layout suits a matrix-matrix multiplication, with one dimension identifying the rows and the other the columns.  In this case, each thread in the grid has two identifiers: the index of the corresponding row and the index of the corresponding column. 

Scalability

Most applications that required parallel processing require many more threads than the number of processing units available on the device.  To map the number of threads onto the number of processing units, we divide the grid into blocks of threads.  Each block is independent of any other block and can execute in any order with respect to any other block.  This grid divisibility into independent blocks provides scalability with respect to the installed hardware. 

Each multiprocessor executes a subset of blocks from the grid of threads.  Blocks executing on a device with 4 multiprocessors are illustrated in the Figure below.  All of the threads in a block execute on the same multiprocessor. 

Grid of blocks

The threads within a block execute on the processing units (CUDA cores) within the multiprocessor assigned to the block.  The multiprocessor schedules the threads within the block and switches contexts as it moves them in and out of its CUDA cores. 

Block of threads

Grid and Block Layouts

We define the execution configuration by specifying the number of blocks in the grid and the number of threads in each block. 

Blocks in a Grid

We define the block layout within a grid in terms of grid dimensions.  The layout can be one-dimensional, two-dimensional or three-dimensional for devices of compute capability 2.x and higher.  The total number of blocks in a grid is the product of these dimensions. 

Threads in a Block

We define the thread layout within each block in terms of block dimensions.  This layout can also be one-dimensional, two-dimensional or three-dimensional.  The total number of threads in each block is the product of these dimensions. 

Type dim3

We use the built-in type dim3 to specify the execution configuration.  The dim3 struct has three members:

  • .x - the fastest varying dimension
  • .y - the second fastest varying dimension
  • .z - the third fastest varying dimension

The constructor for a dim3 type defaults the value of each dimension to 1

Examples

To define a one-dimensional grid of 1500 blocks, with each block consisting of 256 threads, we write:

         dim3 dGrid(1500);
         dim3 dBlock(256);

To define a two-dimensional grid of 30 by 50 blocks, with each block consisting of 16 by 8 by 2 threads, we write:

         dim3 dGrid(30, 50);
         dim3 dBlock(16, 8, 2);

Ordering Convention

The indexing of the blocks within a grid progresses from the fastest varying dimension to the slowest:

(0, 0), (1, 0), ..., (29, 0),
(0, 1), (1, 1), ..., (29, 1),
...,
(0, 49), (1, 49), ..., (29, 49)

Similarly, the indexing of the threads within each block progresses from fastest varying dimension to slowest:

(0, 0, 0), (1, 0, 0), ..., (15, 0, 0),
(0, 1, 0), (1, 1, 0), ..., (15, 1, 0),
...,
(0, 7, 1), (1, 7, 1), ..., (15, 7, 1)

This ordering system follows the convention for defining screen coordinate axes.  Note that this system does not follow the language conventions for storing the elements of a multi-dimensional arrays (column-major or row-major). 

The ordering system described in the CUDA documentation is shown in the Figure below.

Ordering Convention Ordering of blocks and threads

Maximum Values

A device's compute capability determines the maximum values for grid and block dimensions as well as the maximum number of threads within a block.  Appendix G of the CUDA Programming Guide lists the limiting values for the different compute capabilities. 

The maximum .x dimension of a grid is

  • 65535 blocks for devices of compute capability up to and including 2.x
  • 231-1 blocks for devices of compute capability 3.x and 5.x

The maximum .y dimension of a grid is 65535 blocks. 

The maximum .z dimension of a grid is

  • 1 block for devices of compute capability 1.x
  • 65535 blocks for devices of compute capability 2.x or 3.x

The maximum .x or .y dimension of a block is

  • 512 threads for devices of compute capability 1.x
  • 1024 threads for devices of compute capability 2.x, 3.x or 5.x

The maximum .z dimension of a block is 64 threads. 

The maximum number of threads in a block is

  • 512 threads for devices of compute capability 1.x
  • 1024 threads for devices of compute capability 2.x, 3.x or 5.x

Selecting Block Dimensions

The limitation on the total number of threads in a block may impose a restriction on our selection of block dimensions.  For example, if the maximum number is limited to 1024 threads, the following values are admissible block dimensions: (1024, 1, 1), (32, 32, 1), (16, 32, 2). 

However, (32, 32, 2) is inadmissible because the total number of threads would be 2048! 

Launching a Grid

The statement that launches a grid of threads is similar to a templated function call and takes the form

 kernel_identifier<<< dim3 dGrid, dim3 dBlock >>>(argument list);

kernel_identifier is the name of the kernel to be invoked in the launch, dGrid specifies the dimensions of the grid, dBlock specifies the dimensions of each block and argument list specifies the values copied into the kernel's parameters.

The runtime generates the grid structure automatically.  No further coding is necessary.

One-Dimensional Case

To launch a one-dimensional configuration we can specify the .x values directly:

 kernel_identifier<<< int dGrid, int dBlock >>>(argument list);

Thread Identification

Each thread executes the instructions listed in the kernel associated with its grid's launch.  Each thread has access to the configuration parameters that defines its location within its parent block and its parent block's location within the grid.  While a thread is executing the kernel's instructions, the registers contain six index values that define the thread's location within the grid:

  • blockIdx.x
  • blockIdx.y
  • blockIdx.z
  • threadIdx.x
  • threadIdx.y
  • threadIdx.z

along with the six dimensions that define the structure of the grid and of each of its blocks:

  • gridDim.x
  • gridDim.y
  • gridDim.z
  • blockDim.x
  • blockDim.y
  • blockDim.z

Depending on the application, we select a suitable number of dimensions within the grid and within each block of the grid.  We start with .x, introduce .y if necessary and finally .z if also necessary.

One-Dimensional Layouts

One-dimensional layouts are the simplest.  They only use the .x member values of these built-in variables.  The other values default to 1.

Example - A Tiny Vector

Consider a vector with few enough elements to fit within a single block.  A one-dimensional grid structure with one block is sufficient.  The number of threads within the block is equal to the number of elements in the vector.  The element index is simply the thread index within that single block

     int idx = threadIdx.x;

The following program initializes each element of a vector to a prescribed value:

 // Initialize a Tiny Vector using a Kernel
 // tiny_vector.cu

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>

 __global__ void initialize(float* a, float v) {
     int idx = threadIdx.x;
     a[idx] = v;
 }

 int main(int argc, char* argv[]) {
     if (argc != 3) {
         std::cerr << "***Incorrect number of arguments***\n";
         return 1;
     }
     unsigned n = atoi(argv[1]);
     float v = atof(argv[2]);
     int   d;
     cudaDeviceProp prop;
     cudaGetDevice(&d);
     cudaGetDeviceProperties(&prop, d);
     unsigned n_max = prop.maxThreadsDim[0];
     if (n > n_max) {
        n = n_max;
        std::cout << "n reduced to " << n << std::endl;
     }

     float* d_a;
     float* h_a = new float[n];
     cudaMalloc((void**)&d_a, n * sizeof(float));

     // launch a grid of 1 block of n threads
     initialize<<<1, n>>>(d_a, v);

     // copy from device to host memory
     cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost); 

     // output for correctness check
     std::cout << h_a[0] << " ... " << h_a[n - 1] << std::endl;
     cudaFree(d_a);
     delete [] h_a;
 }
 C:\Users\chris\Desktop\gpu610>a 5000 2.5 
 n reduced to 1024
 2.5 ... 2.5

The check on n prevents the code from crashing if n exceeds the maximum number of threads that a multiprocessor supports.

Example - A Medium-Sized Vector

Consider a medium-sized vector with few enough elements to fit within a fully one-dimensional grid.  A one-dimensional grid with a one-dimensional set of blocks is sufficient.  The element index follows from from the block index and the thread index:

     int idx = blockIdx.x * blockDim.x + threadIdx.x; 

The following program initializes each element of a vector to a prescribed value:

 // Initialize a Medium-Sized Vector using a Kernel
 // medium_vector.cu

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>

 __global__ void initialize(float* a, float v, int n) {
     int idx = blockIdx.x * blockDim.x + threadIdx.x;
     if (idx < n)
         a[idx] = v;
 }

 int main(int argc, char* argv[]) {
     if (argc != 3) {
         std::cerr << "***Incorrect nuber of arguments***\n";
         return 1;
     }
     unsigned n = atoi(argv[1]);
     float v = atof(argv[2]);
     int   d;
     cudaDeviceProp prop;
     cudaGetDevice(&d);
     cudaGetDeviceProperties(&prop, d);
     unsigned ntpb = prop.maxThreadsDim[0];
     unsigned ntpg = ntpb * prop.maxGridSize[0];
     if (n > ntpg) {
        n = ntpg;
        std::cout << "n reduced to " << n << std::endl;
     }

     float* d_a;
     float* h_a = new float[n];
     cudaMalloc((void**)&d_a, n * sizeof(float));

     // launch a grid with enoungh blocks of nThreads threads
     initialize<<<(n + ntpb - 1) / ntpb, ntpb>>>(d_a, v, n);

     // copy from device to host memory
     cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost); 

     // output for correctness check
     std::cout << h_a[0] << " ... " << h_a[n - 1] << std::endl;
     cudaFree(d_a);
     delete [] h_a;
 }

Since the size of the vector is not necessarily exactly divisible by the number of threads in a block, the last block may contain less than the maximum number of threads in a block (<= ntpb).  The check on n within the kernel excludes processing of any element with an index outside the device memory allocated by the host.

The results of executing the compiled code should look something like:

 >a 5000 4.5
 4.5 ... 4.5 

Two-Dimensional Data

Consider a two-dimensional data structure like a matrix or an image.  A two-dimensional layout of this structure uses the .x and .y members of the built-in variables to identify the threads.  Each thread in the grid has a unique pair of indices:

     int i = blockIdx.x * blockDim.x + threadIdx.x;
     int j = blockIdx.y * blockDim.y + threadIdx.y;

The following program initializes the coefficients of a matrix that is stored in one-dimensional form:

 // Initialize Memory using a Kernel - Two-Dimensional Data
 // matrix_thread_id.cu

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>

 const unsigned ntpb = 32;

 __global__ void initialize(float* a, float v, int n) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     int j = blockIdx.y * blockDim.y + threadIdx.y;
     if (i < n && j < n)
         a[j * n + i] = v;
 }

 int main(int argc, char* argv[]) {
     if (argc != 3) {
         std::cerr << "***Incorrect number of arguments***\n";
         return 1;
     }
     unsigned n = atoi(argv[1]);
     float v = atof(argv[2]);

     int nb = (n + ntpb - 1) / ntpb;
     std::cout << "n = " << n << ", No of Blocks = " << nb
      << ", No of Threads Per Block = " << ntpb << std::endl;

     float* d_a = nullptr;
     cudaMalloc((void**)&d_a, n * n * sizeof(float));
     if (!d_a) {
         std::cerr << "***Out of Memory***\n";
         return 2;
     }
     float* h_a = new float[n * n];

     // launch
     dim3 dGrid(nb, nb, 1);
     dim3 dBlock(ntpb, ntpb, 1);
     initialize<<<dGrid, dBlock>>>(d_a, v, n);

     // copy from device to host memory
     cudaMemcpy(h_a, d_a, n * n * sizeof(float), cudaMemcpyDeviceToHost); 

     // check correctness
     for (int i = 0; i < n * n; i++)
         if (h_a[i] != v) std::cout << h_a[i] << "" << v << std::endl;
     std::cout << "done" <<std::endl;

     cudaFree(d_a);
     delete [] h_a;
     cudaDeviceReset();
 }

The results of executing the compiled code should look something like:

 >a 5000 4.5
 n = 5000, No of Blocks = 157, No of Threads Per Block = 32
 done

 >a 22400 4.5
 n = 22400, No of Blocks = 700, No of Threads Per Block = 32
 done

 >a 25000 4.5
 n = 25000, No of Blocks = 782, No of Threads Per Block = 32 
 ***Out of Memory***

Exercises

  • Upgrade the code in initialize.cu to set the limit of the number of threads to the maximum value for the installed device 
  • Start the Workshop on A Simple Kernel




Previous Reading  Previous: Thrust Next: Parallel Profiling   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License