Part B - CUDA Programming Model

Warp Partitioning

Describe the structure of CUDA-enabled devices
Describe the scheduling of threads
Identify and resolve thread divergence

Device Resources | Thread Scheduling | Exercises


The CUDA runtime launches kernel execution by assigning blocks of threads to the streaming multi-processors (SMs) on the device.  The number of multi-processors on the device depends on its model number.  Each multi-processor executes the kernel on the blocks assigned to it independently of the other multi-processors. 

Each SM consists of a set of streaming processors (SPs).  The number of SPs in an SM depends on the architecture of the device but is the same for each SM in the device.  Each SP is called a CUDA core.  Each CUDA core processes a single thread.  Each SM partitions the threads in each block assigned to it by the CUDA runtime and schedules each partition to execute on its SPs.  The total time that the device needs to execute a kernel on a grid of threads depends on how we design that grid and on the compatibility of our kernel code with the partitioning scheme implemented by the SM. 

This chapter describes the resource limitations of an SM, outlines the partitioning scheme that each SM uses to schedule threads within each block and shows how to write kernel code that is compatible with that mechanism. 


Device Resources

Designing optimal execution configuration requires some familiarity with the physical resources provide by the device.  A table listing all of the resources available on devices of different compute capabilities may be found in the Technical Specifications Section of Appendix G to the CUDA Toolkit Documentation.

Streaming Multi-Processors

A CUDA-enabled device consists of a set of streaming multi-processors (SMs).  The SMs share all of the global memory on the device. 

The resources that each SM provides to the device varies with its compute capability.  These resources represent limits on the processing of blocks and the scheduling of threads.  Resources are the same for each SM within a particular device.  Three resources that are key to optimal design are listed in the table below: 

Compute Capability 1.0, 1.1 1.2, 1.3 2.x 3.0 3.5 3.7 5.0 5.2 5.3
max no of threads per block 512 512 1024 1024 1024 1024 1024 1024 1024
max no of resident blocks 8 8 8 16 16 16 32 32 32
max no of resident threads 768 1024 1536 2048 2048 2048 2048 2048 2048

The number of resident blocks is the number of blocks that an SM can process simultaneously.  The number of resident threads is the number of threads that an SM can track and schedule simultaneously. 

Streaming Processors

Each SP executes a kernel instruction on one thread.  The SP shares the control logic and instruction cache with the other SPs on the SM.  Each SP has its own multiply-add unit (MAD) and its own multiply unit (MUL). 

Designing within Physical Constraints

We maximize the use of each SM within the constraints imposed by the device's compute capability by selecting optimal block sizes for our threads.  For example, an SM of 1.2 compute capability can execute 1024 threads simultaneously.  We can organize the threads that run on that SM into 2 blocks of 512, 4 blocks of 256, or 8 blocks of 128 threads each.  No SM on the device can process 1 block of 1024 threads because 1024 exceeds the maximum number of threads per block.  No SM can process 16 blocks of 64 threads each since 16 exceeds the maximum number of resident blocks.

Grid Design

The CUDA runtime allocates threads to SMs block by block.  The resources available on each SM determine the number of blocks that each SM can run.  If our grid design exceeds any resource limit, the CUDA runtime reduces the number of blocks that it assigns to each SM incrementally until the number of blocks assigned fits within all resource constraints. 

Note that runtime reduction to fit within resource limits occurs at the block level, not at the thread level. 

Example

Consider a grid of 2048 threads on a device of compute capability 1.2.  If we configure the grid into 8 blocks of 16 x 16 threads, each SM will process 4 blocks at a time (16 * 16 = 256 < 512 threads per block; 4 * 16 * 16 = 1024 <= 1024 resident threads).  This configuration requires 2 SMs on the device for simultaneous processing of all of the threads in the grid. 

Now consider the same grid of 2048 threads on the same device, but configured into two-dimensional blocks of 8 x 8 threads.  This design requires 12 blocks for all 1024 threads.  Each SM can only process 8 blocks at a time (<= 8 resident blocks).  Hence, this configuration requires 4 SMs (2048 / (8 * 8 * 8) = 4) for simultaneous processing of all of the threads in the grid. 

The former design makes significantly better use of a 1.2 device's resources. 


Thread Scheduling

Each SM schedules the threads within a block for execution in partitions.  CUDA calls each partition within a resident block a warp.  Each warp consists of a contiguous subset of threads within the resident block.  The number of threads within a warp on devices of compute capability 1.x, 2.x, 3.x and 5.x is 32.  (This number is implementation dependent and not part of the CUDA specification proper.) 

The first warp (0) within a resident block consists of threads 0,...,31; the second warp (1) consists of threads 32,...,63; the third warp (2) consists of threads 64,...,95; and so forth.  The relation between the thread index and the index of its host warp for a one-dimensional block is 32*i,...,32*(i+1)-1 where i is the warp index. 

Each SM loads a warp of threads into one of its schedulers and executes a single kernel instruction on all of the threads within that warp.  The SM does not execute the next kernel instruction until it has completed executing the currently resident instruction.  This Single Instruction Multiple Thread (SIMT) style amortizes the cost of fetching and processing an instruction over all of the threads in the warp. 

Thread Divergence

SIMT execution works efficiently provided that all threads within a warp follow the same control path.  If different threads in the warp follow different control paths, the SM requires multiple passes to complete execution of the instruction across the entire warp: one pass for each distinct control path.  We say that the threads in such a kernel diverge in their execution.  Thread divergence occurs if the kernel includes a selection construct or an iteration that depends on the thread index. 

Example

Consider the highlighted code in the following program.  The program uses a reduction algorithm with shared memory to calculate the sum of the elements of an n-dimensional vector, which is stored in global memory as a device vector with nblks * ntpb elements. 

 // Thread Divergence
 // divergence.cu

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <cuda_runtime.h>
 // to remove intellisense highlighting
 #include <device_launch_parameters.h>
 #ifndef __CUDACC__
 #define __CUDACC__
 #endif
 #include <device_functions.h>

 const int ntpb = 512;

 __global__ void reduce(float* a, float* b) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     int t = threadIdx.x;
     __shared__ float s[ntpb];
     s[t] = a[i];
     __syncthreads();


     for (int stride = 1; stride < blockDim.x; stride <<= 1) {
         if (t % (2 * stride) == 0)
             s[t] += s[t + stride];
         __syncthreads();
     }

     if (t == 0)
         b[blockIdx.x] = s[0];
 }

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << argv[0]
             << ": invalid no of arguments\n"
             << "Usage: " << argv[0]
             << "  no_of_elements\n";
         return 1;
     }
     int   n = atoi(argv[1]);
     // determine the number of blocks required
     int nblks = (n + ntpb - 1) / ntpb;

     // allocate host memory
     float* h_a = new float[nblks * ntpb];
     float* h_b = new float[nblks];
     // initialize host memory
     for (int i = 0; i < n; i++)
         h_a[i] = float(std::rand()) / RAND_MAX;
     for (int i = n; i < nblks * ntpb; i++)
         h_a[i] = 0.0f;
     float h_sum = 0.0f;
     for (int i = 0; i < n; i++)
         h_sum += h_a[i];

     // allocate device memory
     float* d_a; // full device vector
     float* d_b; // device sum per block
     cudaMalloc((void**)&d_a, nblks * ntpb * sizeof(float));
     cudaMalloc((void**)&d_b, nblks * sizeof(float));
     cudaMemcpy(d_a, h_a, nblks * ntpb * sizeof(float), cudaMemcpyHostToDevice); 

     // reduce to partial sums
     reduce << <nblks, ntpb >> >(d_a, d_b);

     // copy from device to host memory
     cudaMemcpy(h_b, d_b, nblks * sizeof(float), cudaMemcpyDeviceToHost);
     float d_sum = 0.0f;
     for (int i = 0; i < nblks; i++)
         d_sum += h_b[i];

     // report sums
     std::cout << std::fixed << std::setprecision(1);
     std::cout << "Host sum   = " << h_sum << std::endl;
     std::cout << "Device sum = " << d_sum << std::endl;

     // deallocate memory
     delete[] h_a;
     delete[] h_b;
     cudaFree(d_a);
     cudaFree(d_b);
 } 

The highlighted iteration sums the values in shared memory by referring to the thread index and the summation stride.  The control path for the first round differs between odd and even thread indices.  The control path for accumulating values requires divisibility of the thread index by twice the stride.  All other threads follow the control path that skips the accumulation statement.

Thread Divergence

Thread divergence dominates this reduction algorithm. 

To minimize the thread divergence, we can rewrite the iteration to move from the largest stride (half the number of threads in the block) to the smallest (a single thread) rather than from the smallest stride to the largest.  Starting from the largest stride, we sum over all of the threads in the first half of the block, then all of the threads in the first quarter of the block, and so forth as illustrated in the Figure below. 

Less Thread Divergence

Thread divergence only dominates this algorithm once the stride drops below the number of threads in a warp.  The optimized instructions are highlighted below:

 // Less Thread Divergence
 // less_divergence.cu

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <cuda_runtime.h>
 // to remove intellisense highlighting
 #include <device_launch_parameters.h>
 #ifndef __CUDACC__
 #define __CUDACC__
 #endif
 #include <device_functions.h>

 const int ntpb = 512;

 __global__ void reduce(float* a, float* b) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     int t = threadIdx.x;
     __shared__ float s[ntpb];
     s[t] = a[i];
     __syncthreads();


     for (int stride = blockDim.x >> 1; stride > 0; stride >>= 1) {
         if (t < stride)
             s_c[t] += s_c[t + stride];
         __syncthreads();
     }

     if (t == 0)
         b[blockIdx.x] = s[0];
 }

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << argv[0]
             << ": invalid no of arguments\n"
             << "Usage: " << argv[0]
             << "  no_of_elements\n";
         return 1;
     }
     int   n = atoi(argv[1]);
     // determine the number of blocks required
     int nblks = (n + ntpb - 1) / ntpb;

     // allocate host memory
     float* h_a = new float[nblks * ntpb];
     float* h_b = new float[nblks];
     // initialize host memory
     for (int i = 0; i < n; i++)
         h_a[i] = float(std::rand()) / RAND_MAX;
     for (int i = n; i < nblks * ntpb; i++)
         h_a[i] = 0.0f;
     float h_sum = 0.0f;
     for (int i = 0; i < n; i++)
         h_sum += h_a[i];

     // allocate device memory
     float* d_a; // full device vector
     float* d_b; // device sum per block
     cudaMalloc((void**)&d_a, nblks * ntpb * sizeof(float));
     cudaMalloc((void**)&d_b, nblks * sizeof(float));
     cudaMemcpy(d_a, h_a, nblks * ntpb * sizeof(float), cudaMemcpyHostToDevice); 

     // reduce to partial sums
     reduce << <nblks, ntpb >> >(d_a, d_b);

     // copy from device to host memory
     cudaMemcpy(h_b, d_b, nblks * sizeof(float), cudaMemcpyDeviceToHost);
     float d_sum = 0.0f;
     for (int i = 0; i < nblks; i++)
         d_sum += h_b[i];

     // report sums
     std::cout << std::fixed << std::setprecision(1);
     std::cout << "Host sum   = " << h_sum << std::endl;
     std::cout << "Device sum = " << d_sum << std::endl;

     // deallocate memory
     delete[] h_a;
     delete[] h_b;
     cudaFree(d_a);
     cudaFree(d_b);
 } 

Resident Warps

The physical limitations on each SM in the device include a maximum number of resident warps.  This number is the number of resident threads divided by the number of threads in a warp.  The maximum number of resident warps for different compute capabilities is listed in the table below:

Compute Capability 1.0, 1.1 1.2, 1.3 2.0 2.1 3.0 3.5 3.7 5.0 5.2 5.3
max no of resident warps 24 32 48 48 64 64 64 64 64 64
max no of resident threads 768 1024 1536 1536 2048 2048 2048 2048 2048 2048
no of streaming processors 8 8 32 48 192 192 192 128 128 128

The maximum number of resident warps in an SM exceeds the number of SPs in the SM for compute capabilities 1.x and 2.0.  This mechanism allowed CUDA to execute long-latency operations efficiently. 

Multi-Dimensional Blocks

The partitioning of threads within a multi-dimensional block follows the order defined by the grid.  The .x dimension is the fastest increasing one, the .y dimension is the next fastest increasing one, and the .z dimension is the slowest increasing one.  Accordingly, each SM configures the threads within a block to a warp by .x index first, then by .y index and finally by .z index. 

For example, in a 2 x 2 x 2 block, the thread to warp configuration is [.x .y .z] = {[0 0 0], [1 0 0], [0 1 0], [1 1 0], [0 0 1], [1 0 1], [0 1 1], [1 1 1]}. 


Exercises