Part B - CUDA Programming Model

Optimization

Calculate the occupancy of a streaming multi-processor
Describe the technique of accessing global memory
Outline techniques for fine tunig compuational performance

Partitioning of Resources | Memory Accesses | Fine Tuning | Exercises


The multi-processor design of CUDA-enabled devices accommodates a variety of grid configurations.  In specifying a grid configuration for a kernel, we define the occupancy of the device.  The optimal design is one that fully occupies all multi-processors on the device. 

The kernel code determines the memory usage of each block in each multi-processor.  Since a single memory access may take many cycles to complete, communication alone can determine performance.  Designing the kernel to minimize memory access time is one aspect of optimizing performance.

This chapter describes the dynamic partitioning of a multi-processor's resources and introduces the CUDA occupancy calculator.  This calculator helps optimize resource allocation for full occupancy.  This chapter also describes how a multi-processor accesses global memory and shows how to write kernel code to minimize access time.  Lastly, this chapter describes fine-tuning techniques for improving computational throughput.


Dynamic Partitioning of Resources

The multi-processors of a CUDA-enabled device allocate the resources available to kernel code dynamically.  Dynamic partitioning of these resources allows flexibility in grid configurations. 

The CUDA runtime assigns thread blocks to a multi-processor as it becomes unoccupied.  Each multi-processor partitions its assigned thread blocks amongst its thread slots.  Partitioning is constrained by the available physical resources:

  • the number of available thread slots
  • the number of available block slots

Each multi-processor allocates shared memory amonsgt its active blocks dynamically.  This partitioning is only constrained by the size of the multi-processor's shared memory.  If kernel execution of the allocated blocks requires too much shared memory, the multi-processor reduces the number of blocks that it processes simultaneously. 

Each multi-processor allocates register memory amonsgt its active threads dynamically.  This partitioning is only constrained by the size of the multi-processor's register file.  If kernel execution of the allocated threads requires too much register memory, the multi-processor reduces the number of blocks that it processes simultaneously. 

The constraints that operate in determining the number of blocks that execute simultaneously on a multi-processor are listed in the table below:

Compute Capability 1.0, 1.1 1.2, 1.3 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3
max no of threads per block 512 512 1024 1024 1024 1024 1024 1024 1024 1024
max no of resident blocks 8 8 8 16 16 16 16 32 32 32
max no of resident threads 768 1024 1536 2048 2048 2048 2048 2048 2048 2048
shared memory 16K 16K 48K 48K 48K 48K 112K 64K 96K 64K
shared memory per thread block 16K 16K 48K 48K 48K 48K 48K 48K 48K 48K
register file 8K 16K 32K 64K 64K 64K 64K 128K 64K 64K
register file per thread block     32K 64K 64K 64K 64K 64K 64K 32K
register file per thread     63 63 255 255 255 255 255 255

Designing for Resource Constraints

In designing a grid configuration, we start with the resources that the kernel requires.  The principal limitations are:

  1. the maximum number of resident threads in each multi-processor
  2. the maximum number of resident blocks in each multi-processor
  3. the size of each multi-processor's register file
  4. the amount of shared memory in each multi-processor

Evaluating a Configuration

Once we have selected a design in terms of

  • the number of threads per block
  • the number of registers per thread
  • the amount of shared memory per thread

we can determine the occupancy of each multi-processor for a specific compute capability. 

For example, consider a device of compute capability 2.0.  If our grid consists of blocks of 64 threads each, each multi-processor will partition its 1536 thread slots into 24 blocks.  Since each multi-processor has only 8 block slots, it will only processes 8 blocks at a time, filling 8 * 64 = 512 thread slots for an occupancy of 33%.  For full occupancy, we could select a grid composed of blocks of 192 threads each. 

Once we have a trial grid design, we must still confirm that our kernel's memory requirements do not exceed any other resource limitations.

Occupancy Calculator

The CUDA Toolkit includes a spreadsheet that accepts as parameters the compute capability, the number of threads per block, the number of registers per thread and the shared memory per block.  This spreadsheet evaluates these parameters against the resource limitations of the specified compute capability.  This spreadsheet is named CUDA_Occupancy_Calculator.xls and stored under the ../tools/ sub-directory of the installed Toolkit. 

The following figure shows the results of an analysis of a grid configuration composed of blocks of 512 threads and a kernel that uses 16 registers and 512 bytes of shared memory per block for a device of compute capability 2.0.

CUDA Occupancy Calculator

We enter our parameter values under the calculator's first two sections (colored green and orange) and the spreadsheet generates a complete analysis of our design.

The graphs on the right-hand side show that variation in occupancy measured in the number of warps for the full range of parameter values

  • the number of threads per blocks
  • the number of registers per threads
  • the shared memory per block

The highlighted triangle on each graph identifies the current parameter settings.


Memory Accesses

In addition to resource constraints, a major determinant of device performance is access time to global memory.  A single access to global memory takes many instruction cycles.  The lag that a single access introduces can render the streaming processors in a warp that are executing a kernel instruction idle for a significant period of time.  We refer to this lag as the latency of the communication.

The source of the high latency of global memory is its structure.  Global memory is composed of Dynamic Random Access Memory (DRAM) chips.  Each chip consists of a set of very weak capacitors.  A capacitor is a passive electrical component that stores electrical energy.  Each capacitor of a DRAM chip holds the data for a single bit.  An electrical charge on the capacitor identifies an on-bit.  Detecting the presence of a charge on a capacitor is a naturally slow process. 

To improve the rate of data access DRAMs are designed to support parallel access.  Their sensors return the bit values for the set of capacitors that store the data for a set of contiguous memory locations rather than the bit value for a single capacitor that stores the data for one memory location. 

Reducing Latency

The CGMA ratio of a kernel is the ratio of its computation to its global memory access time.  To increase (improve) this ratio, we minimize latency effects. 

The first-in-line technique, described in the previous chapter, involves copying data from global memory to shared memory, performing most of the computations through accesses to shared memory and copying the result of the computations back to global memory upon completion. 

The three next-in-line techniques for reducing or hiding latency are:

  • warp scheduling
  • coalesced access
  • tiling for shared memory

Warp Scheduling

Warp scheduling is the technique that involves switching between warps on a multi-processor, replacing those warps that are waiting for completion of memory access with those warps waiting for computation. 

An active warp stalls when its threads access global data that the kernel requires to complete its current instruction.  Warp schedulers re-start unstalled active warps allowing the stalled warps to wait for their access to complete without hogging the streaming processors. 

Streaming multi-processors with many active warps can continue computing and hide some of the latency associated with memory accesses.  With a sufficiently large number of warps, they completely hide the latency involved in accessing global memory. 

The context switching between active warps does not consume any instruction cycles.  These switches are free, unlike context switches between CPU threads, which can consume 100s of instruction cycles.

Coalesced Access

Coalesced access refers to threads in a warp accessing global memory at the same time.  By coding our kernels to take full advantage of the parallel design of global memory, we ensure that the threads within each warp access data from a single set of contiguous memory locations.  Through coalesced access, we can approach the theoretical bandwidth of a DRAM chip.

A streaming multi-processor detects whether the threads in a warp are accessing contiguous memory locations.  If so, the multi-processor combines the individual accesses into a single coalesced access. 

Designing a kernel for coalesced access arises with two or three-dimensional grid configurations.  Since shared memory is on-chip memory that provides relatively high data rates, shared memory access itself does not require coalescence. 

The figures below illustrate the layout of elements in 5 by 5 C-style array.  The elements are stored in memory in row-major order.  The upper figure shows the row and column indices of the two-dimensional array.  The lower figure shows the layout of these elements in memory. 

two-dimensional memory

Un-Coalesced Example

By aligning the grid configuration to the C language convention of storing row and column elements in memory, we enforce un-coalesced access.  Consider the kernel code listed below and the colour coding in the corresponding figure below.  This kernel performs a naive matrix multiplication on two 5 x 5 matrices. 

two-dimensional memory

 __global__ void matMul(const float* d_a, const float* d_b, float* d_c) {
     int i = threadIdx.x;
     int j = threadIdx.y;
     __shared__ float s_a[5][5];
     __shared__ float s_b[5][5];

     // copy to shared memory
     s_a[i][j] = a[5 * i + j];
     s_b[i][j] = b[5 * i + j];
     __syncthreads();

     // calculate dot product
     float sum = 0.0f;
     for (int k = 0; k < 5; k++)
         sum += s_a[i][k] * s_b[k][j];
     __syncthreads();

     // store result in global memory
     c[5 * i + j] = sum;
 }

Each thread, as it executes this kernel, calculates a single coefficient in the resultant matrix (d_c).  The complete calculation involves three distinct steps:

  1. copy from global memory to shared memory the coefficent at [row][column] for each input matrix (d_a and d_b)
  2. calculate the dot product for the coefficient of the resultant matrix at [row][column] from the shared memory values
  3. copy the result of the calculation for the coefficient of the resultant matrix at [row][column] to global memory

The kernel design ties the row and column indices of the 2-D array respectively to the .x and .y members of the thread index.  Since .x is the fastest changing index in the grid, the threads that are adjacent to one another in a warp access elements in global memory (d_a[], d_b[] and d_c[]) that are separated from one another by 5 locations.  The array elements that adjacent threads in the same warp process are of identical color in the upper figure.  The lower figure shows that these elements do not occupy adjacent locations in global memory.  Since adjacent threads in the same warp do not access adjacent locations in global memory, the accesses to global memory are not coalesced.

Coalesced Access

Coalesced access is achieved by reordering the relation between the element indices and the thread index members.  Note the difference between the values assigned to i and j in the figure below and the figure above.

The row and column indices of the 2-D array are tied respectively to the .y and .x members of the thread index.  Since .x is the fastest changing index in the grid, adjacent threads in a warp access contiguous elements in d_a[], d_b[] and d_c[].  Elements of identical color are processed by threads in the same warp.  Since adjacent threads in the same warp access adjacent locations in global memory, accesses are coalesced. 

two-dimensional memory

The kernel code listed below demonstrates coalesced access.  The reordering is highlighted:

 __global__ void matMul(const float* d_a, const float* d_b, float* d_c) {
     int j = threadIdx.x;
     int i = threadIdx.y;
     __shared__ float s_a[5][5];
     __shared__ float s_b[5][5];

     // copy to shared memory
     s_a[i][j] = a[5 * i + j];
     s_b[i][j] = b[5 * i + j];
     __syncthreads();

     // calculate dot product
     float sum = 0.0f;
     for (int k = 0; k < 5; k++)
         sum += s_a[i][k] * s_b[k][j];
     __syncthreads();

     // store result in global memory
     c[5 * i + j] = sum;
 }

Tiling

Tiling is a technique for processing data larger than the available resources of a multi-processor. 

Shared memory supports much faster access times than global memory, but is a limited resource.  For problems requiring more than the available shared memory, we circumvent this constraint, by partitioning the solution into small tiles and store the data for a single tile at a time in shared memory. 

One-Dimensional Example

Consider a one-dimensional grid configuration for the dot product of two very large vectors of equal size, where the maximum number of blocks in the grid is insufficient to accommodate all of the vectors' elements.  In such cases, we divide the vectors into tiles of equal size that fit within the constraints of each multi-processor and iterate through the set of tiles. 

1d Tiling

Tile Product

 __global__ void reduce(const float* a, const float* b, float* c) {
     // index in a[] and b[]
     int i;
     // offset from the start of each tile
     int o = blockIdx.x * blockDim.x + threadIdx.x;
     int t = threadIdx.x;
     __shared__ float s[ntpb];

     // accumulate for one tile at a time
     for (int iTile = 0; iTile < nTiles; iTile++) {
         i = gridDim.x * blockDim.x * iTile + o;
         // copy from global memory to shared memory
         s[t] = a[i] * b[i];
         __syncthreads();

         // perform reduction on shared memory values
         for (int stride = blockDim.x >> 1; stride > 0; stride >>= 1) {
             if (t < stride)
                 s[t] += s[t + stride];
             __syncthreads();
         }
     }

     // copy reduction result to global memory
     if (t == 0)
         c[blockIdx.x] = s[0];
 }

gridDim.x refers to the number of blocks in a tile.  gridDim.x * blockDim.x refers to the number of threads in a tile. 

Two-Dimensional Examples

Consider a two-dimensional grid configuration for a matrix multiplication composed of multiple blocks.  Let us partition each matrix into equal-sized square tiles (of width and height TILE_SIDE) such that the coefficients for each tile will fit into shared memory.  These coefficients will hold the subset of rows and subset of columns of the parent matrix, as shown in the figure below. 

Tiling

A single tile in the product matrix (C) holds the result of multiplications of the coefficients in the tiles associated with its rows and columns as shown in the figure below.  That is, the results in the product tile only need to access the coefficients in the corresponding rows in the first matrix (A) and the coefficients in the corresponding columns of the second matrix (B).  We obtain the results for the product tile by adding the results of the multiplications for all of the related tiles in matrices A and B as shown on the right-hand side of the figure. 

Tile Product

The kernel listed below implements this tiling technique without coalescing the memory accesses:

 // n is an integral multiple of TILE_SIDE
 //
 __global__ void matMul(const float* d_a, const float* d_b, float* d_c, int n) {
     int l_row = threadIdx.x;
     int l_col = threadIdx.y;
     int g_row = blockIdx.x * blockDim.x + threadIdx.x;
     int g_col = blockIdx.y * blockDim.y + threadIdx.y;
     __shared__ float s_a[TILE_SIDE][TILE_SIDE];
     __shared__ float s_b[TILE_SIDE][TILE_SIDE];
     float sum = 0.0f;
     for (int tile = 0; tile < n / TILE_SIDE; ++tile) {
         // copy from global to shared memory
         int g_r = g_row * n + tile * TILE_SIDE + l_col;
         int g_c = (tile * TILE_SIDE + l_row) * n + g_col;
         s_a[l_row][l_col] = d_a[g_r];
         s_b[l_row][l_col] = d_b[g_c];
         __syncthreads();

         // accumulate the dot product contributions for [l_row][l_col] at tile
         for (int k = 0; k < TILE_SIDE; k++)
             sum += s_a[l_row][k] * s_b[k][l_col];
         __syncthreads();
     }

     // copy to global memory
     d_c[g_row * n + g_col] = sum;
 }

n refers to the width and height of each square input matrix and is an integral multiple of TILE_SIDE.  g_row refers to the absolute row index and g_col refers to the absolute column index.  l_row and l_col refer to the row and column indices within the tile itself. 

The kernel listed below implements coalesced access.  The changes are in the first 4 lines of the kernel as highlighted:

 // n is an integral multiple of TILE_SIDE
 //
 __global__ void matMul(const float* d_a, const float* d_b, float* d_c, int n) {
     int l_row = threadIdx.y;
     int l_col = threadIdx.x;
     int g_row = blockIdx.y * blockDim.y + threadIdx.y;
     int g_col = blockIdx.x * blockDim.x + threadIdx.x;
     __shared__ float s_a[TILE_SIDE][TILE_SIDE];
     __shared__ float s_b[TILE_SIDE][TILE_SIDE];
     float sum = 0.0f;
     for (int tile = 0; tile < n / TILE_SIDE; tile++) {
         // copy from global to shared memory
         int g_r = g_row * n + tile * TILE_SIDE + l_col;
         int g_c = (tile * TILE_SIDE + l_row) * n + g_col;
         s_a[l_row][l_col] = d_a[g_r];
         s_b[l_row][l_col] = d_b[g_c];
         __syncthreads();

         // accumulate the dot product contributions for [l_row][l_col] at tile
         for (int k = 0; k < TILE_SIDE; k++)
             sum += s_a[l_row][k] * s_b[k][l_col];
         __syncthreads();
     }

     // copy to global memory
     d_c[g_row * n + g_col] = sum;
 }

With x and y grid members switched, adjacent threads within a warp access elements in d_a[] and d_b[] that are adjacent to one another. 


Fine Tuning

In addition to utilizing shared memory and optimizing global memory accesses, we can reduce latency and instruction bandwidth through pre-fetching and instruction mixing. 

Pre-Fetching

Pre-fetching refers to accessing data before it is required by an instruction.  Pre-fetching hides the latency associated with global memory access while other instructions operate on previously pre-fetched data. 

Example

The following code pre-fetches data and stores it in register memory in anticipation of copying that data into shared memory:

 // n is an integral multiple of TILE_SIDE
 //
 __global__ void matMul(const float* d_a, const float* d_b, float* d_c, int n) {
     int l_row = threadIdx.y;
     int l_col = threadIdx.x;
     int g_row = blockIdx.y * blockDim.y + threadIdx.y;
     int g_col = blockIdx.x * blockDim.x + threadIdx.x;
     __shared__ float s_a[TILE_SIDE][TILE_SIDE];
     __shared__ float s_b[TILE_SIDE][TILE_SIDE];
     // fetch for the first tile
     float r_a = d_a[g_row * n + l_col]; // pre-fetch
     float r_b = d_b[l_row * n + g_col]; // pre-fetch
     float sum = 0.0f;
     // iterate on n / TILE_SIDE - 1 tiles (exclude last tile)
     for (int tile = 0; tile < n / TILE_SIDE - 1; ++tile) {
         // copy from register to shared memory
         s_a[l_row][l_col] = r_a; // register to shared
         s_b[l_row][l_col] = r_b; // register to shared
         __syncthreads();

         // pre-fetch for the next tile pair
         r_a = d_a[g_row * n + (tile + 1) * TILE_SIDE + l_col]; // pre-fetch
         r_b = d_b[((tile + 1) * TILE_SIDE + l_row) * n + g_col]; // pre-fetch

         // accumulate the dot product contributions for [l_row][l_col] at tile
         for (int k = 0; k < TILE_SIDE; k++)
             sum += s_a[l_row][k] * s_b[k][l_col];
         __syncthreads();
     }
     // process the last tile pair
     s_a[l_row][l_col] = r_a; // register to shared
     s_b[l_row][i_col] = r_b; // register to shared
     __syncthreads();
     for (int k = 0; k < TILE_SIDE; k++)
         sum += s_a[l_row][k] * s_b[k][l_col];
     __syncthreads();

     // copy from register to global memory
     d_c[g_row * n + g_col] = sum;
 }

Instruction Mix

The innermost iteration in the kernel listed above has a constant range (TILE_SIDE).  Constructs that execute a constant number of iterations can be optimized if they are innermost iterations. 

Consider this iteration as a candidate for optimization.  Each iteration consists of 6 instructions: one iteration counter, one iteration branch, two addressing instructions and two floating-point instructions.  Since the number of iterations (TILE_SIDE) is a compile-time constant, we can unroll the construct to remove half of these instructions and let the compiler optimize the unrolling to eliminate the remaining address arithmetic.  Such unrolling can reduce the number of instructions to 1/3 of the original number.

Example

The following example unrolls the inner iteration for the compile-time constant TILE_SIDE = 16:

 // n is an integral multiple of TILE_SIDE
 //
 __global__ void matMul(const float* d_a, const float* d_b, float* d_c, int n) {
     int l_row = threadIdx.y;
     int l_col = threadIdx.x;
     int g_row = blockIdx.y * blockDim.y + threadIdx.y;
     int g_col = blockIdx.x * blockDim.x + threadIdx.x;
     __shared__ float s_a[TILE_SIDE][TILE_SIDE];
     __shared__ float s_b[TILE_SIDE][TILE_SIDE];
     // fetch for the first tile
     float r_a = d_a[g_row * n + l_col]; // pre-fetch
     float r_b = d_b[l_row * n + g_col]; // pre-fetch
     float sum = 0.0f;
     for (int tile = 0; tile < n / TILE_SIDE - 1; ++tile) {
         // copy from register to shared memory
         s_a[l_row][l_col] = r_a; // register to shared
         s_b[l_row][l_col] = r_b; // register to shared
         __syncthreads();

         // pre-fetch for the next tile pair
         r_a = d_a[g_row * n + (tile + 1) * TILE_SIDE + l_col]; // pre-fetch
         r_b = d_b[((tile = 1) * TILE_SIDE + l_row) * n + g_col]; // pre-fetch

         // accumulate the dot product contributions for [l_row][l_col] at tile
         sum += s_a[l_row][0] * s_b[0][l_col] +
             s_a[l_row][1] * s_b[1][l_col] +
             s_a[l_row][2] * s_b[2][l_col] +
             s_a[l_row][3] * s_b[3][l_col] +
             s_a[l_row][4] * s_b[4][l_col] +
             s_a[l_row][5] * s_b[5][l_col] +
             s_a[l_row][6] * s_b[6][l_col] +
             s_a[l_row][7] * s_b[7][l_col] +
             s_a[l_row][8] * s_b[8][l_col] +
             s_a[l_row][9] * s_b[9][l_col] +
             s_a[l_row][10] * s_b[10][l_col] +
             s_a[l_row][11] * s_b[11][l_col] +
             s_a[l_row][12] * s_b[12][l_col] +
             s_a[l_row][13] * s_b[13][l_col] +
             s_a[l_row][14] * s_b[14][l_col] +
             s_a[l_row][15] * s_b[15][l_col];
         __syncthreads();
     }
     // process the last tile pair
     s_a[l_row][l_col] = r_a; // register to shared
     s_b[l_row][i_col] = r_b; // register to shared
     __syncthreads();
     sum += s_a[l_row][0] * s_b[0][l_col] +
         s_a[l_row][1] * s_b[1][l_col] +
         s_a[l_row][2] * s_b[2][l_col] +
         s_a[l_row][3] * s_b[3][l_col] +
         s_a[l_row][4] * s_b[4][l_col] +
         s_a[l_row][5] * s_b[5][l_col] +
         s_a[l_row][6] * s_b[6][l_col] +
         s_a[l_row][7] * s_b[7][l_col] +
         s_a[l_row][8] * s_b[8][l_col] +
         s_a[l_row][9] * s_b[9][l_col] +
         s_a[l_row][10] * s_b[10][l_col] +
         s_a[l_row][11] * s_b[11][l_col] +
         s_a[l_row][12] * s_b[12][l_col] +
         s_a[l_row][13] * s_b[13][l_col] +
         s_a[l_row][14] * s_b[14][l_col] +
         s_a[l_row][15] * s_b[15][l_col];
     __syncthreads();

     // copy from register to global memory
     d_c[g_row * n + g_col] = sum;
 }

Exercises




Previous Reading  Previous: Warp Partitioning Next: Streams   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License