Part B - CUDA Programming Model
Warp Partitioning
Describe the structure of CUDA-enabled devices
Describe the scheduling of threads
Identify and resolve thread divergence
Device Resources
| Thread Scheduling
| Exercises
The CUDA runtime launches kernel execution by assigning blocks of threads to
the streaming multi-processors (SMs) on the device. The number of
multi-processors on the device depends on its model number. Each
multi-processor executes the kernel on the blocks assigned to it independently
of the other multi-processors.
Each SM consists of a set of streaming processors (SPs). The number of SPs in
an SM depends on the architecture of the device but is the same for each SM in the
device. Each SP is called a CUDA core. Each CUDA core processes a single
thread. Each SM partitions the threads in each block assigned to it by the CUDA
runtime and schedules each partition to execute on its SPs. The total time that
the device needs to execute a kernel on a grid of threads depends on how we design
that grid and on the compatibility of our kernel code with the partitioning scheme
implemented by the SM.
This chapter describes the resource limitations of an SM, outlines the partitioning
scheme that each SM uses to schedule threads within each block and shows how
to write kernel code that is compatible with that mechanism.
Device Resources
Designing optimal execution configuration requires some familiarity with the physical
resources provide by the device. A table listing all of the resources available
on devices of different compute capabilities may be found in the
Technical
Specifications Section of Appendix G to the CUDA Toolkit
Documentation.
Streaming Multi-Processors
A CUDA-enabled device consists of a set of streaming multi-processors (SMs). The
SMs share all of the global memory on the device.
The resources that each SM provides to the device varies with its compute capability.
These resources represent limits on the processing of blocks and the scheduling of
threads. Resources are the same for each SM within a particular device.
Three resources that are key to optimal design are listed in the table below:
Compute Capability
|
1.0, 1.1
|
1.2, 1.3
|
2.x
|
3.0
|
3.5
|
3.7
|
5.0
|
5.2
|
5.3
|
max no of threads per block
|
512
|
512
|
1024
|
1024
|
1024
|
1024
|
1024
|
1024
|
1024
|
max no of resident blocks
|
8
|
8
|
8
|
16
|
16
|
16
|
32
|
32
|
32
|
max no of resident threads
|
768
|
1024
|
1536
|
2048
|
2048
|
2048
|
2048
|
2048
|
2048
|
The number of resident blocks is the number of blocks that an SM can process
simultaneously. The number of resident threads is the number of threads
that an SM can track and schedule simultaneously.
Streaming Processors
Each SP executes a kernel instruction on one thread. The SP shares the
control logic and instruction cache with the other SPs on the SM.
Each SP has its own multiply-add unit (MAD) and its own multiply unit (MUL).
Designing within Physical Constraints
We maximize the use of each SM within the constraints imposed by the device's
compute capability by selecting optimal block sizes for our threads. For
example, an SM of 1.2 compute capability can execute 1024 threads simultaneously.
We can organize the threads that run on that SM into 2 blocks of 512, 4 blocks
of 256, or 8 blocks of 128 threads each. No SM on the device can process
1 block of 1024 threads because 1024 exceeds the maximum number of threads per block.
No SM can process 16 blocks of 64 threads each since 16 exceeds the maximum number
of resident blocks.
Grid Design
The CUDA runtime allocates threads to SMs block by block. The resources
available on each SM determine the number of blocks that each SM can run.
If our grid design exceeds any resource limit, the CUDA runtime reduces the
number of blocks that it assigns to each SM incrementally until the number
of blocks assigned fits within all resource constraints.
Note that runtime reduction to fit within resource limits occurs at
the block level, not at the thread level.
Example
Consider a grid of 2048 threads on a device of compute capability 1.2.
If we configure the grid into 8 blocks of 16 x 16 threads, each SM will
process 4 blocks at a time (16 * 16 = 256 < 512 threads per block; 4 * 16 *
16 = 1024 <= 1024 resident threads). This configuration requires
2 SMs on the device for simultaneous processing of all of the threads in
the grid.
Now consider the same grid of 2048 threads on the same device, but configured into
two-dimensional blocks of 8 x 8 threads. This design requires 12 blocks
for all 1024 threads. Each SM can only process 8 blocks at a time (<= 8
resident blocks). Hence, this configuration requires 4 SMs (2048 / (8 * 8 * 8)
= 4) for simultaneous processing of all of the threads in the grid.
The former design makes significantly better use of a 1.2 device's resources.
Thread Scheduling
Each SM schedules the threads within a block for execution in partitions.
CUDA calls each partition within a resident block a warp. Each
warp consists of a contiguous subset of threads within the resident block.
The number of threads within a warp on devices of compute capability 1.x, 2.x,
3.x and 5.x is 32. (This number is implementation dependent and not part
of the CUDA specification proper.)
The first warp (0) within a resident block consists of threads 0,...,31; the second
warp (1) consists of threads 32,...,63; the third warp (2) consists of threads
64,...,95; and so forth. The relation between the thread index and the index
of its host warp for a one-dimensional block is 32*i,...,32*(i+1)-1 where i is the
warp index.
Each SM loads a warp of threads into one of its schedulers and executes a single
kernel instruction on all of the threads within that warp. The SM does not
execute the next kernel instruction until it has completed executing the currently
resident instruction. This Single Instruction Multiple Thread (SIMT) style
amortizes the cost of fetching and processing an instruction over all of the threads
in the warp.
Thread Divergence
SIMT execution works efficiently provided that all threads within a warp
follow the same control path. If different threads in the warp follow
different control paths, the SM requires multiple passes to complete execution
of the instruction across the entire warp: one pass for each distinct control
path. We say that the threads in such a kernel diverge in their
execution. Thread divergence occurs if the kernel includes a
selection construct or an iteration that depends on the thread index.
Example
Consider the highlighted code in the following program. The program uses a
reduction algorithm with shared memory to calculate the sum of the elements of
an n-dimensional vector, which is stored in global memory as a device vector with
nblks * ntpb elements.
// Thread Divergence
// divergence.cu
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <cuda_runtime.h>
// to remove intellisense highlighting
#include <device_launch_parameters.h>
#ifndef __CUDACC__
#define __CUDACC__
#endif
#include <device_functions.h>
const int ntpb = 512;
__global__ void reduce(float* a, float* b) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int t = threadIdx.x;
__shared__ float s[ntpb];
s[t] = a[i];
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride <<= 1) {
if (t % (2 * stride) == 0)
s[t] += s[t + stride];
__syncthreads();
}
if (t == 0)
b[blockIdx.x] = s[0];
}
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << argv[0]
<< ": invalid no of arguments\n"
<< "Usage: " << argv[0]
<< " no_of_elements\n";
return 1;
}
int n = atoi(argv[1]);
// determine the number of blocks required
int nblks = (n + ntpb - 1) / ntpb;
// allocate host memory
float* h_a = new float[nblks * ntpb];
float* h_b = new float[nblks];
// initialize host memory
for (int i = 0; i < n; i++)
h_a[i] = float(std::rand()) / RAND_MAX;
for (int i = n; i < nblks * ntpb; i++)
h_a[i] = 0.0f;
float h_sum = 0.0f;
for (int i = 0; i < n; i++)
h_sum += h_a[i];
// allocate device memory
float* d_a; // full device vector
float* d_b; // device sum per block
cudaMalloc((void**)&d_a, nblks * ntpb * sizeof(float));
cudaMalloc((void**)&d_b, nblks * sizeof(float));
cudaMemcpy(d_a, h_a, nblks * ntpb * sizeof(float), cudaMemcpyHostToDevice);
// reduce to partial sums
reduce << <nblks, ntpb >> >(d_a, d_b);
// copy from device to host memory
cudaMemcpy(h_b, d_b, nblks * sizeof(float), cudaMemcpyDeviceToHost);
float d_sum = 0.0f;
for (int i = 0; i < nblks; i++)
d_sum += h_b[i];
// report sums
std::cout << std::fixed << std::setprecision(1);
std::cout << "Host sum = " << h_sum << std::endl;
std::cout << "Device sum = " << d_sum << std::endl;
// deallocate memory
delete[] h_a;
delete[] h_b;
cudaFree(d_a);
cudaFree(d_b);
}
|
The highlighted iteration sums the values in shared memory by referring to
the thread index and the summation stride. The control path for the
first round differs between odd and even thread indices. The control
path for accumulating values requires divisibility of the thread index by
twice the stride. All other threads follow the control path that skips
the accumulation statement.

Thread divergence dominates this reduction algorithm.
To minimize the thread divergence, we can rewrite the iteration to
move from the largest stride (half the number of threads in the block)
to the smallest (a single thread) rather than from the smallest stride
to the largest. Starting from the largest stride, we sum over all
of the threads in the first half of the block, then all of the
threads in the first quarter of the block, and so forth as illustrated
in the Figure below.

Thread divergence only dominates this algorithm once the stride drops
below the number of threads in a warp. The optimized instructions
are highlighted below:
// Less Thread Divergence
// less_divergence.cu
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <cuda_runtime.h>
// to remove intellisense highlighting
#include <device_launch_parameters.h>
#ifndef __CUDACC__
#define __CUDACC__
#endif
#include <device_functions.h>
const int ntpb = 512;
__global__ void reduce(float* a, float* b) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int t = threadIdx.x;
__shared__ float s[ntpb];
s[t] = a[i];
__syncthreads();
for (int stride = blockDim.x >> 1; stride > 0; stride >>= 1) {
if (t < stride)
s_c[t] += s_c[t + stride];
__syncthreads();
}
if (t == 0)
b[blockIdx.x] = s[0];
}
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << argv[0]
<< ": invalid no of arguments\n"
<< "Usage: " << argv[0]
<< " no_of_elements\n";
return 1;
}
int n = atoi(argv[1]);
// determine the number of blocks required
int nblks = (n + ntpb - 1) / ntpb;
// allocate host memory
float* h_a = new float[nblks * ntpb];
float* h_b = new float[nblks];
// initialize host memory
for (int i = 0; i < n; i++)
h_a[i] = float(std::rand()) / RAND_MAX;
for (int i = n; i < nblks * ntpb; i++)
h_a[i] = 0.0f;
float h_sum = 0.0f;
for (int i = 0; i < n; i++)
h_sum += h_a[i];
// allocate device memory
float* d_a; // full device vector
float* d_b; // device sum per block
cudaMalloc((void**)&d_a, nblks * ntpb * sizeof(float));
cudaMalloc((void**)&d_b, nblks * sizeof(float));
cudaMemcpy(d_a, h_a, nblks * ntpb * sizeof(float), cudaMemcpyHostToDevice);
// reduce to partial sums
reduce << <nblks, ntpb >> >(d_a, d_b);
// copy from device to host memory
cudaMemcpy(h_b, d_b, nblks * sizeof(float), cudaMemcpyDeviceToHost);
float d_sum = 0.0f;
for (int i = 0; i < nblks; i++)
d_sum += h_b[i];
// report sums
std::cout << std::fixed << std::setprecision(1);
std::cout << "Host sum = " << h_sum << std::endl;
std::cout << "Device sum = " << d_sum << std::endl;
// deallocate memory
delete[] h_a;
delete[] h_b;
cudaFree(d_a);
cudaFree(d_b);
}
|
Resident Warps
The physical limitations on each SM in the device include a maximum number of resident
warps. This number is the number of resident threads divided by the number of
threads in a warp. The maximum number of resident warps for different compute
capabilities is listed in the table below:
Compute Capability
|
1.0, 1.1
|
1.2, 1.3
|
2.0
|
2.1
|
3.0
|
3.5
|
3.7
|
5.0
|
5.2
|
5.3
|
max no of resident warps
|
24
|
32
|
48
|
48
|
64
|
64
|
64
|
64
|
64
|
64
|
max no of resident threads
|
768
|
1024
|
1536
|
1536
|
2048
|
2048
|
2048
|
2048
|
2048
|
2048
|
no of streaming processors
|
8
|
8
|
32
|
48
|
192
|
192
|
192
|
128
|
128
|
128
|
The maximum number of resident warps in an SM exceeds the number of SPs in the
SM for compute capabilities 1.x and 2.0. This mechanism allowed CUDA to
execute long-latency operations efficiently.
Multi-Dimensional Blocks
The partitioning of threads within a multi-dimensional block follows the order
defined by the grid. The .x dimension is the fastest
increasing one, the .y dimension is the next fastest increasing
one, and the .z dimension is the slowest increasing one.
Accordingly, each SM configures the threads within a block to a warp by .x
index first, then by .y index and finally by .z
index.
For example, in a 2 x 2 x 2 block, the thread to warp configuration is [.x .y .z]
= {[0 0 0], [1 0 0], [0 1 0], [1 1 0], [0 0 1], [1 0 1], [0 1 1], [1 1 1]}.
Exercises
|