Part B - CUDA Programming Model
Kernels and Grids
Describe how to write code that executes on the device
Describe the organization and launching of threads for execution on the device
Show how to identify a thread during its execution on the device
Execution on the Device
| Grid Design
| Thread Identification
| Exercises
The CUDA programming model provides facilities to incorporate custom source code
that executes on the GPU. The CUDA model is an SPMD model in which threads
execute the same program on multiple data streams concurrently. The CUDA
runtime maps the threads defined in the source code to the CUDA cores of the
device.
This chapter describes how to code intructions that execute on the GPU and the
structure for organizing the threads that will execute on the device. It
shows how to configure the thread design for execution on the multiprocessors
and how to launch the configuration.
Execution on the Device
We store the program that executes on the multiprocessors threads in the form of
a kernel. We organize the software threads that execute these program
in the form of a grid. The grid's dimensionsdepend on the nature of
our algorithm. We define the kernel and the grid in the source code, which
we store in a file with extension .cu.
The nvcc compiler driver distinguishes the code destined
for the device from the rest of the source code, which executes on the host.
The driver sends the host code to the default C/C++ compiler and the device code
to the PTX (Parallel Thread Execution) virtual machine for subsequent execution
on the GPU.
Kernel Definition
Kernel code is similar to C/C++ function code. CUDA Kernels do not return
values: their return type is void.
Their parameter list follows their identifier as shown below. Typically,
this list includes one or more pointers, which receive addresses within device
memory that we allocated outside the kernel definition. The
__global__ keyword identifies a kernel definition
as code that executes on the device as a direct result of launching a grid of
threads:
__global__ void kernel_identifier(float* a, ...) {
// add device logic here
}
|
Each CUDA core executes the kernel's instructions on a thread within the grid
of threads independently of any other thread.
Thread Identifier
The built-in identifier threadIdx is a struct
consisting of three members x,
y, z, which taken
together uniquely identify a software thread.
threadIdx.x holds the x-index of the thread
that is executing the kernel's instructions.
The following kernel code initializes the element of a
associated with thread threadIdx.x to a received
value (v):
__global__ void initialize(float* a, float v) {
a[threadIdx.x] = v;
}
|
Launching a Grid of Threads
We launch a grid of threads from the host code. A launch is
similar to a function call in that it transfers control to the kernel identified
in the call. A launch differs from a function call in that it initiates
the concurrent execution of all of the threads in the specified grid rather than
initiating a single flow of control.
Launch Statement
The launch statement consists of the kernel's identifier, the execution configuration
and an argument list. The identifier is the name of the kernel code.
The execution configuration specifies the grid design, and the argument list
defines the values passed to the kernel's parameters.
The execution configuration consists of several arguments enclosed within a
template-like triple chevron pair:
kernel_identifier<<<dGrid,dBlock>>>(a, n);
|
The arguments to the execution configuration - dGrid and dBlock in this case - specify the layout of blocks of threads within the
grid and threads within a block respectively.
Example
The following program launches a grid of n threads not
exceeding 1024.
Each thread initializes one element in the device array a
to the value received in the second kernel parameter:
// Initialize Memory using a Kernel
// initialize.cu
#include <iostream>
#include <cstdlib>
#include <cuda_runtime.h>
const int MAX_THREADS = 1024;
__global__ void initialize(float* a, float v) {
a[threadIdx.x] = v;
}
int main(int argc, char* argv[]) {
if (argc != 3) {
std::cerr << argv[0]
<< ": invalid no of arguments\n"
<< "Usage: " << argv[0]
<< " no_of_elements initial_value\n";
return 1;
}
int n = std::atoi(argv[1]);
if (n > MAX_THREADS) n = MAX_THREADS;
float v = std::atof(argv[2]);
// allocate host memory
float* h_a = new float[n];
// allocate device memory
float* d_a;
cudaMalloc((void**)&d_a, n * sizeof(float));
// launch a grid of 1 block of n threads
initialize<<<1, n>>>(d_a, v);
// copy from device memory to host memory
cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost);
// display contents of host memory
for (int i = 0; i < n; i++)
std::cout << h_a[i] << (i % 5 == 4 ? '\n' : ' ');
std::cout << std::endl;
// deallocate device memory
cudaFree(d_a);
// deallocate host memory
delete [] h_a;
}
|
The result of compiling this program using nvcc
at the command line should look something like:
>nvcc initialize.cu
initialize.cu
Creating library a.lib and object a.exp initialize.cu
|
Executing the compiled code should produce results
like:
>a 50 2.5
2.5 2.5 2.5 2.5 2.5
2.5 2.5 2.5 2.5 2.5
2.5 2.5 2.5 2.5 2.5
2.5 2.5 2.5 2.5 2.5
2.5 2.5 2.5 2.5 2.5
2.5 2.5 2.5 2.5 2.5
2.5 2.5 2.5 2.5 2.5
2.5 2.5 2.5 2.5 2.5
2.5 2.5 2.5 2.5 2.5
2.5 2.5 2.5 2.5 2.5
|
Coordination
A launch statement differs from a function call in that it returns control to
the caller before completing kernel execution. That is, after a kernel
launch, the host code can continue executing concurrently with the device code.
Concurrent execution can lead to the host code accessing certain data before it
has been updated by the device. This is important for copy operations.
Before copying from the device to the host, we need to ensure that the data on the
device has been updated and is ready to be copied. We do so by coordinating
the host operations with the device operations.
The CUDA runtime coordinates some operations automatically and treats sone as asynchronous.
Copying operations between host and device memories synchronize automatically and don't
require intervention. For instance, a call to cudaMemcpy()
does not return until all copying is complete. Similarily, a launch starts
executing kernel code only after the device has completed all previous device
calls.
The need for intervention may arise after a kernel launch. The device
returns control immediately and does not block continued execution of subsequent
host code while the device executes. To block execution on the host until
all device calls have completed, we call the CUDA API function:
Error Handling
Identifying errors during concurrent execution may require coordination.
Initialization
To determine if an error has occured during the execution of a kernel
we initialize the error state just before the launch by calling
cudaGetLastError(). This call resets the
error code to cudaSuccess. If we omit this
call, a call to cudaGetLastError() after the launch
might return an error code that was generated before the launch.
Sychronization
Because a launch is asynchronous, we need to coordinate execution on the host
and device before checking for kernel errors. We do so by synchronizing
the device with the host immediately following the launch.
Example
// Synchronize the Host and Device for Error Handling
// sync.cu
#include <iostream>
#include <cstdlib>
#include <cuda_runtime.h>
const int MAX_THREADS = 1024;
__global__ void initialize(float* a, float v) {
a[threadIdx.x] = v;
}
bool check(cudaError_t error) {
bool rc;
if (rc = (error != cudaSuccess)) {
std::cout << cudaGetErrorString(error) << std::endl;
}
return rc;
}
int main(int argc, char* argv[]) {
if (argc != 3) {
std::cerr << argv[0]
<< ": invalid no of arguments\n"
<< "Usage: " << argv[0]
<< " no_of_elements initial_value\n";
return 1;
}
int n = std::atoi(argv[1]);
if (n > MAX_THREADS) n = MAX_THREADS;
float v = std::atof(argv[2]);
float* d_a = nullptr;
float* h_a = new float[n];
cudaMalloc((void**)&d_a, n * sizeof(float));
// initialize error code
cudaError_t error = cudaGetLastError();
// launch a grid of n threads
initialize<<<1, n>>>(d_a, v);
// synchronize the device and the host
cudaDeviceSynchronize();
// extract error code for the kernel's execution
if(cudaGetLastError()) {
cudaFree(d_a);
delete [] h_a;
return 3;
}
if(cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost)) {
cudaFree(d_a);
delete [] h_a;
return 4;
}
// output for correctness check
for (int i = 0; i < n; i++)
std::cout << h_a[i] << (i % 5 == 4 ? '\n' : ' ');
std::cout << std::endl;
cudaFree(d_a);
delete [] h_a;
}
|
Grid Design
The grid design for an application can take a variety of forms.
We select a layout that suits the solution of our problem.
For instance, a two-dimensional layout suits a matrix-matrix multiplication,
with one dimension identifying the rows and the other the columns. In
this case, each thread in the grid has two identifiers: the index of the
corresponding row and the index of the corresponding column.
Scalability
Most applications that required parallel processing require many more threads
than the number of processing units available on the device. To map the
number of threads onto the number of processing units, we divide the grid into
blocks of threads. Each block is independent of any other block
and can execute in any order with respect to any other block. This grid
divisibility into independent blocks provides scalability with respect
to the installed hardware.
Each multiprocessor executes a subset of blocks from the grid of threads.
Blocks executing on a device with 4 multiprocessors are illustrated in the
Figure below. All of the threads in a block execute on the same
multiprocessor.

The threads within a block execute on the processing units (CUDA cores) within
the multiprocessor assigned to the block. The multiprocessor schedules the
threads within the block and switches contexts as it moves them in and out
of its CUDA cores.

Grid and Block Layouts
We define the execution configuration by specifying the number of blocks in the
grid and the number of threads in each block.
Blocks in a Grid
We define the block layout within a grid in terms of grid dimensions.
The layout can be one-dimensional, two-dimensional or three-dimensional for
devices of compute capability 2.x and higher.
The total number of blocks in a grid is the product of these dimensions.
Threads in a Block
We define the thread layout within each block in terms of block dimensions.
This layout can also be one-dimensional, two-dimensional or three-dimensional.
The total number of threads in each block is the product of these dimensions.
Type dim3
We use the built-in type dim3
to specify the execution configuration. The dim3
struct has three members:
- .x - the fastest varying dimension
- .y - the second fastest varying dimension
- .z - the third fastest varying dimension
The constructor for a dim3 type
defaults the value of each dimension to 1.
Examples
To define a one-dimensional grid of 1500 blocks, with each block consisting of
256 threads, we write:
dim3 dGrid(1500);
dim3 dBlock(256);
|
To define a two-dimensional grid of 30 by 50 blocks, with each
block consisting of 16 by 8 by 2 threads, we write:
dim3 dGrid(30, 50);
dim3 dBlock(16, 8, 2);
|
Ordering Convention
The indexing of the blocks within a grid progresses
from the fastest varying dimension to the slowest:
(0, 0), (1, 0), ..., (29, 0),
(0, 1), (1, 1), ..., (29, 1),
...,
(0, 49), (1, 49), ..., (29, 49).
Similarly, the indexing of the threads within each block progresses
from fastest varying dimension to slowest:
(0, 0, 0), (1, 0, 0), ..., (15, 0, 0),
(0, 1, 0), (1, 1, 0), ..., (15, 1, 0),
...,
(0, 7, 1), (1, 7, 1), ..., (15, 7, 1).
This ordering system follows the convention for defining screen coordinate
axes. Note that this system does not follow the language conventions for
storing the elements of a multi-dimensional arrays (column-major or row-major).
The ordering system described in the CUDA documentation is shown in the Figure below.
Maximum Values
A device's compute capability determines the maximum values
for grid and block dimensions as well as the maximum number
of threads within a block. Appendix G
of the CUDA Programming Guide lists the limiting values for the different
compute capabilities.
The maximum .x dimension of a grid
is
- 65535 blocks for devices of compute capability up to
and including 2.x
- 231-1 blocks for devices of compute capability
3.x and 5.x
The maximum .y dimension of a grid
is 65535 blocks.
The maximum .z dimension of a grid
is
-
1 block for devices of compute capability 1.x
- 65535 blocks for devices of compute capability 2.x or 3.x
The maximum .x or .y
dimension of a block is
- 512 threads for devices of
compute capability 1.x
-
1024 threads for devices of compute capability
2.x, 3.x or 5.x
The maximum .z dimension
of a block is 64 threads.
The maximum number of threads in a block is
- 512 threads for devices of
compute capability 1.x
- 1024 threads for devices of compute capability
2.x, 3.x or 5.x
Selecting Block Dimensions
The limitation on the total number of threads in a block may impose
a restriction on our selection of block dimensions.
For example, if the maximum number is limited to 1024 threads, the
following values are admissible block dimensions:
(1024, 1, 1), (32, 32, 1), (16, 32, 2).
However, (32, 32, 2) is inadmissible because the total number of threads
would be 2048!
Launching a Grid
The statement that launches a grid of threads is similar to
a templated function call and takes the form
kernel_identifier<<< dim3 dGrid, dim3 dBlock >>>(argument list);
|
kernel_identifier is the name of the kernel
to be invoked in the launch, dGrid specifies the
dimensions of the grid, dBlock specifies the
dimensions of each block and argument list
specifies the values copied into the kernel's parameters.
The runtime generates the grid structure automatically.
No further coding is necessary.
One-Dimensional Case
To launch a one-dimensional configuration we can specify the
.x values directly:
kernel_identifier<<< int dGrid, int dBlock >>>(argument list);
|
Thread Identification
Each thread executes the instructions listed
in the kernel associated with its grid's launch.
Each thread has access to the configuration parameters
that defines its location within its parent block and its
parent block's location within the grid. While a
thread is executing the kernel's instructions, the
registers contain six index values that define the
thread's location within the grid:
- blockIdx.x
- blockIdx.y
- blockIdx.z
- threadIdx.x
- threadIdx.y
- threadIdx.z
along with the six dimensions that define the structure
of the grid and of each of its blocks:
- gridDim.x
- gridDim.y
- gridDim.z
- blockDim.x
- blockDim.y
- blockDim.z
Depending on the application, we select a suitable number of
dimensions within the grid and within each block of the grid.
We start with .x, introduce .y
if necessary and finally .z if also
necessary.
One-Dimensional Layouts
One-dimensional layouts are the simplest. They only use the
.x member values of these built-in variables.
The other values default to 1.
Example - A Tiny Vector
Consider a vector with few enough elements to fit within a single
block. A one-dimensional grid structure with one block
is sufficient. The number of threads within the block
is equal to the number of elements in the vector. The
element index is simply the thread index within that single
block
The following program initializes each element of a vector
to a prescribed value:
// Initialize a Tiny Vector using a Kernel
// tiny_vector.cu
#include <iostream>
#include <cstdlib>
#include <cuda_runtime.h>
__global__ void initialize(float* a, float v) {
int idx = threadIdx.x;
a[idx] = v;
}
int main(int argc, char* argv[]) {
if (argc != 3) {
std::cerr << "***Incorrect number of arguments***\n";
return 1;
}
unsigned n = atoi(argv[1]);
float v = atof(argv[2]);
int d;
cudaDeviceProp prop;
cudaGetDevice(&d);
cudaGetDeviceProperties(&prop, d);
unsigned n_max = prop.maxThreadsDim[0];
if (n > n_max) {
n = n_max;
std::cout << "n reduced to " << n << std::endl;
}
float* d_a;
float* h_a = new float[n];
cudaMalloc((void**)&d_a, n * sizeof(float));
// launch a grid of 1 block of n threads
initialize<<<1, n>>>(d_a, v);
// copy from device to host memory
cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost);
// output for correctness check
std::cout << h_a[0] << " ... " << h_a[n - 1] << std::endl;
cudaFree(d_a);
delete [] h_a;
}
|
C:\Users\chris\Desktop\gpu610>a 5000 2.5
n reduced to 1024
2.5 ... 2.5
|
The check on n prevents the code from
crashing if n exceeds the maximum number
of threads that a multiprocessor supports.
Example - A Medium-Sized Vector
Consider a medium-sized vector with few enough elements to fit within a fully one-dimensional
grid. A one-dimensional grid with a one-dimensional set of blocks is sufficient.
The element index follows from from the block index and the thread index:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
The following program initializes each element of a vector to a
prescribed value:
// Initialize a Medium-Sized Vector using a Kernel
// medium_vector.cu
#include <iostream>
#include <cstdlib>
#include <cuda_runtime.h>
__global__ void initialize(float* a, float v, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n)
a[idx] = v;
}
int main(int argc, char* argv[]) {
if (argc != 3) {
std::cerr << "***Incorrect nuber of arguments***\n";
return 1;
}
unsigned n = atoi(argv[1]);
float v = atof(argv[2]);
int d;
cudaDeviceProp prop;
cudaGetDevice(&d);
cudaGetDeviceProperties(&prop, d);
unsigned ntpb = prop.maxThreadsDim[0];
unsigned ntpg = ntpb * prop.maxGridSize[0];
if (n > ntpg) {
n = ntpg;
std::cout << "n reduced to " << n << std::endl;
}
float* d_a;
float* h_a = new float[n];
cudaMalloc((void**)&d_a, n * sizeof(float));
// launch a grid with enoungh blocks of nThreads threads
initialize<<<(n + ntpb - 1) / ntpb, ntpb>>>(d_a, v, n);
// copy from device to host memory
cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost);
// output for correctness check
std::cout << h_a[0] << " ... " << h_a[n - 1] << std::endl;
cudaFree(d_a);
delete [] h_a;
}
|
Since the size of the vector is not necessarily exactly divisible
by the number of threads in a block, the last block may contain
less than the maximum number of threads in a block (<= ntpb).
The check on n within the kernel excludes
processing of any element with an index outside the device memory
allocated by the host.
The results of executing the compiled code should look something like:
Two-Dimensional Data
Consider a two-dimensional data structure like a matrix or an image.
A two-dimensional layout of this structure uses the .x
and .y members of the built-in variables to identify
the threads.
Each thread in the grid has a unique pair of indices:
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
|
The following program initializes the coefficients of a matrix
that is stored in one-dimensional form:
// Initialize Memory using a Kernel - Two-Dimensional Data
// matrix_thread_id.cu
#include <iostream>
#include <cstdlib>
#include <cuda_runtime.h>
const unsigned ntpb = 32;
__global__ void initialize(float* a, float v, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < n && j < n)
a[j * n + i] = v;
}
int main(int argc, char* argv[]) {
if (argc != 3) {
std::cerr << "***Incorrect number of arguments***\n";
return 1;
}
unsigned n = atoi(argv[1]);
float v = atof(argv[2]);
int nb = (n + ntpb - 1) / ntpb;
std::cout << "n = " << n << ", No of Blocks = " << nb
<< ", No of Threads Per Block = " << ntpb << std::endl;
float* d_a = nullptr;
cudaMalloc((void**)&d_a, n * n * sizeof(float));
if (!d_a) {
std::cerr << "***Out of Memory***\n";
return 2;
}
float* h_a = new float[n * n];
// launch
dim3 dGrid(nb, nb, 1);
dim3 dBlock(ntpb, ntpb, 1);
initialize<<<dGrid, dBlock>>>(d_a, v, n);
// copy from device to host memory
cudaMemcpy(h_a, d_a, n * n * sizeof(float), cudaMemcpyDeviceToHost);
// check correctness
for (int i = 0; i < n * n; i++)
if (h_a[i] != v) std::cout << h_a[i] << "" << v << std::endl;
std::cout << "done" <<std::endl;
cudaFree(d_a);
delete [] h_a;
cudaDeviceReset();
}
|
The results of executing the compiled code should look something like:
>a 5000 4.5
n = 5000, No of Blocks = 157, No of Threads Per Block = 32
done
>a 22400 4.5
n = 22400, No of Blocks = 700, No of Threads Per Block = 32
done
>a 25000 4.5
n = 25000, No of Blocks = 782, No of Threads Per Block = 32
***Out of Memory***
|
Exercises
- Upgrade the code in initialize.cu to set the
limit of the number of threads to the maximum value for the installed device
- Start the Workshop on A Simple Kernel
|