The CUDA Programming Model offers a hierarchy of four distinct memories for keeping
data as close as possible to the streaming processors. Each memory has a
different access time and exposure. Global memory is at the base of this
hierarchy and includes a read-only part. CUDA API calls allocate, populate,
retrieve and deallocate this memory from the host code.
This chapter describes each memory in detail.
It also describes the parallel reduction algorithm and
implements it using the different memories. It concludes with a simple
implementation of a ray-tracing algorithm to demonstrate the use of the
read only part of global memory.
Memory Hierarchy
The memories on the device in order of access speed from slowest
to fastest are:
- global - read/write - public to all threads of all blocks
- constant - read only - public to all threads of all blocks
- shared - read/write - private to each block - public to all threads within a block
- register - read/write - private to each thread
Host code can only access global and constant memories directly through CUDA
API calls. Shared and
register memories are only accessible from the kernel code.

Kernel Efficiency
One measure of kernel efficiency is the compute to
global memory access (CGMA) ratio. This ratio
compares the number of computations to the number of global
memory accesses within a critical part of the kernel.
Consider a GTX750M. Since its memory bandwidth is 80 GB/s,
we expect to load no more than 20 (=80/4) Gigafloats (GFLOPS)
per second. Its peak performance is 723 GFLOPS. If the
computations in the critical part of the kernal require 20 GFLOPS,
the CGMA ratio for that part is 1. To approach peak performance,
we need to increase this ratio significantly.
We increase the CGMA ratio for the critical part of the kernel by
accessing less global memory and more shared and register memory
during the computation.
Global Memory
Global memory is the device memory that is accessible by any thread executing
on any streaming processor in the device. The host code allocates this
memory on the device and accesses its contents through
cudaMemcpy() calls.
Global is the slowest memory on the device.
Constant Memory
Part of global device memory is available for read-only access.
Constant memory is faster than global memory.
Register Memory
Register memory is private to its thread. We allocate this
memory within the kernel using scalar variables.
Compare the kernel codes listed below. Each kernel
calculates a single coefficient for matrix c
using the dot product of a row in matrix a
with a column in matrix b.
The kernel on the left accumulates this dot product in global
memory. The kernel on the right accumulates this dot product
in register memory:
__global__ void matMul(const float* a,
const float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x +
threadIdx.x;
int j = blockIdx.y * blockDim.y +
threadIdx.y;
if (i < n && j < n) {
c[i * n + j] = 0.0f;
for (int k = 0; k < n; k++)
c[i * n + j] +=
a[i * n + k] *
b[k * n + j];
}
}
|
__global__ void matMul(const float* a,
const float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x +
threadIdx.x;
int j = blockIdx.y * blockDim.y +
threadIdx.y;
if (i < n && j < n) {
float sum = 0.0f;
for (int k = 0; k < n; k++)
sum += a[i * n + k] *
b[k * n + j];
c[i * n + j] = sum;
}
}
|
Register memory is the fastest memory on the device.
Local Variables
CUDA allocates memory for local variables in either of memories:
- scalar variables - data stored in a register and private to its thread
- array variable - data stored in global memory and private to its thread
CUDA stores arrays that are locally declared within the kernel
in global memory. There is no advantage to allocating
array memory locally rather than through the host; that is, to
using local array memory instead of global memory.
Shared Memory
Each multi-processor on the device contains its own memory.
The threads within a block execute on the same multi-processor and can
share this memory with other threads in the same block.
Shared memory is faster than global memory but slower than register memory.
Reduction
A reduction algorithm is an algorithm that extracts a single value from
an array of values or a set of arrays of values. This single reduced
value may be the sum of an array's elements, their maximum value, their
minimum value, their average value, or some other scalar result.
A serial reduction algorithm moves through the elements of an array or
set of arrays one element at a time. Its Big-O classification is O(n).
If the reduction operation is associative, this algorithm can be parallelized.
A parallel reduction algorithm moves through the elements of an array or set of arrays
concurrently. Its Big-O classification is O(log n).
Example
The conventional process of eliminating contestants in a tournament is a
parallel reduction. Contestants are paired for the first round; one
contestant is eliminated from each pair. Successful contestants are
paired for the second round; one contestant is eliminated from each pair.
Round-by-round elimination continues until only one successful contestant
is left as illustrated in the Figure below.

Implementations
We can implement a parallel reduction algorithm in different ways with different
degrees of efficiency.
Let us store the results of each pair process in the array element with the
lower index. In this case, the final result will be
in the first element of the array.

Stages and Strides
Each round in a reduction is called a stage.
The difference between the indices of the elements in a pair being evaluated
at any stage is called the stride. As we move from one stage or
round to the next, the stride doubles. That is, it grows as a power
of 2. In any particular stage, we only process those elements with
indices that are exact multiples of the stride for that stage.

Examples
Consider the reduction of the elements of a one-dimensional array to their arithmetic sum.
In the following set of examples, the host code retrieves the number of elements
from the command line and populate the host array with random values.
Since the array is one-dimensional array, we design our configuration to
consist of a single dimension of threads. We only use the
.x members of the built-in variables:
threadIdx.x,
blockIdx.x, blockDim.x
and gridDim.x. The other members default
to unit values.
A Small Vector
Let us assume that the user is working with a small vector which has less
elements than the maximum number of threads in a block. The execution
configuration only requires one block and the number of threads in that block
is the number of elements in the vector. We obtain the index of each
element in the kernel code by the following statement
After each stage, we need to ensure that all of the threads in the
block have completed their reduction. For this purpose, we block
execution by calling the intrinsic function:
Once the threads have synchronized, the kernel is ready to perform the reduction
for the next stage.
// Sum a Small-Sized Vector
// small_reduction.cu
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <cuda_runtime.h>
// to remove intellisense highlighting
#include <device_launch_parameters.h>
#ifndef __CUDACC__
#define __CUDACC__
#endif
#include <device_functions.h>
__global__ void reduce(float* a, int n) {
int i = threadIdx.x;
for (int stride = 1; i + stride < n; stride <<= 1) {
if (i % (2 * stride) == 0)
a[i] += a[i + stride];
__syncthreads();
}
}
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << argv[0]
<< ": invalid no of arguments\n"
<< "Usage: " << argv[0]
<< " no_of_elements\n";
return 1;
}
int n = atoi(argv[1]);
// allocate host memory
float* h_a = new float[n];
// initialize host memory
for (int i = 0; i < n; i++)
h_a[i] = float(std::rand()) / RAND_MAX;
float h_sum = 0.0f;
for (int i = 0; i < n; i++)
h_sum += h_a[i];
// allocate device memory
float* d_a; // full device vector
cudaMalloc((void**)&d_a, n * sizeof(float));
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
// reduce to partial sums
reduce << <1, n >> >(d_a, n);
// copy from device to host memory
float d_sum;
cudaMemcpy(&d_sum, d_a, sizeof(float), cudaMemcpyDeviceToHost);
// report sums
std::cout << std::fixed << std::setprecision(1);
std::cout << "Host sum = " << h_sum << std::endl;
std::cout << "Device sum = " << d_sum << std::endl;
// deallocate memory
delete[] h_a;
cudaFree(d_a);
}
|
The results for an array of 500 elements are:
>a 500
Host sum = 248.7
Device sum = 248.7
|
Note that this code does not check if the number of elements (n)
exceeds the maximum number of threads in a block (blockDim.x).
If it does, it may crash or give an incorrect final result.
A Medium-Sized Vector
Let us assume that the user is working with a medium-sized vector which has more elements
than the maximum number of threads in a block, but less elements than the maximum
number of threads in the grid. The execution configuration requires more than
one block and no more than the maximum number of blocks in the grid.
To determine the element's index we combine the block index with the thread index
int i = blockIdx.x * blockDim.x + threadIdx.x;
|
In this example, we allocate enough memory on the host and device to hold values
that completely fill the number of blocks required, noting that the last block in
the grid may be partly empty. To account for a partly empty block we add
host code that fills the empty elements with 0. The kernel code below performs
the reduction for each block separately and copies the result to an intermediate
array (b). The host code copies that array's elements
to a host array and sums them on the host to obtain the final result:
// Sum a Medium-Sized Vector
// medium_reduction.cu
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <cuda_runtime.h>
// to remove intellisense highlighting
#include <device_launch_parameters.h>
#ifndef __CUDACC__
#define __CUDACC__
#endif
#include <device_functions.h>
const int ntpb = 512;
__global__ void reduce(float* a, float* b, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
for (int stride = 1; stride < ntpb; stride <<= 1) {
if (i % (2 * stride) == 0)
a[i] += a[i + stride];
__syncthreads();
}
if (threadIdx.x == 0)
b[blockIdx.x] = a[i];
}
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << argv[0]
<< ": invalid no of arguments\n"
<< "Usage: " << argv[0]
<< " no_of_elements\n";
return 1;
}
int n = atoi(argv[1]);
// determine the number of blocks required
int nblks = (n + ntpb - 1) / ntpb;
// allocate host memory
float* h_a = new float[nblks * ntpb];
float* h_b = new float[nblks];
// initialize host memory
for (int i = 0; i < n; i++)
h_a[i] = float(std::rand()) / RAND_MAX;
for (int i = n; i < nblks * ntpb; i++)
h_a[i] = 0.0f;
float h_sum = 0.0f;
for (int i = 0; i < n; i++)
h_sum += h_a[i];
// allocate device memory
float* d_a; // full device vector
float* d_b; // device sum per block
cudaMalloc((void**)&d_a, nblks * ntpb * sizeof(float));
cudaMalloc((void**)&d_b, nblks * sizeof(float));
cudaMemcpy(d_a, h_a, nblks * ntpb * sizeof(float), cudaMemcpyHostToDevice);
// reduce to partial sums
reduce << <nblks, ntpb >> >(d_a, d_b, n);
// copy from device to host memory
cudaMemcpy(h_b, d_b, nblks * sizeof(float), cudaMemcpyDeviceToHost);
float d_sum = 0.0f;
for (int i = 0; i < nblks; i++)
d_sum += h_b[i];
// report sums
std::cout << std::fixed << std::setprecision(1);
std::cout << "Host sum = " << h_sum << std::endl;
std::cout << "Device sum = " << d_sum << std::endl;
// deallocate memory
delete[] h_a;
delete[] h_b;
cudaFree(d_a);
cudaFree(d_b);
}
|
The results of executing the compiled code
should look something like:
>a 2000
Host sum = 993.0
Device sum = 993.0
|
Shared Memory
Shared memory is device memory that is shared by all of the threads within
a single block. Different threads within the same block can access any
value stored in this memory, while threads in other blocks cannot access this
memory. Shared memory is a part of a specific multi-processor, unlike
global memory, which is accessible by all of the device's multi-processors.
We call the threads within a block cooperating threads. All
cooperating threads see the same version of any shared variable.
Shared memory is notably faster than global memory.
To improve an algorithm's CGMA ratio, we copy data from global memory into shared
memory, work on the copied data in shared memory rather than that in global memory,
and only once we have completed working on the shared memory, copy the result(s)
to global memory.
The keyword for allocating shared memory
is __shared__. For example,
__global__ void foo(float* a) {
__shared__ float s[512]; // allocate shared memory
int tid = threadIdx.x; //
s[tid] = a[tid]; // from global to shared
... // perform computations on s_a[i]
a[tid] = s[tid]; // copy results to global memory
}
|
Example
Let us improve the CGMA of the medium_reduction example
by introducing shared memory into the implementation.
Reduction with Shared Memory
In the kernel code listed below, we store the element value for each thread
in a block in shared memory and access that value (in its updated form) from
shared memory during each stage. Just before leaving the kernel,
we copy the reduced value for the block of threads from shared memory to the
element in global memory assigned to store the result of the completed reduction
for that block of threads.

The following program implements this caching algorithm in shared memory:
// Sum a Medium-Sized Vector
// shared_reduction.cu
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <cuda_runtime.h>
// to remove intellisense highlighting
#include <device_launch_parameters.h>
#ifndef __CUDACC__
#define __CUDACC__
#endif
#include <device_functions.h>
const int ntpb = 512;
__global__ void reduce(float* a, float* b, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int t = threadIdx.x;
__shared__ float s[ntpb];
s[t] = a[i];
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride <<= 1) {
if (t % (2 * stride) == 0)
s[t] += s[t + stride];
__syncthreads();
}
if (t == 0)
b[blockIdx.x] = s[0];
}
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << argv[0]
<< ": invalid no of arguments\n"
<< "Usage: " << argv[0]
<< " no_of_elements\n";
return 1;
}
int n = atoi(argv[1]);
// determine the number of blocks required
int nblks = (n + ntpb - 1) / ntpb;
// allocate host memory
float* h_a = new float[nblks * ntpb];
float* h_b = new float[nblks];
// initialize host memory
for (int i = 0; i < n; i++)
h_a[i] = float(std::rand()) / RAND_MAX;
for (int i = n; i < nblks * ntpb; i++)
h_a[i] = 0.0f;
float h_sum = 0.0f;
for (int i = 0; i < n; i++)
h_sum += h_a[i];
// allocate device memory
float* d_a; // full device vector
float* d_b; // device sum per block
cudaMalloc((void**)&d_a, nblks * ntpb * sizeof(float));
cudaMalloc((void**)&d_b, nblks * sizeof(float));
cudaMemcpy(d_a, h_a, nblks * ntpb * sizeof(float), cudaMemcpyHostToDevice);
// reduce to partial sums
reduce << <nblks, ntpb >> >(d_a, d_b, n);
// copy from device to host memory
cudaMemcpy(h_b, d_b, nblks * sizeof(float), cudaMemcpyDeviceToHost);
float d_sum = 0.0f;
for (int i = 0; i < nblks; i++)
d_sum += h_b[i];
// report sums
std::cout << std::fixed << std::setprecision(1);
std::cout << "Host sum = " << h_sum << std::endl;
std::cout << "Device sum = " << d_sum << std::endl;
// deallocate memory
delete[] h_a;
delete[] h_b;
cudaFree(d_a);
cudaFree(d_b);
}
|
The results of executing the compiled code
should look something like:
>a 2000
Host sum = 993.0
Device sum = 993.0
|
Constant Memory
Constant memory is a dedicated part of global memory that provides read-only
access. Constant memory is cached and its access times are significantly
faster than global memory.
We allocate constant memory form the host code and copy host data to it in the
host code.
The keyword that identifies constant memory is __constant__.
We allocate this memory statically.
The CUDA API function that copies data into constant memory has the
following prototype
cudaError_t
cudaMemcpyToSymbol(const char* symbol, void* source, size_t n_bytes,
size_t offset = 0, cudaMemcpyKind = cudaMemcpyHostToDevice);
|
The first parameter (symbol) receives the address of the
allocated device memory, the second parameter (source)
receives the address of the host memory that contains the data to be copied to
the device, the third parameter (n_bytes)
receives the number of bytes to be copied, the fourth parameter (offset)
receives the offset to be used in the copying, and the fifth parameter receives the
direction of copying. The fourth and fifth parameters are optional.
Ray Tracing Example
The following program draws the parts of the spheres in a set that are visible
to a viewpoint located on the z axis. The spheres are dispersed randomly
throughout a three-dimensional space and the viewing direction is perpendicular
to the plane of the drawing.

The code determines which part is visible at each and every pixel in turn.
The algorithm iterates through every sphere for every pixel. If the ray
for the pixel hits a sphere and that sphere is closer than the other spheres
processed before, the algorithm marks that sphere as visible at that pixel.
To accelerate execution, we store all of the information for each sphere in constant
memory. This information does not change.
We name the function that determines a hit hit() and call that
function from within the kernel code. The keyword __device__
identifies the function as one that executes on the device.
// Visible Spheres - after Sanders and Kandrot CUDA by Example
// raytrace.cu
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <cuda_runtime.h>
// to remove intellisense highlighting
#include <device_launch_parameters.h>
#ifndef __CUDACC__
#define __CUDACC__
#endif
#include <device_functions.h>
#define DIM 64
#define DIMDIM (DIM * DIM)
#define NTPB 16
#define M_SPHERES 500
#define RADIUS DIM / 10.0f
#define MIN_RADIUS 2.0f
#define rnd(x) ((float) (x) * rand() / RAND_MAX)
#define INF 2e10f
class Sphere {
float x, y, z, r;
public:
Sphere() {}
void init() {
x = rnd(DIM) - DIM / 2.0f;
y = rnd(DIM) - DIM / 2.0f;
z = rnd(DIM) - DIM / 2.0f;
r = rnd(RADIUS) + MIN_RADIUS;
}
__device__ float hit(float ox, float oy) {
float dx = ox - x;
float dy = oy - y;
if (dx * dx + dy * dy < r * r)
return sqrtf(r * r - dx * dx - dy * dy) + z;
else
return -INF;
}
};
__constant__ Sphere s[M_SPHERES];
__global__ void raytrace(bool* a, int n) {
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int k = x + y * blockDim.x * gridDim.x;
float ox = (x - DIM / 2);
float oy = (y - DIM / 2);
float mz = - INF;
for (int i = 0; i < n; i++) {
float t = s[i].hit(ox, oy);
if (t > mz)
mz = t;
}
a[k] = mz != - INF;
}
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << argv[0]
<< ": invalid no of arguments\n"
<< "Usage: " << argv[0]
<< " no_of_spheres\n";
return 1;
}
int n = std::atoi(argv[1]);
if (n > M_SPHERES) n = M_SPHERES;
// create spheres and store in constant memory
Sphere* s_temp = new Sphere[n];
for (int i = 0; i < n; i++)
s_temp[i].init();
cudaMemcpyToSymbol(s, s_temp, sizeof(Sphere) * n);
delete [] s_temp;
// allocate device memory for hit data
bool* d_a;
cudaMalloc((void**)&d_a, DIMDIM);
// launch the grid of threads
dim3 dimGrid(DIM/NTPB, DIM/NTPB);
dim3 dimBlock(NTPB, NTPB);
raytrace<<<dimGrid, dimBlock>>>(d_a, n);
// copy hit data to host
bool* h_a = new bool[DIMDIM];
cudaMemcpy(h_a, d_a, DIM*DIM, cudaMemcpyDeviceToHost);
// display results
int k = 0;
for (int i = 0; i < DIM; i++) {
for (int j = 0; j < DIM; j++)
std::cout << (h_a[k++] ? 'x' : ' ');
std::cout << std::endl;
}
// clean up
delete [] h_a;
cudaFree(d_a);
}
|
The results of a ray-trace for 40 spheres looks something like:
>a 40
xxxxxx
xxxxxxxx
xxxxxxxxxx
xxxxxxxxxx
xxx xxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxx xxxxxxxxxx xxxxxxx
xxx xxxxxxxxx xxxxxxxxxx
xxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxx xxxxxxxxxxxxx
xxxxxxxxxxx xxxxxxxxxxxxxx
xxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxx xxxxxxxxxxxxxxx
x xxxxxxxxx xxxxxxxxxxxxxxxxxx
xxx xxxxxxxx xxxxxxxxxxxxxxxxxx
xxx xxxxx xxxxxxxxx xxxx xxxxxxxxxxxxxxxxxxx
xxxxx xxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx
xxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx
xxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx
xxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxx
xxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxx
xxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xx xxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxx
x xxxxxxxxxxxxxxxx xxxx xxxxxxxxxxxxxx
xxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxx
xxxxxxxxxx xxxxxxxxxxxxxxx xxxxxx xxx
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxx
xxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxx
xxxxxxxxxxxxxxxxx xxxxxxxxxx xxxxxxx
xxxxxxxxxxxxxxxxxx xxxxxxxx xxxxxxx
xxxxxxxxxxxxxxxxxx xxxxx xxxxxxx
xxxxxxxxxxxxxxxxxx xxxxx xxxxxx
xxxxxxxxxxxxxxxxxx xxxxx xxxxxx
xxxxxxxxxxxxxxxxx xxx xxxx
|
Exercises
|