Part B - CUDA Programming Model
Case Studies
Demonstrate how to refine a kernel to hide memory latency
Demonstrate how to use constant memory to enable caching
Demonstrate how to reorder instructions to maximize performance
MRI Reconstruction
| Molecular Visualization
| Exercises
Case studies provide practical insight into optimization
and reveal other steps required to bring actual performance closer
to the theoretical potential.
This chapter presents the series of upgrades used to achieve
optimal solutions for two cases:
MRI reconstruction and molecular visualization.
MRI Reconstruction
Magnetic Resonance Imaging (MRI) consists of two phases:
acquisition (scanning) and reconstruction. The reconstruction
phase creates the sought-after image from the acquired data.
For a full description of this case study see chapter 8 of
the recommended text pages 141-171. This section summarizes
the main points of that case study. Identifiers
have been shortened to improve clarity.
CPU Code
The following code snippet is part of a larger algorithm that
reconstructs an MRI scan. This snippet reconstructs an
image sequentially on the CPU.
// M - number of samples
// N - number of voxels (volume pixels)
float e, c, s; // e = expfhd in text
for (int m = 0; m < M; m++) {
rmu[m] = rphi[m] * rd[m] + iphi[m] * id[m];
imu[m] = rphi[m] * id[m] - iphi[m] * rd[m];
for (int n = 0; n < N; n++) {
e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
c = cos(e);
s = sin(e);
rfhd[n] += rmu[m] * c - imu[m] * s;
ifhd[n] += imu[m] * c + rmu[m] * s;
}
}
|
M is the number of samples, which is typically in the order of 105.
N is the number of voxels in the reconstructed image, which is typically in the order
of 1283 (2,097,152) or 106.
This algorithm exhibits substantial data parallelism and is suited to
execution on the GPU. All iterations in the outer loop
can be executed in parallel. All iterations in the inner loop can also
be executed in parallel.
CUDA Kernels
There are several performance bottlenecks in this algorithm.
Let us address them one at a time.
Initial Version
Since M can be in the millions, let us convert the CPU code
into a grid of multiple thread blocks. Let each block
contain ntpb threads ( = FHD_THREADS_PER_BLOCK
in the text). Then, the number of blocks in the grid is
nblks = (M + ntpb - 1)/ntpb.
The initial version of the kernel looks like
__global__ void cmpfhd(float* rphi, float* iphi, float* rd, float* id,
float* rmu, float* imu, float* rfhd, float* ifhd,
float* kx, float* ky, float* kz,
float* x, float* y, float* z,
int mm, int N) {
int m = blockIdx.x * blockDim.x + threadIdx.x;
float e, c, s;
if (m < mm) {
rmu[m] = rphi[m] * rd[m] + iphi[m] * id[m];
imu[m] = rphi[m] * id[m] - iphi[m] * rd[m];
for (int n = 0; n < N; n++) {
e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
c = cos(e);
s = sin(e);
rfhd[n] += rmu[m] * c - imu[m] * s;
ifhd[n] += imu[m] * c + rmu[m] * s;
}
}
}
// execution configuration
ntpb = FHD_THREADS_PER_BLOCK;
nblks = (M + ntpb - 1) / ntpb;
cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id,
rmu, imu, rfhd, ifhd, kx, ky, kz, x, y, z, M, N);
|
Note that in this version, all threads write to all rfhd
and ifhd pairs (as highlighted below).
__global__ void cmpfhd(float* rphi, float* iphi, float* rd, float* id,
float* rmu, float* imu, float* rfhd, float* ifhd,
float* kx, float* ky, float* kz,
float* x, float* y, float* z,
int mm, int N) {
int m = blockIdx.x * blockDim.x + threadIdx.x;
float e, c, s;
if (m < mm) {
rmu[m] = rphi[m] * rd[m] + iphi[m] * id[m];
imu[m] = rphi[m] * id[m] - iphi[m] * rd[m];
for (int n = 0; n < N; n++) {
e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
c = cos(e);
s = sin(e);
rfhd[n] += rmu[m] * c - imu[m] * s;
ifhd[n] += imu[m] * c + rmu[m] * s;
}
}
}
// execution configuration
ntpb = FHD_THREADS_PER_BLOCK;
nblks = (M + ntpb - 1) / ntpb;
cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id,
rmu, imu, rfhd, ifhd, kx, ky, kz, x, y, z, M, N);
|
One thread can interfere with another thread while writing to the same pair.
Clearly, this kernel will not execute correctly as conflicts (race conditions)
arise between threads writing to the same locations in the rfhd
and ifhd arrays.
Loop Fission
One solution to this conflict is to reconstruct the algorithm so that
each thread writes to only one [rfhd,
ifhd] pair.
For this, we swap the order of the iterations; that is, we
exchange the inner loop and the outer loop. Loop
interchanges require perfectly nested loop structures.
To achieve loop interchangeability we remove the
rmu and imu
calculations from the outer loop. We split the nested
iteration into two separate iterations: a single iteration followed
by a perfectly nested iteration. This removal technique
is called loop fission.
float e, c, s;
if (m < mm) {
for (int m = 0; m < M; m++) {
rmu[m] = rphi[m] * rd[m] + iphi[m] * id[m];
imu[m] = rphi[m] * id[m] - iphi[m] * rd[m];
}
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
c = cos(e);
s = sin(e);
rfhd[n] += rmu[m] * c - imu[m] * s;
ifhd[n] += imu[m] * c + rmu[m] * s;
}
}
}
|
Loop fission is possible here because the first iteration
does not depend on the second iteration.
Mu Kernel
Let us now write a separate kernel for the first iteration:
__global__ void cmpmu(float* rphi, float* iphi, float* rd, float* id,
float* rmu, float* imu, int mm) {
int m = blockIdx.x * blockDim.x + threadIdx.x;
if (m < mm) {
rmu[m] = rphi[m] * rd[m] + iphi[m] * id[m];
imu[m] = rphi[m] * id[m] - iphi[m] * rd[m];
}
}
// execution configuration
ntpb = MU_THREADS_PER_BLOCK;
nblks = (M + ntpb - 1) / ntpb;
cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id, rmu, imu, M);
|
Fhd Kernel
There are three ways to consider in reconstructing the second iteration:
- one thread per calculation - N * M threads - 106 * 105 - too many for a grid
- one thread per sample - M threads - will fit within the grid
- one calculation per thread - N threads - will fit within the grid
The second option writes to all rfhd
and ifhd pairs, causing conflicts.
__global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
float* kx, float* ky, float* kz,
float* x, float* y, float* z, int mm, int N) {
int m = blockIdx.x * blockDim.x + threadIdx.x;
float e, c, s;
if (m < mm) {
for (int n = 0; n < N; n++) {
e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
c = cos(e);
s = sin(e);
rfhd[n] += rmu[m] * c - imu[m] * s;
ifhd[n] += imu[m] * c + rmu[m] * s;
}
}
}
// execution configuration
ntpb = FHD_THREADS_PER_BLOCK;
nblks = (M + ntpb - 1) / ntpb;
cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id,
rmu, imu, rfhd, ifhd, kx, ky, kz, x, y, z, M, N);
|
Computing One Pair of Fhd Values
To avoid conflicts while writing to all rfhd
and ifhd pairs, we interchange
the iterations across the grid and within the kernel. We define
a grid of N threads with each thread executing M iterations instead of
a grid of M threads with each thread executing N iterations.
__global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
float* kx, float* ky, float* kz,
float* x, float* y, float* z, int nn, int M) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
float e, c, s;
if (n < nn) {
for (int m = 0; m < M; m++) {
e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
c = cos(e);
s = sin(e);
rfhd[n] += rmu[m] * c - imu[m] * s;
ifhd[n] += imu[m] * c + rmu[m] * s;
}
}
}
// execution configuration
ntpb = FHD_THREADS_PER_BLOCK;
nblks = (N + ntpb - 1) / ntpb;
cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id,
rmu, imu, rfhd, ifhd, kx, ky, kz, x, y, z, N, M);
|
This is in fact the only possible choice of the three options.
Increased Resolutions
The size of global memory on the device imposes an upper bound
on resolution. If the number of voxels exceeds the
space available (for instance N = 10243), then we
need to divide the voxels into subsets and invoke several
kernels to reconstruct the entire image.
Compute to Memory Access Ratio
The compute-to-memory-access ratio is low in the kernel immediately above:
each iteration accesses memory 14 memory times and performs 13 floating-point
operations for a ratio of about 1.
Using Registers
To improve the compute-to-memory access ratio, we use more local variables:
__global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
float* kx, float* ky, float* kz,
float* x, float* y, float* z, int nn, int M) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
float e, c, s;
if (n < nn) {
float xn = x[n], yn = y[n], zn = z[n];
float rfhdn = rfhd[n], ifhdn = ifhd[n];
for (int m = 0; m < M; m++) {
e = 2 * PI * (kx[m] * xn + ky[m] * yn + kz[m] * zn);
c = cos(e);
s = sin(e);
rfhdn += rmu[m] * c - imu[m] * s;
ifhdn += imu[m] * c + rmu[m] * s;
}
rfhd[n] = rfhdn, ifhd[n] = ifhdn;
}
}
// execution configuration
ntpb = FHD_THREADS_PER_BLOCK;
nblks = (N + ntpb - 1) / ntpb;
cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id,
rmu, imu, rfhd, ifhd, kx, ky, kz, x, y, z, N, M);
|
Here, each iteration increases the register count from 5 to 10,
but accesses memory 7 times. This upgrade reduces the memory
accesses from 14 to 7, but increases the register count from 5 to
10. We need to ensure that we do not exceed the maximum size
of each SM's register file.
Constant Memory
To hide memory latency, we need to improve the compute-to-memory
access ratio further (from 13:7 to about 10:1). Since each
iteration accesses different k-space samples, we cannot load their
values into register memory. Since the kernel does not modify
these values, multiple threads may access them simultaneously.
These values are ideal candidates for constant memory: all threads
in a warp will access the same element simultaneously and everytime
an element loads into cache, 31 other threads will be served by the
cache.
Let us store the kx, ky,
and kz variables in constant memory. Since
constant memory is of limited size, we store the variables in chunks and
iterate through a single chunk at a time launching a separate kernel for
each chunk. If CHUNK_S denotes the size of one chunk:
__constant__ float kxc[CHUNK_S], kyc[CHUNK_S], kzc[CHUNK_S];
__global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
float* x, float* y, float* z, int nn, int M) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
float e, c, s;
if (n < nn) {
float xn = x[n], yn = y[n], zn = z[n];
float rfhdn = rfhd[n], ifhdn = ifhd[n];
for (int m = 0; m < M; m++) {
e = 2 * PI * (kxc[m] * xn + kyc[m] * yn + kzc[m] * zn);
c = cos(e);
s = sin(e);
rfhdn += rmu[m] * c - imu[m] * s;
ifhdn += imu[m] * c + rmu[m] * s;
}
rfhd[n] = rfhdn, ifhd[n] = ifhdn;
}
}
int main() {
ntpb = FHD_THREADS_PER_BLOCK;
nblks = (N + ntpb - 1) / ntpb;
int c = CHUNK_S;
int s = sizeof(float) * c;
int nchunks = (M + c - 1) / c;
for (int i = 0; i < nchunks; i++) {
if (i == nchunks - 1) {
c = M - CHUNK_S * i;
s = sizeof(float) * c;
}
cudaMemcpy(kxc, &kx[i * CHUNK_S], s, cudaMemcpyHostToDevice);
cudaMemcpy(kyc, &ky[i * CHUNK_S], s, cudaMemcpyHostToDevice);
cudaMemcpy(kzc, &kz[i * CHUNK_S], s, cudaMemcpyHostToDevice);
cmpfhd<<<nblks, ntpb>>>(
rmu, imu, rfhd, ifhd, x, y, z, N, c);
}
}
|
Array of Structs
The above solution is not ideal. Constant cache does not perform as well
as expected. The cause is cache design. Each cache entry is designed
to store multiple consecutive words, which reduces the cost of cache hardware.
The data structures in the above solution are not in consecutive words and hence
access multiple cache entries. Constant cache has a very limited number of
entires.
The solution to this inefficient use of constant cache is to rearrange the three
arrays into a single array of structs. Storing the x,
y and z members within the same
instance places these components in consecutive locations in constant memory.
Then, all components used by an iteration fit into one cache entry.
struct KData {
float x, y, z;
};
__constant__ KData kdata[CHUNK_S];
__global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
float* x, float* y, float* z, int nn, int M) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
float e, c, s;
if (n < nn) {
float xn = x[n], yn = y[n], zn = z[n];
float rfhdn = rfhd[n], ifhdn = ifhd[n];
for (int m = 0; m < M; m++) {
e = 2 * PI * (kdata[m].x * x[n] + kdata[m].y * y[n] +
kdata[m].z * z[n]);
c = cos(e);
s = sin(e);
rfhdn += rmu[m] * c - imu[m] * s;
ifhdn += imu[m] * c + rmu[m] * s;
}
rfhd[n] = rfhdn, ifhd[n] = ifhdn;
}
for (int m = 0; m < M; m++) {
c = cos(expfhd);
s = sin(expfhd);
rfhdn += rmu[m] * c - imu[m] * s;
ifhdn += imu[m] * c + rmu[m] * s;
}
rfhd[n] = rfhdn, ifhd[n] = ifhdn;
}
int main() {
ntpb = FHD_THREADS_PER_BLOCK;
nblks = (N + ntpb - 1) / ntpb;
int c = CHUNK_S;
int s = sizeof(kdata) * c;
int nchunks = (M + c - 1) / c;
for (int i = 0; i < nchunks; i++) {
if (i == nchunks - 1) {
c = M - CHUNK_S * i;
s = sizeof(kdata) * c;
}
cudaMemcpy(kdata, &k[i * CHUNK_S], s, cudaMemcpyHostToDevice);
cmpfhd<<<nblks, ntpb>>>(
rmu, imu, rfhd, ifhd, x, y, z, N, c);
}
}
|
Hardware Trigonometry
The trigonometric functions require about 13 floating-point
operations. After implementing the refinements listed above,
the computation of these trigonometric functions is our next
bottleneck.
CUDA exposes hardware trigonometric functions with higher
throughput than software implemented alternatives.
Special function units (SFUs) on each SM execute these
hardware functions. We call these functions
by prefixing their names with a double underscore
(__).
__global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
float* x, float* y, float* z, int nn, int M) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
float e, c, s;
if (n < nn) {
float xn = x[n], yn = y[n], zn = z[n];
float rfhdn = rfhd[n], ifhdn = ifhd[n];
for (int m = 0; m < M; m++) {
e = 2 * PI * (kdata[m].x * x[n] + kdata[m].y * y[n] +
kdata[m].z * z[n]);
c = __cos(expfhd);
s = __sin(expfhd);
rfhdn += rmu[m] * c - imu[m] * s;
ifhdn += imu[m] * c + rmu[m] * s;
}
rfhd[n] = rfhdn, ifhd[n] = ifhdn;
}
}
|
The hardware implementations are less accurate than the
software functions and we need to exercise caution to ensure that
the difference is within our acceptable margin of error.
Configuration Parameters
Adjusting the execution configuration can lead to significant improvements
in performance subject to hardware constraints. We experiment with
different setting to find the optimal configuration.
The configuration parameters for this application are
- the number of threads per block - 32, 64, 128, 256, 512, ...
- the chunk size for constant memory - 32, 64, 128, 256, 1024, 2048, 4096, ...
- the number of registers per thread - unrolling
Larger numbers of threads increase register usage.
Larger chunks require fewer kernel invocations.
Unrolling increases register usage.
Summary
The following table summarizes the runtimes and performance of the different versions.
Version |
Runtime |
GFLOPs |
CPU Double |
518.0 |
0.4 |
CPU Single |
342.3 |
0.7 |
GPU Naive |
41.0 |
5.4 |
GPU Registers and Constant Memory |
9.8 |
22.8 |
GPU Registers, Constant Memory, SPU and Tuning |
1.5 |
144.5 |
Molecular Visualization
Simulating molecular dynamics is another area of application suited to GPUs.
The calculation of a three-dimensional electrostatic potential map is often used
in the placement of ions into a structure for molecules. Direct Coulomb
simulation (DCS) is one method that is used for such calculations on GPUs.
This method accumulates the potential value at each grid point in three-dimensional
space using contributions from all of the atoms within the system. The number
of calculations is proportional to the product of the number of atoms within the
system and the number of grid points in the three-dimensional space.
CPU Code
The following code snippet calculates the DCS simulation for a single slice through
a grid of equally spaced points. k is the slice's index,
z is its z-coordinate, egrid is the
array of energy values at the different grid points, grid
holds the number of blocks in the x, y and z directions, gridspacing
is the spacing between adjacent grid points in any given direction and
na is the number of atoms. The atom
array holds the x, y, and z coordinates of each atom plus the charge in consecutive memory
locations:
void cenergy(float* egrid, dim3 grid, float gridspacing,
int k, float z, const float* atom, int na) {
for (int j = 0; j < grid.y; j++) {
float y = gridspacing * (float)j;
for (int i = 0; i < grid.x; i++) {
float x = gridspacing * (float) i;
float energy = 0.0f;
for (int n = 0; n < na * 4; n += 4) {
float dx = x - atoms[n];
float dy = y - atoms[n + 1];
float dz = z - atoms[n + 2];
energy += atoms[n + 3] /
sqrt(dx * dx + dy * dy + dz * dz);
}
egrid[grid.x * grid.y * k + grid.x * j + i] = energy;
}
}
}
|
CUDA Kernels
Let us address the performance bottlenecks in this algorithm one at a time.
Initial Version
If we assign one thread to each grid point, we obtain the following kernel
__global__ void cenergy(float* egrid, float gridspacing, int k,
float z, const float* atoms, int na) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
float x = gridspacing * (float)i;
float y = gridspacing * (float)j;
float energy = 0.0f;
for (int n = 0; n < na * 4; n+= 4) {
float dx = x - atoms[n];
float dy = y - atoms[n + 1];
float dz = z - atoms[n + 2];
energy += atoms[n + 3] /
sqrt(dx * dx + dy * dy + dz * dz);
}
egrid[gridDim.x * gridDim.y * k + gridDim.x * j + i] = energy;
}
|
Constant Memory
We improve performance by storing the atom
data in an array of structs in constant memory and iterating through
the atoms chunk by chunk (as in the previous case study):
struct Atom {
float x, y, z, w;
};
__constant__ Atom atom[CHUNK_S];
__global__ void cenergy(float* egrid, float gridspacing,
int k, float z, int na) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int m = gridDim.x * gridDim.y * k + gridDim.x * j + i;
float x = gridspacing * (float)i;
float y = gridspacing * (float)j;
float e = egrid[m]; // pre-fetch
float energy = 0.0f;
for (int n = 0; n < na; n++) {
float dx = x - atom[n].x;
float dy = y - atom[n].y;
float dz = z - atom[n].z;
energy += atom[n].w /
sqrt(dx * dx + dy * dy + dz * dz);
}
egrid[m] = energy + e;
}
int main() {
// ...
int s = sizeof(Atom) * CHUNK_S;
dim3 dimGrid(GX/ntpb, GY/ntpb);
dim3 dimBlock(ntpb, ntpb);
for (int k = 0; k < GZ; k++) {
float z = gridspacing * (float)k;
for (int i = 0; i < na/CHUNK_S; i++) {
cudaMemcpy(atom, &atoms[i * CHUNK_S], s,
cudaMemcpyHostToDevice);
cenergy<<<dimGrid, dimBlock>>>(
egrid, gridspacing, k, z, CHUNK_S);
}
}
// ...
}
|
Since we are processing the atoms in chunks, the kernel builds the
grid energy in chunks. Each thread retrieves the currently
accumulated value of its grid point, augments that value by the
contributions from the atoms in the chunk and stores the augmented
value in global memory for the next chunk of atoms.
We have hidden memory access latency by retrieving the currently
accumulated value (egrid[m]) before iterating
on the atoms in the current chunk. We use this value after the
iteration.
Our kernel performs 9 floating-point operations for every 4 memory accesses,
which is a poor ratio. All four accesses are in the atom
array. The values are cached in each SM and broadcast to all threads executing
on the SM. As a result, the effective compute to memory access ratio is much
higher than 10:1 and global memory bandwidth is not a bottleneck.
Initial Improvements
Note that, since we launch the kernel for threads with a constant z
value, the difference between each grid point's z coordinate and an
atom's z coordinate (z - atom[n].z) is constant
across all threads. Accordingly, we calculate this difference for
each atom in the current chunk on the host and store the square of the
different in place of the atom's absolute z coordinate (atom[n].z).
This way, we avoid calculating and squaring the difference individually for each
thread in the grid.
We also implement the sqrt() operation as the
inverse of the inverse square root function - rsqrt():
struct Atom {
float x, y, z, w;
};
__constant__ Atom atom[CHUNK_S];
__global__ void cenergy(float* egrid, float gridspacing,
int k, float z, int na) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int m = gridDim.x * gridDim.y * k + gridDim.x * j + i;
float x = gridspacing * (float)i;
float y = gridspacing * (float)j;
float e = egrid[m];
float energy = 0.0f;
for (int n = 0; n < na; n++) {
float dx = x - atom[n].x;
float dy = y - atom[n].y;
energy += atom[n].w *
rsqrt(dx * dx + dy * dy + atom[n].z);
}
egrid[m] = energy + e;
}
int main() {
// ...
int s = sizeof(Atom) * CHUNK_S;
for (int i = 0; i < na; i++)
atoms[i].z = (z - atoms[i].z) * (z - atoms[i].z);
dim3 dimGrid(GX/ntpb, GY/ntpb);
dim3 dimBlock(ntpb, ntpb);
for (int k = 0; k < GZ; k++) {
float z = gridspacing * (float)k;
for (int i = 0; i < na/CHUNK_S; i++) {
cudaMemcpy(atom, &atoms[i * CHUNK_S], s,
cudaMemcpyHostToDevice);
cenergy<<<dimGrid, dimBlock>>>(
egrid, gridspacing, k, z, CHUNK_S);
}
}
// ...
}
|
Thread Fusion
Constant memory accesses consume resources that could otherwise
be used to increase compute throughput. We reduce the number
of these accesses by fusing multiple threads.
struct Atom {
float x, y, z, w;
};
__constant__ Atom atom[CHUNK_S];
__global__ void cenergy(float* egrid, float gridspacing,
int k, float z, int na) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int m = gridDim.x * gridDim.y * k + gridDim.x * j + i;
float x1 = gridspacing * (float)i;
float x2 = x1 + gridspacing;
float x3 = x2 + gridspacing;
float x4 = x3 + gridspacing;
float y = gridspacing * (float)j;
float e1 = egrid[m];
float e2 = egrid[m + 1];
float e3 = egrid[m + 2];
float e4 = egrid[m + 3];
float energy1 = 0.0f;
float energy2 = 0.0f;
float energy3 = 0.0f;
float energy4 = 0.0f;
for (int n = 0; n < na; n++) {
float dy = y - atom[n].y;
float yy = (dy * dy) + atom[n].z;
float xa = atom[n].x;
float dx1 = x1 - xa;
float dx2 = x2 - xa;
float dx3 = x3 - xa;
float dx4 = x4 - xa;
float chr = atom[n].w;
energy1 += chr * rsqrt(dx1 * dx1 + yy);
energy2 += chr * rsqrt(dx2 * dx2 + yy);
energy3 += chr * rsqrt(dx3 * dx3 + yy);
energy4 += chr * rsqrt(dx4 * dx4 + yy);
}
egrid[m ] = energy1 + e1;
egrid[m + 1] = energy2 + e2;
egrid[m + 2] = energy3 + e3;
egrid[m + 3] = energy4 + e4;
}
int main() {
// ...
int s = sizeof(Atom) * CHUNK_S;
for (int i = 0; i < na; i++)
atoms[i].w = (z - atoms[i].z) * (z - atoms[i].z);
dim3 dimGrid(GX/ntpb/4, GY/ntpb);
dim3 dimBlock(ntpb, ntpb);
for (int k = 0; k < GZ; k++) {
float z = gridspacing * (float)k;
for (int i = 0; i < na/CHUNK_S; i++) {
cudaMemcpy(atom, &atoms[i * CHUNK_S], s,
cudaMemcpyHostToDevice);
cenergy<<<dimGrid, dimBlock>>>(
egrid, gridspacing, k, z, CHUNK_S);
}
}
// ...
}
|
Memory Coalescing
The access pattern in the above solution involves uncoalesced memory
writes. For each egrid[] access, the threads
in a warp do not access adjacent locations. We reselect threads to
fuse that result in coalesced memory writes.
struct Atom {
float x, y, z, w;
};
__constant__ Atom atom[CHUNK_S];
__global__ void cenergy(float* egrid, float gridspacing,
int k, float z, int numatoms) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int m = gridDim.x * gridDim.y * k + gridDim.x * j + i;
float x1 = gridspacing * (float)i;
float x2 = x1 + gridspacing * BLOCK_S;
float x3 = x2 + gridspacing * BLOCK_S;
float x4 = x3 + gridspacing * BLOCK_S;
float y = gridspacing * (float)j;
float e1 = egrid[m];
float e2 = egrid[m + BLOCK_S];
float e3 = egrid[m + 2 * BLOCK_S];
float e4 = egrid[m + 3 * BLOCK_S];
float energy1 = 0.0f;
float energy2 = 0.0f;
float energy3 = 0.0f;
float energy4 = 0.0f;
for (int n = 0; n < numatoms; n++) {
float dy = y - atom[n].y;
float yy = (dy * dy) + atom[n].z;
float xa = atom[n].x;
float dx1 = x1 - xa;
float dx2 = x2 - xa;
float dx3 = x3 - xa;
float dx4 = x4 - xa;
float chr = atom[n].w;
energy1 += chr * rsqrt(dx1 * dx1 + yy);
energy2 += chr * rsqrt(dx2 * dx2 + yy);
energy3 += chr * rsqrt(dx3 * dx3 + yy);
energy4 += chr * rsqrt(dx4 * dx4 + yy);
}
egrid[m ] = energy1 + e1;
egrid[m + BLOCK_S ] = energy2 + e2;
egrid[m + 2 * BLOCK_S] = energy3 + e3;
egrid[m + 3 * BLOCK_S] = energy4 + e4;
}
int main() {
// ...
int s = sizeof(Atom) * CHUNK_S;
for (int i = 0; i < na; i++)
atoms[i].w = (z - atoms[i].z) * (z - atoms[i].z);
dim3 dimGrid(GX/ntpb/4, GY/ntpb);
dim3 dimBlock(ntpb, ntpb);
for (int k = 0; k < GZ; k++) {
float z = gridspacing * (float)k;
for (int i = 0; i < na/CHUNK_S; i++) {
cudaMemcpy(atom, &atoms[i * CHUNK_S], s,
cudaMemcpyHostToDevice);
cenergy<<<dimGrid, dimBlock>>>(
egrid, gridspacing, k, z, CHUNK_S);
}
}
// ...
}
|
Exercises
- Read Chapter 8 in the recommended textbook
|