Part B - CUDA Programming Model

Case Studies

Demonstrate how to refine a kernel to hide memory latency
Demonstrate how to use constant memory to enable caching
Demonstrate how to reorder instructions to maximize performance

MRI Reconstruction | Molecular Visualization | Exercises


Case studies provide practical insight into optimization and reveal other steps required to bring actual performance closer to the theoretical potential.

This chapter presents the series of upgrades used to achieve optimal solutions for two cases: MRI reconstruction and molecular visualization.


MRI Reconstruction

Magnetic Resonance Imaging (MRI) consists of two phases: acquisition (scanning) and reconstruction.  The reconstruction phase creates the sought-after image from the acquired data. 

For a full description of this case study see chapter 8 of the recommended text pages 141-171.  This section summarizes the main points of that case study.  Identifiers have been shortened to improve clarity.

CPU Code

The following code snippet is part of a larger algorithm that reconstructs an MRI scan.  This snippet reconstructs an image sequentially on the CPU.

 // M - number of samples
 // N - number of voxels (volume pixels)

 float e, c, s; // e = expfhd in text

 for (int m = 0; m < M; m++) {
     rmu[m] = rphi[m] * rd[m] + iphi[m] * id[m];
     imu[m] = rphi[m] * id[m] - iphi[m] * rd[m];
     for (int n = 0; n < N; n++) {
         e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
         c = cos(e);
         s = sin(e);
         rfhd[n] += rmu[m] * c - imu[m] * s;
         ifhd[n] += imu[m] * c + rmu[m] * s;
     }
 }

M is the number of samples, which is typically in the order of 105.  N is the number of voxels in the reconstructed image, which is typically in the order of 1283 (2,097,152) or 106

This algorithm exhibits substantial data parallelism and is suited to execution on the GPU.  All iterations in the outer loop can be executed in parallel.  All iterations in the inner loop can also be executed in parallel. 

CUDA Kernels

There are several performance bottlenecks in this algorithm.  Let us address them one at a time.

Initial Version

Since M can be in the millions, let us convert the CPU code into a grid of multiple thread blocks.  Let each block contain ntpb threads ( = FHD_THREADS_PER_BLOCK in the text).  Then, the number of blocks in the grid is nblks = (M + ntpb - 1)/ntpb

The initial version of the kernel looks like

 __global__ void cmpfhd(float* rphi, float* iphi, float* rd,   float* id,
                        float* rmu,  float* imu,  float* rfhd, float* ifhd,
                        float* kx,   float* ky,   float* kz,
                        float* x,    float* y,    float* z,
                        int mm,      int N) {

     int m = blockIdx.x * blockDim.x + threadIdx.x;
     float e, c, s;

     if (m < mm) {
         rmu[m] = rphi[m] * rd[m] + iphi[m] * id[m];
         imu[m] = rphi[m] * id[m] - iphi[m] * rd[m];
         for (int n = 0; n < N; n++) {
             e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
             c = cos(e);
             s = sin(e);
             rfhd[n] += rmu[m] * c - imu[m] * s;
             ifhd[n] += imu[m] * c + rmu[m] * s;
         }
     }
 }

 // execution configuration
 ntpb  = FHD_THREADS_PER_BLOCK;
 nblks = (M + ntpb - 1) / ntpb;
 cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id,
  rmu, imu, rfhd, ifhd, kx, ky, kz, x, y, z, M, N);

Note that in this version, all threads write to all rfhd and ifhd pairs (as highlighted below). 

 __global__ void cmpfhd(float* rphi, float* iphi, float* rd,   float* id,
                        float* rmu,  float* imu,  float* rfhd, float* ifhd,
                        float* kx,   float* ky,   float* kz,
                        float* x,    float* y,    float* z,
                        int mm,      int N) {

     int m = blockIdx.x * blockDim.x + threadIdx.x;
     float e, c, s;

     if (m < mm) {
         rmu[m] = rphi[m] * rd[m] + iphi[m] * id[m];
         imu[m] = rphi[m] * id[m] - iphi[m] * rd[m];
         for (int n = 0; n < N; n++) {
             e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
             c = cos(e);
             s = sin(e);
             rfhd[n] += rmu[m] * c - imu[m] * s;
             ifhd[n] += imu[m] * c + rmu[m] * s;
         }
     }
 }

 // execution configuration
 ntpb = FHD_THREADS_PER_BLOCK;
 nblks = (M + ntpb - 1) / ntpb;
 cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id,
  rmu, imu, rfhd, ifhd, kx, ky, kz, x, y, z, M, N);

One thread can interfere with another thread while writing to the same pair.  Clearly, this kernel will not execute correctly as conflicts (race conditions) arise between threads writing to the same locations in the rfhd and ifhd arrays. 

Loop Fission

One solution to this conflict is to reconstruct the algorithm so that each thread writes to only one [rfhd, ifhd] pair. 

For this, we swap the order of the iterations; that is, we exchange the inner loop and the outer loop.  Loop interchanges require perfectly nested loop structures. 

To achieve loop interchangeability we remove the rmu and imu calculations from the outer loop.  We split the nested iteration into two separate iterations: a single iteration followed by a perfectly nested iteration.  This removal technique is called loop fission

 float e, c, s;

 if (m < mm) {
     for (int m = 0; m < M; m++) {
         rmu[m] = rphi[m] * rd[m] + iphi[m] * id[m];
         imu[m] = rphi[m] * id[m] - iphi[m] * rd[m];
     }

     for (int m = 0; m < M; m++) {
         for (int n = 0; n < N; n++) {
             e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
             c = cos(e);
             s = sin(e);
             rfhd[n] += rmu[m] * c - imu[m] * s;
             ifhd[n] += imu[m] * c + rmu[m] * s;
         }
     }
 }

Loop fission is possible here because the first iteration does not depend on the second iteration.

Mu Kernel

Let us now write a separate kernel for the first iteration: 

 __global__ void cmpmu(float* rphi, float* iphi, float* rd, float* id,
                        float* rmu, float* imu, int mm) {

     int m = blockIdx.x * blockDim.x + threadIdx.x;

     if (m < mm) {
         rmu[m] = rphi[m] * rd[m] + iphi[m] * id[m];
         imu[m] = rphi[m] * id[m] - iphi[m] * rd[m];
     }
 }

 // execution configuration
 ntpb  = MU_THREADS_PER_BLOCK;
 nblks = (M + ntpb - 1) / ntpb;
 cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id, rmu, imu, M);

Fhd Kernel

There are three ways to consider in reconstructing the second iteration:

  1. one thread per calculation - N * M threads - 106 * 105 - too many for a grid
  2. one thread per sample - M threads - will fit within the grid
  3. one calculation per thread - N threads - will fit within the grid

The second option writes to all rfhd and ifhd pairs, causing conflicts. 

 __global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
                        float* kx, float* ky, float* kz,
                        float* x, float* y, float* z, int mm, int N) {

     int m = blockIdx.x * blockDim.x + threadIdx.x;
     float e, c, s;

     if (m < mm) {
         for (int n = 0; n < N; n++) {
             e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
             c = cos(e);
             s = sin(e);
             rfhd[n] += rmu[m] * c - imu[m] * s;
             ifhd[n] += imu[m] * c + rmu[m] * s;
         }
     }
 }

 // execution configuration
 ntpb  = FHD_THREADS_PER_BLOCK;
 nblks = (M + ntpb - 1) / ntpb;
 cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id,
  rmu, imu, rfhd, ifhd, kx, ky, kz, x, y, z, M, N);

Computing One Pair of Fhd Values

To avoid conflicts while writing to all rfhd and ifhd pairs, we interchange the iterations across the grid and within the kernel.  We define a grid of N threads with each thread executing M iterations instead of a grid of M threads with each thread executing N iterations. 

 __global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
                        float* kx, float* ky, float* kz,
                        float* x, float* y, float* z, int nn, int M) {

     int n = blockIdx.x * blockDim.x + threadIdx.x;
     float e, c, s;

     if (n < nn) {
         for (int m = 0; m < M; m++) {
             e = 2 * PI * (kx[m] * x[n] + ky[m] * y[n] + kz[m] * z[n]);
             c = cos(e);
             s = sin(e);
             rfhd[n] += rmu[m] * c - imu[m] * s;
             ifhd[n] += imu[m] * c + rmu[m] * s;
         }
     }
 }

 // execution configuration
 ntpb  = FHD_THREADS_PER_BLOCK;
 nblks = (N + ntpb - 1) / ntpb;
 cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id,
  rmu, imu, rfhd, ifhd, kx, ky, kz, x, y, z, N, M);

This is in fact the only possible choice of the three options. 

Increased Resolutions

The size of global memory on the device imposes an upper bound on resolution.  If the number of voxels exceeds the space available (for instance N = 10243), then we need to divide the voxels into subsets and invoke several kernels to reconstruct the entire image.

Compute to Memory Access Ratio

The compute-to-memory-access ratio is low in the kernel immediately above: each iteration accesses memory 14 memory times and performs 13 floating-point operations for a ratio of about 1.

Using Registers

To improve the compute-to-memory access ratio, we use more local variables:

 __global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
                        float* kx, float* ky, float* kz,
                        float* x, float* y, float* z, int nn, int M) {

     int n = blockIdx.x * blockDim.x + threadIdx.x;
     float e, c, s;

     if (n < nn) {
         float xn = x[n], yn = y[n], zn = z[n];
         float rfhdn = rfhd[n], ifhdn = ifhd[n];
         for (int m = 0; m < M; m++) {
             e = 2 * PI * (kx[m] * xn + ky[m] * yn + kz[m] * zn);
             c = cos(e);
             s = sin(e);
             rfhdn += rmu[m] * c - imu[m] * s;
             ifhdn += imu[m] * c + rmu[m] * s;
         }
         rfhd[n] = rfhdn, ifhd[n] = ifhdn;
     }
 }

 // execution configuration
 ntpb  = FHD_THREADS_PER_BLOCK;
 nblks = (N + ntpb - 1) / ntpb;
 cmpfhd<<<nblks, ntpb>>>(rphi, iphi, rd, id,
  rmu, imu, rfhd, ifhd, kx, ky, kz, x, y, z, N, M);

Here, each iteration increases the register count from 5 to 10, but accesses memory 7 times.  This upgrade reduces the memory accesses from 14 to 7, but increases the register count from 5 to 10.  We need to ensure that we do not exceed the maximum size of each SM's register file.

Constant Memory

To hide memory latency, we need to improve the compute-to-memory access ratio further (from 13:7 to about 10:1).  Since each iteration accesses different k-space samples, we cannot load their values into register memory.  Since the kernel does not modify these values, multiple threads may access them simultaneously.  These values are ideal candidates for constant memory: all threads in a warp will access the same element simultaneously and everytime an element loads into cache, 31 other threads will be served by the cache. 

Let us store the kx, ky, and kz variables in constant memory.  Since constant memory is of limited size, we store the variables in chunks and iterate through a single chunk at a time launching a separate kernel for each chunk.  If CHUNK_S denotes the size of one chunk:

 __constant__ float kxc[CHUNK_S], kyc[CHUNK_S], kzc[CHUNK_S];

 __global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
                        float* x, float* y, float* z, int nn, int M) {

     int n = blockIdx.x * blockDim.x + threadIdx.x;
     float e, c, s;

     if (n < nn) {
         float xn = x[n], yn = y[n], zn = z[n];
         float rfhdn = rfhd[n], ifhdn = ifhd[n];
         for (int m = 0; m < M; m++) {
             e = 2 * PI * (kxc[m] * xn + kyc[m] * yn + kzc[m] * zn);
             c = cos(e);
             s = sin(e);
             rfhdn += rmu[m] * c - imu[m] * s;
             ifhdn += imu[m] * c + rmu[m] * s;
         }
         rfhd[n] = rfhdn, ifhd[n] = ifhdn;
     }
 }

 int main() {
     ntpb = FHD_THREADS_PER_BLOCK;
     nblks = (N + ntpb - 1) / ntpb;
     int c = CHUNK_S;
     int s = sizeof(float) * c;
     int nchunks = (M + c - 1) / c;
     for (int i = 0; i < nchunks; i++) {
         if (i == nchunks - 1) {
             c = M - CHUNK_S * i;
             s = sizeof(float) * c;
         }
         cudaMemcpy(kxc, &kx[i * CHUNK_S], s, cudaMemcpyHostToDevice);
         cudaMemcpy(kyc, &ky[i * CHUNK_S], s, cudaMemcpyHostToDevice);
         cudaMemcpy(kzc, &kz[i * CHUNK_S], s, cudaMemcpyHostToDevice);
         cmpfhd<<<nblks, ntpb>>>(
          rmu, imu, rfhd, ifhd, x, y, z, N, c);
     }
 }

Array of Structs

The above solution is not ideal.  Constant cache does not perform as well as expected.  The cause is cache design.  Each cache entry is designed to store multiple consecutive words, which reduces the cost of cache hardware.  The data structures in the above solution are not in consecutive words and hence access multiple cache entries.  Constant cache has a very limited number of entires. 

The solution to this inefficient use of constant cache is to rearrange the three arrays into a single array of structs.  Storing the x, y and z members within the same instance places these components in consecutive locations in constant memory.  Then, all components used by an iteration fit into one cache entry. 

 struct KData {
     float x, y, z;
 };
 __constant__ KData kdata[CHUNK_S];

 __global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
                        float* x, float* y, float* z, int nn, int M) {

     int n = blockIdx.x * blockDim.x + threadIdx.x;
     float e, c, s;

     if (n < nn) {
         float xn = x[n], yn = y[n], zn = z[n];
         float rfhdn = rfhd[n], ifhdn = ifhd[n];
         for (int m = 0; m < M; m++) {
             e = 2 * PI * (kdata[m].x * x[n] + kdata[m].y * y[n] +
              kdata[m].z * z[n]);
             c = cos(e);
             s = sin(e);
             rfhdn += rmu[m] * c - imu[m] * s;
             ifhdn += imu[m] * c + rmu[m] * s;
         }
         rfhd[n] = rfhdn, ifhd[n] = ifhdn;
     }
     for (int m = 0; m < M; m++) {
         c = cos(expfhd);
         s = sin(expfhd);
         rfhdn += rmu[m] * c - imu[m] * s;
         ifhdn += imu[m] * c + rmu[m] * s;
     }
     rfhd[n] = rfhdn, ifhd[n] = ifhdn;
 }

 int main() {
     ntpb = FHD_THREADS_PER_BLOCK;
     nblks = (N + ntpb - 1) / ntpb;
     int c = CHUNK_S;
     int s = sizeof(kdata) * c;
     int nchunks = (M + c - 1) / c;
     for (int i = 0; i < nchunks; i++) {
         if (i == nchunks - 1) {
             c = M - CHUNK_S * i;
             s = sizeof(kdata) * c;
         }
         cudaMemcpy(kdata, &k[i * CHUNK_S], s, cudaMemcpyHostToDevice);
         cmpfhd<<<nblks, ntpb>>>(
          rmu, imu, rfhd, ifhd, x, y, z, N, c);
     }
 }

Hardware Trigonometry

The trigonometric functions require about 13 floating-point operations.  After implementing the refinements listed above, the computation of these trigonometric functions is our next bottleneck. 

CUDA exposes hardware trigonometric functions with higher throughput than software implemented alternatives.  Special function units (SFUs) on each SM execute these hardware functions.  We call these functions by prefixing their names with a double underscore (__).

 __global__ void cmpfhd(float* rmu, float* imu, float* rfhd, float* ifhd,
                        float* x, float* y, float* z, int nn, int M) {

     int n = blockIdx.x * blockDim.x + threadIdx.x;
     float e, c, s;

     if (n < nn) {
         float xn = x[n], yn = y[n], zn = z[n];
         float rfhdn = rfhd[n], ifhdn = ifhd[n];
         for (int m = 0; m < M; m++) {
             e = 2 * PI * (kdata[m].x * x[n] + kdata[m].y * y[n] +
              kdata[m].z * z[n]);
             c = __cos(expfhd);
             s = __sin(expfhd);
             rfhdn += rmu[m] * c - imu[m] * s;
             ifhdn += imu[m] * c + rmu[m] * s;
         }
         rfhd[n] = rfhdn, ifhd[n] = ifhdn;
     }
 }

The hardware implementations are less accurate than the software functions and we need to exercise caution to ensure that the difference is within our acceptable margin of error. 

Configuration Parameters

Adjusting the execution configuration can lead to significant improvements in performance subject to hardware constraints.  We experiment with different setting to find the optimal configuration.

The configuration parameters for this application are

  • the number of threads per block - 32, 64, 128, 256, 512, ...
  • the chunk size for constant memory - 32, 64, 128, 256, 1024, 2048, 4096, ...
  • the number of registers per thread - unrolling

Larger numbers of threads increase register usage.  Larger chunks require fewer kernel invocations.  Unrolling increases register usage. 

Summary

The following table summarizes the runtimes and performance of the different versions.

Version Runtime GFLOPs
CPU Double 518.0 0.4
CPU Single 342.3 0.7
GPU Naive 41.0 5.4
GPU Registers and Constant Memory 9.8 22.8
GPU Registers, Constant Memory, SPU and Tuning 1.5 144.5

Molecular Visualization

Simulating molecular dynamics is another area of application suited to GPUs.  The calculation of a three-dimensional electrostatic potential map is often used in the placement of ions into a structure for molecules.  Direct Coulomb simulation (DCS) is one method that is used for such calculations on GPUs.  This method accumulates the potential value at each grid point in three-dimensional space using contributions from all of the atoms within the system.  The number of calculations is proportional to the product of the number of atoms within the system and the number of grid points in the three-dimensional space. 

CPU Code

The following code snippet calculates the DCS simulation for a single slice through a grid of equally spaced points.  k is the slice's index, z is its z-coordinate, egrid is the array of energy values at the different grid points, grid holds the number of blocks in the x, y and z directions, gridspacing is the spacing between adjacent grid points in any given direction and na is the number of atoms.  The atom array holds the x, y, and z coordinates of each atom plus the charge in consecutive memory locations:

 void cenergy(float* egrid, dim3 grid, float gridspacing,
              int k, float z, const float* atom, int na) {

     for (int j = 0; j < grid.y; j++) {
         float y = gridspacing * (float)j;
         for (int i = 0; i < grid.x; i++) {
             float x = gridspacing * (float) i;
             float energy = 0.0f;
             for (int n = 0; n < na * 4; n += 4) {
                 float dx = x - atoms[n];
                 float dy = y - atoms[n + 1];
                 float dz = z - atoms[n + 2];
                 energy += atoms[n + 3] /
                  sqrt(dx * dx + dy * dy + dz * dz);
             }
             egrid[grid.x * grid.y * k + grid.x * j + i] = energy;
         }
     }
 }

CUDA Kernels

Let us address the performance bottlenecks in this algorithm one at a time. 

Initial Version

If we assign one thread to each grid point, we obtain the following kernel

 __global__ void cenergy(float* egrid, float gridspacing, int k,
                         float z, const float* atoms, int na) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     int j = blockIdx.y * blockDim.y + threadIdx.y;
     float x = gridspacing * (float)i;
     float y = gridspacing * (float)j;
     float energy = 0.0f;
     for (int n = 0; n < na * 4; n+= 4) {
         float dx = x - atoms[n];
         float dy = y - atoms[n + 1];
         float dz = z - atoms[n + 2];
         energy += atoms[n + 3] /
          sqrt(dx * dx + dy * dy + dz * dz);
     }
     egrid[gridDim.x * gridDim.y * k + gridDim.x * j + i] = energy;
 }

Constant Memory

We improve performance by storing the atom data in an array of structs in constant memory and iterating through the atoms chunk by chunk (as in the previous case study):

 struct Atom {
     float x, y, z, w;
 };

 __constant__ Atom atom[CHUNK_S];

 __global__ void cenergy(float* egrid, float gridspacing,
                         int k, float z, int na) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     int j = blockIdx.y * blockDim.y + threadIdx.y;
     int m = gridDim.x * gridDim.y * k + gridDim.x * j + i;
     float x = gridspacing * (float)i;
     float y = gridspacing * (float)j;
     float e = egrid[m]; // pre-fetch
     float energy = 0.0f;
     for (int n = 0; n < na; n++) {
         float dx = x - atom[n].x;
         float dy = y - atom[n].y;
         float dz = z - atom[n].z;
         energy += atom[n].w /
          sqrt(dx * dx + dy * dy + dz * dz);
     }
     egrid[m] = energy + e;
 }

 int main() {
     // ...
     int s = sizeof(Atom) * CHUNK_S;
     dim3 dimGrid(GX/ntpb, GY/ntpb);
     dim3 dimBlock(ntpb, ntpb);
     for (int k = 0; k < GZ; k++) {
         float z = gridspacing * (float)k;
         for (int i = 0; i < na/CHUNK_S; i++) {
             cudaMemcpy(atom, &atoms[i * CHUNK_S], s,
              cudaMemcpyHostToDevice);
             cenergy<<<dimGrid, dimBlock>>>(
              egrid, gridspacing, k, z, CHUNK_S);
         }
     }
     // ...
 }

Since we are processing the atoms in chunks, the kernel builds the grid energy in chunks.  Each thread retrieves the currently accumulated value of its grid point, augments that value by the contributions from the atoms in the chunk and stores the augmented value in global memory for the next chunk of atoms.

We have hidden memory access latency by retrieving the currently accumulated value (egrid[m]) before iterating on the atoms in the current chunk.  We use this value after the iteration.

Our kernel performs 9 floating-point operations for every 4 memory accesses, which is a poor ratio.  All four accesses are in the atom array.  The values are cached in each SM and broadcast to all threads executing on the SM.  As a result, the effective compute to memory access ratio is much higher than 10:1 and global memory bandwidth is not a bottleneck.

Initial Improvements

Note that, since we launch the kernel for threads with a constant z value, the difference between each grid point's z coordinate and an atom's z coordinate (z - atom[n].z) is constant across all threads.  Accordingly, we calculate this difference for each atom in the current chunk on the host and store the square of the different in place of the atom's absolute z coordinate (atom[n].z).  This way, we avoid calculating and squaring the difference individually for each thread in the grid. 

We also implement the sqrt() operation as the inverse of the inverse square root function - rsqrt():

 struct Atom {
     float x, y, z, w;
 };

 __constant__ Atom atom[CHUNK_S];

 __global__ void cenergy(float* egrid, float gridspacing,
                         int k, float z, int na) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     int j = blockIdx.y * blockDim.y + threadIdx.y;
     int m = gridDim.x * gridDim.y * k + gridDim.x * j + i;
     float x = gridspacing * (float)i;
     float y = gridspacing * (float)j;
     float e = egrid[m];
     float energy = 0.0f;
     for (int n = 0; n < na; n++) {
         float dx = x - atom[n].x;
         float dy = y - atom[n].y;
         energy += atom[n].w *
          rsqrt(dx * dx + dy * dy + atom[n].z);
     }
     egrid[m] = energy + e;
 }

 int main() {
     // ...
     int s = sizeof(Atom) * CHUNK_S;
     for (int i = 0; i < na; i++)
         atoms[i].z = (z - atoms[i].z) * (z - atoms[i].z);
     dim3 dimGrid(GX/ntpb, GY/ntpb);
     dim3 dimBlock(ntpb, ntpb);
     for (int k = 0; k < GZ; k++) {
         float z = gridspacing * (float)k;
         for (int i = 0; i < na/CHUNK_S; i++) {
             cudaMemcpy(atom, &atoms[i * CHUNK_S], s,
              cudaMemcpyHostToDevice);
             cenergy<<<dimGrid, dimBlock>>>(
              egrid, gridspacing, k, z, CHUNK_S);
         }
     }
     // ...
 }

Thread Fusion

Constant memory accesses consume resources that could otherwise be used to increase compute throughput.  We reduce the number of these accesses by fusing multiple threads. 

 struct Atom {
     float x, y, z, w;
 };

 __constant__ Atom atom[CHUNK_S];

 __global__ void cenergy(float* egrid, float gridspacing,
                         int k, float z, int na) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     int j = blockIdx.y * blockDim.y + threadIdx.y;
     int m = gridDim.x * gridDim.y * k + gridDim.x * j + i;
     float x1 = gridspacing * (float)i;
     float x2 = x1 + gridspacing;
     float x3 = x2 + gridspacing;
     float x4 = x3 + gridspacing;
     float y = gridspacing * (float)j;
     float e1 = egrid[m];
     float e2 = egrid[m + 1];
     float e3 = egrid[m + 2];
     float e4 = egrid[m + 3];
     float energy1 = 0.0f;
     float energy2 = 0.0f;
     float energy3 = 0.0f;
     float energy4 = 0.0f;
     for (int n = 0; n < na; n++) {
         float dy = y - atom[n].y;
         float yy = (dy * dy) + atom[n].z;
         float xa = atom[n].x;
         float dx1 = x1 - xa;
         float dx2 = x2 - xa;
         float dx3 = x3 - xa;
         float dx4 = x4 - xa;
         float chr = atom[n].w;
         energy1 += chr * rsqrt(dx1 * dx1 + yy);
         energy2 += chr * rsqrt(dx2 * dx2 + yy);
         energy3 += chr * rsqrt(dx3 * dx3 + yy);
         energy4 += chr * rsqrt(dx4 * dx4 + yy);
     }
     egrid[m    ] = energy1 + e1;
     egrid[m + 1] = energy2 + e2;
     egrid[m + 2] = energy3 + e3;
     egrid[m + 3] = energy4 + e4;
 }

 int main() {
     // ...
     int s = sizeof(Atom) * CHUNK_S;
     for (int i = 0; i < na; i++)
         atoms[i].w = (z - atoms[i].z) * (z - atoms[i].z);
     dim3 dimGrid(GX/ntpb/4, GY/ntpb);
     dim3 dimBlock(ntpb, ntpb);
     for (int k = 0; k < GZ; k++) {
         float z = gridspacing * (float)k;
         for (int i = 0; i < na/CHUNK_S; i++) {
             cudaMemcpy(atom, &atoms[i * CHUNK_S], s,
              cudaMemcpyHostToDevice);
             cenergy<<<dimGrid, dimBlock>>>(
              egrid, gridspacing, k, z, CHUNK_S);
         }
     }
     // ...
 }

Memory Coalescing

The access pattern in the above solution involves uncoalesced memory writes.  For each egrid[] access, the threads in a warp do not access adjacent locations.  We reselect threads to fuse that result in coalesced memory writes.

 struct Atom {
     float x, y, z, w;
 };

 __constant__ Atom atom[CHUNK_S];

 __global__ void cenergy(float* egrid, float gridspacing,
                         int k, float z, int numatoms) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     int j = blockIdx.y * blockDim.y + threadIdx.y;
     int m = gridDim.x * gridDim.y * k + gridDim.x * j + i;
     float x1 = gridspacing * (float)i;
     float x2 = x1 + gridspacing * BLOCK_S;
     float x3 = x2 + gridspacing * BLOCK_S;
     float x4 = x3 + gridspacing * BLOCK_S;
     float y = gridspacing * (float)j;
     float e1 = egrid[m];
     float e2 = egrid[m + BLOCK_S];
     float e3 = egrid[m + 2 * BLOCK_S];
     float e4 = egrid[m + 3 * BLOCK_S];
     float energy1 = 0.0f;
     float energy2 = 0.0f;
     float energy3 = 0.0f;
     float energy4 = 0.0f;
     for (int n = 0; n < numatoms; n++) {
         float dy = y - atom[n].y;
         float yy = (dy * dy) + atom[n].z;
         float xa = atom[n].x;
         float dx1 = x1 - xa;
         float dx2 = x2 - xa;
         float dx3 = x3 - xa;
         float dx4 = x4 - xa;
         float chr = atom[n].w;
         energy1 += chr * rsqrt(dx1 * dx1 + yy);
         energy2 += chr * rsqrt(dx2 * dx2 + yy);
         energy3 += chr * rsqrt(dx3 * dx3 + yy);
         energy4 += chr * rsqrt(dx4 * dx4 + yy);
     }
     egrid[m              ] = energy1 + e1;
     egrid[m + BLOCK_S    ] = energy2 + e2;
     egrid[m + 2 * BLOCK_S] = energy3 + e3;
     egrid[m + 3 * BLOCK_S] = energy4 + e4;
 }

 int main() {
     // ...
     int s = sizeof(Atom) * CHUNK_S;
     for (int i = 0; i < na; i++)
         atoms[i].w = (z - atoms[i].z) * (z - atoms[i].z);
     dim3 dimGrid(GX/ntpb/4, GY/ntpb);
     dim3 dimBlock(ntpb, ntpb);
     for (int k = 0; k < GZ; k++) {
         float z = gridspacing * (float)k;
         for (int i = 0; i < na/CHUNK_S; i++) {
             cudaMemcpy(atom, &atoms[i * CHUNK_S], s,
              cudaMemcpyHostToDevice);
             cenergy<<<dimGrid, dimBlock>>>(
              egrid, gridspacing, k, z, CHUNK_S);
         }
     }
     // ...
 }

Exercises

  • Read Chapter 8 in the recommended textbook




Previous Reading  Previous: Floating-Point Considerations Next: OpenCL Preliminaries   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License