Workshop 8

In this workshop, you evaluate the effects of thread divergence on a reduction algorithm.

Learning Outcomes

Upon successful completion of this workshop, you will have demonstrated the abilities to

2. reduce accesses to global memory through shared memory programming
3. write kernel code that minimizes thread divergence
4. summarize what you think that you have learned in completing this workshop

Specifications

This workshop consists of two parts:

1. write a pair of kernels to calculate the dot product of two vectors using shared memory

Shared Memory

Complete the following program by adding two kernels:

• the first calculates the product of corresponding elements in the first two arrays, stores the product in shared memory, accumulates the data in shared memory and stores the result in the third array
• the second kernel accumulates the data stored in each element of the array received and stores the result in the first element of that array
 ``` // Thread Divergence - Workshop 8 // w8.1.cu #include #include #include // to remove intellisense highlighting #include #ifndef __CUDACC__ #define __CUDACC__ #endif #include const int ntpb = 1024; // number of threads per block void init(float* a, int n) { float f = 1.0f / RAND_MAX; for (int i = 0; i < n; i++) a[i] = std::rand() * f; // [0.0f 1.0f] } // calculate the dot product block by block __global__ void dotProduct(const float* a, const float* b, float* c, int n) { // store the product of a[i] and b[i] in shared memory // sum the data in shared memory // store the sum in c[blockIdx.x] } // accumulate the block sums __global__ void accumulate(float* c, int n) { // store the elements of c[] in shared memory // sum the data in shared memory // store the sum in c[0] } int main(int argc, char** argv) { // interpret command-line arguments if (argc != 2) { std::cerr << argv[0] << ": invalid number of arguments\n";  std::cerr << "Usage: " << argv[0] << " size_of_vectors\n";  return 1; } int n = std::atoi(argv[1]); int nblocks = (n + ntpb - 1) / ntpb; if (nblocks > ntpb) { nblocks = ntpb; n = nblocks * ntpb; } // host vectors float* h_a = new float[n]; float* h_b = new float[n]; init(h_a, n); init(h_b, n); // device vectors (d_a, d_b, d_c) float* d_a; float* d_b; float* d_c; cudaMalloc((void**)&d_a, n * sizeof(float)); cudaMalloc((void**)&d_b, n * sizeof(float)); cudaMalloc((void**)&d_c, nblocks * sizeof(float)); // copy from host to device h_a -> d_a, h_b -> d_b cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice); // dot product on the device dotProduct<<>>(d_a, d_b, d_c, n); // synchronize cudaDeviceSynchronize(); // accumulate the block sums on the device accumulate<<< 1, nblocks>>>(d_c, nblocks); // copy from device to host d_c[0] -> h_d float h_c; cudaMemcpy(&h_c, d_c, sizeof(float), cudaMemcpyDeviceToHost);  float hx = 0.f; for (int i = 0; i < n; i++) hx += h_a[i] * h_b[i]; // compare results std::cout << "Device = " << h_c << " Host = " << hx << std::endl;  // free device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // free host memory delete [] h_a; delete [] h_b; // reset the device cudaDeviceReset(); }```

Compile and test your code checking that it calculates the same result on the device as it does on the host.

Copy your first solution to a file named w8.2.cu.  Upgrade the kernels in this new file to minimize thread divergence.  For this upgrade, you will need to account for warp partitioning of the threads within a block.

 ``` // Thread Divergence - Minimized - Workshop 8 // w8.2.cu // ... // calculate the dot product block by block __global__ void dotProduct(const float* a, const float* b, float* c, int n)  {  // upgrade your original kernel here } // accumulate the block sums __global__ void accumulate(float* c, int n) { // upgrade your original kernel here } int main(int argc, char** argv) { // same as above ... }```

Compile and test your code to ensure that it calculates the same result on the deivce as on the host.

Profiles

Start the Visual Profiler by entering the following at the command line:

 ` nvvp`

Complete the table below from the dotProduct kernel times reported by the profiler.

 n With TD Minimal TD 2500 5000 7500 10000

Prepare a 3D look realistic column chart plotting the memcpy, kernel and session times against the number of elements in the two vectors (n) along the horizontal axis as shown below.

You can create the chart in Open Office using the following steps:

• Highlight data and labels
• Select Chart in the Toolbar
• Chart Type - check 3D Look Realistic Column
• Data Range - 1st row as label, 1st column as label
• Chart Elements - add title, subtitle, axes labels

You can create the chart in Excel using the following steps:

• Select Insert Tab -> Column -> 3D Clustered Column
• Select Data -> remove n -> select edit on horizontal axis labels -> add n column
• Select Chart tools -> Layout -> Chart Title - enter title and subtitle
• Select Chart tools -> Layout -> Axis Titles -> Select axis - enter axis label

SUBMISSION

Copy the results of your tests for both versions into a file named w8.txt.  This file should include

• console output from running your test cases