Workshop 8
Thread Divergence
In this workshop, you evaluate the effects of thread divergence on a reduction
algorithm.
Learning Outcomes
Upon successful completion of this workshop, you will have
demonstrated the abilities to
- describe the partitioning of threads within a thread block
- reduce accesses to global memory through shared memory programming
- write kernel code that minimizes thread divergence
- summarize what you think that you have learned in completing this workshop
Specifications
This workshop consists of two parts:
- write a pair of kernels to calculate the dot product of two vectors
using shared memory
- upgrade the kernels to minimize thread divergence during reduction
operations
Shared Memory
Complete the following program by adding two kernels:
- the first calculates the
product of corresponding elements in the first two arrays, stores the product in
shared memory, accumulates the data in shared memory and stores the result in the
third array
- the
second kernel accumulates the data stored in each element of the array received
and stores the result in the first element of that array
// Thread Divergence - Workshop 8
// w8.1.cu
#include <iostream>
#include <cstdlib>
#include <cuda_runtime.h>
// to remove intellisense highlighting
#include <device_launch_parameters.h>
#ifndef __CUDACC__
#define __CUDACC__
#endif
#include <device_functions.h>
const int ntpb = 1024; // number of threads per block
void init(float* a, int n) {
float f = 1.0f / RAND_MAX;
for (int i = 0; i < n; i++)
a[i] = std::rand() * f; // [0.0f 1.0f]
}
// calculate the dot product block by block
__global__ void dotProduct(const float* a, const float* b, float* c, int n) {
// store the product of a[i] and b[i] in shared memory
// sum the data in shared memory
// store the sum in c[blockIdx.x]
}
// accumulate the block sums
__global__ void accumulate(float* c, int n) {
// store the elements of c[] in shared memory
// sum the data in shared memory
// store the sum in c[0]
}
int main(int argc, char** argv) {
// interpret command-line arguments
if (argc != 2) {
std::cerr << argv[0] << ": invalid number of arguments\n";
std::cerr << "Usage: " << argv[0] << " size_of_vectors\n";
return 1;
}
int n = std::atoi(argv[1]);
int nblocks = (n + ntpb - 1) / ntpb;
if (nblocks > ntpb) {
nblocks = ntpb;
n = nblocks * ntpb;
}
// host vectors
float* h_a = new float[n];
float* h_b = new float[n];
init(h_a, n);
init(h_b, n);
// device vectors (d_a, d_b, d_c)
float* d_a;
float* d_b;
float* d_c;
cudaMalloc((void**)&d_a, n * sizeof(float));
cudaMalloc((void**)&d_b, n * sizeof(float));
cudaMalloc((void**)&d_c, nblocks * sizeof(float));
// copy from host to device h_a -> d_a, h_b -> d_b
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);
// dot product on the device
dotProduct<<<nblocks, ntpb>>>(d_a, d_b, d_c, n);
// synchronize
cudaDeviceSynchronize();
// accumulate the block sums on the device
accumulate<<< 1, nblocks>>>(d_c, nblocks);
// copy from device to host d_c[0] -> h_d
float h_c;
cudaMemcpy(&h_c, d_c, sizeof(float), cudaMemcpyDeviceToHost);
float hx = 0.f;
for (int i = 0; i < n; i++)
hx += h_a[i] * h_b[i];
// compare results
std::cout << "Device = " << h_c << " Host = " << hx << std::endl;
// free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
// free host memory
delete [] h_a;
delete [] h_b;
// reset the device
cudaDeviceReset();
}
|
Compile and test your code checking that it calculates the same result
on the device as it does on the host.
Minimize Thread Divergence
Copy your first solution to a file named w8.2.cu.
Upgrade the kernels in this new file to minimize thread divergence. For
this upgrade, you will need to account for warp partitioning of the threads
within a block.
// Thread Divergence - Minimized - Workshop 8
// w8.2.cu
// ...
// calculate the dot product block by block
__global__ void dotProduct(const float* a, const float* b, float* c, int n) {
// upgrade your original kernel here
}
// accumulate the block sums
__global__ void accumulate(float* c, int n) {
// upgrade your original kernel here
}
int main(int argc, char** argv) {
// same as above ...
}
|
Compile and test your code to ensure that it calculates the same
result on the deivce as on the host.
Profiles
Start the Visual Profiler by entering the following at the command line:
Complete the table below from the dotProduct kernel times
reported by the profiler.
n |
With TD |
Minimal TD |
2500 |
|
|
5000 |
|
|
7500 |
|
|
10000 |
|
|
Prepare a 3D look realistic column chart plotting the
memcpy, kernel and session times against the number of
elements in the two vectors (n)
along the horizontal axis as shown below.

You can create the chart in Open Office using the following steps:
- Highlight data and labels
- Select Chart in the Toolbar
- Chart Type - check 3D Look Realistic Column
- Data Range - 1st row as label, 1st column as label
- Chart Elements - add title, subtitle, axes labels
You can create the chart in Excel using the following steps:
- Select Insert Tab -> Column -> 3D Clustered Column
- Select Data -> remove n -> select edit on horizontal axis labels -> add n column
- Select Chart tools -> Layout -> Chart Title - enter title and subtitle
- Select Chart tools -> Layout -> Axis Titles -> Select axis - enter axis label
Save your chart as part of your spreadsheet file.
SUBMISSION
Copy the results of your tests for both versions into a file
named w8.txt. This file should include
- your userid
- console output from running your test cases
Upload your typescript to Blackboard:
- Login to
- Select your course code
- Select Workshop 8 under Workshops
- Upload w8.txt and w8.ods or
w8.xls
- Under "Add Comments" write a short note to your instructor:
Add a sentence or two describing what you think you have learned
in this workshop.
- When ready to submit, press "Submit"
|
|
|