Workshop 7

Reduction

In this workshop, you code kernels that calculate the dot product of two vectors.

Learning Outcomes

Upon successful completion of this workshop, you will have demonstrated the abilities

• to scale a problem by reducing it to a set of smaller problems
• to store intermediate results for blocks of cooperating threads
• to use shared memory to improve the CGMA ratio
• to summarize what you think that you have learned in completing this workshop

Specifications

This workshop consists of three parts:

• a small vector solution that uses a single block of cooperating threads
• a scaled solution that uses multiple blocks of cooperating threads
• an efficient scaled solution that uses shared memory

Small Vector Reduction

The incomplete program listed below populates each of two host vectors (stored as arrays) with equal numbers of random values.  The user specifies the size of each vecotr on the command line.  The program copies the host data to the device, calculates the dot product of the two vectors, retrieves the result from the device, and compares it to the dot product calculated on the host.

The grid configuration is a single block with the number of threads equal to the number of elements in each vector.

Kernel

The kernel calculates the dot product on a set of cooperating threads in two parts:

1. calculates the product of two identically indexed elements of the two vectors
2. sums the products

The kernel stores the product in one of the vectors and performs the reduction on the elements of that vector, leaving the result in its first element.

Complete the following source code assuming that all CUDA calls will be successful.  Your solution only works for up to 512 or 1024 elements, depending on your device's compute capability.  Name your source file w7_1.cu

 ``` // Reduction - Workshop 7 // w7_1.cu #include #include #include // CUDA header file void init(float* a, int n) { float f = 1.0f / RAND_MAX; for (int i = 0; i < n; i++) a[i] = std::rand() * f; // [0.0f 1.0f] } // kernel code int main(int argc, char** argv) { // interpret command-line arguments if (argc != 2) { std::cerr << argv[0] << ": invalid number of arguments\n";  std::cerr << "Usage: " << argv[0] << " size_of_vectors\n";  return 1; } int n = std::atoi(argv[1]); std::srand((unsigned)time(nullptr)); // host vectors float* h_a = new float[n]; float* h_b = new float[n]; init(h_a, n); init(h_b, n); // dot product on the host float h_h = 0.f; for (int i = 0; i < n; i++) h_h += h_a[i] * h_b[i]; // allocate memory for device vectors (d_a[n], d_b[n]) // copy host vectors to device vectors h_a -> d_a, h_b -> d_b // launch the grid of threads // copy the result from the device to the host d_a -> h_c float h_c; // compare the results std::cout << "Device = " << h_c << "\nHost = " << h_h << std::endl;  // free device memory // free host memory delete [] h_a; delete [] h_b; }```

Compilation

Compile your solution and test the executable for 200 elements.  A command line output looks something like:

 ``` >nvcc w7_1.cu w7_1.cu Creating library a.lib and object a.exp  >a 200 Device = 46.87 Host = 46.87 ```

Scalable Solution

Upgrade your small-vector solution to accommodate vectors of sizes larger than the maximum number of threads in a block, starting with the incomplete code listed below.  Name your scalable solution w7_2.cu

Two Kernels

Pre-set the number of threads per block to ntpb (a pre-defined constant, say 512 or 1024).

Code two separate kernels.  The first kernel is named product and multiplies corresponding elements of the two vectors and stores their product in the first vector (d_a).  The second kernel is named reduce and sums the elements of d_a as generated by the first kernel.  This second kernel reduces all of the elements accessible by a block of threads to a single scalar sum and stores the sum in the vector element of d_c that corresponds to the block.

Complete the following source code for your scalable solution:

 ``` // Reduction - Workshop 7 // w7_2.cu #include #include #include // CUDA header file const int ntpb = 1024; // number of threads per block void init(float* a, int n, bool debug) { float f = 1.0f / RAND_MAX; for (int i = 0; i < n; i++) if (debug) a[i] = 1.0f; else a[i] = std::rand() * f; // [0.0f 1.0f] } // kernel 1 - product // kernel 2 - reduction on a single block int main(int argc, char** argv) { // interpret command-line arguments if (argc != 2 && argc != 3) { std::cerr << argv[0] << ": invalid number of arguments\n"; std::cerr << "Usage: " << argv[0] << " size_of_vectors\n"; return 1; } int n = atoi(argv[1]); bool debug = argc == 3; std::srand((unsigned)time(nullptr)); int nblks = ; // calculate required number of blocks // host vectors float* h_a = new float[ntpb * nblks]; float* h_b = new float[ntpb * nblks]; init(h_a, n, debug); init(h_b, n, debug); for (int i = n; i < nblks * ntpb; i++) { h_a[i] = 0.0f; h_b[i] = 0.0f; } // dot product on the host float h_h = 0.f; for (int i = 0; i < n; i++) h_h += h_a[i] * h_b[i]; // allocate device vectors (d_a[nblks * ntpb], d_b[n], d_c[nblks]) // copy from the host to the device h_a -> d_a, h_b -> d-b // calculate product (launch kernel 1) // reduce products to one value per block (launch kernel 2) // intermediate debugging output if (debug) { float* h_c = new float[nblks]; cudaMemcpy(h_c, d_c, nblks * sizeof(float), cudaMemcpyDeviceToHost); for (int i = 0; i < nblks; i++) std::cout << h_c[i] << ' '; std::cout << std::endl; delete[] h_c; } // reduce block values to a single value (launch kernel 2 again) // copy final result from device to host - from d_c to h_c float h_c; // report your results std::cout << std::fixed << std::setprecision(3); std::cout << "Device = " << h_c << "\nHost = " << h_h << std::endl; // free device memory // free host memory delete[] h_a; delete[] h_b; }```

The first command line argument specifies the number of elements in each vector.  The second argument, if present, requests debug-friendly initialization and output.

Compilation

Compile your completed solution and test the executable for different sizes of vectors in debug and no-debug modes.  A command line output looks something like this:

 ``` >nvcc w7_2.cu Creating library a.lib and object a.exp  >a 2048 x 1024 1024 Device = 2048 Host = 2048 >a 2049 x 1024 1024 1 Device = 2049 Host = 2049 >a 2048 Device = 514.223 Host = 514.223  >a 2049 Device = 506.374 Host = 506.374```

Shared Memory Solution

SUBMISSION

Your solution files for this workshop include:

1. w7_1.cu - the small-vector solution
2. w7_2.cu - the scalable solution
3. w7_3.cu - the scalable solution with shared memory

Create a typescript (see below for how to details) using your w7_3.cu solution file.  Your typescript should include:

1. a compilation of your source code
2. a set of test runs for different test cases

• Select Workshop 7 under Assignments