Workshop 6
A Simple Kernel
In this workshop, you code a kernel that multiplies two square matrices
and profile your application for different sizes of matrices.
Learning Outcomes
Upon successful completion of this workshop, you will have
demonstrated the abilities to
- code a kernel that executes on the device
- calculate the number of grid blocks required for a CUDA solution
- launch an execution configuration
- profile an application using a parallel profiler
- summarize what you think that you have learned in completing this workshop
Specifications
This workshop consists of two parts:
- coding a kernel that calculates a coefficient in the multiplication of two
square matrices
- profiling your solution for a range of square-matrix sizes
Kernel
The following incomplete application takes a matrix of
user-specified size, initializes its components, multiplies the
matrix by itself and copies the result to host memory.
The number of rows (and columns) in the matrices is a command-line argument.
// Simple Matrix Multiply - Workshop 6
// w6.cu
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <chrono>
// add CUDA runtime header file
using namespace std::chrono;
const int ntpb = 32; // number of threads per block
// - add your kernel here
// check reports error if any
//
void check(const char* msg, const cudaError_t err) {
if (err != cudaSuccess)
std::cerr << "*** " << msg << ":" << cudaGetErrorString(err) << " ***\n";
}
// display matrix M, which is stored in row-major order
//
void display(const char* str, const float* M, int nr, int nc)
{
std::cout << str << std::endl;
std::cout << std::fixed << std::setprecision(4);
for (int i = 0; i < nr; i++) {
for (int j = 0; j < nc; j++)
std::cout << std::setw(10)
<< M[i * nc + j];
std::cout << std::endl;
}
std::cout << std::endl;
}
// report system time
//
void reportTime(const char* msg, steady_clock::duration span) {
auto ms = duration_cast<milliseconds>(span);
std::cout << msg << " - took - " <<
ms.count() << " millisecs" << std::endl;
}
// matrix multiply
//
void sgemm(const float* h_a, const float* h_b, float* h_c, int n) {
// - calculate number of blocks for n rows
// allocate memory for matrices d_a, d_b, d_c on the device
// - add your allocation code here
// copy h_a and h_b to d_a and d_b (host to device)
// - add your copy code here
// launch execution configuration
// - define your 2D grid of blocks
// - define your 2D block of threads
// - launch your execution configuration
// - check for launch termination
// copy d_c to h_c (device to host)
// - add your copy code here
// deallocate device memory
// - add your deallocation code here
// reset the device
cudaDeviceReset();
}
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << argv[0] << ": invalid number of arguments\n";
std::cerr << "Usage: " << argv[0] << " size_of_vector\n";
return 1;
}
int n = std::atoi(argv[1]); // number of rows/columns in h_a, h_b, h_c
steady_clock::time_point ts, te;
// allocate host memory
ts = steady_clock::now();
float* h_a = new float[n * n];
float* h_b = new float[n * n];
float* h_c = new float[n * n];
// populate host matrices a and b
for (int i = 0, kk = 0; i < n; i++)
for (int j = 0; j < n; j++, kk++)
h_a[kk] = h_b[kk] = (float)kk / (n * n);
te = steady_clock::now();
reportTime("allocation and initialization", te - ts);
// h_c = h_a * h_b
ts = steady_clock::now();
sgemm(h_a, h_b, h_c, n);
te = steady_clock::now();
reportTime("matrix-matrix multiplication", te - ts);
// display results
if (n <= 5) {
display("h_a :", h_a, n, n);
display("h_b :", h_b, n, n);
display("h_c = h_a h_b :", h_c, n, n);
}
// check correctness
std::cout << "correctness test ..." << std::endl;
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++) {
float sum = 0.0f;
for (int k = 0; k < n; k++)
sum += h_a[i * n + k] * h_b[k * n + j];
if (std::abs(h_c[i * n + j] - sum) > 1.0e-3f)
std::cout << "[" << i << "," << j << "]" << h_c[i * n + j]
<< " != " << sum << std::endl;
}
std::cout << "done" << std::endl;
// deallocate host memory
delete [] h_a;
delete [] h_b;
delete [] h_c;
}
|
Complete the coding of sgemm, compile your solution and
record the timing for the sizes listed below.
n |
Allocation Initialization |
Matrix Multiplication |
250 |
|
|
500 |
|
|
750 |
|
|
1000 |
|
|
You may find that for higher matrix sizes the launch takes too much time and is
terminated. Add error handling across the launch to confirm the cause.
Profile
Start the NSight Profiler in Visual Studio:
- NSight -> Start Performance Analysis
- Select Trace Application under Activity Type
- Select CUDA under Trace Settings
- Click the Launch Button under Application Control
Results
Complete the table below from the results reported by the profiler.
n |
Memcpy |
Kernel |
Session |
250 |
|
|
|
500 |
|
|
|
750 |
|
|
|
1000 |
|
|
|
Prepare a 3D look realistic column chart plotting the
memcpy, kernel and session times against n
along the horizontal axis as shown below.

You can create the chart in Open Office using the following steps:
- Highlight data and labels
- Select Chart in the Toolbar
- Chart Type - check 3D Look Realistic Column
- Data Range - 1st row as label, 1st column as label
- Chart Elements - add title, subtitle, axes labels
You can create the chart in Excel using the following steps:
- Select Insert Tab -> Column -> 3D Clustered Column
- Select Data -> remove n -> select edit on horizontal axis labels -> add n column
- Select Chart tools -> Layout -> Chart Title - enter title and subtitle
- Select Chart tools -> Layout -> Axis Titles -> Select axis - enter axis label
Save your chart as part of your spreadsheet file.
SUBMISSION
Copy the results of your tests for both versions into a file
named w6.txt. This file should include
- your userid
- the source code for your solution
- output from running your test cases
Upload your typescript to Blackboard:
- Login to
- Select your course code
- Select Workshop 6 under Workshops
- Upload w6.txt and w6.ods or
w6.xls
- Under "Add Comments" write a short note to your instructor:
Add a sentence or two describing what you think you have learned
in this workshop.
- When ready to submit, press "Submit"
|
|
|