Workshop 4
cuBLAS
In this workshop, you compare the cuBLAS matrix multiplier with
the gcc cblas matrix multiplier.
Learning Outcomes
Upon successful completion of this workshop, you will have
demonstrated the abilities to
- code matrix multiplication using the cblas library functions
- code matrix multiplication using the cuBLAS library functions
- summarize what you think that you have learned in completing this workshop
Specifications
This workshop repeats the task started in Workshop 2.
There are two versions of the same matrix-multiplication program to complete: the
cblas version and the cuBLAS version.
Each executable takes one command-line argument, which is the number of rows/columns
in the square matrices being multiplied.
cblas Version
Complete the following program using the cblas
implementation of the BLAS standard.
For more information, refer to the L.A. in Science
chapter and Workshop 2.
// Level 3 cblas - Workshop 4
// w4_cblas.cpp
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <chrono>
// add cblas header file
using namespace std::chrono;
// indexing function (column major order)
//
inline int idx(int r, int c, int n)
{
// ... add indexing formula
}
// display matrix M, which is stored in column-major order
//
void display(const char* str, const float* M, int nr, int nc)
{
std::cout << str << std::endl;
std::cout << std::fixed << std::setprecision(4);
for (int i = 0; i < nr; i++) {
for (int j = 0; j < nc; j++)
std::cout << std::setw(10)
<< // ... access in column-major order;
std::cout << std::endl;
}
std::cout << std::endl;
}
// report system time
//
void reportTime(const char* msg, steady_clock::duration span) {
auto ms = duration_cast<milliseconds>(span);
std::cout << msg << " - took - " <<
ms.count() << " millisecs" << std::endl;
}
// matrix multiply
//
void sgemm(const float* A, const float* B, float* C, int n) {
steady_clock::time_point ts, te;
// level 3 calculation: C = alpha * A * B + beta * C
// add any preliminaries
ts = steady_clock::now();
// ... add call to cblas sgemm
te = steady_clock::now();
reportTime("matrix-matrix multiplication", te - ts);
}
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << argv[0] << ": invalid number of arguments\n";
std::cerr << "Usage: " << argv[0] << " size_of_matrices\n";
return 1;
}
int n = std::atoi(argv[1]); // no of rows/columns in A, B, C
// allocate host memory
float* h_A = new float[n * n];
float* h_B = new float[n * n];
float* h_C = new float[n * n];
// populate host matrices a and b
for (int i = 0, kk = 0; i < n; i++)
for (int j = 0; j < n; j++, kk++)
h_A[kk] = h_B[kk] = (float)kk;
// C = A * B
sgemm(h_A, h_B, h_C, n);
// display results
if (n <= 5) {
display("A :", h_A, n, n);
display("B :", h_B, n, n);
display("C = A B :", h_C, n, n);
}
// deallocate host memory
delete [] h_A;
delete [] h_B;
delete [] h_C;
}
|
Compile and Link
Compile and link your completed version of this program using version
4.8 of GCC or higher along with the O2 optimization
switch. Version 4.9.0 is available in matrix's
local system directory and accessible using the following
Makefile:
# Makefile for w4
#
GCC_VERSION = 7.2.0
PREFIX = /usr/local/gcc/${GCC_VERSION}/bin/
CC = ${PREFIX}gcc
CPP = ${PREFIX}g++
w4_cblas: w4_cblas.o
$(CPP) -ow4_cblas w4_cblas.o -lgslcblas
w4_cblas.o: w4_cblas.cpp
$(CPP) -c -O2 -std=c++17 w4_cblas.cpp
clean:
rm *.o
|
To execute this Makefile, enter the command
To run the executable, enter the command
where the argument is the size of the vector/matrix.
Test Results
Test the executable for a command line argument of 4.
The results should look something like:
matrix-matrix multiplication - took 0 secs
A :
0.0000 4.0000 8.0000 12.0000
1.0000 5.0000 9.0000 13.0000
2.0000 6.0000 10.0000 14.0000
3.0000 7.0000 11.0000 15.0000
B :
0.0000 4.0000 8.0000 12.0000
1.0000 5.0000 9.0000 13.0000
2.0000 6.0000 10.0000 14.0000
3.0000 7.0000 11.0000 15.0000
C = A B :
56.0000 152.0000 248.0000 344.0000
62.0000 174.0000 286.0000 398.0000
68.0000 196.0000 324.0000 452.0000
74.0000 218.0000 362.0000 506.0000
|
cuBLAS Version
Modify your cblas version by replacing its definition
of the sgemm() function with a definition that uses the
CUDA cublas calls needed to access the equivalent
cublasSgemm()
library function.
// Level 3 cuBLAS - Workshop 4
// w4_cublas.cu
// ...
// matrix multiply
//
void sgemm(const float* h_A, const float* h_B, float* h_C, int n) {
steady_clock::time_point ts, te;
// level 3 calculation: C = alpha * A * B + beta * C
ts = steady_clock::now();
// ... allocate memory on the device
te = steady_clock::now();
reportTime("allocation of device memory for matrices d_A, d_B and d_C",
te - ts);
// ... create cuBLAS context
ts = steady_clock::now();
// ... copy host matrices to the device
te = steady_clock::now();
reportTime("copying of matrices h_A and h_B to device memory", te - ts);
ts = steady_clock::now();
// ... calculate matrix-matrix product
te = steady_clock::now();
reportTime("matrix-matrix multiplication", te - ts);
// ... copy result matrix from the device to the host
te = steady_clock::now();
reportTime("copying of matrix d_C from device", te - ts);
// ... destroy cuBLAS context
ts = steady_clock::now();
// ... deallocate device memory
te = steady_clock::now();
reportTime("deallocation of device memory for matrices A, B and C",
te - ts);
}
|
The instructions to build this version of your program can be found
in the chapter entitled CUDA Libraries.
Test Results
Test the executable for a command-line argument of 4
and compare the results to those shown above.
Comparison
Run each version for the matrix sizes listed
below. Record the reported matrix-multiplication
elapsed time for each size and each version.
n |
cblas |
cuBLAS |
500 |
|
|
1000 |
|
|
1500 |
|
|
2000 |
|
|
2500 |
|
|
3000 |
|
|
3500 |
|
|
4000 |
|
|
Save this table in a spreadsheet file named w4.ods
or w4.xls.
Prepare a 3D look realistic column chart showing the clock times in seconds
along the vertical axis and the number of rows/columns (n)
along the horizontal axis as shown below.

You can create the chart in Open Office using the following steps:
- Highlight data and labels
- Select Chart in the Toolbar
- Chart Type - check 3D Look Realistic Column
- Data Range - 1st row as label, 1st column as label
- Chart Elements - add your title, your subtitle, your axes labels
You can create the chart in Excel using the following steps:
- Select Insert Tab -> Column -> 3D Clustered Column
- Select Data -> remove n -> select edit on horizontal axis labels -> add n column (500-4000)
- Select Chart tools -> Layout -> Chart Title - enter title and subtitle
- Select Chart tools -> Layout -> Axis Titles -> Select axis - enter axis label
Save your chart as part of your spreadsheet file.
SUBMISSION
Copy the results of your initial tests for both versions into a file
named w4.txt. This file should include
- a listing of your cblas version
- output from running your cblas version
- a listing of your cuBLAS version
- output from running your cuBLAS version
Upload your typescript to Blackboard:
- Login to
- Select your course code
- Select Workshop 4 under Workshops
- Upload w4.txt and w4.ods or
w4.xls
- Under "Add Comments" describe to your instructor in detail
what you have learned in completing this workshop.
- When ready to submit, press "Submit"
|