Part B - CUDA Programming Model

CUDA Libraries

Describe Nvidia's implementation of BLAS on CUDA
List the other libraries that ship with the CUDA Toolkit

cublas | Visual Studio Builds | Other Libraries | Exercises


The CUDA Toolkit includes a set of libraries, which implement standard mathematical algorithms that have been developed in the field of High Performance Computing.  These libraries cover the BLAS standard, Fast Fourier Transforms, sparse linear algebra, random number generation and image and video processing.  Some algorithms, which were originally developed in the Fortran language, assume 1-based indexing and storage of two-dimensional matrices in column-major order. 

This chapter describes the three levels of the cuBLAS library and introduces the other libraries.  Transitioning between 0-based indexing and 1-based indexing as well as between row-major ordering and column-major ordering are covered.  The section on cuBLAS includes examples that describe the conventions for calling the cuBLAS functions. 


cuBlas

cuBLAS is Nvidia's CUDA implementation of the BLAS standard. 

The cuBLAS library exposes two sets of APIs:

  • cuBLAS API - the standard interface - the application allocates the required matrices and vectors in the GPU memory space, fills them with data, calls the cuBLAS function(s) and uploads the GPU results to the host.
  • CUBLASXT API - the multi-GPU interface for 64-bit platforms - the application keeps the data on the host and the library dispatches the operation(s) to one or more GPUs, depending on the application's request.

Documentation for these two versions is at

The associated header files are:

  • Legacy - cublas.h
  • Regular - cublas_v2.h

For the cuBLAS API we allocate and deallocate device memory host-side.  This API follows the same naming convention as the BLAS standard. 

Overview of the cuBLAS API

Features

The regular version of the cuBLAS API includes:

  • a handle to a context that is passed to each function
  • scalar parameters α and β that can be passed by reference on the host or the device. 
  • return a scalar result by reference from a function
  • an error status of type cublasStatus_treturned by each function
  • CUBLAS_STATUS_SUCCESSreturned by each function on successful execution

Context

Before calling any cuBLAS function, we create a CUBLAS context.  Our call returns a handle to the CUBLAS context

 cublasHandle_t handle;
 cublasStatus_t status;
 status = cublasCreate(&handle); 

Once finished with cuBLAS, we destroy the context by calling

 cublasDestroy(handle); 

Memory Allocation and Copying

We allocate memory for the matrices and vectors on both the host and the device, fill host memory with the appropriate values, upload the data to the device, perform the operations on the device and download the results to the host.

Compatibility with Fortran

For compatibility with Fortran, cuBLAS uses:

  1. 1-based array indexing
  2. column-major storage for two-dimensional arrays

C and C++ functions assume 0-based indexing and row-major order.  In calling any cuBLAS function from a C or C++ function, we need to convert to these conventions.  One solution is to linearize the two-dimensional arrays into one-dimensional arrays and use inline functions to implement matrices on top of these arrays. 

To convert to column-major order, we calculate the C/C++ storing index in an inline function idx() and replace the array index with a call to this inline function:

 // Column-Major Order
 // colMajor.cpp

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>

 inline int idx(int i, int j, int ld) {
     return j * ld + i;
 }

 int main(int argc, char** argv) {
     if (argc != 3) {
         std::cerr << argv[0]
                   << ": invalid no of arguments\n"
                   << "Usage: " << argv[0]
                   << "  no_of_rows no_of_cols\n"; 
         return 1;
     }
     int nr = std::atoi(argv[1]);
     int nc = std::atoi(argv[2]);
     float* a = new float[nr * nc];
     std::cout << std::fixed << std::setprecision(1); 

     // initialize in column-major order
     for (int i = 0; i < nr; i++)
         for (int j = 0; j < nc; j++)
             a[idx(i, j, nr)] =
              (float)i + j * 0.1f; 

     // display in column-major order
     for (int i = 0; i < nr; i++) {
         for (int j = 0; j < nc; j++) {
             std::cout << std::setw(4);
             std::cout << a[idx(i, j, nr)];
         }
         std::cout << std::endl;
     }
     std::cout << std::endl;

     // display the stored order
     int k = 0;
     for (int i = 0; i < nc; i++) {
         for (int j = 0; j < nr; j++)
             std::cout << std::setw(4) << a[k++];
         std::cout << std::endl;
     }

     delete [] a;
 }





























 >a 5 6
  0.0 0.1 0.2 0.3 0.4 0.5
  1.0 1.1 1.2 1.3 1.4 1.5
  2.0 2.1 2.2 2.3 2.4 2.5
  3.0 3.1 3.2 3.3 3.4 3.5
  4.0 4.1 4.2 4.3 4.4 4.5 




  0.0 1.0 2.0 3.0 4.0
  0.1 1.1 2.1 3.1 4.1
  0.2 1.2 2.2 3.2 4.2
  0.3 1.3 2.3 3.3 4.3
  0.4 1.4 2.4 3.4 4.4
  0.5 1.5 2.5 3.5 4.5




  

To convert to 1-based indexing, we change the calculation of the C/C++ storing index in inline function idx() and the iteration arguments

 // 1 Based Indexing
 // 1_based_indexing.cpp

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>

 inline int idx(int i, int j, int ld) {
     return (j - 1) * ld + i - 1;
 }

 int main(int argc, char** argv) {
     if (argc != 3) {
         std::cerr << argv[0]
                   << ": invalid no of arguments\n"
                   << "Usage: " << argv[0]
                   << "  no_of_rows no_of_cols\n"; 
         return 1;
     }
     int nr = std::atoi(argv[1]);
     int nc = std::atoi(argv[2]);
     float* a = new float[nr * nc];
     std::cout << std::fixed << std::setprecision(1); 

     // initialize in cloumn-major order
     for (int i = 1; i <= nr; i++)
         for (int j = 1; j <= nc; j++)
             a[idx(i, j, nr)] =
              (float)(i - 1) + (j - 1) *
              0.1f; 

     // display in column-major order
     for (int i = 1; i <= nr; i++) {
         for (int j = 1; j <= nc; j++) {
             std::cout << std::setw(4);
             std::cout << a[idx(i, j, nr)];
         }
         std::cout << std::endl;
     }
     std::cout << std::endl;

     // display the stored order
     int k = 0;
     for (int i = 1; i <= nr; i++) {
         for (int j = 1; j <= nc; j++)
             std::cout << std::setw(4) << a[k++];
         std::cout << std::endl;
     }

     delete [] a;
 }






























 >a 5 6
  0.0 0.1 0.2 0.3 0.4 0.5
  1.0 1.1 1.2 1.3 1.4 1.5
  2.0 2.1 2.2 2.3 2.4 2.5
  3.0 3.1 3.2 3.3 3.4 3.5
  4.0 4.1 4.2 4.3 4.4 4.5 




  0.0 1.0 2.0 3.0 4.0
  0.1 1.1 2.1 3.1 4.1
  0.2 1.2 2.2 3.2 4.2
  0.3 1.3 2.3 3.3 4.3
  0.4 1.4 2.4 3.4 4.4
  0.5 1.5 2.5 3.5 4.5




  

Compilation and Linking

To compile and link a cuBLAS program from the command line, we include the cublas library

 nvcc sourceFile.cpp -lcublas

To compile, link and execute a cuBLAS program in Visual Studio, we include the following

  • Project Properties -> Configuration Properties -> Linker -> Input: insert cublas.lib;
  • Project Properties -> Configuration Properties -> Build Events -> Post-Build Events -> Command Line: append copy "$(CudaToolkitBinDir)\cublas*.dll"

Helpers

The cuBLAS library provides helper functions for writing data to and retrieving data from the device.  These helper functions assist in building and retrieving vectors and matrices for the cuBLAS operations.

cublasSetVector

This function copies n elements from host vector h_x to device vector d_xsize defines the size of each element, inc_h_x defines the element spacing on the host, and inc_d_x defines the element spacing on the device.

 cublasStatus_t
 cublasSetVector(int n, int size, const void* h_x, int inc_h_x, 
  void* d_x, int inc_d_x);

cublasGetVector

This function copies n elements from device vector d_x to host vector h_xsize defines the size of each element, inc_d_x defines the element spacing on the device, and inc_h_x defines the element spacing on the host.

 cublasStatus_t
 cublasGetVector(int n, int size, const void* d_x, int inc_d_x, 
  void* h_x, int inc_h_x);

cublasSetMatrix

This function copies an r by c host matrix h_A to a device matrix d_Asize defines the size of each element, ld_h_A defines the leading dimension on the host, and ld_d_A defines the leading dimension on the device.

 cublasStatus_t
 cublasSetMatrix(int r, int c, int size, const void* h_A, int ld_h_A, 
  void* d_A, int ld_d_A);

cublasGetMatrix

This function copies an r by c device matrix d_A to a host matrix h_Asize defines the size of each element, ld_d_A defines the leading dimension on the device, and ld_h_A defines the leading dimension on the host.

 cublasStatus_t
 cublasSetMatrix(int r, int c, int size, const void* d_A, int ld_d_A, 
  void* h_A, int ld_h_A);

Level 1 Functions

The Level 1 functions perform the scalar and vector operations.  The following example illustrates the use of one of these functions along with helper support and error handling.

cublasIsamax()

This function returns the smallest index of the element with the largest magnitude.  The function prototype is:

 cublasStatus_t
 cublasIsamax(cublasHandle_t, int n, const float* d_x, int inc_d_x, int* idx);
 // Level 1 cuBLAS
 // cuBLAS_1.cpp

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>
 #include "cublas_v2.h"

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << argv[0]
                   << ": invalid no of arguments\n"
                   << "Usage: " << argv[0]
                   << "  no_of_elements\n"; 
         return 1;
     }
     int n = std::atoi(argv[1]);
     float* d_a;
     float* h_a = new float[n];
     cudaMalloc((void**)&d_a, n * sizeof(float));

     float f = 1.0f / RAND_MAX;
     for (int i = 0; i < n; i++)
         h_a[i] = f * rand();

     cublasStatus_t status;
     cublasHandle_t handle;
     // create the context
     status = cublasCreate(&handle);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasCreate failed***\n";
         return 2;
     }
     // copy to device
     int inc_h_a = 1;
     int inc_d_a = 1;
     status = cublasSetVector(n, sizeof(float), h_a, inc_h_a, d_a, inc_d_a);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasSetVector failed***\n";
         return 2;
     }
     // retrieve index of the first maximum
     int idx_max;
     status = cublasIsamax(handle, n, d_a, inc_d_a, &idx_max);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasIsamax failed***\n";
         return 2;
     }

     std::cout << "First largest is " << h_a[idx_max -1] << std::endl; 

     // destroy the context
     cublasDestroy(handle);
     cudaFree(d_a);
     delete [] h_a;
 }

Level 2 Functions

The Level 2 functions perform matrix-vector operations.  The following example illustrates use of the sgemv function along with helper support and error handling.

cublasSgemv()

This function performs the Level 2 matrix-vector multiplication and returns the result in vector d_y.  The prototype is:

 cublasStatus_t
 cublasSgemv(cublasHandle_t, cublasOperation_t, int m, int n,
  const float* alpha, const float* d_A, int ld_d_A, const float* d_x,
  int inc_d_x, const float* beta, float* d_y, int inc_d_y);

The vector and matrix components are random values with the matrix size specified through command line arguments:

 // Level 2 cuBLAS
 // cuBLAS_2.cpp

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <cuda_runtime.h>
 #include "cublas_v2.h"

 inline int idx(int i, int j, int nr) {
     return j * nr + i;
 }

 // display matrix a, which is stored in column-major order
 //
 void display(const char* str, const float* a, int nr, int nc) {
     std::cout << str << std::endl;
     for (int i = 0; i < nr; i++) {
         for (int j = 0; j < nc; j++)
             std::cout << std::setw(10) << a[idx(i, j, nr)];
         std::cout << std::endl;
     }
     std::cout << std::endl;
 }

 int main(int argc, char* argv[]) {
     if (argc != 3) {
         std::cerr << argv[0]
                   << ": invalid no of arguments\n"
                   << "Usage: " << argv[0]
                   << "  no_of_rows no_of_cols\n"; 
         return 1;
     }
     int m = std::atoi(argv[1]);
     int n = std::atoi(argv[2]);
     float* d_x;
     float* d_y;
     float* d_A;
     float* h_x = new float[n];
     float* h_y = new float[m];
     float* h_A = new float[m * n];
     cudaMalloc((void**)&d_x, n * sizeof(float));
     cudaMalloc((void**)&d_y, m * sizeof(float));
     cudaMalloc((void**)&d_A, m * n * sizeof(float));

     int k = 0;
     for (int i = 0; i < n; i++) {
         h_x[i] = float(i + 1);
         for (int j = 0; j < m; j++)
             h_A[k++] = float(k);
     }

     cublasStatus_t status;
     cublasHandle_t handle;
     // create the context
     status = cublasCreate(&handle);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasCreate failed***\n";
         return 2;
     }
     // copy host vector h_x to device vector d_x
     status = cublasSetVector(n, sizeof(float), h_x, 1, d_x, 1);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasSetVector x failed***\n";
         return 2;
     }
     // copy host matrix h_A to device matrix d_A
     status = cublasSetMatrix(m, n, sizeof(float), h_A, m, d_A, m); 
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasSetMatrix A failed***\n";
         return 2;
     }
     // level 2 calculation y = alpha * A * x + beta * y
     float alpha = 1.0f;
     float beta  = 0.0f;
     int lda = m;
     status = cublasSgemv(handle, CUBLAS_OP_N, m, n, &alpha, d_A,
      lda, d_x, 1, &beta, d_y, 1);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasSgemv failed***\n";
         return 2;
     }
     // copy device vector y to host vector Y
     status = cublasGetVector(m, sizeof(float), d_y, 1, h_y, 1);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasGetVector y failed***\n";
         return 2;
     }

     // display results
     std::cout << std::fixed << std::setprecision(4);
     display("A :", h_A, m, n);
     display("x :", h_x, n, 1);
     display("y = A x :", h_y, m, 1);
     std::cout << "done " << std::endl;

     // destroy the context
     cublasDestroy(handle);
     cudaFree(d_x);
     cudaFree(d_y);
     cudaFree(d_A);
     delete [] h_x;
     delete [] h_y;
     delete [] h_A;
 }

The output for command line arguments 3 5 is:

 A :
     0.0000    3.0000    6.0000    9.0000   12.0000 
     1.0000    4.0000    7.0000   10.0000   13.0000
     2.0000    5.0000    8.0000   11.0000   14.0000

 x :
     1.0000
     2.0000
     3.0000
     4.0000
     5.0000

 y = A x :
   120.0000
   135.0000
   150.0000

 done

Level 3 Functions

The Level 3 functions perform matrix-matrix operations.  The following example illustrates use of the sgemm function along with helper support and error handling.

cublasSgemm()

This function performs the Level 3 matrix-matrix multiplication and returns the result in matrix C.  The prototype is:

 cublasStatus_t
 cublasSgemm(cublasHandle_t, cublasOperation_t, cublasOperation_t,
  int m, int n, int k, const float* alpha, const float* d_A, int ld_d_A,
  const float* d_B, int ld_d_B, const float* beta, float* d_C, int ld_d_C);

In the following example the vecotr and matrix components are random values and the matrix sizes are command line arguments:

 // Level 3 cuBLAS
 // cuBLAS_3.cpp

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <cuda_runtime.h>
 #include "cublas_v2.h"

 inline int idx(int i, int j, int nr) {
     return j * nr + i;
 }

 // display matrix a, which is stored in column-major order
 //
 void display(const char* str, const float* a, int nr, int nc) {
     std::cout << str << std::endl;
     for (int i = 0; i < nr; i++) {
         for (int j = 0; j < nc; j++)
             std::cout << std::setw(10) << a[idx(i, j, nr)];
         std::cout << std::endl;
     }
     std::cout << std::endl;
 }

 int main(int argc, char* argv[]) {
     if (argc != 4) {
         std::cerr << argv[0]
                   << ": invalid no of arguments\n"
                   << "Usage: " << argv[0]
                   << "  no_of_rows_A_C no_of_cols_B_C "
                   "no_of_cols_A|no_of_rows_B\n"; 
         return 1;
     }
     int m = atoi(argv[1]); // number of rows in A, C
     int n = atoi(argv[2]); // number of columns in B, C
     int k = atoi(argv[3]); // number of columns in A, rows in B
     float* d_A;
     float* d_B;
     float* d_C;
     float* h_A = new float[m * k];
     float* h_B = new float[k * n];
     float* h_C = new float[m * n];
     cudaMalloc((void**)&d_A, m * k * sizeof(float));
     cudaMalloc((void**)&d_B, k * n * sizeof(float));
     cudaMalloc((void**)&d_C, m * n * sizeof(float));

     int kk = 0;
     for (int i = 0; i < m; i++)
         for (int j = 0; j < k; j++)
             h_A[kk++] = float(kk);
     kk = 0;
     for (int i = 0; i < k; i++)
         for (int j = 0; j < n; j++)
             h_B[kk++] = float(kk);

     cublasStatus_t status;
     cublasHandle_t handle;
     // create the context
     status = cublasCreate(&handle);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasCreate failed***\n";
         return 2;
     }

     // copy host matrix h_A to device matrix d_A [m, k]
     int ld_h_A = m;
     int ld_d_A = m;
     status = cublasSetMatrix(m, k, sizeof(float), h_A, ld_h_A, d_A, ld_d_A);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasSetMatrix A failed***\n";
         return 2;
     }
     // copy host matrix h_B to device matrix d_B [k, n]
     int ld_h_B = k;
     int ld_d_B = k;
     status = cublasSetMatrix(k, n, sizeof(float), h_B, k, d_B, k);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasSetMatrix B failed***\n";
         return 2;
     }
     // level 3 calculation: C = alpha * A * B + beta * C
     float alpha = 1.0f;
     float beta  = 0.0f;
     int ld_d_C = m;
     status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, 
      &alpha, d_A, ld_d_A, d_B, ld_d_B, &beta, d_C, ld_d_C);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasSgemm failed***\n";
         return 2;
     }
     // copy device matrix d_C to host matrix h_C [m, n]
     status = cublasGetMatrix(m, n, sizeof(float), d_C, m, h_C, m);
     if (status != CUBLAS_STATUS_SUCCESS) {
         std::cerr << "***cublasGetMatrix C failed***\n";
         return 2;
     }

     // display results
     std::cout << std::fixed << std::setprecision(4);
     display("A :", h_A, m, k);
     display("B :", h_B, k, n);
     display("C = A B :", h_C, m, n);
     std::cout << "done " << std::endl;

     // destroy the context
     cublasDestroy(handle);
     cudaFree(d_A);
     cudaFree(d_B);
     cudaFree(d_C);
     delete [] h_A;
     delete [] h_B;
     delete [] h_C;
 }

The output for command line arguments 4 5 6 is:

 A :
     0.0000    4.0000    8.0000   12.0000   16.0000   20.0000 
     1.0000    5.0000    9.0000   13.0000   17.0000   21.0000
     2.0000    6.0000   10.0000   14.0000   18.0000   22.0000
     3.0000    7.0000   11.0000   15.0000   19.0000   23.0000

 B :
     0.0000    6.0000   12.0000   18.0000   24.0000
     1.0000    7.0000   13.0000   19.0000   25.0000
     2.0000    8.0000   14.0000   20.0000   26.0000
     3.0000    9.0000   15.0000   21.0000   27.0000
     4.0000   10.0000   16.0000   22.0000   28.0000
     5.0000   11.0000   17.0000   23.0000   29.0000

 C = A B :
   220.0000  580.0000  940.0000 1300.0000 1660.0000
   235.0000  631.0000 1027.0000 1423.0000 1819.0000
   250.0000  682.0000 1114.0000 1546.0000 1978.0000
   265.0000  733.0000 1201.0000 1669.0000 2137.0000

 done

Visual Studio Builds

A Visual Studio project that accesses the CUDA libraries can be built in either of two ways:

  • as a Visual C++ project with additional include and linker files to be referenced
  • as an NVIDIA CUDA project that has some of the additional files referenced by default

Visual C++ Project

The steps involved in building a Visual C++ project are:

  • Start New Project
  • Select Visual C++ -> Empty Project
  • Project -> Add New Item -> C++ File
  • Project -> Properties -> Configuration Properties -> C/C++ -> General -> Additional Include Directories
  • Add %CUDA_PATH%\include
  • Apply
  • Linker -> General -> Additional Library Directories
  • Add %CUDA_PATH%\lib\x64
  • Apply
  • Linker -> Input -> Additional Dependencies
  • Add cudart.lib; cublas.lib
  • OK
  • Build

NVIDIA CUDA 8.0 Project

The steps involved in building a Visual C++ project are:

  • Start New Project
  • Select NVIDIA -> CUDA 8.0 -> CUDA 8.0 Runtime
  • Right Click on kernel.cu -> Remove -> Delete
  • Select Project in Solution Explorer
  • Project -> Add New Item -> C++ File
  • add source code
  • Build

Missing DLL Error

Some installations report a message that cudart??_80.dll is missing.  To correct this

  • Project -> Properties -> Configuration Properties -> Build Events -> Post-Build Events -> Command Line
  • enter echo copy "%CUDA_PATH%\bin\cudart*.dll" "$(OutDir)" copy "%CUDA_PATH%\bin\cudart*.dll" "$(OutDir)"
  • enter echo copy "%CUDA_PATH%\bin\cublas*.dll" "$(OutDir)" copy "%CUDA_PATH%\bin\cublas*.dll" "$(OutDir)"
  • OK
  • Re-Build

Other Libraries

cuSparse

cuSPARSE is an implementation of the three levels of Sparse-BLAS on top of the CUDA runtime.  You can find the documentation online at

cuFFT

cuFFT provides facilities for fast Fourier transform calculations.  The fast Fourier transform is an algorithm used extensively in computational physics and signal processing.  You can find the documentation online at

cuRand

cuRAND provides facilities for high-quality pseudo-random number generation.  You can find the documentation online at


Exercises

  • Complete the Workshop on cuBlas




Previous Reading  Previous: CUDA Preliminaries Next: Thrust   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License