Part B - CUDA Programming Model

Parallel Profiling

Introduce the CUDA Visual Profiler
Introduce NSight for Visual Studio

Visual Profiler | Parallel NSight | Exercises


The CUDA Toolkit includes a profiler execution on the GPU.  In parallel profiling the metrics include the compute global memory access ratio of kernel logic, the memory copy throughput, the compute to memory copy ratio and the multiprocessor utilization.  A parallel profiler helps us optimize the performance of a kernel.  The performance of a CUDA application depends on many factors including the compute global memory access ratio of the kernel logic, the memory copy throughput, the compute to memory copy ratio, the multiprocessor utilization and the patterns of accessing global memory load.  In order to optimize an applications performance, we measure these quantities using a parallel profiler. 

This chapter introduces the parallel profiling facilities that are available for CUDA applications.  The chapter describes the Visual Profiler that ships with the toolkit and the Parallel NSight Profiler that integrates with the Visual Studio and Eclipse IDEs. 


Visual Profiler

The Visual Profiler analyzes an application by executing it multiple times and reports the average results in terms of the time spent and the ratios achieved.  Based on these results the Visual Profiler provides advice on where to optimize the application.  Reported times are in milli-seconds and micro-seconds.

To profile an application, first compile the application using nvcc.  Then, to launch the Visual Profiler enter the following at the command line

 nvvp

The Visual Profiler opens as an independent asynchronous application.  To profile the application:

  • Select File->New Session
    • Browse to find the name of the executable file
    • Enter the name of the working directory (defaults to the current directory)
    • Enter the command-line arguments
    • Press Next
    • Enter the maximum execution time (optional)
    • If necessary, uncheck Enable unified memory profiling
    • Press Finish
    • The profiler will analyze the executable automatically if the Run Analysis Box is checked


  • The profile display is shown below

    Visual Profiler


    • The timeline for the application appears in the left top frame
    • Numerical values appear in the right top frame (to obtain detail values for a particular stage hover the mouse hovers over that stage in the left top frame)
    • The analysis, the results and the settings appear in the lower frame
      • The Analysis tab summarizes the results for four categories
        • Timeline
        • Multiprocessor
        • Kernel Memory
        • Kernel Instruction
      • The Details tab provides statistics for each stage of the analysis
      • The Console tab displays the console output
      • The Settings tab shows the settings for the current analysis
  • To analyze the results for a different set of command line arguments
    • change the command-line arguments under the Settings tab
    • select Run->Analyze Application

Example

The following program multiplies two matrices together on the device and returns the result to the host.  This naive version does not include any optimizations:

 // Matrix Multiplication - Naive Version
 // matMul_0.cu

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <cuda_runtime.h>
 #include "device_launch_parameters.h" // intellisense on CUDA syntax

 const int ntpb = 32;

 __global__ void matMul(const float* d_a, const float* d_b, float* d_c,
  int ni, int nj, int nk) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     int j = blockIdx.y * blockDim.y + threadIdx.y;
     if (i < ni && j < nj) {
         float sum = 0.0f;
         for (int k = 0; k < nk; k++)
             sum += d_a[i * nk + k] * d_b[k * nj + j];
         d_c[i * nj + j] = sum;
     }
 }

 // display matrix a, which is stored in row-major order
 //
 void display(const char* str, const float* a, int ni, int nj) {
     std::cout << str << std::endl;
     for (int i = 0; i < ni; i++) {
         for (int j = 0; j < nj; j++)
             std::cout << std::setw(10) << a[i * nj + j];
         std::cout << std::endl;
     }
     std::cout << std::endl;
 }

 int main(int argc, char* argv[]) {
     if (argc < 4) {
         std::cout << "Incorrect no of arguments" << std::endl;
         return 1;
     }
     int m = atoi(argv[1]); // number of rows in A, C
     int n = atoi(argv[2]); // number of columns in B, C
     int k = atoi(argv[3]); // number of columns in A, rows in B
     bool disp = argc == 5; // display results?
     float* d_a;
     float* d_b;
     float* d_c;
     float* h_a = new float[m * k];
     float* h_b = new float[k * n];
     float* h_c = new float[m * n];
     cudaMalloc((void**)&d_a, m * k * sizeof(float));
     cudaMalloc((void**)&d_b, k * n * sizeof(float));
     cudaMalloc((void**)&d_c, m * n * sizeof(float));

     // initialize a[] and b[]
     int kk = 0;
     for (int i = 0; i < m; i++)
         for (int j = 0; j < k; j++)
             h_a[kk++] = (float)kk;
     kk = 0;
     for (int i = 0; i < k; i++)
         for (int j = 0; j < n; j++)
             h_b[kk++] = (float)kk;

     // copy to the device
     cudaMemcpy(d_a, h_a, m * k * sizeof(float), cudaMemcpyHostToDevice);
     cudaMemcpy(d_b, h_b, k * n * sizeof(float), cudaMemcpyHostToDevice);

     // launch grid
     int nbx = (m + ntpb - 1) / ntpb;
     int nby = (n + ntpb - 1) / ntpb;
     dim3 dGrid(nbx, nby);
     dim3 dBlock(ntpb, ntpb);
     matMul<<<dGrid, dBlock>>>(d_a, d_b, d_c, m, n, k);

     // copy from the device
     cudaMemcpy(h_c, d_c, m * n * sizeof(float), cudaMemcpyDeviceToHost);

     // display results
     if (disp) {
         std::cout << std::fixed << std::setprecision(4);
         display("A :", h_a, m, k);
         display("B :", h_b, k, n);
         display("C = A B :", h_c, m, n);
     }
     std::cout << "done " << std::endl;

     // deallocate
     cudaFree(d_a);
     cudaFree(d_b);
     cudaFree(d_c);
     delete [] h_a;
     delete [] h_b;
     delete [] h_c;
     cudaDeviceReset();
 }

The cudaDeviceReset() API call is necessary for profiling.  It resets the device so that the data for the current run is retrieved before a subsequent run begins. 

The output for command line arguments 4 5 6 d is:

 A :
     0.0000    1.0000    2.0000    3.0000    4.0000    5.0000
     6.0000    7.0000    8.0000    9.0000   10.0000   11.0000
    12.0000   13.0000   14.0000   15.0000   16.0000   17.0000
    18.0000   19.0000   20.0000   21.0000   22.0000   23.0000

 B :
     0.0000    1.0000    2.0000    3.0000    4.0000
     5.0000    6.0000    7.0000    8.0000    9.0000
    10.0000   11.0000   12.0000   13.0000   14.0000
    15.0000   16.0000   17.0000   18.0000   19.0000
    20.0000   21.0000   22.0000   23.0000   24.0000
    25.0000   26.0000   27.0000   28.0000   29.0000

 C = A B :
   275.0000  290.0000  305.0000  320.0000  335.0000
   725.0000  776.0000  827.0000  878.0000  929.0000
  1175.0000 1262.0000 1349.0000 1436.0000 1523.0000
  1625.0000 1748.0000 1871.0000 1994.0000 2117.0000

 done

The profile for command line arguments 1000 1000 1000 is:

Visual Profiler


NVIDIA NSight 5.2

NVIDIA NSight 5.2 is a development environment for CUDA and graphics applications running on NVIDIA GPUs.  It is a stand alone application that integrates with Visual Studio and Eclipse to provide analysis and profiling facilities for CUDA applications.  You can find the documentation here

Visual Studio 2015

To prepare for a Visual Studio profile:

  • Create an NVIDIA CUDA 8.0 Runtime Project
  • Replace kernel.cu code with the source code above
  • Rename the source file MatMul_0.cu
  • Build the solution
  • Add the command line arguments
    • Solution explorer - Right click the project - Select Properties->Configuration Properties->Debugging->Command Arguments
    • Enter the arguments
  • Execute the solution

Analysis Tools

The Activity Configuration Document provides an overview of the activities available.  For a timeline

  • NSight -> Start Performance Analysis
  • Select Trace Application under Activity Type
  • Select CUDA under Trace Settings
  • Select the Launch button

Start NSight


To view the timeline

  • Select Timeline in the Drop-Down menu below the Window's title bar
  • Hover over a specific bar to obtain detail information about that bar

NSight Timeline


Exercises




Previous Reading  Previous: Kernels and Grids Next: Memory Model   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License