The CUDA Toolkit includes a profiler execution on the GPU. In parallel
profiling the metrics include the compute global memory access ratio of kernel
logic, the memory copy throughput, the compute to memory copy ratio and the
multiprocessor utilization. A parallel profiler helps us optimize the
performance of a kernel.
The performance of a CUDA application depends on many factors including
the compute global memory access ratio of the kernel logic, the memory
copy throughput, the compute to memory copy ratio, the multiprocessor
utilization and the patterns of accessing global memory load.
In order to optimize an applications performance, we measure these
quantities using a parallel profiler.
This chapter introduces the parallel profiling facilities that are
available for CUDA applications. The chapter describes the
Visual Profiler that ships with the toolkit and the Parallel
NSight Profiler that integrates with the Visual Studio and Eclipse
IDEs.
Visual Profiler
The Visual Profiler analyzes an application by executing it multiple
times and reports the average results in terms of the time spent and
the ratios achieved. Based on these results the Visual Profiler
provides advice on where to optimize the application. Reported
times are in milli-seconds and micro-seconds.
To profile an application, first compile the application using
nvcc. Then, to launch the Visual
Profiler enter the following at the command line
The Visual Profiler opens as an independent asynchronous
application. To profile the application:
- Select File->New Session
- Browse to find the name of the executable file
- Enter the name of the working directory (defaults to the current directory)
- Enter the command-line arguments
- Press Next
- Enter the maximum execution time (optional)
- If necessary, uncheck Enable unified memory profiling
- Press Finish
- The profiler will analyze the executable automatically if the Run Analysis Box is checked
- The profile display is shown below

- The timeline for the application appears in the left top frame
- Numerical values appear in the right top frame (to obtain detail values
for a particular stage hover the mouse hovers over that stage in the
left top frame)
- The analysis, the results and the settings appear in the lower frame
- The Analysis tab summarizes the results for four categories
- Timeline
- Multiprocessor
- Kernel Memory
- Kernel Instruction
- The Details tab provides statistics for each stage of the analysis
- The Console tab displays the console output
- The Settings tab shows the settings for the current analysis
- To analyze the results for a different set of command line arguments
- change the command-line arguments under the Settings tab
- select Run->Analyze Application
Example
The following program multiplies two matrices together on the device
and returns the result to the host. This naive version does not
include any optimizations:
// Matrix Multiplication - Naive Version
// matMul_0.cu
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <cuda_runtime.h>
#include "device_launch_parameters.h" // intellisense on CUDA syntax
const int ntpb = 32;
__global__ void matMul(const float* d_a, const float* d_b, float* d_c,
int ni, int nj, int nk) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < ni && j < nj) {
float sum = 0.0f;
for (int k = 0; k < nk; k++)
sum += d_a[i * nk + k] * d_b[k * nj + j];
d_c[i * nj + j] = sum;
}
}
// display matrix a, which is stored in row-major order
//
void display(const char* str, const float* a, int ni, int nj) {
std::cout << str << std::endl;
for (int i = 0; i < ni; i++) {
for (int j = 0; j < nj; j++)
std::cout << std::setw(10) << a[i * nj + j];
std::cout << std::endl;
}
std::cout << std::endl;
}
int main(int argc, char* argv[]) {
if (argc < 4) {
std::cout << "Incorrect no of arguments" << std::endl;
return 1;
}
int m = atoi(argv[1]); // number of rows in A, C
int n = atoi(argv[2]); // number of columns in B, C
int k = atoi(argv[3]); // number of columns in A, rows in B
bool disp = argc == 5; // display results?
float* d_a;
float* d_b;
float* d_c;
float* h_a = new float[m * k];
float* h_b = new float[k * n];
float* h_c = new float[m * n];
cudaMalloc((void**)&d_a, m * k * sizeof(float));
cudaMalloc((void**)&d_b, k * n * sizeof(float));
cudaMalloc((void**)&d_c, m * n * sizeof(float));
// initialize a[] and b[]
int kk = 0;
for (int i = 0; i < m; i++)
for (int j = 0; j < k; j++)
h_a[kk++] = (float)kk;
kk = 0;
for (int i = 0; i < k; i++)
for (int j = 0; j < n; j++)
h_b[kk++] = (float)kk;
// copy to the device
cudaMemcpy(d_a, h_a, m * k * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, k * n * sizeof(float), cudaMemcpyHostToDevice);
// launch grid
int nbx = (m + ntpb - 1) / ntpb;
int nby = (n + ntpb - 1) / ntpb;
dim3 dGrid(nbx, nby);
dim3 dBlock(ntpb, ntpb);
matMul<<<dGrid, dBlock>>>(d_a, d_b, d_c, m, n, k);
// copy from the device
cudaMemcpy(h_c, d_c, m * n * sizeof(float), cudaMemcpyDeviceToHost);
// display results
if (disp) {
std::cout << std::fixed << std::setprecision(4);
display("A :", h_a, m, k);
display("B :", h_b, k, n);
display("C = A B :", h_c, m, n);
}
std::cout << "done " << std::endl;
// deallocate
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
delete [] h_a;
delete [] h_b;
delete [] h_c;
cudaDeviceReset();
}
|
The cudaDeviceReset() API call is necessary for
profiling. It resets the device so that the data for the current
run is retrieved before a subsequent run begins.
The output for command line arguments 4 5 6 d is:
A :
0.0000 1.0000 2.0000 3.0000 4.0000 5.0000
6.0000 7.0000 8.0000 9.0000 10.0000 11.0000
12.0000 13.0000 14.0000 15.0000 16.0000 17.0000
18.0000 19.0000 20.0000 21.0000 22.0000 23.0000
B :
0.0000 1.0000 2.0000 3.0000 4.0000
5.0000 6.0000 7.0000 8.0000 9.0000
10.0000 11.0000 12.0000 13.0000 14.0000
15.0000 16.0000 17.0000 18.0000 19.0000
20.0000 21.0000 22.0000 23.0000 24.0000
25.0000 26.0000 27.0000 28.0000 29.0000
C = A B :
275.0000 290.0000 305.0000 320.0000 335.0000
725.0000 776.0000 827.0000 878.0000 929.0000
1175.0000 1262.0000 1349.0000 1436.0000 1523.0000
1625.0000 1748.0000 1871.0000 1994.0000 2117.0000
done
|
The profile for command line arguments 1000 1000 1000 is:

NVIDIA NSight 5.2
NVIDIA NSight 5.2 is a development environment for CUDA and graphics applications
running on NVIDIA GPUs. It is a stand alone application that integrates
with Visual Studio and Eclipse to provide analysis and profiling facilities for
CUDA applications. You can find the documentation
here.
Visual Studio 2015
To prepare for a Visual Studio profile:
- Create an NVIDIA CUDA 8.0 Runtime Project
- Replace kernel.cu code with the source code above
- Rename the source file MatMul_0.cu
- Build the solution
- Add the command line arguments
- Solution explorer - Right click the project - Select Properties->Configuration Properties->Debugging->Command Arguments
- Enter the arguments
- Execute the solution
Analysis Tools
The Activity Configuration Document provides an overview of the activities available.
For a timeline
- NSight -> Start Performance Analysis
- Select Trace Application under Activity Type
- Select CUDA under Trace Settings
- Select the Launch button

To view the timeline
- Select Timeline in the Drop-Down menu below the Window's title bar
- Hover over a specific bar to obtain detail information about that bar

Exercises
|
|
|