CUDA treats the GPU as a massively data-parallel processor. This simple
model subdivides the data into equal small parts and executes instructions on
each part concurrently. CUDA provides a direct interface to the GPU
hardware.
CUDA programmers write source code for the CPU and GPU in the same files.
The CUDA compiler driver distinguishes those parts that execute on the CPU as
host code from those parts that execute on the GPU as device
code. The driver passes the host code to the installed C/C++ compiler
and the device code to the CUDA PTX compiler. The PTX compiler translates
the device code into a pseudo-assembly language code (PTX). The graphics
driver for the installed hardware translates this PTX code into binary code that
executes on the GPU.
This chapter outlines the instructions needed to install the CUDA Toolkit,
describes the API that manages the device and its memory and introduces the
error handling system used by CUDA.
Getting Started
As of January 2017, CUDA 8.0 is available for
download
to the following platform families:
- Windows: 7, 8.1, 10, Server 2008 R2, Server 2012 R2 (x86_64)
- Linux x86_64: Fedora 23, OpenSUSE 13.2, RHEL 7|6, CentOS 7|6, SLES 12|11-SP4 (64-bit), Ubuntu 16.04|14.04
- Linux ppc64le: Ubuntu 14.04
- Mac OS X x86_64: 10.12, 10.11
The release notes on the download pages contain up-to-date information regarding
known issues, bugs and workarounds.
All families require a CUDA-enabled GPU to complete the installation.
The CUDA-enabled GPUs are listed at CUDA-enabled
GPUs.
Windows
System Requirements
For a Windows installation, you need
- a CUDA-enabled GPU
- a supported version of Windows
- the CUDA Toolkit
- Visual Studio
To confirm that your GPU is CUDA-enabled, navigate to
Start < Control Panel < Hardware and Sound < Devices and Printers
< Device Manager < Display Adapters
If one of the display adapters is listed at CUDA-enabled
GPUs, your GPU qualifies.
Installation
The installer package installs the driver, the toolkit and the samples.
If you've already installed a driver, make sure that it meets the minimum
requirements for your specific toolkit. To check whether you've already
installed a driver and, if so, its version number, navigate to
Start < Control Panel < Hardware and Sound < NVIDIA Control Panel
< System Information
By default the CUDA Toolkit installs to c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v#.#,
where #.# is the version number. The directory
includes the following sub-directories:
- bin - compiler executables and runtime libraries
- include - header files for compiling CUDA source code
- lib - library files for linking CUDA object code
- doc - documentation in html and pdf form
The installer automatically sets the CUDA_PATH
environment variable to the installation directory and creates the .rules
file for locating the nvcc compiler driver.
By default, the samples install to c:\%Program Data%\NVIDIA Corporation\CUDA Samples\v#.#.
The source code for the samples are under the specific <category_name> sub-directory.
Build Your Sample Executables
Verify your installation as follows:
- navigate to the C:\ProgramData\NVIDIA Corporation\CUDA Samples\v#.# folder in File Explorer
- start the solution file that corresponds to your Visual Studio installation
- build the solution (this will take some time)
- note the configuration for which you have compiled the samples (required below)
Verification
Verify your installation as follows:
- driver version number
Start < Control Panel < NVIDIA Control Panel < System Information
- check path to current CUDA Toolkit - from the Windows command prompt
echo %CUDA_PATH%
- compiler driver version number - from the Windows command prompt
cd %CUDA_PATH%\bin
nvcc -V
- switch to the samples' executable directory - from the Windows command prompt
cd %NVCUDASAMPLES_ROOT%\bin\win64\Debug or
cd %NVCUDASAMPLES_ROOT%\bin\win64\Release
- run the following executables - from the Windows command prompt
deviceQuery
bandwidthTest
Device Management
Compute Capability
CUDA classifies devices according to their compute capability.
Compute capability consists of two integers: a major version and a minor version.
The major integer identifies the architecture (1 - Tesla, 2 - Fermi, 3 - Kepler, 5 - Maxwell or 6 - Pascal) while
the minor integer describes the version of the architecture itself.
A table of CUDA-enabled GPUs with compute capabilities is at
https://developer.nvidia.com/cuda-gpus.
Latest Drivers
Drivers for CUDA-enabled GPUs may be downloaded from
http://www.nvidia.com/Download/index.aspx?lang=en-us.
CUDA API Functions
The CUDA Toolkit includes a comprehensive library of functions for
managing devices.
These functions include:
- cudaGetDeviceCount(int*) - returns by parameter the number of compute-capable devices
- cudaGetDevice(int*) - returns by parameter the index of the device currently being used
- cudaGetDeviceProperties(struct cudaDeviceProp*, int) - returns by parameter the properties for
the device identified by the second argument
- cudaSetDevice(int) - selects a device
- cudaChooseDevice(int*, const struct cudaDeviceProp*) - identifies by parameter the device that
best matches the specified properties
The cudaDeviceProp struct contains the properties of a device.
Its members include:
- char name[256] - C-style null-terminated string identifying the device
- totalGlobalMem - total amount of global memory available on the device in bytes
- major - major version number of the compute capability
- minor - minor revision number of the compute capability
- multiProcessorCount - number of multiprocessors on the device
- maxThreadsPerMultiProcessor - maximum number of resident threads per multi-processor
Examples
Hello World
For your first CUDA program, save the following source file as hello.cpp:
// Hello World
// hello.cpp
#include <iostream>
#include <cuda_runtime.h>
int main() {
int nDevices;
cudaGetDeviceCount(&nDevices);
std::cout << "Hello World! " << nDevices <<
" CUDA-enabled device(s) installed" << std::endl;
system("pause);
}
|
Visual Studio Test Run
Start a Visual Studio project named hello. Use Visual
Studio 2012 or better for a CUDA 8.0 installation.
- New Project -> Visual C++ -> Empty Project
- name: hello | OK
- select Properties -> New Item -> hello.cpp file
- paste in the source code listed above
- set configuration to x64
- select Project -> Properties -> C/C++ -> General -> Additional Include Directories -> Edit
- add path to CUDA include directory
- select Linker -> General -> Additional Library Directories -> Edit
- add path to CUDA lib\x64 directory
- select Linker -> Input -> Additional Dependencies -> Edit
- add cudart.lib
- close Properties: select OK
- select Build Solution
- select Debug -> Start without Debugging
Executing the hello project produces the following output:
Hello World! 1 CUDA-enabled device(s) installed
|
Check Device Properties
To identify the properties of the installed device, build and run the
following source code:
// Check Properties
// properties.cpp
#include <iostream>
#include <cuda_runtime.h>
int main() {
int iDevice;
cudaDeviceProp prop;
cudaGetDevice(&iDevice);
cudaGetDeviceProperties(&prop, iDevice);
std::cout << "Name of the Device : " << prop.name << std::endl;
std::cout << "Total Global Memory : " << prop.totalGlobalMem << std::endl;
system("pause");
}
|
Your results should look something like:
Name of the Device : Quadro M1000M
Total Global Memory : 2147483648
Press any key to continue . . .
|
Memory Management
GPU memory hardware is physically distinct from the system memory hardware
on the host device. We refer to the memory on the GPU as device memory.
We manage device memory allocation and deallocation from the host code
itself. We allocate device memory, deallocate it, initialize it and
copy data between the host and the device using CUDA API functions.
Memory Functions
Allocation and Deallocation
The functions for allocating and deallocating device memory have
similar names to their C language counterparts:
- cudaMalloc(void**, int) -
returns the address of the allocated memory through its first parameter;
the function call passes the address of the pointer in its first
argument and the number of bytes in its second argument
- cudaFree(void*) -
deallocates the memory; the function call passes the address of the
allocated memory in its first and only argument
Initialization
The function for initializing device memory byte by byte is:
- cudaMemset(void*, int, size_t) -
the function call passes the address of the memory to be initialized in its first
argument, the initial value of each byte in its second argument and the number of
bytes to be initialized in its third argument
Copying
The function for copying between system and device memory is:
- cudaMemcpy(void*, const void*, size_t, enum cudaMemcpyKind) -
the function call passes the address of the destination memory in its first argument, the address of the
source memory in its second argument, the number of bytes to copy in its third
argument and the copying direction
in its fourth argument
The following enumeration constants specify the copying direction:
- cudaMemcpyHostToDevice - from host to device
- cudaMemcpyDeviceToHost - from device to host
- cudaMemcpyDeviceToDevice - from device to device
Example
The following program
- allocates dynamic memory on the host and the
device for the user-specified number of ints
- sets each byte of device memory to the user-specified value
- copies the initialized values from the device to the host
- displays the values
- deallocates the host and device memory
// Set Memory
// memset.cpp
#include <iostream>
#include <cstdlib>
#include <cuda_runtime.h>
int main(int argc, char* argv[]) {
if (argc != 3) {
std::cerr << argv[0] << ": invalid number of arguments\n";
std::cerr << "Usage: " << argv[0] << " number_of_ints initial_value\n";
return 1;
}
int n = std::atoi(argv[1]);
int v = std::atoi(argv[2]);
int* d_a;
int* h_a = new int[n];
cudaMalloc((void**)&d_a, n * sizeof(int));
cudaMemset(d_a, v, n * sizeof(int));
cudaMemcpy(h_a, d_a, n * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < n; i++)
std::cout << h_a[i] << (i % 5 == 4 ? '\n' : ' ');
std::cout << std::endl;
cudaFree(d_a);
delete [] h_a;
system("pause);
}
|
Some programmers preface their pointer identifiers:
- h_ for host memory
- d_ for device memory
The results for command-line arguments of 10 1 are:
16843009 16843009 16843009 16843009 16843009
16843009 16843009 16843009 16843009 16843009
|
Question: why does each memory location hold a value of 16843009
for a second command line argument of 1?
Hint
Dereferencing
Since the compiler cannot distinguish an address on the host from an address on
the device, we cannot dereference a pointer to device memory from within host
code or a pointer to host memory from within device code.
Error Handling
CUDA supports two separate methods for identifying runtime errors:
- direct - error state monitoring
- indirect - memory addresses evaluation
Error State Monitoring
The fundamental error handling system in CUDA is a state system that reports the latest error.
Determining success involves examining the error-state after a function has returned
control to its caller. Many of the CUDA API functions return the error state directly.
The cudaGetLastError() function returns the error code
that identifies the current error state. This error code refers to the
last error generated by the runtime binary code. Error codes are of cudaError_t
enumerated type. An error code on success has the value
cudaSuccess. The Toolkit Reference Manual
lists the enumeration constants that identify the error states associated
with different error codes.
To check for a particular error, we compare its enumeration constant with the value
returned by the API function or the cudaGetLastError() function.
A call to cudaGetLastError() resets the error state to
cudaSuccess.
The cudaGetErrorString() function provides a user-friendly
description of an error code. This function receives the error code as its
first and only argument and returns the address of a C-style null-terminated string
that describes that code.
Example
// Error State Check
// error_state.cpp
#include <iostream>
#include <cstdlib>
#include <cuda_runtime.h>
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << argv[0] << ": invalid number of arguments\n";
std::cerr << "Usage: " << argv[0] << " number_of_elements\n";
return 1;
}
int n = std::atoi(argv[1]);
int* d_a;
cudaError_t error;
// this API function returns an error code
error = cudaMalloc((void**)&d_a, n * sizeof(int));
if (error != cudaSuccess)
std::cerr << "***" << cudaGetErrorString(error) << "***\n";
else {
std::cout << "cudaMalloc succeeded" << std::endl;
cudaFree(d_a);
// this API function returns the current error state
error = cudaGetLastError();
if (error != cudaSuccess)
std::cerr << "***" << cudaGetErrorString(error) << "***\n";
else
std::cout << "cudaFree succeeded" << std::endl;
}
system("pause);
}
|
Memory Address
CUDA API functions that allocate memory assign the address of the reserved memory
only if allocation succeeds, but do not alter the address if
allocation fails. By initializing the address to nullptr
and checking if the function has changed the address once it has returned control is an
indirect way of identifying an error.
Example
The cudaMalloc() function returns a memory
address through its first parameter. Let us initialize the receiving pointer
to nullptr (C++) or NULL (C).
The pointer's value will change only if the allocation succeeds:
// Memory Address Check
// address_check.cpp
#include <iostream>
#include <cstdlib>
#include <cuda_runtime.h>
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << argv[0] << ": invalid number of arguments\n";
std::cerr << "Usage: " << argv[0] << " number_of_elements\n";
return 1;
}
int n = std::atoi(argv[1]);
int* d_a = nullptr;
cudaMalloc((void**)&d_a, n * sizeof(int));
if (d_a) {
std::cout << "cudaMalloc succeeded" << std::endl;
cudaFree(d_a);
} else {
std::cerr << "***cudaMalloc failed***\n";
}
system("pause);
}
|
Combined Example
The following example demonstrates both methods of trapping errors:
// Error Handling
// error.cpp
#include <iostream>
#include <cstdlib>
#include <cuda_runtime.h>
int main(int argc, char* argv[]) {
if (argc != 3) {
std::cerr << argv[0] << ": invalid number of arguments\n";
std::cerr << "Usage: " << argv[0] << " number_of_ints initial_value\n";
return 1;
}
int rc = 0;
int n = std::atoi(argv[1]);
int v = std::atoi(argv[2]);
// check device memory pointer - method 2
int* d_a = nullptr;
cudaMalloc((void**)&d_a, n * sizeof(int));
if (!d_a) {
std::cerr << "***Memory not allocated on the device***\n";
return 2;
}
int* h_a = new int[n];
cudaError_t error;
// check return value - method 1
error = cudaMemset(d_a, v, n * sizeof(int));
// check error code
if (error != cudaSuccess) {
std::cerr << "***" << cudaGetErrorString(error) << "***\n";
rc = 3;
} else {
// check return value - method 1
error = cudaMemcpy(h_a, d_a, n * sizeof(int), cudaMemcpyDeviceToHost);
if (error != cudaSuccess) {
std::cerr << "***" << cudaGetErrorString(error) << "***\n";
rc = 4;
} else {
for (int i = 0; i < n; i++)
std::cout << h_a[i] << (i % 5 == 4 ? '\n' : ' ');
std::cout << std::endl;
}
}
cudaFree(d_a);
delete [] h_a;
system("pause);
return rc;
}
|
Exercises
|