Part B - CUDA Programming Model

CUDA Preliminaries

Install the toolkit and check the hardware's capabilities
Describe the API functions for memory allocation and data transfer
Introduce the error handling system

Getting Started | Device Management | Memory | Error Handling | Exercises


CUDA treats the GPU as a massively data-parallel processor.  This simple model subdivides the data into equal small parts and executes instructions on each part concurrently.  CUDA provides a direct interface to the GPU hardware. 

CUDA programmers write source code for the CPU and GPU in the same files.  The CUDA compiler driver distinguishes those parts that execute on the CPU as host code from those parts that execute on the GPU as device code.  The driver passes the host code to the installed C/C++ compiler and the device code to the CUDA PTX compiler.  The PTX compiler translates the device code into a pseudo-assembly language code (PTX).  The graphics driver for the installed hardware translates this PTX code into binary code that executes on the GPU. 

This chapter outlines the instructions needed to install the CUDA Toolkit, describes the API that manages the device and its memory and introduces the error handling system used by CUDA. 


Getting Started

As of January 2017, CUDA 8.0 is available for download to the following platform families:

  • Windows: 7, 8.1, 10, Server 2008 R2, Server 2012 R2 (x86_64)
  • Linux x86_64: Fedora 23, OpenSUSE 13.2, RHEL 7|6, CentOS 7|6, SLES 12|11-SP4 (64-bit), Ubuntu 16.04|14.04
  • Linux ppc64le: Ubuntu 14.04
  • Mac OS X x86_64: 10.12, 10.11

The release notes on the download pages contain up-to-date information regarding known issues, bugs and workarounds. 

All families require a CUDA-enabled GPU to complete the installation.  The CUDA-enabled GPUs are listed at CUDA-enabled GPUs.

Windows

System Requirements

For a Windows installation, you need

  • a CUDA-enabled GPU
  • a supported version of Windows
  • the CUDA Toolkit
  • Visual Studio

To confirm that your GPU is CUDA-enabled, navigate to

 Start < Control Panel < Hardware and Sound < Devices and Printers
       < Device Manager < Display Adapters 

If one of the display adapters is listed at CUDA-enabled GPUs, your GPU qualifies.

Installation

The installer package installs the driver, the toolkit and the samples.  If you've already installed a driver, make sure that it meets the minimum requirements for your specific toolkit.  To check whether you've already installed a driver and, if so, its version number, navigate to

 Start < Control Panel < Hardware and Sound < NVIDIA Control Panel
       < System Information

By default the CUDA Toolkit installs to c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v#.#, where #.# is the version number.  The directory includes the following sub-directories:

  • bin - compiler executables and runtime libraries
  • include - header files for compiling CUDA source code
  • lib - library files for linking CUDA object code
  • doc - documentation in html and pdf form

The installer automatically sets the CUDA_PATH environment variable to the installation directory and creates the .rules file for locating the nvcc compiler driver. 

By default, the samples install to c:\%Program Data%\NVIDIA Corporation\CUDA Samples\v#.#.  The source code for the samples are under the specific <category_name> sub-directory.

Build Your Sample Executables

Verify your installation as follows:

  • navigate to the C:\ProgramData\NVIDIA Corporation\CUDA Samples\v#.# folder in File Explorer
  • start the solution file that corresponds to your Visual Studio installation
  • build the solution (this will take some time)
  • note the configuration for which you have compiled the samples (required below)

Verification

Verify your installation as follows:

  • driver version number
    Start < Control Panel < NVIDIA Control Panel < System Information
  • check path to current CUDA Toolkit - from the Windows command prompt
    echo %CUDA_PATH%
  • compiler driver version number - from the Windows command prompt
    cd %CUDA_PATH%\bin
    nvcc -V
  • switch to the samples' executable directory - from the Windows command prompt
    cd %NVCUDASAMPLES_ROOT%\bin\win64\Debug or cd %NVCUDASAMPLES_ROOT%\bin\win64\Release
  • run the following executables - from the Windows command prompt
    deviceQuery
    bandwidthTest

Device Management

Compute Capability

CUDA classifies devices according to their compute capability.  Compute capability consists of two integers: a major version and a minor version.  The major integer identifies the architecture (1 - Tesla, 2 - Fermi, 3 - Kepler, 5 - Maxwell or 6 - Pascal) while the minor integer describes the version of the architecture itself.

A table of CUDA-enabled GPUs with compute capabilities is at https://developer.nvidia.com/cuda-gpus

Latest Drivers

Drivers for CUDA-enabled GPUs may be downloaded from http://www.nvidia.com/Download/index.aspx?lang=en-us

CUDA API Functions

The CUDA Toolkit includes a comprehensive library of functions for managing devices.  These functions include:

  • cudaGetDeviceCount(int*) - returns by parameter the number of compute-capable devices
  • cudaGetDevice(int*) - returns by parameter the index of the device currently being used
  • cudaGetDeviceProperties(struct cudaDeviceProp*, int) - returns by parameter the properties for the device identified by the second argument
  • cudaSetDevice(int) - selects a device
  • cudaChooseDevice(int*, const struct cudaDeviceProp*) - identifies by parameter the device that best matches the specified properties

The cudaDeviceProp struct contains the properties of a device.  Its members include:

  • char name[256] - C-style null-terminated string identifying the device
  • totalGlobalMem - total amount of global memory available on the device in bytes
  • major - major version number of the compute capability
  • minor - minor revision number of the compute capability
  • multiProcessorCount - number of multiprocessors on the device
  • maxThreadsPerMultiProcessor - maximum number of resident threads per multi-processor

Examples

Hello World

For your first CUDA program, save the following source file as hello.cpp:

 // Hello World
 // hello.cpp

 #include <iostream>
 #include <cuda_runtime.h>

 int main() {
     int nDevices;
     cudaGetDeviceCount(&nDevices);
     std::cout << "Hello World!  " << nDevices << 
       " CUDA-enabled device(s) installed" << std::endl;
     system("pause);
 }

Visual Studio Test Run

Start a Visual Studio project named hello.  Use Visual Studio 2012 or better for a CUDA 8.0 installation. 

  • New Project -> Visual C++ -> Empty Project
  • name: hello | OK
  • select Properties -> New Item -> hello.cpp file
  • paste in the source code listed above
  • set configuration to x64
  • select Project -> Properties -> C/C++ -> General -> Additional Include Directories -> Edit
  • add path to CUDA include directory
  • select Linker -> General -> Additional Library Directories -> Edit
  • add path to CUDA lib\x64 directory
  • select Linker -> Input -> Additional Dependencies -> Edit
  • add cudart.lib
  • close Properties: select OK
  • select Build Solution
  • select Debug -> Start without Debugging

Executing the hello project produces the following output:

 Hello World!  1 CUDA-enabled device(s) installed 

Check Device Properties

To identify the properties of the installed device, build and run the following source code:

 // Check Properties
 // properties.cpp

 #include <iostream>
 #include <cuda_runtime.h>

 int main() {
     int iDevice;
     cudaDeviceProp prop;
     cudaGetDevice(&iDevice);
     cudaGetDeviceProperties(&prop, iDevice);
     std::cout << "Name of the Device  : " << prop.name << std::endl;
     std::cout << "Total Global Memory : " << prop.totalGlobalMem << std::endl; 
     system("pause");
 }

Your results should look something like:

Name of the Device  : Quadro M1000M 
Total Global Memory : 2147483648
Press any key to continue . . .

Memory Management

GPU memory hardware is physically distinct from the system memory hardware on the host device.  We refer to the memory on the GPU as device memory.  We manage device memory allocation and deallocation from the host code itself.  We allocate device memory, deallocate it, initialize it and copy data between the host and the device using CUDA API functions. 

Memory Functions

Allocation and Deallocation

The functions for allocating and deallocating device memory have similar names to their C language counterparts:

  • cudaMalloc(void**, int) - returns the address of the allocated memory through its first parameter; the function call passes the address of the pointer in its first argument and the number of bytes in its second argument
  • cudaFree(void*) - deallocates the memory; the function call passes the address of the allocated memory in its first and only argument

Initialization

The function for initializing device memory byte by byte is:

  • cudaMemset(void*, int, size_t) - the function call passes the address of the memory to be initialized in its first argument, the initial value of each byte in its second argument and the number of bytes to be initialized in its third argument

Copying

The function for copying between system and device memory is:

  • cudaMemcpy(void*, const void*, size_t, enum cudaMemcpyKind) - the function call passes the address of the destination memory in its first argument, the address of the source memory in its second argument, the number of bytes to copy in its third argument and the copying direction in its fourth argument

The following enumeration constants specify the copying direction:

  • cudaMemcpyHostToDevice - from host to device
  • cudaMemcpyDeviceToHost - from device to host
  • cudaMemcpyDeviceToDevice - from device to device

Example

The following program

  1. allocates dynamic memory on the host and the device for the user-specified number of ints
  2. sets each byte of device memory to the user-specified value
  3. copies the initialized values from the device to the host
  4. displays the values
  5. deallocates the host and device memory
 // Set Memory
 // memset.cpp

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>

 int main(int argc, char* argv[]) {
     if (argc != 3) {
         std::cerr << argv[0] << ": invalid number of arguments\n"; 
         std::cerr << "Usage: " << argv[0] << "  number_of_ints initial_value\n"; 
         return 1;
     }
     int n = std::atoi(argv[1]);
     int v = std::atoi(argv[2]);
     int* d_a;
     int* h_a = new int[n];
     cudaMalloc((void**)&d_a, n * sizeof(int));
     cudaMemset(d_a, v, n * sizeof(int));
     cudaMemcpy(h_a, d_a, n * sizeof(int), cudaMemcpyDeviceToHost); 
     for (int i = 0; i < n; i++)
         std::cout << h_a[i] << (i % 5 == 4 ? '\n' : ' ');
     std::cout << std::endl;
     cudaFree(d_a);
     delete [] h_a;
     system("pause);
 }

Some programmers preface their pointer identifiers:

  • h_ for host memory
  • d_ for device memory

The results for command-line arguments of 10 1 are:

 16843009 16843009 16843009 16843009 16843009
 16843009 16843009 16843009 16843009 16843009

Question: why does each memory location hold a value of 16843009 for a second command line argument of 1Hint

Dereferencing

Since the compiler cannot distinguish an address on the host from an address on the device, we cannot dereference a pointer to device memory from within host code or a pointer to host memory from within device code. 


Error Handling

CUDA supports two separate methods for identifying runtime errors:

  • direct - error state monitoring
  • indirect - memory addresses evaluation

Error State Monitoring

The fundamental error handling system in CUDA is a state system that reports the latest error.  Determining success involves examining the error-state after a function has returned control to its caller.  Many of the CUDA API functions return the error state directly. 

The cudaGetLastError() function returns the error code that identifies the current error state.  This error code refers to the last error generated by the runtime binary code.  Error codes are of cudaError_t enumerated type.  An error code on success has the value cudaSuccess.  The Toolkit Reference Manual lists the enumeration constants that identify the error states associated with different error codes. 

To check for a particular error, we compare its enumeration constant with the value returned by the API function or the cudaGetLastError() function.  A call to cudaGetLastError() resets the error state to cudaSuccess

The cudaGetErrorString() function provides a user-friendly description of an error code.  This function receives the error code as its first and only argument and returns the address of a C-style null-terminated string that describes that code.

Example

 // Error State Check
 // error_state.cpp

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << argv[0] << ": invalid number of arguments\n"; 
         std::cerr << "Usage: " << argv[0] << "  number_of_elements\n"; 
         return 1;
     }
     int n = std::atoi(argv[1]);
     int* d_a;
     cudaError_t error;

     // this API function returns an error code
     error = cudaMalloc((void**)&d_a, n * sizeof(int));
     if (error != cudaSuccess)
         std::cerr << "***" << cudaGetErrorString(error) << "***\n";
     else {
         std::cout << "cudaMalloc succeeded" << std::endl;
         cudaFree(d_a);

         // this API function returns the current error state
         error = cudaGetLastError();
         if (error != cudaSuccess)
             std::cerr << "***" << cudaGetErrorString(error) << "***\n";
         else
             std::cout << "cudaFree succeeded" << std::endl;
     }
     system("pause);
 }

Memory Address

CUDA API functions that allocate memory assign the address of the reserved memory only if allocation succeeds, but do not alter the address if allocation fails.  By initializing the address to nullptr and checking if the function has changed the address once it has returned control is an indirect way of identifying an error. 

Example

The cudaMalloc() function returns a memory address through its first parameter.  Let us initialize the receiving pointer to nullptr (C++) or NULL (C).  The pointer's value will change only if the allocation succeeds:

 // Memory Address Check
 // address_check.cpp

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << argv[0] << ": invalid number of arguments\n"; 
         std::cerr << "Usage: " << argv[0] << "  number_of_elements\n"; 
         return 1;
     }
     int n = std::atoi(argv[1]);
     int* d_a = nullptr;

     cudaMalloc((void**)&d_a, n * sizeof(int));

     if (d_a) {
         std::cout << "cudaMalloc succeeded" << std::endl;
         cudaFree(d_a);
     } else {
         std::cerr << "***cudaMalloc failed***\n";
     }
     system("pause);
 }

Combined Example

The following example demonstrates both methods of trapping errors:

 // Error Handling
 // error.cpp

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>

 int main(int argc, char* argv[]) {
     if (argc != 3) {
         std::cerr << argv[0] << ": invalid number of arguments\n"; 
         std::cerr << "Usage: " << argv[0] << "  number_of_ints initial_value\n"; 
         return 1;
     }
     int rc = 0;
     int n = std::atoi(argv[1]);
     int v = std::atoi(argv[2]);

     // check device memory pointer - method 2
     int* d_a = nullptr;
     cudaMalloc((void**)&d_a, n * sizeof(int));
     if (!d_a) {
         std::cerr << "***Memory not allocated on the device***\n";
         return 2;
     }

     int* h_a = new int[n];
     cudaError_t error;

     // check return value - method 1
     error = cudaMemset(d_a, v, n * sizeof(int));
     // check error code
     if (error != cudaSuccess) {
         std::cerr << "***" << cudaGetErrorString(error) << "***\n";
         rc = 3;
     } else {

         // check return value - method 1
         error = cudaMemcpy(h_a, d_a, n * sizeof(int), cudaMemcpyDeviceToHost);
         if (error != cudaSuccess) {
             std::cerr << "***" << cudaGetErrorString(error) << "***\n";
             rc = 4;
         } else {
             for (int i = 0; i < n; i++)
                 std::cout << h_a[i] << (i % 5 == 4 ? '\n' : ' ');
             std::cout << std::endl;
         }
     }

     cudaFree(d_a);
     delete [] h_a;
     system("pause);
     return rc;
 }

Exercises




Previous Reading  Previous: The Eco-System Next: CUDA Libraries   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License