Part D - Directive Programming

OpenACC

Show how to generate parallel code using the compiler
Show how to minimize copies between host and device using directives

OpenACC | Example | Memory Transfers | Tips | Exercises


Directive programming is a well-established straightforward technique for guiding the compiler in generating parallel code.  This technique passes hints to the compiler, which generates its own code accordingly where possible.  The compiler may or might not satisfy all requests. 

OpenACC is the programming standard for directing compilers in their generation of parallel code for accelerators.  These accelerators may include not only GPUs, but also many-core CPUs.  Compilers that implement this standard are commercially available from The Portland Group (PGI), Cray Corporation and CAPS Enterprise.  PGI is currently owned by Nvidia.

This chapter introduces the OpenACC API available from the Portland Group, demonstrates the use of OpenACC directives in generating parallel code from serial code, demonstrates how to minimize memory transfers between the host and provides a few tips for maximizing accelerator performance.


OpenACC

The OpenACC standard describes compiler directives, library routines and environment variables for expressing parallelism in C, C++ and Fortran programs and offloading code to accelerators.  The currently supported release of OpenACC is version 2.0a.  The standard is available at http://www.openacc.org/sites/default/files/OpenACC%202%200.pdf

Each OpenACC directive starts with the following prefix

 #pragma acc

#pragma identifies a compiler directive, while acc identifies the directive as an OpenACC directive. 

Most OpenACC directives apply to the structured block of code immediately following the directive:

 #pragma acc ...
 for (int i = 0; i < n; i++) {
        a[i] = i * i;
 }  // end of the directive's scope 

The OpenACC execution model defines three levels: gangs, workers and vectors.  The correspondence to CUDA levels is implementation dependent.  Two possible correspondences are shown below.

 OpenACC   CUDA (a)  CUDA (b) 
 gang   block   block 
 worker   warp   --- 
 vector   threads of a warp   threads of a block 

C Syntax

A complete OpenACC directive takes the form

 #pragma acc construct [clause[,], clause ...] new-line

construct is a placeholder that defines the type of construct.  clause is a placeholder for the construct's qualifiers.  The brackets identifies a directive option.

Constructs

OpenACC constructs include:

  • parallel - launches a number of gangs that execute in parallel

  • loop - describes the type of parallelism to use on the immediately following loop and its nested loops
  • data - defines a region of the program within which data is accessible by the accelerator

  • kernels - identifies loops to be executed on the accelerator, typically a sequence of operations

parallel assumes that the code is safe to parallelize and requires that the programmer ensures that the code is indeed safe.  kernel performs a parallel analysis and parallelizes what the compiler accepts as safe to parallelize.

It is common to see parallel followed by loop

Runtime Library Routines

The OpenACC data types, enumeration types, and runtime library prototypes are exposed through the header file

 #include "openacc.h"

Runtime routines include:

  • int acc_get_num_devices(acc_device_t) - number of devices of the specified type attached to the host

  • void acc_set_device_type(acc_device_t) - set the device type to use when executing an accelerator parallel or kernels region

  • void acc_set_device_num(int, acc_device_t) - set the device to use from amongst the attached devices

  • int acc_get_device_num(acc_device_t) - returns the device number of the specified device that will be used to run the next accelerator parallel of kernels region

  • void acc_shutdown(acc_device_t) - shutdown the connection to the given accelerator device

  • d_void* acc_malloc(size_t) - allocate memory on the device

  • void acc_free(d_void*) - deallocate memory on the device

PGI Compiler

Currently, the PGI compiler is available for free for a 30-day trial period.  You will need to purchase a license to use the compiler after the trial period.

Compilation

To compile a C program named sample.c containing OpenACC directives using the PGI compiler, enter the following at the command line

 pgcc -acc [-Minfo] [-ta=nvidia,host] -o sample sample.c 

The brackets [ ] enclose optional flags.

The -acc flag identifies the presence of OpenACC directives in the source code.  The -Minfo flag directs the compiler to output informative messages while processing the directives.  The -ta flag directs the compiler to generate host and/or device versions. 

Environment Variables

The PGI_ACC_TIME environment variable controls the output of timing statistics.  If this variable is set to 1, the executable will output timing statistics. 

We set this variable at the command line

 export PGI_ACC_TIME=1
 setenv PGI_ACC_TIME=1

Example

The following example converts a serial version of a level 1 BLAS function to a parallel version using OpenACC.

Serial Version

The following host function executes in a single thread:

 void saxpy(int n, float a, const float* x, float* restrict y) { 
     for (int i = 0; i < n; i++)
         y[i] = a * x[i] + y[i];
 }

 // called for 1,048,576 elements
 sapxy(1<<20, 2.0f, x, y);

The restrict keyword informs the compiler that no other pointer shares this address during the function's execution. 

Parallel Version

Source with OpenACC Directive

The following OpenACC directive offloads the iteration to execute in parallel on an accelerator

 void saxpy(float* x, float* restrict y, float a, int n) { 
     #pragma acc parallel loop
     for (int i = 0; i < n; i++)
         y[i] = a * x[i] + y[i];
 }

 // called for 1,048,576 elements
 sapxy(1<<20, 2.0f, x, y);

The compiler

  1. generates parallel code to execute on the device
  2. allocates memory on the device
  3. copies the input data from the host to the device
  4. executes the parallel code on the device
  5. copies the output from the device to the host
  6. deallocates memory on the device

Compiler Output

The output from compilation at the command line looks like

 pgcc -acc -Minfo=accel -ta=nvidia -o saxpy_acc saxpy.c
 saxpy:
    11, Accelerator kernel generated
        13, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */ 
    11, Generating present_or_copyin(x[0:n])
        Generating present_or_copy(y[0:n])

        Generating NVIDIA code
        Generating compute capability 1.0 binary
        Generating compute capability 2.0 binary
        Generating compute capability 3.0 binary

0:n identifies the array as starting from index 0 and containing n elements.  This is a common notation in parallel programming.  Note that the compiler-generated executables include code for three different compute capabilities.

Profiling Output

Output from running the executable at the command line looks like

 Accelerator Kernel Timing data
  saxpy NVIDIA devicenum=0
        time(us): 3,256
        11: data copyin reached 2 times
            device time(us): total=1,619 max=892 min=727 avg=809 
        11: kernel launched 1 times
            grid: [4096] block: [256]
            device time(us): total=714 max=714 min=714 avg=714
            elapsed time(us): total=724 max=724 min=724 avg=724
        15: data copyout reached 1 times
            device time(us): total=923 max=923 min=923 avg=923

The times listed here are in micro-seconds (us).


Memory Transfers

OpenACC directives access data on the host by default.  If parallelism is expressed within an iteration, this default behavior warrants intervention.  Without intervention a directive within an iterationcan cause multiple copies between the host and the device and consequently a notable performance penalty.  The solution is to define a data region on the device and store the data within that region.

Jacobi Iteration

Jacobi iteration is common algorithm for determining the average value at a specific location from the values of its neighbours.  Jacobi iteration continues until all corrections become negligible.

Serial Version

The serial code for a Jacobi Iteration is as follows

 float err = 0.0f, tol = 1.0e-5f;
 int iter = 0, iter_max = 100;

 while (err > tol && iter < iter_max) {
     err = 0.0;
     for (int j = 1; j < n - 1; j++) {
         for (int i = 1; i < m - 1; i++) {
             Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] 
              + A[j-1][i] + A[j+1][i]);
             err = max(err, abs(Anew[j][i] - A[j][i]));
         }
     }
     for( int j = 1; j < n - 1; j++) {
         for(int i = 1; i < m - 1; i++ ) {
             A[j][i] = Anew[j][i];
         }
     }
     iter++;
 }

Naive OpenACC Version

The naive OpenACC equivalent of this serial code is

 float err = 0.0f, tol = 1.0e-5f;
 int iter = 0, iter_max = 100;

 while (err > tol && iter < iter_max) {
     err=0.0;

     #pragma acc parallel loop

     for (int j = 1; j < n - 1; j++) {
         for (int i = 1; i < m - 1; i++) {
             Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] 
              + A[j-1][i] + A[j+1][i]);
             err = max(err, abs(Anew[j][i] - A[j][i]));
         }
     }

     #pragma acc parallel loop

     for( int j = 1; j < n - 1; j++) {
         for(int i = 1; i < m - 1; i++ ) {
             A[j][i] = Anew[j][i];
         }
     }
     iter++;
 }

The compiler output at the command line looks like

 pgcc -Minfo=all -ta=nvidia:5.0,cc3x -acc -Minfo=accel -o sample2 sample2.c 
  main:
    56, Accelerator kernel generated
    57, #pragma acc loop gang /* blockIdx.x*/
    59, #pragma acc loop vector(256) /*threadIdx.x*/
    56, Generating present_or_copyin(A[0:][0:])
        Generating present_or_copyout(Anew[1:4094][1:4094])
        Generating NVIDIA code
        Generating compute capability 3.0 binary
    59, Loop is parallelizable
    68, Accelerator kernel generated
    69, #pragma acc loop gang /* blockIdx.x*/
    71, #pragma acc loop vector(256) /* threadIdx.x*/
    68, Generating present_or_copyout(A[1:4094][1:4094])
        Generating present_or_copyin(Anew[1:4094][1:4094])
        Generating NVIDIA code
        Generating compute capability 3.0 binary
    71, Loop is parallelizable

Output from the executable at the command line looks like

 Accelerator Kernel Timing data
  main NVIDIA devicenum=0
        time(us): 93,201,190
        56: data copyin reached 1000 times
            device time(us): total=23,049,452 max=28,928 min=22,761 avg=23,049
        56: kernel launched 1000 times
            grid: [4094] block: [256]
            device time(us): total=2,609,928 max=2,812 min=2,593 avg=2,609
            elapsed time(us): total=2,872,585 max=3,022 min=2,642 avg=2,872
        56: reduction kernel launched 1000 times
            grid: [1] block: [256]
            device time(us): total=19,218 max=724 min=16 avg=19
            elapsed time(us): total=29,070 max=734 min=26 avg=29
        68: data copyin reached 1000 times
            device time(us): total=23,888,588 max=33,546 min=23,378 avg=23,888
        68: kernel launched 1000 times
            grid: [4094] block: [256]
            device time(us): total=2,398,101 max=2,961 min=2,137 avg=2,398
            elapsed time(us): total=2,407,481 max=2,971 min=2,146 avg=2,407
        68: data copyout reached 1000 times
            device time(us): total=20,664,362 max=27,788 min=20,511 avg=20,664 
        77: data copyout reached 1000 times
            device time(us): total=20,571,541 max=24,837 min=20,521 avg=20,571

Note the number of copies between host and device and the excessive amounts of time involved in completing the iteration. 

The figure below illustrates the difference between the elapsed times using CPU threads versus OpenACC.  The OpenACC code is slower than the serial code!

Naive OpenACC

OpenACC with Data Regions

To minimize the copies between host and device, we define data regions on the device and store the data in those regions: 

 float err = 0.0f, tol = 1.0e-5f;
 int iter = 0, iter_max = 100;

 #pragma acc data copy(A), create(Anew)

 while (err > tol && iter < iter_max) {
     err=0.0;

     #pragma acc parallel loop reduction(max:err)

     for (int j = 1; j < n - 1; j++) {
         for (int i = 1; i < m - 1; i++) {
             Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] 
              + A[j-1][i] + A[j+1][i]);
             err = max(err, abs(Anew[j][i] - A[j][i]));
         }
     }

     #pragma acc parallel loop

     for( int j = 1; j < n - 1; j++) {
         for(int i = 1; i < m - 1; i++ ) {
             A[j][i] = Anew[j][i];
         }
     }
     iter++;
 }

Then, output from the executable at the command line looks like

 Accelerator Kernel Timing data
  main NVIDIA devicenum=0
        time(us): 4,802,950
        51: data copyin reached 1 times
            device time(us): total=22,768 max=22,768 min=22,768 avg=22,768
        57: kernel launched 1000 times
            grid: [4094] block: [256]
            device time(us): total=2,611,387 max=2,817 min=2,593 avg=2,611
            elapsed time(us): total=2,620,044 max=2,900 min=2,601 avg=2,620
        57: reduction kernel launched 1000 times
            grid: [1] block: [256]
            device time(us): total=18,083 max=842 min=16 avg=18
            elapsed time(us): total=27,731 max=852 min=25 avg=27
        69: kernel launched 1000 times
            grid: [4094] block: [256]
            device time(us): total=2,130,162 max=2,599 min=2,112 avg=2,130
            elapsed time(us): total=2,139,919 max=2,712 min=2,122 avg=2,139 
        83: data copyout reached 1 times
            device time(us): total=20,550 max=20,550 min=20,550 avg=20,550

Note the number of copies between host and device and the amounts of time involved in completing the iteration with data regions. 

The figure below illustrates the difference between the elapsed times involved using OpenACC with data regions and using CPU threads.

OpenACC with Data Regions


Performance Tips

Keep in mind the following tips when using OpenACC:

  • eliminate pointer arithmetic - use array indexing instead and the restrict keyword
  • make while loops parallelizable by using for loops instead
  • rectangles are better than triangles - avoid inner triangular loops
  • restructure linearized arrays into arrays with computed indices
  • inline the function calls in directive regions
  • be well aware of data movement between host and device
  • use data regions to avoid inefficiencies

Exercises




Previous Reading  Previous: OpenCL Memory Model Next: Best Practices   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License