Part D - Directive Programming
OpenACC
Show how to generate parallel code using the compiler
Show how to minimize copies between host and device using directives
OpenACC
| Example
| Memory Transfers
| Tips
| Exercises
Directive programming is a well-established straightforward technique
for guiding the compiler in generating parallel code. This
technique passes hints to the compiler, which generates its
own code accordingly where possible. The compiler may or
might not satisfy all requests.
OpenACC is the programming standard for directing compilers in their
generation of parallel code for accelerators. These accelerators
may include not only GPUs, but also many-core CPUs. Compilers
that implement this standard are commercially available from The
Portland Group (PGI), Cray Corporation and CAPS Enterprise.
PGI is currently owned by Nvidia.
This chapter introduces the OpenACC API available from the Portland
Group, demonstrates the use of OpenACC directives in generating parallel
code from serial code, demonstrates how to minimize memory transfers
between the host and provides a few tips for maximizing accelerator
performance.
OpenACC
The OpenACC standard describes compiler directives, library routines and
environment variables for expressing parallelism in C, C++ and Fortran
programs and offloading code to accelerators. The currently
supported release of OpenACC is version 2.0a.
The standard is available at
http://www.openacc.org/sites/default/files/OpenACC%202%200.pdf.
Each OpenACC directive starts with the following prefix
#pragma identifies a compiler directive, while
acc identifies the directive as an OpenACC directive.
Most OpenACC directives apply to the structured block of code
immediately following the directive:
#pragma acc ...
for (int i = 0; i < n; i++) {
a[i] = i * i;
} // end of the directive's scope
|
The OpenACC execution model defines three levels: gangs,
workers and vectors. The correspondence
to CUDA levels is implementation dependent. Two possible
correspondences are shown below.
OpenACC |
CUDA (a) |
CUDA (b) |
gang |
block |
block |
worker |
warp |
--- |
vector |
threads of a warp |
threads of a block |
C Syntax
A complete OpenACC directive takes the form
#pragma acc construct [clause[,], clause ...] new-line
|
construct is a placeholder
that defines the type of construct.
clause is a placeholder
for the construct's qualifiers. The brackets
identifies a directive option.
Constructs
OpenACC constructs include:
- parallel - launches a number
of gangs that execute in parallel
- loop - describes the type
of parallelism to use on the immediately following loop
and its nested loops
- data - defines a region of
the program within which data is accessible by the
accelerator
- kernels - identifies loops
to be executed on the accelerator, typically a sequence
of operations
parallel assumes that the code is safe
to parallelize and requires that the programmer ensures that the
code is indeed safe.
kernel performs a parallel
analysis and parallelizes what the compiler accepts as
safe to parallelize.
It is common to see parallel followed
by loop.
Runtime Library Routines
The OpenACC data types, enumeration types, and runtime library prototypes
are exposed through the header file
Runtime routines include:
- int acc_get_num_devices(acc_device_t) - number of devices of the specified type attached to the host
- void acc_set_device_type(acc_device_t) - set the device type to use when executing an accelerator parallel or kernels region
- void acc_set_device_num(int, acc_device_t) - set the device to use from amongst the attached devices
- int acc_get_device_num(acc_device_t) - returns the device number of the specified device that will be used to run the next accelerator parallel of kernels region
- void acc_shutdown(acc_device_t) - shutdown the connection to the given accelerator device
- d_void* acc_malloc(size_t) - allocate memory on the device
- void acc_free(d_void*) - deallocate memory on the device
PGI Compiler
Currently, the PGI compiler is available for free for a
30-day trial period.
You will need to purchase a license to use the compiler after the trial period.
Compilation
To compile a C program named sample.c containing OpenACC
directives using the PGI compiler, enter the following at the command line
pgcc -acc [-Minfo] [-ta=nvidia,host] -o sample sample.c
|
The brackets [ ] enclose optional flags.
The -acc flag identifies the presence
of OpenACC directives in the source code.
The -Minfo flag directs the compiler
to output informative messages while processing the directives.
The -ta flag directs the compiler to
generate host and/or device versions.
Environment Variables
The PGI_ACC_TIME environment variable
controls the output of timing statistics. If this variable
is set to 1, the executable will output
timing statistics.
We set this variable at the command line
export PGI_ACC_TIME=1
|
setenv PGI_ACC_TIME=1
|
Example
The following example converts a serial version of a level 1
BLAS function to a parallel version using OpenACC.
Serial Version
The following host function
executes in a single thread:
void saxpy(int n, float a, const float* x, float* restrict y) {
for (int i = 0; i < n; i++)
y[i] = a * x[i] + y[i];
}
// called for 1,048,576 elements
sapxy(1<<20, 2.0f, x, y);
|
The restrict keyword informs the
compiler that no other pointer shares this address during
the function's execution.
Parallel Version
Source with OpenACC Directive
The following OpenACC directive offloads the iteration to execute
in parallel on an accelerator
void saxpy(float* x, float* restrict y, float a, int n) {
#pragma acc parallel loop
for (int i = 0; i < n; i++)
y[i] = a * x[i] + y[i];
}
// called for 1,048,576 elements
sapxy(1<<20, 2.0f, x, y);
|
The compiler
- generates parallel code to execute on the device
- allocates memory on the device
- copies the input data from the host to the device
- executes the parallel code on the device
- copies the output from the device to the host
- deallocates memory on the device
Compiler Output
The output from compilation at the command line looks like
pgcc -acc -Minfo=accel -ta=nvidia -o saxpy_acc saxpy.c
saxpy:
11, Accelerator kernel generated
13, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
11, Generating present_or_copyin(x[0:n])
Generating present_or_copy(y[0:n])
Generating NVIDIA code
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
|
0:n identifies the array as starting from index
0 and containing n elements.
This is a common notation in parallel programming.
Note that the compiler-generated executables include code for three different
compute capabilities.
Profiling Output
Output from running the executable at the command line looks like
Accelerator Kernel Timing data
saxpy NVIDIA devicenum=0
time(us): 3,256
11: data copyin reached 2 times
device time(us): total=1,619 max=892 min=727 avg=809
11: kernel launched 1 times
grid: [4096] block: [256]
device time(us): total=714 max=714 min=714 avg=714
elapsed time(us): total=724 max=724 min=724 avg=724
15: data copyout reached 1 times
device time(us): total=923 max=923 min=923 avg=923
|
The times listed here are in micro-seconds (us).
Memory Transfers
OpenACC directives access data on the host by default.
If parallelism is expressed within an iteration, this default
behavior warrants intervention. Without intervention a
directive within an iterationcan cause multiple copies
between the host and the device and consequently a notable
performance penalty. The solution is to define a
data region on the device and store the data within that region.
Jacobi Iteration
Jacobi iteration is common algorithm for determining the average value
at a specific location from the values of its neighbours.
Jacobi iteration continues until all corrections become negligible.
Serial Version
The serial code for a Jacobi Iteration is as follows
float err = 0.0f, tol = 1.0e-5f;
int iter = 0, iter_max = 100;
while (err > tol && iter < iter_max) {
err = 0.0;
for (int j = 1; j < n - 1; j++) {
for (int i = 1; i < m - 1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
err = max(err, abs(Anew[j][i] - A[j][i]));
}
}
for( int j = 1; j < n - 1; j++) {
for(int i = 1; i < m - 1; i++ ) {
A[j][i] = Anew[j][i];
}
}
iter++;
}
|
Naive OpenACC Version
The naive OpenACC equivalent of this serial code is
float err = 0.0f, tol = 1.0e-5f;
int iter = 0, iter_max = 100;
while (err > tol && iter < iter_max) {
err=0.0;
#pragma acc parallel loop
for (int j = 1; j < n - 1; j++) {
for (int i = 1; i < m - 1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
err = max(err, abs(Anew[j][i] - A[j][i]));
}
}
#pragma acc parallel loop
for( int j = 1; j < n - 1; j++) {
for(int i = 1; i < m - 1; i++ ) {
A[j][i] = Anew[j][i];
}
}
iter++;
}
|
The compiler output at the command line looks like
pgcc -Minfo=all -ta=nvidia:5.0,cc3x -acc -Minfo=accel -o sample2 sample2.c
main:
56, Accelerator kernel generated
57, #pragma acc loop gang /* blockIdx.x*/
59, #pragma acc loop vector(256) /*threadIdx.x*/
56, Generating present_or_copyin(A[0:][0:])
Generating present_or_copyout(Anew[1:4094][1:4094])
Generating NVIDIA code
Generating compute capability 3.0 binary
59, Loop is parallelizable
68, Accelerator kernel generated
69, #pragma acc loop gang /* blockIdx.x*/
71, #pragma acc loop vector(256) /* threadIdx.x*/
68, Generating present_or_copyout(A[1:4094][1:4094])
Generating present_or_copyin(Anew[1:4094][1:4094])
Generating NVIDIA code
Generating compute capability 3.0 binary
71, Loop is parallelizable
|
Output from the executable at the command line looks like
Accelerator Kernel Timing data
main NVIDIA devicenum=0
time(us): 93,201,190
56: data copyin reached 1000 times
device time(us): total=23,049,452 max=28,928 min=22,761 avg=23,049
56: kernel launched 1000 times
grid: [4094] block: [256]
device time(us): total=2,609,928 max=2,812 min=2,593 avg=2,609
elapsed time(us): total=2,872,585 max=3,022 min=2,642 avg=2,872
56: reduction kernel launched 1000 times
grid: [1] block: [256]
device time(us): total=19,218 max=724 min=16 avg=19
elapsed time(us): total=29,070 max=734 min=26 avg=29
68: data copyin reached 1000 times
device time(us): total=23,888,588 max=33,546 min=23,378 avg=23,888
68: kernel launched 1000 times
grid: [4094] block: [256]
device time(us): total=2,398,101 max=2,961 min=2,137 avg=2,398
elapsed time(us): total=2,407,481 max=2,971 min=2,146 avg=2,407
68: data copyout reached 1000 times
device time(us): total=20,664,362 max=27,788 min=20,511 avg=20,664
77: data copyout reached 1000 times
device time(us): total=20,571,541 max=24,837 min=20,521 avg=20,571
|
Note the number of copies between host and device and the excessive amounts
of time involved in completing the iteration.
The figure below illustrates
the difference between the elapsed times using CPU threads versus OpenACC.
The OpenACC code is slower than the serial code!

OpenACC with Data Regions
To minimize the copies between host and device, we define data regions
on the device and store the data in those regions:
float err = 0.0f, tol = 1.0e-5f;
int iter = 0, iter_max = 100;
#pragma acc data copy(A), create(Anew)
while (err > tol && iter < iter_max) {
err=0.0;
#pragma acc parallel loop reduction(max:err)
for (int j = 1; j < n - 1; j++) {
for (int i = 1; i < m - 1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
err = max(err, abs(Anew[j][i] - A[j][i]));
}
}
#pragma acc parallel loop
for( int j = 1; j < n - 1; j++) {
for(int i = 1; i < m - 1; i++ ) {
A[j][i] = Anew[j][i];
}
}
iter++;
}
|
Then, output from the executable at the command line looks like
Accelerator Kernel Timing data
main NVIDIA devicenum=0
time(us): 4,802,950
51: data copyin reached 1 times
device time(us): total=22,768 max=22,768 min=22,768 avg=22,768
57: kernel launched 1000 times
grid: [4094] block: [256]
device time(us): total=2,611,387 max=2,817 min=2,593 avg=2,611
elapsed time(us): total=2,620,044 max=2,900 min=2,601 avg=2,620
57: reduction kernel launched 1000 times
grid: [1] block: [256]
device time(us): total=18,083 max=842 min=16 avg=18
elapsed time(us): total=27,731 max=852 min=25 avg=27
69: kernel launched 1000 times
grid: [4094] block: [256]
device time(us): total=2,130,162 max=2,599 min=2,112 avg=2,130
elapsed time(us): total=2,139,919 max=2,712 min=2,122 avg=2,139
83: data copyout reached 1 times
device time(us): total=20,550 max=20,550 min=20,550 avg=20,550
|
Note the number of copies between host and device and the amounts of time involved
in completing the iteration with data regions.
The figure below illustrates
the difference between the elapsed times involved using OpenACC with data regions
and using CPU threads.

Performance Tips
Keep in mind the following tips when using OpenACC:
- eliminate pointer arithmetic - use array indexing instead and the restrict keyword
- make while loops parallelizable by using for loops instead
- rectangles are better than triangles - avoid inner triangular loops
- restructure linearized arrays into arrays with computed indices
- inline the function calls in directive regions
- be well aware of data movement between host and device
- use data regions to avoid inefficiencies
Exercises
|