Converting an optimized serial application to an application that includes
concurrent processing is an iterative process. Once we have identified
the principal hotspots in the serial application and resolved them within a
serial context, we parallelize the principal hotspot. Our solution
enables:
- a reduction in execution time
- the solution of a larger problems in the same time
- a refinement of the existing solution in the same time
We redesign the algorithm at the principal hotspot to
execute on parallel hardware. By doing so, we improve
utilization of the host hardware by our software.
Converting serial code to parallel code cannot be left entirely to compilers.
Compilers that generate parallel code introduce some assumptions that
might not apply in certain cases. Minimally, we need to
inform automatic compilers explicitly that a code segment may be parallelized. In
the absence of automatic compilers, we parallelize serial algorithms ourselves.
This chapter describes the APOD pattern for converting serial code to parallel code,
defines basic terminology used in concurrency-oriented programming, introduces the
concept of granularity and identifies two parallel patterns that are ideally suited
for execution on GPU hardware.
Parallelization Cycle
A typical design cycle in GPU programming consists of
four distinct steps. We repeat these four steps until we find the
application's performance acceptable.
In each step, we
- Assess
- Parallelize
- Optimize
- Deploy
The acronym for this pattern is APOD.

Assess
The Assess step identifies the major hotspot, which is the one that
consumes the largest share of execution time.
Initially, we use a standard serial profiler like that described in the
chapter entitled Initial Assessment.
In subsequent cycles, we use a parallel profiler like that described in
the chapter entitled Parallel Profiling.
Parallelize
In the Parallelize step, we focus on the major hotspot
identified in the Assess step. We convert the serial
code to parallel code either by implementing an
available algorithm or by introducing our own code.
Optimize
In the Optimize step, we refine our parallel solution by taking
the specific hardware directly into account. We attempt a
solution that accesses intermediate values through register memory
rather than moving them to and from system memory.
Deploy
In the Deploy step, we pause the iteration process and report the
improvement that we have achieved.
Terminology
An algorithm that we choose to parallelize often consumes both computational
and communications resources. The algorithm may involve a large
number of variables, access a large amount of data and involve a
large number of similar, if not identical, computations.
Hardware-wise, arithmetic logic units (ALU) perform the computations,
while buses transfer the data between the memory alongside the ALU and
system memory (RAM). Concurrent use of these resources enables
us to hide some of the time spent by one resource (computation or
communication) within the time spent by the other resource.
The basic terms of parallel programming used to model such scenarios
include:
- tasks
- threads
- granularity
Tasks
A task is a unit of work with a separate flow of control. Tasks
are potential units of parallel work. A task can execute independently
of any other task.
Standard strategies for identifying tasks and creating work units include:
- divide and conquer - divide the algorithm into sub-algorithms
where each sub-algorithm executes independently
- scatter and gather - send different parts of the data to different
processors and perform the same computations concurrently
- collect the results of computations performed concurrently
Threads
We schedule the work units created for parallel execution software threads.
In general, a software thread may be:
- control-intensive - searching, sorting, and parsing
- data-intensive - image processing, simulation, modeling, and data mining
- compute-intensive - iterative methods, numerical methods, and financial modeling
The underlying system schedules the software threads into hardware threads.
Granularity
The granularity of a task is the ratio of its computation time to
its communication time.
Granularity may vary significantly across different tasks. A well-designed
solution may consist of many tasks with differing granularities.
Designs Options
We classify designs according to the granularity that they address:
- A coarse-grained design (high granularity)
- suits a task of relatively high computational intensity
- the dominant computations hide asynchronous data transfers
- A fine-grained design (low granularity)
- suits a task of relatively high communication intensity
- the dominant communications hide the computations
Each design has distinct benefits and drawbacks:
- A coarse-grained design
- can perform completely different tasks concurrently
- might lead to difficulties in balancing load efficiently
- A fine-grained solution
- can balance load by providing enough work units
- might not generate enough work to hide asynchronous communications
Hardware Suitability
Multi-core CPUs are better suited to coarse-grained solutions and compute-intensive tasks.
Many core GPUs are better suited to fine-grained solutions and data-intensive tasks.
Introductory Examples
The following examples highlight the parallelize
and optimize steps of the APOD cycle.
Parallelize Step
Data-Independent Example
Consider the simple iteration shown below. The serial code on the left adds
the elements of two arrays element by element and stores each result in
the element of the second array. This serial code completes all computation and
communication for one element before starting the same process for the next element.
The parallel code on the right adds the element values and stores the results at the
same time for several tasks.
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
|
thread i - any order thread j - any order
y[i] = x[i] + y[i] y[j] = x[j] + y[j] ...
|
The parallel code does not impose the sequential order of the serial code.
In the parallel code, the indices of the tasks that
execute concurrently have no particular relation to one another.
A separate task is associated with each index and the tasks do not
communicate with one another. Each task involves two operations
and requires three memory accesses. The intensity of communication
over computation indicates that a fine-grained solution is preferrable.
Reduction Example
The serial code listed below accumulates the values of all elements of the
array in its first element. This iteration accumulates each element
in a single pass, element by element. The nested iteration on
the right accumulates partial sums in equally spaced elements, which leads
to a final accumulation in the first element (as illustrated in the diagram
below).
for (int i = 1; i < n; i+=)
a[0] += a[i];
|
for (int s = 1; s <= n / 2; s *= 2)
for (int j = 0; j < n; j += 2 * s)
a[j] += a[j + s];
|

The steps of the innermost iteration are independent of one another.
Accordingly, we can execute them in any order. On the other hand,
since each outmost iteration depends on values calculated in the preceding
iteration, we cannot execute those iterations in any order.
We must wait until the inner iteration has completed before proceeding to
the next set (that is, before incrementing the stride s).
We refer to each such enforced wait as a required
synchronization.
for (int s = 1; s <= n / 2; s *= 2)
for (int j = 0; j < n; j += 2 * s)
a[j] += a[j + s];
|
for (int s = 1; s <= n / 2; s *= 2)
a[0] += a[s] a[2*s] += a[3*s] ...
// synchronize here
|

At each stage, the number of tasks that are independent of one
another reduces by 1/2.
Optimize Step
Consider the communication in the code shown on the left.
Each accumulation involves a system-memory access. To reduce the
communication overhead, we optimize the code to store the intermediate
accumulated value in much faster memory and only copy the final result
to system memory once the accumulation has completed.
for (int s = 1; s <= n / 2; s *= 2)
a[0] += a[s] a[2*s] += a[3*s] ...
// synchronize here
|
for (int s = 1; s <= n / 2; s *= 2)
fm0 += a[s] fm1 += a[3*s] ...
// synchronize here
a[0] = fm0;
|

fm? are fast memory (register) variables.
For iterations with many strides (n is large),
the communication speedup can be significant.
Exercises
Primary Reading:
- Chapter 10 of the Course Text
Secondary Reading:
- Chapter 1 of the OpenCL Text
|