Part A - Introduction

From Serial to Parallel

Introduce the APOD pattern
Introduce granularity and synchronization

Parallelization Cycle | Terminology | Introductory Examples | Exercises

Converting an optimized serial application to an application that includes concurrent processing is an iterative process.  Once we have identified the principal hotspots in the serial application and resolved them within a serial context, we parallelize the principal hotspot.  Our solution enables:

  • a reduction in execution time
  • the solution of a larger problems in the same time
  • a refinement of the existing solution in the same time

We redesign the algorithm at the principal hotspot to execute on parallel hardware.  By doing so, we improve utilization of the host hardware by our software. 

Converting serial code to parallel code cannot be left entirely to compilers.  Compilers that generate parallel code introduce some assumptions that might not apply in certain cases.  Minimally, we need to inform automatic compilers explicitly that a code segment may be parallelized.  In the absence of automatic compilers, we parallelize serial algorithms ourselves. 

This chapter describes the APOD pattern for converting serial code to parallel code, defines basic terminology used in concurrency-oriented programming, introduces the concept of granularity and identifies two parallel patterns that are ideally suited for execution on GPU hardware. 

Parallelization Cycle

A typical design cycle in GPU programming consists of four distinct steps.  We repeat these four steps until we find the application's performance acceptable. 

In each step, we

  • Assess
  • Parallelize
  • Optimize
  • Deploy

The acronym for this pattern is APOD. 


The Assess step identifies the major hotspot, which is the one that consumes the largest share of execution time. 

Initially, we use a standard serial profiler like that described in the chapter entitled Initial Assessment.  In subsequent cycles, we use a parallel profiler like that described in the chapter entitled Parallel Profiling.


In the Parallelize step, we focus on the major hotspot identified in the Assess step.  We convert the serial code to parallel code either by implementing an available algorithm or by introducing our own code. 


In the Optimize step, we refine our parallel solution by taking the specific hardware directly into account.  We attempt a solution that accesses intermediate values through register memory rather than moving them to and from system memory. 


In the Deploy step, we pause the iteration process and report the improvement that we have achieved. 


An algorithm that we choose to parallelize often consumes both computational and communications resources.  The algorithm may involve a large number of variables, access a large amount of data and involve a large number of similar, if not identical, computations.  Hardware-wise, arithmetic logic units (ALU) perform the computations, while buses transfer the data between the memory alongside the ALU and system memory (RAM).  Concurrent use of these resources enables us to hide some of the time spent by one resource (computation or communication) within the time spent by the other resource.

The basic terms of parallel programming used to model such scenarios include:

  • tasks
  • threads
  • granularity


A task is a unit of work with a separate flow of control.  Tasks are potential units of parallel work.  A task can execute independently of any other task. 

Standard strategies for identifying tasks and creating work units include:

  • divide and conquer - divide the algorithm into sub-algorithms where each sub-algorithm executes independently
  • scatter and gather - send different parts of the data to different processors and perform the same computations concurrently - collect the results of computations performed concurrently


We schedule the work units created for parallel execution software threads.  In general, a software thread may be:

  • control-intensive - searching, sorting, and parsing
  • data-intensive - image processing, simulation, modeling, and data mining
  • compute-intensive - iterative methods, numerical methods, and financial modeling

The underlying system schedules the software threads into hardware threads. 


The granularity of a task is the ratio of its computation time to its communication time. 

Granularity may vary significantly across different tasks.  A well-designed solution may consist of many tasks with differing granularities. 

Designs Options

We classify designs according to the granularity that they address:

  • A coarse-grained design (high granularity)
    • suits a task of relatively high computational intensity
    • the dominant computations hide asynchronous data transfers
  • A fine-grained design (low granularity)
    • suits a task of relatively high communication intensity
    • the dominant communications hide the computations

Each design has distinct benefits and drawbacks:

  • A coarse-grained design
    • can perform completely different tasks concurrently
    • might lead to difficulties in balancing load efficiently
  • A fine-grained solution
    • can balance load by providing enough work units
    • might not generate enough work to hide asynchronous communications

Hardware Suitability

Multi-core CPUs are better suited to coarse-grained solutions and compute-intensive tasks. 

Many core GPUs are better suited to fine-grained solutions and data-intensive tasks.

Introductory Examples

The following examples highlight the parallelize and optimize steps of the APOD cycle.

Parallelize Step

Data-Independent Example

Consider the simple iteration shown below.  The serial code on the left adds the elements of two arrays element by element and stores each result in the element of the second array.  This serial code completes all computation and communication for one element before starting the same process for the next element.  The parallel code on the right adds the element values and stores the results at the same time for several tasks. 

 for (int i = 0; i < n; i++)
     y[i] = x[i] + y[i];
 thread i - any order  thread j - any order
 y[i] = x[i] + y[i]    y[j] = x[j] + y[j] ... 

The parallel code does not impose the sequential order of the serial code.  In the parallel code, the indices of the tasks that execute concurrently have no particular relation to one another.  A separate task is associated with each index and the tasks do not communicate with one another.  Each task involves two operations and requires three memory accesses.  The intensity of communication over computation indicates that a fine-grained solution is preferrable.

Reduction Example

The serial code listed below accumulates the values of all elements of the array in its first element.  This iteration accumulates each element in a single pass, element by element.  The nested iteration on the right accumulates partial sums in equally spaced elements, which leads to a final accumulation in the first element (as illustrated in the diagram below). 

 for (int i = 1; i < n; i+=) 
     a[0] += a[i];
 for (int s = 1; s <= n / 2; s *= 2)
     for (int j = 0; j < n; j += 2 * s) 
         a[j] += a[j + s];

The steps of the innermost iteration are independent of one another.  Accordingly, we can execute them in any order.  On the other hand, since each outmost iteration depends on values calculated in the preceding iteration, we cannot execute those iterations in any order.  We must wait until the inner iteration has completed before proceeding to the next set (that is, before incrementing the stride s).  We refer to each such enforced wait as a required synchronization.

 for (int s = 1; s <= n / 2; s *= 2)
     for (int j = 0; j < n; j += 2 * s) 
         a[j] += a[j + s];
 for (int s = 1; s <= n / 2; s *= 2)
     a[0] += a[s] a[2*s] += a[3*s] ... 
     // synchronize here

At each stage, the number of tasks that are independent of one another reduces by 1/2.

Optimize Step

Consider the communication in the code shown on the left.  Each accumulation involves a system-memory access.  To reduce the communication overhead, we optimize the code to store the intermediate accumulated value in much faster memory and only copy the final result to system memory once the accumulation has completed. 

 for (int s = 1; s <= n / 2; s *= 2)
     a[0] += a[s] a[2*s] += a[3*s] ... 
     // synchronize here
 for (int s = 1; s <= n / 2; s *= 2)
     fm0 += a[s] fm1 += a[3*s] ... 
     // synchronize here
 a[0] = fm0;

fm? are fast memory (register) variables.  For iterations with many strides (n is large), the communication speedup can be significant.


Primary Reading:

  • Chapter 10 of the Course Text 

Secondary Reading:

  • Chapter 1 of the OpenCL Text 

Previous Reading  Previous: Linear Algebra in Science Next: The Eco-System   Next Reading

  Designed by Chris Szalwinski   Copying From This Site   
Creative Commons License