Part A - Introduction

Heterogeneous Computers

Introduce heterogeneous computing
Introduce the concurrency-oriented programming paradigm
Define performance metrics

Introduction | Concurrency | Performance | Exercises


Heterogeneous computers are computers that consist of several, distinct computational units.  Each unit may have its own instruction set architecture (ISA), which the manufacturer has specialized for that unit's particular tasks.  The common computational units include

  • CPUs (central processing units)
  • GPUs (graphics processing units)
  • DPSs (digital signal processors)
  • ASICs (application specific integrated cicuits)
  • FPGAs (field-programmable gate arrays)

This course introduces the subject of programming heterogeneous computers that consist of CPUs and GPUs. 

Modern CPUs are highly sophisticated processors originally designed for serial operations.  Engineers have perfected their design over the last half century.  CPUs that have several cores on the same chip are called multi-core CPUs.  Each core processes instructions a single instruction at a time while minimizing memory access times. 

Modern GPUs are processors designed from the outset to perform floating-point operations in parallel.  Engineers have designed these units to process 1000s of instructions concurrently.  GPUs are today's popular answer to massive, but simple parallel processing.  Currently, the number of processors on a GPU is in the 4-digit range.  To distinguish CPU designs with several cores from GPU designs with a large number of cores, we use the terms multi-core and many-core


source: SIGGRAPH Asia 2010 - OpenCL by Example

A heterogeneous application combines the advantageous features of the CPU with those of a GPU for an optimal, efficient solution.  The developer offloads massively parallel tasks to the GPU and executes the serial tasks on the CPU.  Software design involves determining which tasks to offload to the GPU and how to optimize the offloading and the offloaded code.  This involves identification of bottlenecks and parallelization of candidate algorithms.  Offloading to the GPU reduces the workload on the CPU enabling the CPU to perform its tasks more efficiently. 

The challenges facing software developers accustomed primarily to serial programming include:

  • learning to think in parallel
  • learning to write and optimize parallel code
  • learning to coordinate concurrent execution of code on the CPU and the GPU

Introduction

Engineers developed GPU technology over the first decade of the second millenium initially for gaming applications.  At the time, GPU technology was not considered a serious challenge to the CPU ecosystem.  Many universities and colleges offered game programming courses as optional courses in their curricula.  Digital game development and high performance computing were niche courses in graduate degree programs.  Academics treated these fields as marginal.  Programming heterogeneous computers remained a special skill found primarily amongst expert game programmers. 

GPU technology has now disrupted the CPU ecosystem.  GPU technology exhibits three features of a disruptive technology: a wide audience, a sustained demand and a high volume market.  The GPUs developed for the digital game industry possesses all three features: digital games are apps that many participants play and enjoy, the need for increased realism continues to drive demand and the digital games market is a high volume consumer market.  In 2010, Blizzard's Call of Duty grossed $775M in the first 5 days of its release; in 2011, Electronic Arts sold 5M units of Battlefield 3 within its first week.

Before the advent of GPU technology, parallel programs were written primarily for high performance computing (HPC) applications.  Many HPC applications used clusters of commodity hardware, which required cooling and high-speed interconnect technologies.  These platforms were not a viable alternative to the processors available in desktops, workstations or portable devices.  Serial processors in these devices had kept up well with the demands of the non-gamer marketplace.  Their performance had doubled every 18 months for quite a few decades.  Shrinking processor areas kept driving down cost and creating room for more transistors.  Conventional wisdom amongst software developers was a wait a bit attitude: after a year or two the processing power that we need will become available.  Little incentive existed to program in parallel outside the specialized HPC community. 

Moore's law summarizes the evolution of single processor technology.  In 1965, Gordon Moore, a co-founder of Intel Corporation, predicted that the number of transistors that can be placed inexpensively on an integrated cicuit would double every two years (Moore, Gordon E. (1965). "Cramming more components onto integrated circuits" (PDF). Electronics Magazine. p. 4.). 

285


The Concurrency Paradigm

The first major transition in the consumer space towards GPU technology became apparent around 2005.  Manufacturers had been aware that serial processors alone would prove unable to deliver increased performance with further increases in the number of transistors. 

The graph below illustrates how increases in clock-rate and performance that accompanied increases in the number of transistors have levelled off since 2005.  The top curve shows the steady increase in the number of transistors on a single chip following Moore's law.  The frequency curve shows the levelling off in clock speed.  The sequential processor performance curve shows the levelling off per clock.  Note how 2005 marks an evident, pressing need for a transition. 

400

As single-processor performance levelled off, CPU manufacturers like Intel and AMD started printing more multiple cores on a single die.  The era of multi-core and many-core processors had begun. 

Paradigm Shift

The turn towards concurrency in hardware elevated multi-core programming to an essential skill on par with single-core programming.  New programming languages enabled main-stream developers to implement algorithms that execute concurrently without relying on experts or enduring the steep learning curves encountered in HPC programming. 

The significant drop in the cost of manufacturing single processors, the availability of massively parallel processors at consumerprice-points and the inability of serial-processor technology to continue delivering increased performance with further increases in the number of transistors set the stage for adoption of a new programming paradigm.  We call this paradigm the concurrency-oriented paradigm

The free lunch to which developers had become accustomed over preceding decades had come to an end.  Potentials for single-processor solutions were saturated and computer scientists with an eye to the future began designing programming tools to enhance performance in novel ways. 

The paradigm-shift to concurrency-oriented programming represents the most pervasive shift in software development since the introduction of the object-oriented paradigm. 

191

Under the concurrency-oriented programming paradigm, the software developer actively optimizes the performance of applications by maximizing the elapsed execution time using heterogeneous computing techniques.  The developer assigns serial-processing friendly tasks to multi-core CPUs and parallel-processing friendly tasks to other dedicated units. 

The terms concurrent and parallel are not synonyms for one another.  Concurrency-oriented programming (COP) refers to programming of multiple tasks performed independently, but not necessarily simultaneously.  Parallel programming refers to programming of multiple tasks that execute not only independently but also simultaneously.  Concurrency oriented programming encompasses parallel programming, while parallel programming is a proper subset of concurrency-oriented programming. 

Concurrency-oriented programming is complementary to both object-oriented and structured programming.  Just as object-oriented languages provide opportunities to implement structured-programming principles, concurrency-oriented language extensions provide opportunities to implement object-oriented principles. 

The GPU Alternative

In 2005, GPU technology offered an obvious alternative route to increased performance.  As this technology matured through growth in the sophistication of digital games, per-processor cost dropped dramatically.  In less than a decade, GPU technology stood as therealistic, inexpensive alternative to not only further refinements in serial-processor technology, but also traditional high performance computing hardware. 

The rapid evolution of GPUs, the availability of these massively parallel processors at consumer-price points and the inability of serial-processor technology to continue delivering increased performance with further increases in the number of transistors promoted adoption of the new programming paradigm. 


Performance

The primary measure of performance in concurrency-oriented programming is the speedup in elapsed execution time.  Two formulas predict this speedup in ideal conditions:

  • Amdahl's Law
  • Gustafson's Law

These laws helps us predict the benefits that may arise from implementing algorithms in parallel form.

Amdahl's Law

Amdahl's law predicts the maximum speedup that an application can achieve as a result of converting processes that execute on a serial processor to ones that execute on parallel processors.  Amdahl's speedup law uses the proportion of the program that can be processed in parallel (P) and the number of processors available (n):

Sn = 1 / ( 1 - P + P/n )

S is the maximum speedup that we can expect after converting the parallelizable portion of a program into processes that execute in parallel. 

For example, if we convert 50% of the executable code in a program to parallel code executing on 10 processors, we can expect no more than a 1.818x speedup.  If we can convert 90% of the executable code to parallel code executing on 10 processors, we can expect no more than a speedup of 5.263x. 

As n approaches a very large number, this formula simplifies to:

S ≈ 1 / ( 1 - P )

For example, if we convert 50% of the executable code in a program to parallel code, we can expect no more than a 2x speedup.  If we can convert 90% of the executable code to parallel code, we can expect no more than a speedup of 10x. 

These are ideal scenarios.  Actual speedups will always be below these theoretical values.

Amdahl's law assumes that the problem size remains unchanged. 

Gustafson's Law

In 1988, John Gustafson and Edwin Barsis noted that many developers are less interested in in minimizing the elapsed time for a given problem size and more interested in maximizing the problem size.  Accordingly, they proposed a law that accounts for increases in problem size with increases in the number of parallel processors. 

Gustafson's law predicts speedup using the number of processors (n) and the non-parallelizable fraction of the process (1-P):

S(n) = n - ( 1 - P ) ∙ ( n - 1 )

For example, with 10 processors, if we convert 50% of the executable code to parallel code, we can expect no more than a 5.5x speedup.  If we convert 90% of the executable code to parallel code, we can expect no more than a 9.1x speedup.  Compare these figures to 1.8x and 5.3x under Amdahl's law. 

With 100 processors, if we convert 50% of the executable code to parallel code, we can expect no more than a 51.5x speedup.  If we convert 90% of the executable code to parallel code, we can expect no more than a 90.1x speedup.  In other words, form this viewpoint, any speedup is possible with enough processors.

Gustafson's reference point is the available equipment, while Amdahl's reference point is the problem size.  Amdahl's law encourages sticking with serial processing as long as possible, while Gustafson's law encourages parallelizing wherever possible. 

Process Time

While the primary measure of performance is elapsed time, process time is also important.  The process time of each computational unit helps identify bottlenecks within an application.  Since a single task may determine when other tasks can start execution, reducing process time of a critical task may result in a significant decrease in the elapsed time of the application as a whole. 

We design algorithms that reduce process time wherever those reductions significantly affect elapsed time.  Since CPUs and GPUs may execute their processes concurrently, not all individual process times are necessarily crucial to the final result. 


Exercises

Recommended Reading:




Previous Reading  Previous: Table of Contents Next: Initial Assessment   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License