While some software applications are ideal candidates for speedup, others
are less so. To determine if enhancing an application's performance
is a worthwhile effort, we assess the application by identifying those parts
that consume disproportionate amounts of execution time. We call
these parts the application's hotspots and try to reduce
their consumption of elapsed time by restructuring their logic.
Once we have restructured the hotspots' logic and identified the
surviving hotspots, we apply Amdahl's and Gustafson's
laws to predict the maximum possible speedup.
If the predicted speedup is sufficiently significant,
it may be worth our while to parallelize the
application. If not, we leave the application in
its optimized serial form.
This chapter introduces the requisite tools, demonstrates
how to replace hotspot code with more optimal code and describes
Big-O notation. We use Big-O notation to predict the growth
rate of execution time with problem size.
Profiling
Profiling is the manual process of determining where an application spends most of its
process time.
Standard compilers like gcc include built-in
profiling tools for serial applications.
Profiling Steps
Profiling involves three distinct steps:
- Preparing an executable for profiling
- Generating a profile by running the executable
- Analyzing the generated profile
Preparation
To prepare an executable of a C++ source file named myapp.cpp
for profiling using the g++ compiler, enter
the following command:
> g++ -O2 -g -pg -omyapp myapp.cpp
|
-O2 directs the compiler (g++) to optimize the code to level 2.
-g directs the compiler to produce debugging information for use with the gdb debugger.
-pg directs the compiler to include the executable code required for profiling.
-o directs the compiler to name the executable myapp.
Generation
To generate a profile, run the executable in the normal way:
The string following myapp is the command-line
argument read by the executable.
Analysis
The g++ profiler/analyzer, named gprof,
provides two distinct profiles of an application:
- a flat profile
- a call graph
To analyze the flat profile, enter the following command:
> gprof -p -b myapp > myapp.flt
|
-p directs the profiler (gprof) to output a flat profile.
-b directs the profiler to omit detailed explanations of the column headings from the
output.
To analyze the call graph, enter the following command:
> gprof -q -b myapp > myapp.clg
|
-q directs the profiler (gprof) to output a call graph.
-b directs the profiler to omit detailed explanations of the column headings from the
output.
Example - Sort
Bubble Sort
The following program sorts n random numbers in ascending order
using a bubble-sort algorithm:
// Bubble Sort
// bs.cpp
#include <iostream>
#include <fstream>
#include <iomanip>
#include <cstdlib>
#include <ctime>
void init(double* a, int n) {
std::srand(std::time(NULL));
double f = 1.0 / RAND_MAX;
for (int i = 0; i < n; i++)
a[i] = rand() * f;
}
void display(double* a, int n) {
std::ofstream out("sorted.txt");
out << std::fixed << std::setprecision(3);
for (int i = 0; i < n; i++) {
out << std::setw(10) << a[i] << '|';
if (i % 10 == 9) out << std::endl;
}
}
void bubbleSort(double* a, int n) {
int i, j;
double temp;
for (i = n - 1; i > 0; i--) {
for (j = 0; j < i; j++) {
if (a[j] > a[j+1]) {
temp = a[j];
a[j] = a[j+1];
a[j+1] = temp;
}
}
}
}
int main(int argc, char** argv) {
if (argc != 2) {
std::cerr << "** invalid number of arguments **\n";
return 1;
}
int n = std::atoi(argv[1]);
double* a = new double[n];
init(a, n);
bubbleSort(a, n);
display(a, n);
delete [] a;
}
|
To produce the flat profile, enter the following commands:
> g++ -O2 -g -pg -obs bs.cpp
> bs 500000
> gprof -p -b bs > bs.flt
|
The flat profile looks something like:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
100.00 751.31 751.31 1 751.31 751.31 bubbleSort(double*, int)
0.00 751.32 0.01 1 0.01 0.01 init(double*, int)
0.00 751.33 0.01 1 0.01 0.01 display(double*, int)
0.00 751.33 0.00 1 0.00 0.00 global constructors keyed
to _Z4initPdi
0.00 751.33 0.00 1 0.00 0.00 __static_initialization_
and_destruction_0(int, int)
0.00 751.33 0.00 1 0.00 0.00 std::fixed(std::ios_base&)
|
To produce the call graph, enter the following command:
> gprof -q -b bs > bs.clg
|
The call graph looks something like:
granularity: each sample hit covers 4 byte(s) for 0.00% of 751.33 seconds
index % time self children called name
<spontaneous>
[1] 100.0 0.00 751.33 main [1]
751.31 0.00 1/1 bubbleSort(double*, int) [2]
0.01 0.00 1/1 init(double*, int) [3]
0.01 0.00 1/1 display(double*, int) [4]
----------------------------------------
751.31 0.00 1/1 main [1]
[2] 100.0 751.31 0.00 1 bubbleSort(double*, int) [2]
----------------------------------------
0.01 0.00 1/1 main [1]
[3] 0.0 0.01 0.00 1 init(double*, int) [3]
----------------------------------------
0.01 0.00 1/1 main [1]
[4] 0.0 0.01 0.00 1 display(double*, int) [4]
0.00 0.00 1/1 std::fixed(std::ios_base&) [12]
----------------------------------------
0.00 0.00 1/1 __do_global_ctors_aux [13]
[10] 0.0 0.00 0.00 1 global constructors keyed
to _Z4initPdi [10]
0.00 0.00 1/1 __static_initialization_
and_destruction_0(int, int) [11]
----------------------------------------
0.00 0.00 1/1 global constructors keyed
to _Z4initPdi [10]
[11] 0.0 0.00 0.00 1 __static_initialization_
and_destruction_0(int, int) [11]
----------------------------------------
0.00 0.00 1/1 display(double*, int) [4]
[12] 0.0 0.00 0.00 1 std::fixed(std::ios_base&) [12]
----------------------------------------
Index by function name
[10] global constructors keyed to _Z4initPdi (bs.cpp)
[11] __static_initialization_and_destruction_0(int, int) (bs.cpp)
[4] display(double*, int)
[2] bubbleSort(double*, int)
[3] init(double*, int)
[12] std::fixed(std::ios_base&)
|
Clearly, this application spends nearly all of its execution time in
its bubble sort function.
Quick Sort
Let us replace the bubble-sort algorithm with
the well-known, significantly faster quick-sort algorithm:
// Quick Sort
// qs.cpp
#include <iostream>
#include <fstream>
#include <iomanip>
#include <cstdlib>
#include <ctime>
void init(double* a, int n) {
std::srand(std::time(NULL));
double f = 1.0 / RAND_MAX;
for (int i = 0; i < n; i++)
a[i] = std::rand() * f;
}
void display(double* a, int n) {
std::ofstream out("sorted.txt");
out << std::fixed << std::setprecision(3);
for (int i = 0; i < n; i++) {
out << std::setw(10) << a[i] << '|';
if (i % 10 == 9) out << std::endl;
}
}
void quickSort(double* a, int left, int right) {
int i = left; // index of left to right scan
int j = right; // index of right to left scan
if (right - left >= 1) {
double tmp, pivot = a[left];
while (j > i) {
while (a[i] <= pivot && i <= right && j > i)
i++;
while (a[j] > pivot && j >= left && j >= i)
j--;
if (j > i) {
tmp = a[i];
a[i] = a[j];
a[j] = tmp;
}
}
tmp = a[left];
a[left] = a[j];
a[j] = tmp;
quickSort(a, left, j - 1);
quickSort(a, j + 1, right);
}
}
int main(int argc, char** argv) {
if (argc != 2) {
std::cerr << "** invalid number of arguments **\n";
return 1;
}
int n = std::atoi(argv[1]);
double* a = new double[n];
init(a, n);
quickSort(a, 0, n - 1);
display(a, n);
delete [] a;
}
|
The flat profile for this quick-sort version looks something like:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
80.00 0.08 0.08 1 80.00 80.00 quickSort(double*, int, int)
10.00 0.09 0.01 1 10.00 10.00 init(double*, int)
10.00 0.10 0.01 1 10.00 10.00 display(double*, int)
0.00 0.10 0.00 1 0.00 0.00 global constructors keyed
to _Z4initPdi
0.00 0.10 0.00 1 0.00 0.00 __static_initialization_
and_destruction_0(int, int)
0.00 0.10 0.00 1 0.00 0.00 std::fixed(std::ios_base&)
|
The call graph for this quick-sort version looks something like:
granularity: each sample hit covers 4 byte(s) for 10.00% of 0.10 seconds
index %time self children called name
<spontaneous>
[1] 100.0 0.00 0.10 main [1]
0.08 0.00 1/1 quickSort(double*, int, int) [2]
0.01 0.00 1/1 init(double*, int) [3]
0.01 0.00 1/1 display(double*, int) [4]
-----------------------------------------
333399 quickSort(double*, int, int) [2]
0.08 0.00 1/1 main [1]
[2] 80.0 0.08 0.00 1+333399 quickSort(double*, int, int) [2]
333399 quickSort(double*, int, int) [2]
-----------------------------------------
0.01 0.00 1/1 main [1]
[3] 10.0 0.01 0.00 1 init(double*, int) [3]
-----------------------------------------
0.01 0.00 1/1 main [1]
[4] 10.0 0.01 0.00 1 display(double*, int) [4]
0.00 0.00 1/1 std::fixed(std::ios_base&) [12]
-----------------------------------------
0.00 0.00 1/1 __do_global_ctors_aux [13]
[10] 0.0 0.00 0.00 1 global constructors keyed to
_Z4initPdi [10]
0.00 0.00 1/1 __static_initialization_
and_destruction_0(int, int) [11]
-----------------------------------------
0.00 0.00 1/1 global constructors keyed to
_Z4initPdi [10]
[11] 0.0 0.00 0.00 1 __static_initialization_
and_destruction_0(int, int) [11]
-----------------------------------------
0.00 0.00 1/1 display(double*, int) [4]
[12] 0.0 0.00 0.00 1 std::fixed(std::ios_base&) [12]
-----------------------------------------
Index by function name
[10] global constructors keyed to _Z4initPdi (qs.cpp)
[3] init(double*, int)
[2] quickSort(double*, int, int)
[11] __static_initialization_and_destruction_0(int, int) (qs.cpp)
[4] display(double*, int)
[12] std::fixed(std::ios_base&)
|
Evaluation
Replacing the bubble-sort algorithm with the quick-sort algorithm
produced a speedup of 751.31 / 0.08 = 9391 on a sort of 500,000 elements!
Note that the speedup for this application varies with problem size. While
the improvement in elapsed time for a large number of elements is
clearly dramatic, improvements for much smaller numbers of elements may be
insignificant.
To predict the growth rate in elapsed time of an application with
problem size, we focus the hotspot algorithms within the application
and examine their growth rates.
Big-O Classification
The conventional notation for classifying the growth rate in execution time
with problem size is called Big-O. Big-O notation defines
an upper bound on an algorithm's performance and is represented symbolically
as O(.), where the dot denotes the growth rate.
For example, an algorithm that adds a constant to the elements of an array
scales linearly. That is, an array with 2000 elements takes twice as
long to fill as an array with 1000 elements. We classify that algorithm
as O(n), where n is the number
of elements.
To identify the Big-O for an algorithm, we examine its basic operations and
apply two rules:
- if there are several growth rates, select the highest rate and ignore
the lower rates
- if the growth rate includes a constant value, ignore the constant
The functions on the left describe the number of steps in three different
algorithms.
In each case, we drop all but the first term and ignore the constant.
Their order of each function is shown on the right
f(n) = 3
g(n) = 3n2 + 3
h(n) = 5n3 + 3n2 + 3
|
f(n) is O(1)
g(n) is O(n2)
h(n) is O(n3)
|
More commonly encountered growth rates are listed below:
Big-O
|
Scale
|
Description
|
Example
|
O(1)
|
constant
|
the number of steps does not change
|
retrieve the value of a particular element of an array
|
O(log n)
|
logarithmic
|
the number of steps cuts in half each time
|
binary search of a sorted array
|
O(n)
|
linear
|
the number of steps for each element in an array is constant
|
search an unsorted array for a specified value
|
O(n log n)
|
log linear
|
the number of steps for each element in an array is O(log n)
|
quick-sort algorithm (best)
|
O(n2)
|
quadratic
|
the number of steps for each element in an array is O(n)
|
bubble-sort algorithm (average)
|
O(n3)
|
cubic
|
the number of steps for each element in an array is O(n2)
|
naive multiplication of two matrices
|
Given the Big-O for an algorithm, we can predict the growth
rate in its execution time with problem size.
The table below lists orders for different Big-O classes and
different problem sizes.
Problem Size |
O(1) |
O(log n) |
O(n) |
O(n log n) |
O(n2) |
O(n3) |
10 |
1 |
3 |
10 |
30 |
100 |
1,000 |
100 |
1 |
5 |
100 |
500 |
10,000 |
1,000,000 |
1,000 |
1 |
7 |
1,000 |
7,000 |
1,000,000 |
1,000,000,000 |
10,000 |
1 |
10 |
10,000 |
100,000 |
100,000,000 |
1,000,000,000,000 |
100,000 |
1 |
12 |
100,000 |
1,200,000 |
10,000,000,000 |
1,000,000,000,000,000 |
1,000,000 |
1 |
14 |
1,000,000 |
14,000,000 |
1,000,000,000,000 |
1,000,000,000,000,000,000 |
Note that a quick-sort (n log n) on 1,000,000 elements running on a
desktop can beat a bubble-sort (n3) running on a supercomputer.
Amdahl's Law
Once we have Big-O-classified an algorithm, we can estimate the
problem size at which parallelization is worthwhile.
Consider running the quick-sort algorithm on 1,000,000 elements.
For this problem size, the process time is 0.21 seconds and the
algorithm consumes 80% of this time.
Is there any point in parallelizing this algorithm?
What would be the benefit?
According to Amdahl's law the maximum expected speedup with
480 processors would be
S480 = 1 / ( 1 - 0.80 + 0.80 / 480 ) = 4.96
|
At best, we expect the process time to drop
from 0.21 secs to about 0.21 / 4.96 = 0.04 secs. Since
this algorithm only takes 0.21 secs to execute, we clearly
need a much larger problem size to justify the effort involved
in parallelizing the application.
Exercises
- Run the bubble-sort and quick-sort algorithms for different array
sizes. Tabulate the execution time for each run. Add a column
with the logarithm of the execution time. Use a spreadsheet to plot
your results on a log(time) versus problem size graph.
- Read Rob Bell on Big-O notation
- Read Recursive Design on Big-O notation
- Read NIST for a more formal definition of Big-O notation
- Check out the Big-O Cheatsheet
|