Part A - Introduction

Initial Assessment

Describe how to profile a purely serial application
Introduce Big-O classification

Profiling | Big-O Classification | Exercises


While some software applications are ideal candidates for speedup, others are less so.  To determine if enhancing an application's performance is a worthwhile effort, we assess the application by identifying those parts that consume disproportionate amounts of execution time.  We call these parts the application's hotspots and try to reduce their consumption of elapsed time by restructuring their logic. 

Once we have restructured the hotspots' logic and identified the surviving hotspots, we apply Amdahl's and Gustafson's laws to predict the maximum possible speedup.  If the predicted speedup is sufficiently significant, it may be worth our while to parallelize the application.  If not, we leave the application in its optimized serial form. 

This chapter introduces the requisite tools, demonstrates how to replace hotspot code with more optimal code and describes Big-O notation.  We use Big-O notation to predict the growth rate of execution time with problem size. 


Profiling

Profiling is the manual process of determining where an application spends most of its process time.  Standard compilers like gcc include built-in profiling tools for serial applications. 

Profiling Steps

Profiling involves three distinct steps:

  1. Preparing an executable for profiling
  2. Generating a profile by running the executable
  3. Analyzing the generated profile

Preparation

To prepare an executable of a C++ source file named myapp.cpp for profiling using the g++ compiler, enter the following command:

 > g++ -O2 -g -pg -omyapp myapp.cpp

-O2 directs the compiler (g++) to optimize the code to level 2. 
-g directs the compiler to produce debugging information for use with the gdb debugger. 
-pg directs the compiler to include the executable code required for profiling. 
-o directs the compiler to name the executable myapp

Generation

To generate a profile, run the executable in the normal way:

 > myapp 500

The string following myapp is the command-line argument read by the executable. 

Analysis

The g++ profiler/analyzer, named gprof, provides two distinct profiles of an application:

  • a flat profile
  • a call graph

To analyze the flat profile, enter the following command:

 > gprof -p -b myapp > myapp.flt

-p directs the profiler (gprof) to output a flat profile. 
-b directs the profiler to omit detailed explanations of the column headings from the output. 

To analyze the call graph, enter the following command:

 > gprof -q -b myapp > myapp.clg

-q directs the profiler (gprof) to output a call graph. 
-b directs the profiler to omit detailed explanations of the column headings from the output. 

Example - Sort

Bubble Sort

The following program sorts n random numbers in ascending order using a bubble-sort algorithm:

 // Bubble Sort
 // bs.cpp

 #include <iostream>
 #include <fstream>
 #include <iomanip>
 #include <cstdlib>
 #include <ctime>

 void init(double* a, int n) {
     std::srand(std::time(NULL));
     double f = 1.0 / RAND_MAX;
     for (int i = 0; i < n; i++)
         a[i] = rand() * f;
 }

 void display(double* a, int n) {
     std::ofstream out("sorted.txt");
     out << std::fixed << std::setprecision(3);
     for (int i = 0; i < n; i++) {
         out << std::setw(10) << a[i] << '|';
         if (i % 10 == 9) out << std::endl;
     }
 }

 void bubbleSort(double* a, int n) {
     int i, j;
     double temp;
     for (i = n - 1; i > 0; i--) {
         for (j = 0; j < i; j++) {
             if (a[j] > a[j+1]) {
                 temp = a[j];
                 a[j] = a[j+1];
                 a[j+1] = temp;
             }
         }
     }
 }

 int main(int argc, char** argv) {
     if (argc != 2) {
         std::cerr << "** invalid number of arguments **\n"; 
         return 1;
     }
     int n = std::atoi(argv[1]);
     double* a = new double[n];
     init(a, n);
     bubbleSort(a, n);
     display(a, n);
     delete [] a;
 }

To produce the flat profile, enter the following commands:

 > g++ -O2 -g -pg -obs bs.cpp
 > bs 500000
 > gprof -p -b bs > bs.flt

The flat profile looks something like:

 Each sample counts as 0.01 seconds.
   %   cumulative  self         self   total
   time  seconds seconds calls s/call s/call name
 100.00   751.31  751.31     1 751.31 751.31 bubbleSort(double*, int)
   0.00   751.32    0.01     1   0.01   0.01 init(double*, int)
   0.00   751.33    0.01     1   0.01   0.01 display(double*, int)
   0.00   751.33    0.00     1   0.00   0.00 global constructors keyed
                                             to _Z4initPdi
   0.00   751.33    0.00     1   0.00   0.00 __static_initialization_
                                             and_destruction_0(int, int) 
   0.00   751.33    0.00     1   0.00   0.00 std::fixed(std::ios_base&)

To produce the call graph, enter the following command:

 > gprof -q -b bs > bs.clg

The call graph looks something like:

 granularity: each sample hit covers 4 byte(s) for 0.00% of 751.33 seconds

 index % time    self  children called name
                                           <spontaneous>
 [1]    100.0    0.00  751.33          main [1]
               751.31    0.00     1/1      bubbleSort(double*, int) [2]
                 0.01    0.00     1/1      init(double*, int) [3]
                 0.01    0.00     1/1      display(double*, int) [4]
 ----------------------------------------
               751.31    0.00     1/1      main [1]
 [2]    100.0  751.31    0.00     1    bubbleSort(double*, int) [2]
 ----------------------------------------
                 0.01    0.00     1/1      main [1]
 [3]      0.0    0.01    0.00     1    init(double*, int) [3]
 ----------------------------------------
                 0.01    0.00     1/1      main [1]
 [4]      0.0    0.01    0.00     1    display(double*, int) [4]
                 0.00    0.00     1/1      std::fixed(std::ios_base&) [12]
 ----------------------------------------
                 0.00    0.00     1/1      __do_global_ctors_aux [13]
 [10]     0.0    0.00    0.00     1    global constructors keyed
                                           to _Z4initPdi [10]
                 0.00    0.00     1/1      __static_initialization_
                                           and_destruction_0(int, int) [11]
 ----------------------------------------
                 0.00    0.00     1/1      global constructors keyed
                                           to _Z4initPdi [10]
 [11]     0.0    0.00    0.00     1    __static_initialization_
                                           and_destruction_0(int, int) [11] 
 ----------------------------------------
                 0.00    0.00     1/1      display(double*, int) [4]
 [12]     0.0    0.00    0.00     1    std::fixed(std::ios_base&) [12]
 ----------------------------------------
 
 Index by function name

   [10] global constructors keyed to _Z4initPdi (bs.cpp)
   [11] __static_initialization_and_destruction_0(int, int) (bs.cpp)
   [4]  display(double*, int)
   [2]  bubbleSort(double*, int)
   [3]  init(double*, int)
   [12] std::fixed(std::ios_base&)

Clearly, this application spends nearly all of its execution time in its bubble sort function. 

Quick Sort

Let us replace the bubble-sort algorithm with the well-known, significantly faster quick-sort algorithm:

 // Quick Sort
 // qs.cpp

 #include <iostream>
 #include <fstream>
 #include <iomanip>
 #include <cstdlib>
 #include <ctime>

 void init(double* a, int n) {
     std::srand(std::time(NULL));
     double f = 1.0 / RAND_MAX;
     for (int i = 0; i < n; i++)
         a[i] = std::rand() * f;
 }

 void display(double* a, int n) {
     std::ofstream out("sorted.txt");
     out << std::fixed << std::setprecision(3);
     for (int i = 0; i < n; i++) {
         out << std::setw(10) << a[i] << '|';
         if (i % 10 == 9) out << std::endl;
     }
 }

 void quickSort(double* a, int left, int right) {
       int i = left;  // index of left to right scan
       int j = right; // index of right to left scan

       if (right - left >= 1) {
           double tmp, pivot = a[left];

           while (j > i) {
                 while (a[i] <= pivot && i <= right && j > i) 
                       i++;
                 while (a[j] > pivot && j >= left && j >= i)
                       j--;
                 if (j > i) {
                       tmp = a[i];
                       a[i] = a[j];
                       a[j] = tmp;
                 }
           }
           tmp = a[left];
           a[left] = a[j];
           a[j] = tmp;
           quickSort(a, left, j - 1);
           quickSort(a, j + 1, right);
       }
 }

 int main(int argc, char** argv) {
     if (argc != 2) {
         std::cerr << "** invalid number of arguments **\n";
         return 1;
     }
     int n = std::atoi(argv[1]);
     double* a = new double[n];
     init(a, n);
     quickSort(a, 0, n - 1);
     display(a, n);
     delete [] a;
 }

The flat profile for this quick-sort version looks something like:

 Each sample counts as 0.01 seconds.
   %   cumulative  self          self    total
   time  seconds seconds calls ms/call ms/call name
  80.00     0.08    0.08     1   80.00   80.00 quickSort(double*, int, int) 
  10.00     0.09    0.01     1   10.00   10.00 init(double*, int)
  10.00     0.10    0.01     1   10.00   10.00 display(double*, int)
   0.00     0.10    0.00     1    0.00    0.00 global constructors keyed
                                               to _Z4initPdi
   0.00     0.10    0.00     1    0.00    0.00 __static_initialization_
                                               and_destruction_0(int, int)
   0.00     0.10    0.00     1    0.00    0.00 std::fixed(std::ios_base&)

The call graph for this quick-sort version looks something like:

 granularity: each sample hit covers 4 byte(s) for 10.00% of 0.10 seconds

 index %time    self children called    name
                                             <spontaneous>
 [1]   100.0    0.00   0.10             main [1]
                0.08   0.00   1/1           quickSort(double*, int, int) [2]
                0.01   0.00   1/1           init(double*, int) [3]
                0.01   0.00   1/1           display(double*, int) [4]
 -----------------------------------------
                              333399        quickSort(double*, int, int) [2] 
                0.08   0.00   1/1           main [1]
 [2]    80.0    0.08   0.00   1+333399  quickSort(double*, int, int) [2]
                              333399        quickSort(double*, int, int) [2]
 -----------------------------------------
                0.01   0.00   1/1           main [1]
 [3]    10.0    0.01   0.00   1         init(double*, int) [3]
 -----------------------------------------
                0.01   0.00   1/1           main [1]
 [4]    10.0    0.01   0.00   1         display(double*, int) [4]
                0.00   0.00   1/1           std::fixed(std::ios_base&) [12]
 -----------------------------------------
                0.00   0.00   1/1           __do_global_ctors_aux [13]
 [10]    0.0    0.00   0.00   1         global constructors keyed to
                                            _Z4initPdi [10]
                0.00   0.00   1/1           __static_initialization_
                                            and_destruction_0(int, int) [11]
 -----------------------------------------
                0.00   0.00   1/1           global constructors keyed to
                                            _Z4initPdi [10]
 [11]    0.0    0.00   0.00   1         __static_initialization_
                                            and_destruction_0(int, int) [11]
 -----------------------------------------
                0.00   0.00   1/1           display(double*, int) [4]
 [12]    0.0    0.00   0.00   1         std::fixed(std::ios_base&) [12]
 -----------------------------------------
 
 Index by function name

  [10] global constructors keyed to _Z4initPdi (qs.cpp)
  [3]  init(double*, int)
  [2]  quickSort(double*, int, int)
  [11] __static_initialization_and_destruction_0(int, int) (qs.cpp)
  [4]  display(double*, int)
  [12] std::fixed(std::ios_base&)

Evaluation

Replacing the bubble-sort algorithm with the quick-sort algorithm produced a speedup of 751.31 / 0.08 = 9391 on a sort of 500,000 elements! 

Note that the speedup for this application varies with problem size.  While the improvement in elapsed time for a large number of elements is clearly dramatic, improvements for much smaller numbers of elements may be insignificant.

To predict the growth rate in elapsed time of an application with problem size, we focus the hotspot algorithms within the application and examine their growth rates. 


Big-O Classification

The conventional notation for classifying the growth rate in execution time with problem size is called Big-O.  Big-O notation defines an upper bound on an algorithm's performance and is represented symbolically as O(.), where the dot denotes the growth rate.  For example, an algorithm that adds a constant to the elements of an array scales linearly.  That is, an array with 2000 elements takes twice as long to fill as an array with 1000 elements.  We classify that algorithm as O(n), where n is the number of elements. 

To identify the Big-O for an algorithm, we examine its basic operations and apply two rules:

  1. if there are several growth rates, select the highest rate and ignore the lower rates
  2. if the growth rate includes a constant value, ignore the constant

The functions on the left describe the number of steps in three different algorithms.  In each case, we drop all but the first term and ignore the constant.  Their order of each function is shown on the right

f(n) = 3
g(n) = 3n2 + 3
h(n) = 5n3 + 3n2 + 3
f(n) is O(1)
g(n) is O(n2)
h(n) is O(n3)

More commonly encountered growth rates are listed below:

Big-O Scale Description Example
O(1) constant the number of steps does not change retrieve the value of a particular element of an array
O(log n)  logarithmic  the number of steps cuts in half each time binary search of a sorted array
O(n) linear the number of steps for each element in an array is constant search an unsorted array for a specified value
 O(n log n)  log linear the number of steps for each element in an array is O(log n) quick-sort algorithm (best)
O(n2) quadratic the number of steps for each element in an array is O(n) bubble-sort algorithm (average)
O(n3) cubic the number of steps for each element in an array is O(n2) naive multiplication of two matrices

Given the Big-O for an algorithm, we can predict the growth rate in its execution time with problem size.  The table below lists orders for different Big-O classes and different problem sizes.

Problem Size  O(1)   O(log n)   O(n)   O(n log n)   O(n2  O(n3
10 1 3 10 30 100 1,000
100 1 5 100 500 10,000 1,000,000
1,000 1 7 1,000 7,000 1,000,000 1,000,000,000
10,000 1 10 10,000 100,000 100,000,000 1,000,000,000,000
100,000 1 12 100,000 1,200,000 10,000,000,000 1,000,000,000,000,000
1,000,000 1 14 1,000,000 14,000,000 1,000,000,000,000 1,000,000,000,000,000,000

Note that a quick-sort (n log n) on 1,000,000 elements running on a desktop can beat a bubble-sort (n3) running on a supercomputer. 

Amdahl's Law

Once we have Big-O-classified an algorithm, we can estimate the problem size at which parallelization is worthwhile. 

Consider running the quick-sort algorithm on 1,000,000 elements.  For this problem size, the process time is 0.21 seconds and the algorithm consumes 80% of this time. 

Is there any point in parallelizing this algorithm?  What would be the benefit?  According to Amdahl's law the maximum expected speedup with 480 processors would be

 S480 = 1 / ( 1 - 0.80 + 0.80 / 480 ) = 4.96

At best, we expect the process time to drop from 0.21 secs to about 0.21 / 4.96 = 0.04 secs.  Since this algorithm only takes 0.21 secs to execute, we clearly need a much larger problem size to justify the effort involved in parallelizing the application. 


Exercises

  • Run the bubble-sort and quick-sort algorithms for different array sizes.  Tabulate the execution time for each run.  Add a column with the logarithm of the execution time.  Use a spreadsheet to plot your results on a log(time) versus problem size graph. 
  • Read Rob Bell on Big-O notation
  • Read Recursive Design on Big-O notation
  • Read NIST for a more formal definition of Big-O notation
  • Check out the Big-O Cheatsheet




Previous Reading  Previous: Heterogeneous Computers Next: Linear Algebra Fundamentals   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License