Part E - Standard Library

Multi-Threading

Introduce concurrency and multi-threading
Define the common terms associated with multi-threading

"Although threads seem to be a small step from sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism. Threads, as a model of computation, are wildly non-deterministic, and the job of the programmer becomes one of pruning that nondeterminism." Lee (2006)

Performance | Processes and Threads | Terminology | Policy Execution | Exercises



The industry-wide introduction of multi-core CPUs to the consumer market at the beginning of the millenium promoted the need for multi-threading support in all object-oriented languages.  Multi-core CPUs can execute different strands of code concurrently.  They enhance program performance through concurrent execution of independent tasks.  A multi-threaded solution improves the elapsed time to complete execution by distributing independent tasks across separate hardware threads.  C++11 introduced direct multi-threading facilities.  The standard library includes a comprehensive set of libraries that support simple as well as complex multi-threading.  C++17 enhanced the algorithms of the Standard Template Library to accommodate execution policies that allow both sequential and parallel computations.

This chapter introduces the topic of multi-threading.  It describes the difference between processes and threads, the importance of tasks and communications in multi-threaded programming and some of the terminology related to multi-threading.  This chapter concludes by listing some of the STL algorithms that accept execution policy arguments, which allow for parallel execution.


Performance

The performance of a program can be described in terms of two concepts:

  • tasks
  • communications

Tasks take time to compute and communications take time to complete.  A task refers to a group of instructions that represents an algorithm.  A task can refer to an entire program or an relatively small part of a large program.  A communication refers to the transfer of information to or from a processor.  The communication channel may be between the processor and main memory or the processor and a resource, such as a file stream, input stream or output stream. 

The time to complete a communication is called its latency.  The latency of a communication between a processor and a resource is at least one order of magnitude greater than that between the processor and main memory.  Accessing a resource can leave the processor idle for a considerable period of time.  A processor waiting for a communication to complete is available to execute another program. 

Operating systems are designed to switch from one program to another.  The system of switching from one program to another is called multi-tasking.  In this case, each task refers to a separate program. 

Role of Problem Size

The amount of time and space that a program requires to execute increases with problem size.  A large program can consume enough time or space to make its performance an important programming issue. 

The time and space that an algorithm requires for completion depends on its complexity.  Two distinct measures of a program's complexity are time complexity and space complexity.  Time complexity refers to the rate at which solution time grows with problem size.  Space complexity refers to the rate at which memory requirements grow with problem size.  For instance, the initialization of 1000 elements to a constant value takes 1000 units of time (linear complexity), while a bubble sort of the same 1000 elements takes about 1000 * 1000 = 1,000,000 units of time (quadratic complexity).  In other words, the sorting parts of a program may require our attention, while the access of individual elements of an array need not require our attention; especially if the problem size is large.


Processes and Threads

Multi-tasking differs from multi-threading.  Their difference lies in the distinction between a process and a thread.

Process

A process is an instance of a program executing on the host platform.  When the user instructs the operating system to load a program from a file into main memory, the system creates a process.  When the operating system transfers control to the program's entry point, the process starts executing.  At any instant during its execution the process contains all of the information related to its current activity.  When the process finishes executing, it returns control to the operating system and leaves the operating system ready to initiate a new process. 

If a process requests a resource, the operating system's multi-tasking scheduler blocks the process and may start or restart another process.  Once the requested resource is available, the scheduler places the blocked process in a wait state ready to restart.  When the scheduler blocks another executing process, it can restart the waiting process. 

This switch from process to process is called a context switch.  A process' context refers to the contents of the CPU's registers and program counters at the instant of blocking.  A context switch suspends an executing process and stores its context in memory.  The scheduler then retrieves the context of a waiting process from memory, restores it in the CPU's registers and program counters and restarts the reloaded process. 

A context switch on a process is computationally intensive.  The processor time required may be in the order of thousands of clock cycles. 

Thread

A thread is a sequence of program instructions that the scheduler can execute independently.  Threads exist within a process and are light-weight versions of processes.  A thread can run concurrently with other threads and share the same address space with other threads.  The task that each thread executes is part of the host process itself. 

process and threads

A thread may be a software thread or a hardware thread.  A software thread is a virtual thread associated with a task.  The task may be the same or different from the task associated with another thread.  A hardware thread is a mini-process that executes a software thread (that is, its task).  The operating system's scheduler maps software threads to hardware threads. 

Depending on the hardware available the operating system can schedule several threads for concurrent execution.  This is called multi-threading.  While the operating system completely controls the multi-tasking of its loaded processes, we can include within our programs instructions that control multi-threading of various tasks.

A context switch on a thread is similar to that on a process.  As a processor waits to complete a software thread's communication, the scheduler blocks the thread and may switch to another thread.  A context switch on a thread is much less computationally instensive than a switch on a process.  The processor time required may be in the order of hundreds of clock cycles.  In other words, a thread is the lighter unit of system scheduling, while a process is the heavier.


Terminology

A process starts executing on a single thread.  When it encounters an instruction that divides the execution path into concurrent tasks, we say that the process launches a concurrent task or spawns a child thread.  When the child thread reunites with its parent, we say that they join or the child and parent threads synchronize.  The join point in the execution path is a synchronization point.  If either task is still executing when the one thread reaches this point, that thread waits until the other has completed its task. 

Shared Memory

The complexity of multi-threaded programming arises from threads accessing variables that they share with other threads.  Users expect the results of a program to be reproducible no matter how many times they are run.  With multi-threaded programs, users expect the same results for the same set of inputs.  That is, they expect a determinate solution.  Since the order in which threads execute their instructions is indeterminate, multi-threaded programming demands attention and care whenever concurrently executing threads access a shared variable. 

Race Condition

A race condition occurs when two threads try to update the same memory location at the same time.  If race conditions are present, one run may produce different results from a subsequent run.  Different techniques of avoiding race conditions include shared states, mutexes, locks and atomics. 

Deadlock

A deadlock occurs when two or more threads are waiting for one another to complete execution and therefore are blocked forever. 

Shared State

A shared state is a state shared by two threads for the purpose of communicating a value from one thread to the other.  C++11 uses this technique in its thread libraries. 

Mutexes and Locks and Atomics

Mutex stands for mutual exclusion, which is a technique that excludes access by other threads while a particular thread updates a variable.  Mutexes are implemented using locks.  The thread that owns a lock must release it before any other thread can acquire that lock. 

The C++11 thread libraries include a library for implementing mutexes and locks.  This technique, which is used in more complex multi-threaded programming solutions is beyond the scope of these notes. 

Atomics

The C++11 libraries also include an atomics library for implementing lock-free execution of instruction.  Atomics treat an instruction as indivisible.  This technique is also used in more complex multi-threaded programming solutions and is also beyond the scope of these notes. 


Policy-Based Execution

C++17 introduced the concept of execution policies to control the method of computation.  This concept is implemented in the C++17 version of the STL algorithms.

Execution Policies

Policy Types

C++17 identifies three distinct execution policy types:

  • std::execution::sequenced_policy - execution may not be parallelized - the element access functions are indeterminately sequenced in the calling thread
  • std::execution::parallel_policy - execution may be parallelized - the element access functions may execute in either the invoking thread or in a thread implicitly created by the library to support parallel algoithm execution
  • std::execution::parallel_unsequenced_policy - execution may be parallelized, vectorized or migrated across threads- the element access functions may execute in unordered fashio in unspecified threads

Vectorization is the register-based technique that implements a single instruction multiple data scenario.  Vectorization takes advantage of any vector registers available within a CPU.  It allows the loading of one instruction that operates on the data values stored in a set (or vector) of registers.  This method of computation is beyond the scope of this course.

If the invocation of any element access function exits with an uncaught exception, std::terminate() is called.

Policy Objects

C++17 instantiates three distinct execution policy objects:

  • std::execution::seq - sequential - single-threaded execution
  • std::execution::par - parallel - multi-threaded execution
  • std::execution::par_vec - parallel plus vector - multi-threaded execution with vectorization

The programmer is responsible for avoiding deadlocks. 

STL Algorithms

C++17 overloads the following STL algorithms (along with others) to accept an execution policy as the first argument in the function call:

  • std::all_of - checks if all elements satisfy the predicate
  • std::any_of - checks if any element satisfies the predicate
  • std::copy - copies elements to another range
  • std::copy_if - copies elements that satisfy the predicate to another range
  • std::copy_n - copies n elements to another range
  • std::count - counts the elements that are equal to a value
  • std::count_if - counts the elements that satisfies the predicate
  • std::equal - returns true is the range is equal to another range
  • std::fill - assigns a value to elements in the range
  • std::fill_n - assigns a value to the first n elements in the range
  • std::find - finds the first element in the range equal to the specified value
  • std::find_first_of - finds the first element in the range that satisfies a criterion in a speficied range
  • std::find_if - finds the first element in the range that satisfies the predicate
  • std::find_if_not - finds the first element in the range that does not satisfy the predicate
  • std::generate - assigns a value to each element in the range generated by a function
  • std::generate_n - assigns a value to the first n elements in the range generated by a function
  • std::inner_product - computes the inner product of two ranges
  • std::max_element - finds the greatest element in the range
  • std::merge - merges two ranges into a single range
  • std::min_element - finds the smallest element in the range
  • std::move - moves the element of the range into another range
  • std::none_of - checks if the predicate returns true for none of the elements in the range
  • std::reduce - reduces all elements using a specified function
  • std::remove - removes all elements satisfying a criterion
  • std::remove_copy - copies elements in the range to another range omitting those elements that have the specified value
  • std::remove_copy_if - copies elements in the range to another range omitting those elements that satisfy the predicate
  • std::replace - replaces all elements in the range equal to the specified value
  • std::replace_copy - copies all elements in the range to another range replacing all elements that have the specified value
  • std::replace_copy_if - copies all elements in the range to another range replacing all elements that satisfy the predicate
  • std::replace_if - replace all elements in the range that satisfy the predicate
  • std::reverse - reverses the order of the elements in the range
  • std::reverse_copy - copies all elements in the range reversing thier order
  • std::search - searches the range for the first subsequence of elements
  • std::search_n - searches the range for the first sequence of n identical elements
  • std::sort - sorts elements in the range
  • std::transform - applies a function to a range and stores the result in another range

Example

The following program sums the elements in vector x, using multi-threading, and displays the result:

 // Algorithms - Execution Policies - C++17
 // policies.cpp

 #include <iostream>
 #include <numeric>
 #include <vector>
 #include <execution>

 int main() {
     std::vector<double> x(10000000, 0.5);

     double s = std::reduce(std::execution::par, 
      x.begin(), x.end());
     std::cout << "sum = " << s <<  std::endl;
 }













 sum = 5e+06 


Exercises




Previous Reading  Previous: Smart Pointers Next: Thread Classes   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License