Part B - CUDA Programming Model

Thrust

Introduce the Thrust Template Library
Describe the use of iterators in Thrust
Describe the algorithms available in Thrust

Introduction | Iterators | Algorithms | Exercises


The Standard Template Library (STL) of C++ consists of container classes, algorithms, and iterators and implements many of the basic algorithms of computer science.  These algorithms use function objects to define operations on elements in containers in general.  The CUDA Toolkit includes a library that provides templated parallel processing facilities similar to the serial facilities provided by the STL.  This library is called ThrustThrust is a high-level interface that implements the common containers, iterators and algorithms on the GPU. 

This chapter describes basic usage of the Thrust library, the role of iterators, the definition of function objects (or functors) and some of the available algorithms. 


Introduction

Thrust installs with the CUDA Toolkit and is interoperable with CUDA, TBB and OpenMP.  TBB stands for Intel's Threading Building Blocks library.  Thrust documentation is available at docs dot nvidia dot com/cuda/thrust.  The most recent version and the documentation are available at thrust dot github dot com

Preliminaries

The file extension on Thrust source files is .cu.  The header files are in a sub-directory named thrust off the system directory. 

Thrust identifies all of its types, variables, and functions in the thrust namespace.  This avoids name collisions with identifiers declared in other namespaces, such as the std namespace.

Vectors

Thrust defines two vector templates:

  • host_vector
  • device_vector

These templates hide the logic for allocating memory, copying data, and deallocating memory on the host and the device.  Although these templates simplify the coding considerably, they also hide the CUDA calls completely.  Consequently, they may not always produce the most efficient solution.  Keep this in mind and use these templates judiciously. 

Examples

Verification

The following program reports the version of the Thrust library that is installed.

 // Thrust Version
 // thrust.cu

 #include <thrust/version.h>
 #include <iostream>

 int main(int argc, char* argv[]) {
     int major = THRUST_MAJOR_VERSION;
     int minor = THRUST_MINOR_VERSION;

     std::cout << "Thrust v" << major << "." 
      << minor  << std::endl;
 }

 >a
 Thrust v1.8 










Sorting

The following program creates a vector of 32M random ints on the host, copies the vector to a device vector, sorts the elements on the device, copies the sorted vector back to the host vector and displays the largest value. 

 // Thrust Sort
 // thrust_sort.cu

 #include <thrust/host_vector.h>
 #include <thrust/device_vector.h>
 #include <thrust/generate.h>
 #include <thrust/sort.h>
 #include <thrust/copy.h>
 #include <algorithm>
 #include <cstdlib>
 #include <iostream>

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << "***Incorrect no of arguments***\n"; 
         return 1;
     }

     // generate random numbers serially
     thrust::host_vector<int> h_vec(std::atoi(argv[1])); 
     std::generate(h_vec.begin(), h_vec.end(), std::rand);

     // transfer data to the device
     thrust::device_vector<int> d_vec = h_vec;

     // sort data on the device
     thrust::sort(d_vec.begin(), d_vec.end());

     // transfer data back to host
     thrust::copy(d_vec.begin(), d_vec.end(),
      h_vec.begin()); 

     // output largest value
     std::cout << "Largest is\n"
      << h_vec[h_vec.size()-1] << std::endl;
 }



 >a 4194240
 Largest is
 32767 































Iterators

Thrust accesses objects within containers using iterators.  Iterators are pointer-like objects that enable movement through a container structure a single element at a time.  Iterators improve the functionality of indices by allowing insertion of objects into the container before the current object without invalidating the iterator.  (With indices, the value of the current index changes with each insertion before that index).  Iterators hide the data structure's internal organization and make the structure appear as a simple sequence.  For more information and detailed examples of C++ STL iterators see the Containers and Iterators chapter of the BTP305 course notes.

Dispatching

Iterators, unlike pointers, carry type information.  Thrust inspects the iterator to determine whether to call the host implementation of a function or the device implementation.  This type checking occurs at compile time.  We call this static dispatching.

If we pass a raw pointer to a Thrust function, it dispatches the host version of the function.  To dispatch the device version, we need to convert the raw pointer to a device pointer:

 // Raw Pointers and Iterators - Dispatching
 // dispatching.cu

 #include <thrust/host_vector.h>
 #include <thrust/device_vector.h>
 #include <thrust/generate.h>
 #include <thrust/sort.h>
 #include <thrust/copy.h>
 #include <algorithm>
 #include <cstdlib>
 #include <iostream>

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << "***Incorrect no of arguments***\n"; 
         return 1;
     }

     int n = std::atoi(argv[1]);

     // declare a raw pointer
     int* raw_ptr;
     // allocate memory on the device
     cudaMalloc((void**)&raw_ptr, n * sizeof(int));

     // create a device pointer from the raw pointer
     thrust::device_ptr<int> d_ptr(raw_ptr);

     // use an iterator on the device to fill device memory with 0
     thrust::fill(d_ptr, d_ptr + n, (int)0);

     // extract the raw pointer from the device pointer
     raw_ptr = thrust::raw_pointer_cast(d_ptr);

     // free the device memory
     cudaFree(raw_ptr);
 }

Algorithms

Thrust supports parallel versions for many STL algorithms using the same identifiers as the STL.  For example, thrust::sort() and thrust::copy() are parallel versions of std::sort() and std::copy().

Thrust algorithms, like Thrust containers, have host and device versions.  The C/C++ compiler implements the version that matches the signature of the function call.  Only the thrust::copy() algorithm can take both host and device iterators as its arguments.  All other algorithms take as arguments either host or device iterators but not combinations of the two. 

Transformations

Transformation algorithms are algorithms that apply an operation to each element in a range or a set of ranges and store the results in a target range.  The Thrust transformation algorithms include

  • thrust::fill() - sets all elements in a range to a specified value
  • thrust::sequence() - fills the range with a sequence of numbers
  • thrust::replace() - replaces each element in a range that is equal to a specified value with a new value
  • thrust::transform() - applies a unary function to each element in a range and stores the result in the corresponding position in an output sequence

Functors

We define the operation that a transformation algorithm performs on the range (or set of ranges) using a functor.  A functor is an object that acts like a function.  Its class definition overloads the call operator (operator()) to perform a specified operation using the object's instance variable(s), where appropriate.  A functor may be

  • a generator (f())
  • a unary functor (f(x)) or
  • a binary functor (f(x, y)). 

Examples include

  • thrust::negate() - unary
  • thrust::modulus() - binary
  • thrust::plus() - binary
  • thrust::multiplies() - binary

The thrust/functional.h header file contains their definitions.  Unlike functions, functors contain state.  The presence of state allows us to customize the operation performed on the elements of a container class depending on the instance value(s). 

Custom Functors

We can define custom functors (in addition to those defined in the Thrust library) for use with thrust::transform().  For example, to increment the elements in a host array by a constant value (4), we store the constant in the instance variable shown below and through thrust::transform() access that variable with each call to the functor.  To increment the elements in the updated array by a constant value (5), we store the constant in the instance variable and once again through thrust::transform() access that variable with each call to the functor.

 // Functors
 // functor.cpp

 #include <thrust/host_vector.h>
 #include <thrust/functional.h>
 #include <iostream>

 class Add {
     const int inc;
 public:
     Add(int c) : inc(c) { }
     int operator()(int x) const { return x + inc; }
 };

 int main() {
     int array[] = {1, 2, 3, 4, 5};
     thrust::transform(array, array + 5, array, Add(4)); 
     for (int i = 0; i < 5; i++)
         std::cout << array[i] << ' ';
     thrust::transform(array, array + 5, array, Add(5)); 
     for (int i = 0; i < 5; i++)
         std::cout << array[i] << ' ';
 }

 >a
 5 6 7 8 9
 10 11 12 13 14


















 

saxpy

Consider the saxpy operation (y <- a * x + y) defined in Level 1 of the BLAS and cuBLAS libraries.  We can implement this operation through a single Thrust transformation using a functor: 

 // saxpy
 // saxpy.cu

 #include <thrust/host_vector.h>
 #include <thrust/device_vector.h>
 #include <thrust/generate.h>
 #include <thrust/functional.h>
 #include <thrust/copy.h>
 #include <cstdlib>
 #include <algorithm>
 #include <iostream>
 #include <iomanip>

 class saxpy {
     const int a;
   public:
     saxpy(int aa) : a(aa) {}
     __host__ __device__ int operator()(
      const int& x, const int& y) const {
         return a * x + y;
     }
 };

 int main(int argc, char* argv[]) {
     if (argc != 3) {
         std::cerr <<
          "***Incorrect number of arguments***\n";
         return 1;
     }

     int n = std::atoi(argv[1]);
     int m = std::atoi(argv[2]);

     // generate random numbers serially
     thrust::host_vector<int> a(n);
     thrust::host_vector<int> b(n);
     thrust::host_vector<int> c(n);
     std::generate(a.begin(), a.end(), std::rand);
     std::generate(b.begin(), b.end(), std::rand); 

     // transfer data to the device
     thrust::device_vector<int> d_a = a;
     thrust::device_vector<int> d_b = b;

     // transform data on the device
     thrust::transform(d_a.begin(), d_a.end(),
      d_b.begin(), d_b.begin(), saxpy(m));

     // transfer data back to host
     thrust::copy(d_b.begin(), d_b.end(),
      c.begin()); 

     // output largest value
     for (int i = 0; i < n; i++ )
         std::cout << std::setw(6) << c[i] << " = " 
          << std::setw(2) << m
          << "*" << std::setw(5) << a[i]
          << "+" << std::setw(5) << b[i]
          << std::endl;
 }



 >a 20 5
  32596 = 5*   41+32391
 106939 = 5*18467+14604
  35572 = 5* 6334+ 3902
 132653 = 5*26500+  153
  96137 = 5*19169+  292
  91002 = 5*15724+12382
  74811 = 5*11478+17421
 165506 = 5*29358+18716
 154528 = 5*26962+19718
 142215 = 5*24464+19895
  33972 = 5* 5705+ 5447
 162451 = 5*28145+21726
 131176 = 5*23281+14771
  95673 = 5*16827+11538
  51674 = 5* 9961+ 1869
  22367 = 5*  491+19912
  40642 = 5* 2995+25667
  86009 = 5*11942+26299
  41170 = 5* 4827+17035
  37074 = 5* 5436+ 9894 



































 

The keywords __host__ and __device__ specify that the call operation should be defined separately on the host and on the device. 

Reductions

Thrust also provides templates for a variety of reduction algorithms:

  • int sum = thrust::reduce(vec.begin(), vec.end()) - returns the sum of the elements of vec
  • int cnt = thrust::count_if(vec.begin(), vec.end(), predicate) - counts the number of elements in vec for which predicate is true
  • iterator itr = thrust::max_element(vec.begin(), vec.end()) - returns the iterator of the maximal element in vec
  • iterator itr = thrust::min_element(vec.begin(), vec.end()) - returns the iterator of the minimal element in vec
  • type inp = thrust::inner_product(vec1.begin(), vec1.end(), vec2.begin(), type) - returns the inner product of vec1 and vec2

Exercises




Previous Reading  Previous: CUDA Libraries Next: Kernels and Grids   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License