Part B - CUDA Programming Model
Thrust
Introduce the Thrust Template Library
Describe the use of iterators in Thrust
Describe the algorithms available in Thrust
Introduction
| Iterators
| Algorithms
| Exercises
The Standard Template Library (STL) of C++ consists of container classes,
algorithms, and iterators and implements many of the basic algorithms of
computer science. These algorithms use function objects to define
operations on elements in containers in general.
The CUDA Toolkit includes a library that provides templated parallel
processing facilities similar to the serial facilities provided by the STL.
This library is called Thrust.
Thrust
is a high-level interface that implements the common containers, iterators
and algorithms on the GPU.
This chapter describes basic usage of the Thrust library,
the role of iterators, the definition of function objects (or functors) and some
of the available algorithms.
Introduction
Thrust installs with the CUDA Toolkit and is interoperable
with CUDA, TBB and OpenMP. TBB stands for Intel's Threading
Building Blocks library. Thrust
documentation is available at
docs dot
nvidia dot com/cuda/thrust. The most recent version and the documentation
are available at thrust dot github dot com
Preliminaries
The file extension on Thrust source files is
.cu. The header files are in a sub-directory
named thrust off the system directory.
Thrust identifies all of its types, variables, and functions in the
thrust namespace. This
avoids name collisions with identifiers declared in other namespaces,
such as the std namespace.
Vectors
Thrust defines two vector templates:
- host_vector
- device_vector
These templates hide the logic for allocating memory, copying data, and
deallocating memory on the host and the device. Although these
templates simplify the coding considerably, they also hide the CUDA
calls completely. Consequently, they may not always produce the
most efficient solution. Keep this in mind and use these templates
judiciously.
Examples
Verification
The following program reports the version of the Thrust
library that is installed.
// Thrust Version
// thrust.cu
#include <thrust/version.h>
#include <iostream>
int main(int argc, char* argv[]) {
int major = THRUST_MAJOR_VERSION;
int minor = THRUST_MINOR_VERSION;
std::cout << "Thrust v" << major << "."
<< minor << std::endl;
}
|
>a
Thrust v1.8
|
Sorting
The following program creates a vector of 32M random ints
on the host, copies the vector to a device vector, sorts the
elements on the device, copies the sorted vector back to the host vector
and displays the largest value.
// Thrust Sort
// thrust_sort.cu
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <algorithm>
#include <cstdlib>
#include <iostream>
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << "***Incorrect no of arguments***\n";
return 1;
}
// generate random numbers serially
thrust::host_vector<int> h_vec(std::atoi(argv[1]));
std::generate(h_vec.begin(), h_vec.end(), std::rand);
// transfer data to the device
thrust::device_vector<int> d_vec = h_vec;
// sort data on the device
thrust::sort(d_vec.begin(), d_vec.end());
// transfer data back to host
thrust::copy(d_vec.begin(), d_vec.end(),
h_vec.begin());
// output largest value
std::cout << "Largest is\n"
<< h_vec[h_vec.size()-1] << std::endl;
}
|
>a 4194240
Largest is
32767
|
Iterators
Thrust accesses objects within
containers using iterators. Iterators are pointer-like objects that enable movement
through a container structure a single element at a time. Iterators
improve the functionality of indices by allowing insertion of objects
into the container before the current object without invalidating the
iterator. (With indices, the value of the current index
changes with each insertion before that index). Iterators hide
the data structure's internal organization and make the structure appear
as a simple sequence. For more information and detailed examples of C++ STL iterators
see the Containers and Iterators chapter of the
BTP305
course notes.
Dispatching
Iterators, unlike pointers, carry type information. Thrust
inspects the iterator to determine whether to call the host implementation of a
function or the device implementation. This type checking occurs
at compile time. We call this static dispatching.
If we pass a raw pointer to a Thrust function, it
dispatches the host version of the function. To dispatch the device
version, we need to convert the raw pointer to a device pointer:
// Raw Pointers and Iterators - Dispatching
// dispatching.cu
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <algorithm>
#include <cstdlib>
#include <iostream>
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << "***Incorrect no of arguments***\n";
return 1;
}
int n = std::atoi(argv[1]);
// declare a raw pointer
int* raw_ptr;
// allocate memory on the device
cudaMalloc((void**)&raw_ptr, n * sizeof(int));
// create a device pointer from the raw pointer
thrust::device_ptr<int> d_ptr(raw_ptr);
// use an iterator on the device to fill device memory with 0
thrust::fill(d_ptr, d_ptr + n, (int)0);
// extract the raw pointer from the device pointer
raw_ptr = thrust::raw_pointer_cast(d_ptr);
// free the device memory
cudaFree(raw_ptr);
}
|
Algorithms
Thrust supports parallel versions for many
STL algorithms using the same identifiers as the STL. For example,
thrust::sort() and
thrust::copy() are parallel versions of
std::sort() and std::copy().
Thrust algorithms, like Thrust
containers, have host and device versions.
The C/C++ compiler implements the version that matches the
signature of the function call.
Only the thrust::copy()
algorithm can take both host and device iterators as its arguments. All
other algorithms take as arguments either host or device iterators but not combinations of
the two.
Transformations
Transformation algorithms are algorithms that apply an operation to each element in a range
or a set of ranges and store the results in a target range.
The Thrust transformation algorithms include
- thrust::fill() - sets all elements in a range to a specified value
- thrust::sequence() - fills the range with a sequence of numbers
- thrust::replace() - replaces each element in a range that is equal to a specified
value with a new value
- thrust::transform() - applies a unary function to each element in a
range and
stores the result in the corresponding position in an output sequence
Functors
We define the operation that a transformation algorithm performs on the range (or set
of ranges) using a functor. A functor is an object that acts like
a function. Its class definition overloads the call operator (operator())
to perform a specified operation using the object's instance variable(s), where appropriate.
A functor
may be
- a generator (f())
- a unary functor (f(x)) or
- a binary functor (f(x, y)).
Examples include
- thrust::negate() - unary
- thrust::modulus() - binary
- thrust::plus() - binary
- thrust::multiplies() - binary
The thrust/functional.h header file contains their definitions.
Unlike functions, functors contain state. The presence of state allows us
to customize the operation performed on the elements of a container class
depending on the instance value(s).
Custom Functors
We can define custom functors (in addition to those defined in the Thrust
library) for use with thrust::transform(). For example,
to increment the elements in a host array by a constant value (4), we store the
constant in the instance variable shown below and through thrust::transform()
access that variable with each call to the functor.
To increment the elements in the updated array by a constant value (5), we store
the constant in the instance variable and once again through thrust::transform()
access that variable with each call to the functor.
// Functors
// functor.cpp
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <iostream>
class Add {
const int inc;
public:
Add(int c) : inc(c) { }
int operator()(int x) const { return x + inc; }
};
int main() {
int array[] = {1, 2, 3, 4, 5};
thrust::transform(array, array + 5, array, Add(4));
for (int i = 0; i < 5; i++)
std::cout << array[i] << ' ';
thrust::transform(array, array + 5, array, Add(5));
for (int i = 0; i < 5; i++)
std::cout << array[i] << ' ';
}
|
>a
5 6 7 8 9
10 11 12 13 14
|
saxpy
Consider the saxpy operation (y <- a * x + y)
defined in Level 1 of the BLAS and cuBLAS libraries. We can implement this
operation through a single Thrust transformation using a functor:
// saxpy
// saxpy.cu
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/functional.h>
#include <thrust/copy.h>
#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <iomanip>
class saxpy {
const int a;
public:
saxpy(int aa) : a(aa) {}
__host__ __device__ int operator()(
const int& x, const int& y) const {
return a * x + y;
}
};
int main(int argc, char* argv[]) {
if (argc != 3) {
std::cerr <<
"***Incorrect number of arguments***\n";
return 1;
}
int n = std::atoi(argv[1]);
int m = std::atoi(argv[2]);
// generate random numbers serially
thrust::host_vector<int> a(n);
thrust::host_vector<int> b(n);
thrust::host_vector<int> c(n);
std::generate(a.begin(), a.end(), std::rand);
std::generate(b.begin(), b.end(), std::rand);
// transfer data to the device
thrust::device_vector<int> d_a = a;
thrust::device_vector<int> d_b = b;
// transform data on the device
thrust::transform(d_a.begin(), d_a.end(),
d_b.begin(), d_b.begin(), saxpy(m));
// transfer data back to host
thrust::copy(d_b.begin(), d_b.end(),
c.begin());
// output largest value
for (int i = 0; i < n; i++ )
std::cout << std::setw(6) << c[i] << " = "
<< std::setw(2) << m
<< "*" << std::setw(5) << a[i]
<< "+" << std::setw(5) << b[i]
<< std::endl;
}
|
>a 20 5
32596 = 5* 41+32391
106939 = 5*18467+14604
35572 = 5* 6334+ 3902
132653 = 5*26500+ 153
96137 = 5*19169+ 292
91002 = 5*15724+12382
74811 = 5*11478+17421
165506 = 5*29358+18716
154528 = 5*26962+19718
142215 = 5*24464+19895
33972 = 5* 5705+ 5447
162451 = 5*28145+21726
131176 = 5*23281+14771
95673 = 5*16827+11538
51674 = 5* 9961+ 1869
22367 = 5* 491+19912
40642 = 5* 2995+25667
86009 = 5*11942+26299
41170 = 5* 4827+17035
37074 = 5* 5436+ 9894
|
The keywords __host__ and
__device__ specify that the call
operation should be defined separately on the host and on the device.
Reductions
Thrust also provides templates for a variety of reduction algorithms:
- int sum = thrust::reduce(vec.begin(), vec.end()) - returns the sum of the elements of vec
- int cnt = thrust::count_if(vec.begin(), vec.end(), predicate) - counts the number of elements in vec for which predicate is true
- iterator itr = thrust::max_element(vec.begin(), vec.end()) - returns the iterator of the maximal element in vec
- iterator itr = thrust::min_element(vec.begin(), vec.end()) - returns the iterator of the minimal element in vec
- type inp = thrust::inner_product(vec1.begin(), vec1.end(), vec2.begin(), type) - returns the inner product of vec1 and vec2
Exercises
|