Part B - CUDA Programming Model

Memory Model

Outline the CUDA Memory Model
Describe the Parallel Reduction Algorithm
Describe the role of shared memory within each multi-processor

Memory Hierarchy | Reduction Algorithm | Shared Memory | Constant Memory | Exercises


The CUDA Programming Model offers a hierarchy of four distinct memories for keeping data as close as possible to the streaming processors.  Each memory has a different access time and exposure.  Global memory is at the base of this hierarchy and includes a read-only part.  CUDA API calls allocate, populate, retrieve and deallocate this memory from the host code. 

This chapter describes each memory in detail.  It also describes the parallel reduction algorithm and implements it using the different memories.  It concludes with a simple implementation of a ray-tracing algorithm to demonstrate the use of the read only part of global memory.


Memory Hierarchy

The memories on the device in order of access speed from slowest to fastest are:

  • global - read/write - public to all threads of all blocks
  • constant - read only - public to all threads of all blocks
  • shared - read/write - private to each block - public to all threads within a block
  • register - read/write - private to each thread

Host code can only access global and constant memories directly through CUDA API calls.  Shared and register memories are only accessible from the kernel code. 

memory model

Kernel Efficiency

One measure of kernel efficiency is the compute to global memory access (CGMA) ratio.  This ratio compares the number of computations to the number of global memory accesses within a critical part of the kernel. 

Consider a GTX750M.  Since its memory bandwidth is 80 GB/s, we expect to load no more than 20 (=80/4) Gigafloats (GFLOPS) per second.  Its peak performance is 723 GFLOPS.  If the computations in the critical part of the kernal require 20 GFLOPS, the CGMA ratio for that part is 1.  To approach peak performance, we need to increase this ratio significantly. 

We increase the CGMA ratio for the critical part of the kernel by accessing less global memory and more shared and register memory during the computation.

Global Memory

Global memory is the device memory that is accessible by any thread executing on any streaming processor in the device.  The host code allocates this memory on the device and accesses its contents through cudaMemcpy() calls.

Global is the slowest memory on the device. 

Constant Memory

Part of global device memory is available for read-only access.  Constant memory is faster than global memory. 

Register Memory

Register memory is private to its thread.  We allocate this memory within the kernel using scalar variables. 

Compare the kernel codes listed below.  Each kernel calculates a single coefficient for matrix c using the dot product of a row in matrix a with a column in matrix b.  The kernel on the left accumulates this dot product in global memory.  The kernel on the right accumulates this dot product in register memory:

 __global__ void matMul(const float* a, 
  const float* b, float* c, int n) {
     int i = blockIdx.x * blockDim.x +
             threadIdx.x;
     int j = blockIdx.y * blockDim.y +
             threadIdx.y;
     if (i < n && j < n) {
         c[i * n + j] = 0.0f;
         for (int k = 0; k < n; k++)
             c[i * n + j] +=
              a[i * n + k] *
              b[k * n + j];
     }
 }
 __global__ void matMul(const float* a, 
  const float* b, float* c, int n) {
     int i = blockIdx.x * blockDim.x +
             threadIdx.x;
     int j = blockIdx.y * blockDim.y +
             threadIdx.y;
     if (i < n && j < n) {
         float sum = 0.0f;
         for (int k = 0; k < n; k++)
             sum += a[i * n + k] *
             b[k * n + j];
         c[i * n + j] = sum;
     }
 }

Register memory is the fastest memory on the device. 

Local Variables

CUDA allocates memory for local variables in either of memories:

  • scalar variables - data stored in a register and private to its thread
  • array variable - data stored in global memory and private to its thread

CUDA stores arrays that are locally declared within the kernel in global memory.  There is no advantage to allocating array memory locally rather than through the host; that is, to using local array memory instead of global memory.

Shared Memory

Each multi-processor on the device contains its own memory.  The threads within a block execute on the same multi-processor and can share this memory with other threads in the same block. 

Shared memory is faster than global memory but slower than register memory. 


Reduction

A reduction algorithm is an algorithm that extracts a single value from an array of values or a set of arrays of values.  This single reduced value may be the sum of an array's elements, their maximum value, their minimum value, their average value, or some other scalar result. 

A serial reduction algorithm moves through the elements of an array or set of arrays one element at a time.  Its Big-O classification is O(n).  If the reduction operation is associative, this algorithm can be parallelized.  A parallel reduction algorithm moves through the elements of an array or set of arrays concurrently.  Its Big-O classification is O(log n). 

Example

The conventional process of eliminating contestants in a tournament is a parallel reduction.  Contestants are paired for the first round; one contestant is eliminated from each pair.  Successful contestants are paired for the second round; one contestant is eliminated from each pair.  Round-by-round elimination continues until only one successful contestant is left as illustrated in the Figure below. 

elimination contest

Implementations

We can implement a parallel reduction algorithm in different ways with different degrees of efficiency. 

Let us store the results of each pair process in the array element with the lower index.  In this case, the final result will be in the first element of the array.

parallel reduction

Stages and Strides

Each round in a reduction is called a stage.  The difference between the indices of the elements in a pair being evaluated at any stage is called the stride.  As we move from one stage or round to the next, the stride doubles.  That is, it grows as a power of 2.  In any particular stage, we only process those elements with indices that are exact multiples of the stride for that stage. 

parallel reduction

Examples

Consider the reduction of the elements of a one-dimensional array to their arithmetic sum.  In the following set of examples, the host code retrieves the number of elements from the command line and populate the host array with random values.

Since the array is one-dimensional array, we design our configuration to consist of a single dimension of threads.  We only use the .x members of the built-in variables: threadIdx.x, blockIdx.x, blockDim.x and gridDim.x.  The other members default to unit values.

A Small Vector

Let us assume that the user is working with a small vector which has less elements than the maximum number of threads in a block.  The execution configuration only requires one block and the number of threads in that block is the number of elements in the vector.  We obtain the index of each element in the kernel code by the following statement

     int i = threadIdx.x;

After each stage, we need to ensure that all of the threads in the block have completed their reduction.  For this purpose, we block execution by calling the intrinsic function:

     __syncthreads();

Once the threads have synchronized, the kernel is ready to perform the reduction for the next stage. 

 // Sum a Small-Sized Vector
 // small_reduction.cu

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <cuda_runtime.h>
 // to remove intellisense highlighting
 #include <device_launch_parameters.h>
 #ifndef __CUDACC__
 #define __CUDACC__
 #endif
 #include <device_functions.h>

 __global__ void reduce(float* a, int n) {
     int i = threadIdx.x;

     for (int stride = 1; i + stride < n; stride <<= 1) {
         if (i % (2 * stride) == 0)
             a[i] += a[i + stride];
         __syncthreads();
     }
 }

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << argv[0]
             << ": invalid no of arguments\n"
             << "Usage: " << argv[0]
             << "  no_of_elements\n";
         return 1;
     }
     int   n = atoi(argv[1]);

     // allocate host memory
     float* h_a = new float[n];
     // initialize host memory
     for (int i = 0; i < n; i++)
         h_a[i] = float(std::rand()) / RAND_MAX;
     float h_sum = 0.0f;
     for (int i = 0; i < n; i++)
         h_sum += h_a[i];

     // allocate device memory
     float* d_a; // full device vector
     cudaMalloc((void**)&d_a, n * sizeof(float));
     cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);

     // reduce to partial sums
     reduce << <1, n >> >(d_a, n);

     // copy from device to host memory
     float d_sum;
     cudaMemcpy(&d_sum, d_a, sizeof(float), cudaMemcpyDeviceToHost);

     // report sums
     std::cout << std::fixed << std::setprecision(1);
     std::cout << "Host sum   = " << h_sum << std::endl;
     std::cout << "Device sum = " << d_sum << std::endl;

     // deallocate memory
     delete[] h_a;
     cudaFree(d_a);
 }

The results for an array of 500 elements are:

 >a 500
 Host sum   = 248.7
 Device sum = 248.7

Note that this code does not check if the number of elements (n) exceeds the maximum number of threads in a block (blockDim.x).  If it does, it may crash or give an incorrect final result.

A Medium-Sized Vector

Let us assume that the user is working with a medium-sized vector which has more elements than the maximum number of threads in a block, but less elements than the maximum number of threads in the grid.  The execution configuration requires more than one block and no more than the maximum number of blocks in the grid.  To determine the element's index we combine the block index with the thread index

     int i = blockIdx.x * blockDim.x + threadIdx.x; 

In this example, we allocate enough memory on the host and device to hold values that completely fill the number of blocks required, noting that the last block in the grid may be partly empty.  To account for a partly empty block we add host code that fills the empty elements with 0.  The kernel code below performs the reduction for each block separately and copies the result to an intermediate array (b).  The host code copies that array's elements to a host array and sums them on the host to obtain the final result:

 // Sum a Medium-Sized Vector
 // medium_reduction.cu

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <cuda_runtime.h>
 // to remove intellisense highlighting
 #include <device_launch_parameters.h>
 #ifndef __CUDACC__
 #define __CUDACC__
 #endif
 #include <device_functions.h>

 const int ntpb = 512;

 __global__ void reduce(float* a, float* b, int n) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;

     for (int stride = 1; stride < ntpb; stride <<= 1) {
         if (i % (2 * stride) == 0)
             a[i] += a[i + stride];
         __syncthreads();
     }
     if (threadIdx.x == 0)
         b[blockIdx.x] = a[i];
 }

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << argv[0]
             << ": invalid no of arguments\n"
             << "Usage: " << argv[0]
             << "  no_of_elements\n";
         return 1;
     }
     int   n = atoi(argv[1]);
     // determine the number of blocks required
     int nblks = (n + ntpb - 1) / ntpb;

     // allocate host memory
     float* h_a = new float[nblks * ntpb];
     float* h_b = new float[nblks];
     // initialize host memory
     for (int i = 0; i < n; i++)
         h_a[i] = float(std::rand()) / RAND_MAX;
     for (int i = n; i < nblks * ntpb; i++)
         h_a[i] = 0.0f;
     float h_sum = 0.0f;
     for (int i = 0; i < n; i++)
         h_sum += h_a[i];

     // allocate device memory
     float* d_a; // full device vector
     float* d_b; // device sum per block
     cudaMalloc((void**)&d_a, nblks * ntpb * sizeof(float));
     cudaMalloc((void**)&d_b, nblks * sizeof(float));
     cudaMemcpy(d_a, h_a, nblks * ntpb * sizeof(float), cudaMemcpyHostToDevice); 

     // reduce to partial sums
     reduce << <nblks, ntpb >> >(d_a, d_b, n);

     // copy from device to host memory
     cudaMemcpy(h_b, d_b, nblks * sizeof(float), cudaMemcpyDeviceToHost);
     float d_sum = 0.0f;
     for (int i = 0; i < nblks; i++)
         d_sum += h_b[i];

     // report sums
     std::cout << std::fixed << std::setprecision(1);
     std::cout << "Host sum   = " << h_sum << std::endl;
     std::cout << "Device sum = " << d_sum << std::endl;

     // deallocate memory
     delete[] h_a;
     delete[] h_b;
     cudaFree(d_a);
     cudaFree(d_b);
 }

The results of executing the compiled code should look something like:

 >a 2000
 Host sum   = 993.0
 Device sum = 993.0

Shared Memory

Shared memory is device memory that is shared by all of the threads within a single block.  Different threads within the same block can access any value stored in this memory, while threads in other blocks cannot access this memory.  Shared memory is a part of a specific multi-processor, unlike global memory, which is accessible by all of the device's multi-processors.  We call the threads within a block cooperating threads.  All cooperating threads see the same version of any shared variable. 

Shared memory is notably faster than global memory. 

To improve an algorithm's CGMA ratio, we copy data from global memory into shared memory, work on the copied data in shared memory rather than that in global memory, and only once we have completed working on the shared memory, copy the result(s) to global memory. 

The keyword for allocating shared memory is __shared__.  For example,

 __global__ void foo(float* a) {
     __shared__ float s[512];   // allocate shared memory
     int tid = threadIdx.x;     //
     s[tid] = a[tid];           // from global to shared
     ...                        // perform computations on s_a[i]
     a[tid] = s[tid];           // copy results to global memory
 }

Example

Let us improve the CGMA of the medium_reduction example by introducing shared memory into the implementation.

Reduction with Shared Memory

In the kernel code listed below, we store the element value for each thread in a block in shared memory and access that value (in its updated form) from shared memory during each stage.  Just before leaving the kernel, we copy the reduced value for the block of threads from shared memory to the element in global memory assigned to store the result of the completed reduction for that block of threads. 

parallel reduction shared memory

The following program implements this caching algorithm in shared memory:

 // Sum a Medium-Sized Vector
 // shared_reduction.cu

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <cuda_runtime.h>
 // to remove intellisense highlighting
 #include <device_launch_parameters.h>
 #ifndef __CUDACC__
 #define __CUDACC__
 #endif
 #include <device_functions.h>

 const int ntpb = 512;

 __global__ void reduce(float* a, float* b, int n) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     int t = threadIdx.x;
     __shared__ float s[ntpb];
     s[t] = a[i];
     __syncthreads();
     

     for (int stride = 1; stride < blockDim.x; stride <<= 1) {
         if (t % (2 * stride) == 0)
             s[t] += s[t + stride];
         __syncthreads();
     }
     if (t == 0)
         b[blockIdx.x] = s[0];
 }

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << argv[0]
             << ": invalid no of arguments\n"
             << "Usage: " << argv[0]
             << "  no_of_elements\n";
         return 1;
     }
     int   n = atoi(argv[1]);
     // determine the number of blocks required
     int nblks = (n + ntpb - 1) / ntpb;

     // allocate host memory
     float* h_a = new float[nblks * ntpb];
     float* h_b = new float[nblks];
     // initialize host memory
     for (int i = 0; i < n; i++)
         h_a[i] = float(std::rand()) / RAND_MAX;
     for (int i = n; i < nblks * ntpb; i++)
         h_a[i] = 0.0f;
     float h_sum = 0.0f;
     for (int i = 0; i < n; i++)
         h_sum += h_a[i];

     // allocate device memory
     float* d_a; // full device vector
     float* d_b; // device sum per block
     cudaMalloc((void**)&d_a, nblks * ntpb * sizeof(float));
     cudaMalloc((void**)&d_b, nblks * sizeof(float));
     cudaMemcpy(d_a, h_a, nblks * ntpb * sizeof(float), cudaMemcpyHostToDevice);

     // reduce to partial sums
     reduce << <nblks, ntpb >> >(d_a, d_b, n);

     // copy from device to host memory
     cudaMemcpy(h_b, d_b, nblks * sizeof(float), cudaMemcpyDeviceToHost);
     float d_sum = 0.0f;
     for (int i = 0; i < nblks; i++)
         d_sum += h_b[i];

     // report sums
     std::cout << std::fixed << std::setprecision(1);
     std::cout << "Host sum   = " << h_sum << std::endl;
     std::cout << "Device sum = " << d_sum << std::endl;

     // deallocate memory
     delete[] h_a;
     delete[] h_b;
     cudaFree(d_a);
     cudaFree(d_b);
 } 

The results of executing the compiled code should look something like:

 >a 2000
 Host sum   = 993.0
 Device sum = 993.0

Constant Memory

Constant memory is a dedicated part of global memory that provides read-only access.  Constant memory is cached and its access times are significantly faster than global memory. 

We allocate constant memory form the host code and copy host data to it in the host code.  The keyword that identifies constant memory is __constant__.  We allocate this memory statically.

__constant__ float a[N];

The CUDA API function that copies data into constant memory has the following prototype

 cudaError_t
 cudaMemcpyToSymbol(const char* symbol, void* source, size_t n_bytes, 
  size_t offset = 0, cudaMemcpyKind = cudaMemcpyHostToDevice);

The first parameter (symbol) receives the address of the allocated device memory, the second parameter (source) receives the address of the host memory that contains the data to be copied to the device, the third parameter (n_bytes) receives the number of bytes to be copied, the fourth parameter (offset) receives the offset to be used in the copying, and the fifth parameter receives the direction of copying.  The fourth and fifth parameters are optional. 

Ray Tracing Example

The following program draws the parts of the spheres in a set that are visible to a viewpoint located on the z axis.  The spheres are dispersed randomly throughout a three-dimensional space and the viewing direction is perpendicular to the plane of the drawing. 

ray tracing example

The code determines which part is visible at each and every pixel in turn.  The algorithm iterates through every sphere for every pixel.  If the ray for the pixel hits a sphere and that sphere is closer than the other spheres processed before, the algorithm marks that sphere as visible at that pixel.  To accelerate execution, we store all of the information for each sphere in constant memory.  This information does not change.

We name the function that determines a hit hit() and call that function from within the kernel code.  The keyword __device__ identifies the function as one that executes on the device. 

 // Visible Spheres - after Sanders and Kandrot CUDA by Example
 // raytrace.cu

 #include <iostream>
 #include <cmath>
 #include <cstdlib>
 #include <cuda_runtime.h>
 // to remove intellisense highlighting
 #include <device_launch_parameters.h>
 #ifndef __CUDACC__
 #define __CUDACC__
 #endif
 #include <device_functions.h>

 #define DIM 64
 #define DIMDIM (DIM * DIM)
 #define NTPB 16
 #define M_SPHERES 500
 #define RADIUS DIM / 10.0f
 #define MIN_RADIUS 2.0f
 #define rnd(x) ((float) (x) * rand() / RAND_MAX)
 #define INF 2e10f

 class Sphere {
     float x, y, z, r;
   public:
     Sphere() {}
     void init() {
         x = rnd(DIM) - DIM / 2.0f;
         y = rnd(DIM) - DIM / 2.0f;
         z = rnd(DIM) - DIM / 2.0f;
         r = rnd(RADIUS) + MIN_RADIUS;
     }
     __device__ float hit(float ox, float oy) {
         float dx = ox - x;
         float dy = oy - y;
         if (dx * dx + dy * dy < r * r)
             return sqrtf(r * r - dx * dx - dy * dy) + z;
         else
             return -INF;
     }
 };

 __constant__ Sphere s[M_SPHERES];

 __global__ void raytrace(bool* a, int n) {
     int x = threadIdx.x + blockIdx.x * blockDim.x;
     int y = threadIdx.y + blockIdx.y * blockDim.y;
     int k = x + y * blockDim.x * gridDim.x;
     float ox = (x - DIM / 2);
     float oy = (y - DIM / 2);
     float mz = - INF;
     for (int i = 0; i < n; i++) {
         float t = s[i].hit(ox, oy);
         if (t > mz)
             mz = t;
     }
     a[k] = mz != - INF;
 }

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << argv[0]
                   << ": invalid no of arguments\n"
                   << "Usage: " << argv[0]
                   << "  no_of_spheres\n"; 
         return 1;
     }
     int n = std::atoi(argv[1]);
     if (n > M_SPHERES) n = M_SPHERES;

     // create spheres and store in constant memory
     Sphere* s_temp = new Sphere[n];
     for (int i = 0; i < n; i++)
         s_temp[i].init();
     cudaMemcpyToSymbol(s, s_temp, sizeof(Sphere) * n);
     delete [] s_temp;

     // allocate device memory for hit data
     bool* d_a;
     cudaMalloc((void**)&d_a, DIMDIM);

     // launch the grid of threads
     dim3 dimGrid(DIM/NTPB, DIM/NTPB);
     dim3 dimBlock(NTPB, NTPB);
     raytrace<<<dimGrid, dimBlock>>>(d_a, n);

     // copy hit data to host
     bool* h_a = new bool[DIMDIM];
     cudaMemcpy(h_a, d_a, DIM*DIM, cudaMemcpyDeviceToHost);

     // display results
     int k = 0;
     for (int i = 0; i < DIM; i++) {
         for (int j = 0; j < DIM; j++)
             std::cout << (h_a[k++] ? 'x' : ' ');
         std::cout << std::endl;
     }

     // clean up
     delete [] h_a;
     cudaFree(d_a);
 }

The results of a ray-trace for 40 spheres looks something like:

>a 40
                             xxxxxx
                            xxxxxxxx
                           xxxxxxxxxx
                           xxxxxxxxxx
                       xxx xxxxxxxxxx
                      xxxxxxxxxxxxxxx
                      xxxxxxxxxxxxxxx
                      xxxxx xxxxxxxxxx              xxxxxxx
                       xxx   xxxxxxxxx             xxxxxxxxxx
                             xxxxxxxxxx           xxxxxxxxxxxx
                            xxxxxxxxxx           xxxxxxxxxxxxx
                           xxxxxxxxxxx           xxxxxxxxxxxxxx
                           xxxxxxxxxx           xxxxxxxxxxxxxxx
                           xxxxxxxxxx           xxxxxxxxxxxxxxx
                           xxxxxxxxx            xxxxxxxxxxxxxxx
                           xxxxxxxxx            xxxxxxxxxxxxxxx
                          xxxxxxxxxx            xxxxxxxxxxxxxxx
                 x        xxxxxxxxx         xxxxxxxxxxxxxxxxxx
                xxx       xxxxxxxx         xxxxxxxxxxxxxxxxxx
xxx            xxxxx     xxxxxxxxx   xxxx xxxxxxxxxxxxxxxxxxx
xxxxx           xxx     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx         x      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxx            xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxx             xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx              xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx                 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxx                    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxx                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx
xxxx                          xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx
xxxx                          xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  xx
xxxxxx                        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxx                        xxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxx                        xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxx                        xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx                        xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx                        xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx                         xxxxxxxxxxxxxxxxxxxxxxx   xxxxx
xxxxxxx                          xxxxxxxxxxxxxxxxxxxxxxx xxxxxxx
xxxxxxx                          xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxx                         xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxx                         xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxx                          xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xx                             xxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
                        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                   xxxxxxxxxxxxxxxxx    xxxxxxxxxxxxxxxxxxxxxxxx
                   xxxxxxxxxxxxxxxxx     xxxxxxxxxxxxxxxxxxxxx
         x          xxxxxxxxxxxxxxxx      xxxx xxxxxxxxxxxxxx
      xxxxxxxx      xxxxxxxxxxxxxxxx            xxxxxxxxxxx
     xxxxxxxxxx     xxxxxxxxxxxxxxx              xxxxxx     xxx
xxxxxxxxxxxxxxxx     xxxxxxxxxxxxxx                       xxxxxx
xxxxxxxxxxxxxxxxx     xxxxxxxxxxxx                        xxxxxx
xxxxxxxxxxxxxxxxx      xxxxxxxxxx                        xxxxxxx
xxxxxxxxxxxxxxxxxx      xxxxxxxx                         xxxxxxx
xxxxxxxxxxxxxxxxxx         xxxxx                         xxxxxxx
xxxxxxxxxxxxxxxxxx         xxxxx                          xxxxxx
xxxxxxxxxxxxxxxxxx         xxxxx                          xxxxxx
xxxxxxxxxxxxxxxxx           xxx                             xxxx

Exercises




Previous Reading  Previous: Parallel Profiling Next: Warp Partitioning   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License