Programming heterogeneous computers involves the determination of a
stable balance between coarse-grained solutions executing on the serial
processors and fine-grained solutions executing on the parallel
processors. These notes have focused on fine-grained solutions
that deliver optimized performance.
This chapter summarizes the best practices introduced in these notes
and described in more detail in Nvidia's CUDA Best Practices Guide.
Heterogeneous Computing
Heterogeneous computers offer a framework for designing performant solutions
that capitalize on the distinguishing features of both serial and parallel
processors. Some algorithms perform better as coarse-grained solutions,
while other algorithms perform better as fine-grained solutions.
We optimize coarse-grained solutions for heavyweight threads on CPUs and
fine-grained solutions for lightweight threads on GPUs.
Designing optimal solutions for heterogeneous computers involves:
- Assessing the application to identify its hotspots
- Determining if the code at the hotspots is parallelizable
- Understanding the workload on the algorithm
- Understanding the differences between host and device thread
models and physical memory
- Threading Resources
- CPU - 16 concurrent
- GPU - 2048 concurrent
- Threading
- heavyweight - minimize latency - context switches are slow and expensive
- lightweight - maximize throughput - separate registers for each thread
eliminate the need for register swaps - context switches are free
- PCIe bus
- Coding for coalesced memory accesses in multi-dimensional configurations
- Retain data on a device as long as possible - using complexity to justify
transfer across PCIe bus - add matrices versus multiply matrices
CUDA Practices
The CUDA Best Practices Guide classifies practices into high, medium, and low
priorities.
Priorities
The Guide suggests implement higher priority practices before any other ones.
High Priorities
High priority practices include:
- To maximize productivity, profile the application
to determine hotspots and bottlenecks
- For maximum benefit from CUDA, focus first on finding
ways to parallelize sequential code
- Use the effective bandwidth of your computation
to measure performance and optimization benefits
- Minimize data transfer between the host and the device,
even if it means running some kernels on the device that don't
individually show performance gains compared to running on the
host CPU
- Avoid different execution paths within the same warp
- Minimize the use of global memory; prefer shared memory
access where possible
- Ensure global memory accesses are coalesced whenever
possible
Medium Priorities
Medium priority practices include:
- Use shared memory to avoid redundant transfers from
global memory
- Maintain sufficient numbers of active threads to hide
latency arising from register dependencies
- Select numbers of threads per block as multiples of 32
to provide optimal computing efficiency and facilitate
coalescing
- Use the fast math library whenever speed trumps precision
- Prefer faster, more specialized math functions over slower, more general
ones when possible
- Use signed integers rather than unsigned integers as
iteration counters
- Design access to shared memory to avoid serializing requests
due to bank conflicts (not covered in these notes)
Low Priorities
Low priority practices include:
- Use bit-shifting operations to avoid more expensive division
and modulo calculations
- Avoid automatic conversions of doubles to floats
- Make it easy for the compiler to use branch prediction in lieu
of iterations or control statements
- Use zero-copy operations on integrated GPUs
|