Part E - Conclusion

Best Practices

List best practices for programming heterogeneous computers
Summarize the CUDA Best Practices Guide

Heterogeneous Computing | CUDA Practices

Programming heterogeneous computers involves the determination of a stable balance between coarse-grained solutions executing on the serial processors and fine-grained solutions executing on the parallel processors.  These notes have focused on fine-grained solutions that deliver optimized performance.

This chapter summarizes the best practices introduced in these notes and described in more detail in Nvidia's CUDA Best Practices Guide.

Heterogeneous Computing

Heterogeneous computers offer a framework for designing performant solutions that capitalize on the distinguishing features of both serial and parallel processors.  Some algorithms perform better as coarse-grained solutions, while other algorithms perform better as fine-grained solutions.  We optimize coarse-grained solutions for heavyweight threads on CPUs and fine-grained solutions for lightweight threads on GPUs.

Designing optimal solutions for heterogeneous computers involves:

  1. Assessing the application to identify its hotspots
  2. Determining if the code at the hotspots is parallelizable
  3. Understanding the workload on the algorithm
  4. Understanding the differences between host and device thread models and physical memory
    • Threading Resources
      • CPU - 16 concurrent
      • GPU - 2048 concurrent
    • Threading
      • heavyweight - minimize latency - context switches are slow and expensive
      • lightweight - maximize throughput - separate registers for each thread eliminate the need for register swaps - context switches are free
    • PCIe bus
  5. Coding for coalesced memory accesses in multi-dimensional configurations
  6. Retain data on a device as long as possible - using complexity to justify transfer across PCIe bus - add matrices versus multiply matrices

CUDA Practices

The CUDA Best Practices Guide classifies practices into high, medium, and low priorities.


The Guide suggests implement higher priority practices before any other ones.

High Priorities

High priority practices include:

  • To maximize productivity, profile the application to determine hotspots and bottlenecks
  • For maximum benefit from CUDA, focus first on finding ways to parallelize sequential code
  • Use the effective bandwidth of your computation to measure performance and optimization benefits
  • Minimize data transfer between the host and the device, even if it means running some kernels on the device that don't individually show performance gains compared to running on the host CPU
  • Avoid different execution paths within the same warp
  • Minimize the use of global memory; prefer shared memory access where possible
  • Ensure global memory accesses are coalesced whenever possible

Medium Priorities

Medium priority practices include:

  • Use shared memory to avoid redundant transfers from global memory
  • Maintain sufficient numbers of active threads to hide latency arising from register dependencies
  • Select numbers of threads per block as multiples of 32 to provide optimal computing efficiency and facilitate coalescing
  • Use the fast math library whenever speed trumps precision
  • Prefer faster, more specialized math functions over slower, more general ones when possible
  • Use signed integers rather than unsigned integers as iteration counters
  • Design access to shared memory to avoid serializing requests due to bank conflicts (not covered in these notes)

Low Priorities

Low priority practices include:

  • Use bit-shifting operations to avoid more expensive division and modulo calculations
  • Avoid automatic conversions of doubles to floats
  • Make it easy for the compiler to use branch prediction in lieu of iterations or control statements
  • Use zero-copy operations on integrated GPUs

Previous Reading  Previous: Directive Programming Next: Table of Contents   Next Reading

  Designed by Chris Szalwinski   Copying From This Site   
Creative Commons License