Part B - CUDA Programming Model

Floating-Point Considerations

Describe floating-point representation
Discuss arithmetic accuracy
Describe algorithmic considerations

IEEE 754 Standard | Accuracy | Algorithms | Exercises

Achieving accuracy in floating-point operations requires some understanding of the hardware that performs the operations.  The circuitry for floating-point operations is different from that for integer operations.  The CPU's arithmetic logic unit (ALU) handles integer operations while its floating-point unit or accelerator (FPA) handles floating-point operations.  In the 1980s, the FPA shipped as a physically separate coproceesor to the CPU.  With the Pentium processors, Intel incorporated the FPA on the main processor chip.  On-chip FPAs still serve as co-processors in the sense that they perform floating-point operations in parallel with the CPU.  GPUs are specialized versions of floating-point co-processors. 

Since we cannot represent all of the possible floating-point values within a discrete number of bits, most stored values are approximations.  To understand the effect on approximate results we study how they are stored in memory, how their representation affects accuracy in parallel computations, and how to design algorithms that minimize the errors that compound in parallel computations. 

This chapter introduces the IEEE 754 standard for representing floating-point values, describes the range of representable numbers, identifies the special bit patterns, describes the effects of rounding in arithmetic operations, and outlines a technique for maximizing floating-point arithmetic accuracy in parallel computations.

IEEE 754 Standard

In 1985, IEEE published its floating-point standard (IEEE 754-1985) in a concerted effort to convince manufacturers to adopt a common representation of floating-point data.  Today, nearly all manufacturers conform to this standard.  The most recent version is IEEE 754-2008.  Its full name is "The IEEE Standard for Floating-Point Arithmetic". 

IEEE 754-2008 defines:

  • arithmetic formats for storing data
  • rounding rules
  • arithmetic operations
  • advanced operations for trigonometric functions
  • interchange formats for exchanging data efficiently and compactly
  • exception handling

Data Representation

The bit pattern for storing a floating-point number is a set of three contiguous bit groups

  1. a sign group - 1 bit
  2. an exponent group
  3. a mantissa or significand group

Arrangement of these three groups is open to the manufacturer. 

The value of the complete set of bit groups is given by the formula

 value = (-1)sign x mantissa x 2exponent

For example, if the groups are stored in the order listed above, if the sign bit is off and if the mantissa and exponent are in base 10, then the triples on the left represent the corresponding values on the right

 0  -1  1.0
 0.5 = (-1)0 x 1.0 x 2-1
 0  -1  1.8
 0.9 = (-1)0 x 1.8 x 2-1
 0   0  1.0
 1.0 = (-1)0 x 1.0 x 20
 0   0  1.9
 1.9 = (-1)0 x 1.9 x 20
 0   1  1.0
 2.0 = (-1)0 x 1.0 x 2+1
 0   1  1.9
 3.8 = (-1)0 x 1.9 x 2+1

Normalized Numbers

Separating values into a mantissa and an exponent admits multiple representations of the same floating-point number (for instance, 2.0 = 2.0 x 100 and 0.2 x 101 are two possible representations of 2).  To avoid this multiplicity, we place the following constraint on the mantissa

 1.0 <= mantissa < 2.0

Under this constraint, a unique representation exists for each floating-point number.  We refer to representations that satisfy this constraint as normalized numbers.

Since all normalized numbers have mantissa values of the form 1.*, there is no need to use a bit to store the leading 1.  By convention, we assume the leading bit and store only the trailing bits.  The formula for the value in truncated mantissa (M) form is

 value = (-1)sign x 1.M x 2exponent

An n-bit stored mantissa is actually an n+1-bit mantissa. 


The range of representable floating-point numbers depends on the number of bits reserved for the exponent.  If we reserve n bits for the exponent, then the range of the representable numbers is 1.M x 2-(n-1) through 1.M x 2(n-1).

Encoding Convention

The standard specifies that, in encoding the exponent, we add (2n-1 - 1) to its two's complement.  We refer to the result as the excess representation. 

For example, if we reserved 6 bits for the exponent and its value is 24 (2410 = 0110002), then

 24 start with 24 (2410 = 0110002)
 2s   complement   = 011000
 26-1 - 1 = 31     = 011111 // to be added
 representation    = 110111 // is 011000 + 011111

If we reserved 6 bits for the exponent and its value is -24 (2410 = 0110002), then

 -24 start with 24 (2410 = 0110002)
 1s  complement of 24 = 100111
 2s  complement of 24 = 101000 // is 100111 + 000001
 26-1 - 1 = 31        = 011111 // to be added
 representation       = 000111 // is 011111 + 101000

The advantage of excess representation is the absence of a sign.  Unsigned comparators can be used for all values as shown in the table below.  Unsigned comparators are less expensive than signed comparators. 

2s complement
Excess representation

Representable Numbers

The standard defines a representable number as a floating-point number that the standard represents exactly.  In the figure below, the grey bars identify the major intervals for the representable numbers.  The vertical lines identify the representable numbers within each interval. 

Normalized Numbers

For a 2 bit mantissa (M occupies 2 bits) there are 4 representable numbers.  For an N bit mantissa (M occupies N bits) there are 2N representable numbers. 


The principal features of a representable number are:

  • the exponent bits identify the range = the number of major intervals
  • the mantissa bits identify the number of representable numbers within each interval
  • 0 is not representable
  • representable numbers approach one another towards the neighbourhood of 0
  • a gap exists in the interval of representable numbers adjacent to zero

Abrupt Underflow

The abrupt underflow convention was a common method of accommodating 0 before the 1985 standard,.  Under that convention, any number with an exponent of 0 was treated as 0. 

Abrupt Overflow

The difficulty with that method was that it increased the gap in the interval adjacent to 0. 


The standard eliminated the gap in the vicinity of 0 by introducing denormalization.  Denormalization replaces the assumption that the mantissa is of the form 1.M by a local assumption that in this particular range the mantissa is of the form 0.M instead.  Denormalized representation yields a uniformly spaced set of representable numbers in the interval adjacent to 0. 

DeNormalized Numbers

The following code snippet illustrates that, if the exponent is 0, then the floating-point value is determined by the first formula:

 if (exponent == 0)
     value = (-1)sign x 0.M x 2-2n-1+2 where 0.0 <= 0.M < 1.0
     value = (-1)sign x 1.M x 2E-2n-1+1 where 1.0 <= 1.M < 2.0

Special Bit Patterns

Storing an exponent in excess representation leaves some ambiguity in the limiting bit patterns.  The standard defines the meaning of the limiting bit patterns with respect to the value of the mantissa as listed in the table below.

Exponent Mantissa Meaning Description
11...1 ≠ 0 NaN not a number
11...1 = 0 (-1)s x ∞ infinity
00...0 ≠ 0 Denormalized near zero
00...0 = 0 0 exactly zero

NaN stands for "Not A Number". 


A NaN is the result of an operation that does not make arithmetic sense.  For example, 0/0, 0 * ∞, ∞ / ∞, or ∞ - ∞.  All representable numbers fall between -∞ and +∞. 

The standard defines two types of NaNs:

  • signaling NaN
  • quiet NaN

A signaling NaN raises a hardware exception.

A quiet NaN percolates through the program's execution and appears as NaN if printed.  It does not raise a hardware exception and can expose data corruption that has occured during a parallel operation.

Arithmetic Accuracy

The accuracy of arithmetic operations requires attention due to the approximate nature of floating-point representation.  We measure the accuracy of a floating-point operation by the maximal error that that the operation introduces.  The larger the error, the lower the accuracy. 

The representable floating-point numbers are quite limited.  The floating-point values between any two adjacent representable numbers are necessarily stored as one or the other of the two bounds. 

Whenever the mantissa requires more bits to represent the result of an operation than the available number of bits, rounding occurs.  The most common source of error in an arithmetic operation occurs when the result is not representable and must be rounded. 

Consider an addition operation on two operands with different exponents.  Before performing the operation, we equate the exponents by right-shifting the mantissa of the smaller value.  Once the exponents are equal the addition operation is straightforward. 


Consider adding 1.00 x 21 to 1.00 x 2-2 on a 5-bit representation [0 00 01] + [0 10 00]:

   1.00 x 21 + 1.00 x 2-2
 = 1.00 x 21 + 0.001 x 21
 = 1.001 x 21 (accurate result)
 1.00 x 21 (closest representation) [0 10 00]
 0.001 x 21 (error introduced)


The error introduced in arithmetic operations is related to the place value of the least significant bit in the mantissa.  We refer to this value as the unit in last place (ULP).  If the hardware is designed to perform arithmetic and rounding operations perfectly, then the greatest error that an arithmetic operation introduces is no more than 0.5 ULP.

Algorithmic Considerations

We can improve the accuracy of floating-point operations by designing algorithms that accommodate the approximate nature of floating-point arithmetic. 

While integer operations on CPUs are associative, floating-point operations on GPUs are not associative.  That is, the order in which the GPU performs the operations does determine their outcome. 


Consider the addition of 4 numbers in sequence

   1.00000 x 20 + 1.00000 x 20 + 1.00000 x 2-5 + 1.00000 x 2-5
 = 1.00000 x 21 + 1.00000 x 2-5 + 1.00000 x 2-5
 = 1.00000 x 21 + 1.00000 x 2-5
 = 1.00000 x 21

Now consider the same set of 4 numbers in paired addition.  Compare the results

  [1.00000 x 20 + 1.00000 x 20] + [1.00000 x 2-5 + 1.00000 x 2-5]
 = 1.00000 x 21 + 1.00000 x 2-4
 = 1.00001 x 21

Note how the rounding of each intermediate result led to a different final result.


Sorting the numbers in ascending order before performing the addition would clearly improve accuracy.

Consider the sequential addition of the above set of 4 numbers after sorting them in ascending order

   1.00000 x 2-5 + 1.00000 x 2-5 + 1.00000 x 20 + 1.00000 x 20
 = 1.00000 x 2-4 + 1.00000 x 20 + 1.00000 x 20
 = 1.00010 x 20 + 1.00000 x 20
 = 1.00001 x 21

Parallel Algorithms

A common technique to maximize floating-point accuracy in parallel algorithms such as reductions is to presort the data prior to the operation.  If we divide the numbers into groups with identical or neigbouring exponents, we can gaurantee that the operations are performed on numbers close to one another and greater accuracy will result. 


  • Read Wikipedia on Coprocessors
  • Read Wikipedia on IEEE 754
  • Discuss why a reduction algorithm designed to minimize thread divergence on an array presorted in ascending order can reduce the accuracy of the final result.  Suggest one way to improve accuracy. 

Previous Reading  Previous: Streams Next: Case Studies   Next Reading

  Designed by Chris Szalwinski   Copying From This Site   
Creative Commons License