Achieving accuracy in floatingpoint operations requires some
understanding of the hardware that performs the operations.
The circuitry for floatingpoint operations is different from
that for integer operations. The CPU's arithmetic logic unit
(ALU) handles integer operations while its floatingpoint unit or
accelerator (FPA) handles floatingpoint operations. In the
1980s, the FPA shipped as a physically separate coproceesor to the
CPU. With the Pentium processors, Intel incorporated the FPA
on the main processor chip. Onchip FPAs still serve as
coprocessors in the sense that they perform floatingpoint
operations in parallel with the CPU.
GPUs are specialized versions of floatingpoint coprocessors.
Since we cannot represent all of the possible floatingpoint values
within a discrete number of bits, most stored values are approximations.
To understand the effect on approximate results we study how they are
stored in memory, how their representation affects accuracy in parallel
computations, and how to design algorithms that minimize the errors
that compound in parallel computations.
This chapter introduces the IEEE 754 standard for
representing floatingpoint values, describes the range
of representable numbers, identifies the special bit patterns,
describes the effects of rounding in arithmetic operations, and
outlines a technique for maximizing floatingpoint arithmetic
accuracy in parallel computations.
IEEE 754 Standard
In 1985, IEEE published its floatingpoint standard (IEEE 7541985) in a
concerted effort to convince manufacturers to adopt a common representation
of floatingpoint data. Today, nearly all manufacturers conform to
this standard. The most recent version is IEEE 7542008. Its
full name is "The IEEE Standard for FloatingPoint Arithmetic".
IEEE 7542008 defines:
 arithmetic formats for storing data
 rounding rules
 arithmetic operations
 advanced operations for trigonometric functions
 interchange formats for exchanging data efficiently and compactly
 exception handling
Data Representation
The bit pattern for storing a floatingpoint number is a set of three
contiguous bit groups
 a sign group  1 bit
 an exponent group
 a mantissa or significand group
Arrangement of these three groups is open to the manufacturer.
The value of the complete set of bit groups is given by the formula
value = (1)^{sign} x mantissa x 2^{exponent}

For example, if the groups are stored in the order listed above, if the sign bit
is off and if the mantissa and exponent are in base 10, then the triples on the
left represent the corresponding values on the right
Representation 
Value 
0 1 1.0

0.5 = (1)^{0} x 1.0 x 2^{1}

0 1 1.8

0.9 = (1)^{0} x 1.8 x 2^{1}

0 0 1.0

1.0 = (1)^{0} x 1.0 x 2^{0}

0 0 1.9

1.9 = (1)^{0} x 1.9 x 2^{0}

0 1 1.0

2.0 = (1)^{0} x 1.0 x 2^{+1}

0 1 1.9

3.8 = (1)^{0} x 1.9 x 2^{+1}

Normalized Numbers
Separating values into a mantissa and an exponent admits
multiple representations of the same floatingpoint number
(for instance, 2.0 = 2.0 x 10^{0} and 0.2 x 10^{1}
are two possible representations of 2).
To avoid this multiplicity, we place the following constraint on the
mantissa
Under this constraint, a unique representation exists for each
floatingpoint number. We refer to representations that
satisfy this constraint as normalized numbers.
Since all normalized numbers have mantissa values of the form 1.*,
there is no need to use a bit to store the leading 1.
By convention, we assume the leading bit and store only the trailing bits. The
formula for the value in truncated mantissa (M) form is
value = (1)^{sign} x 1.M x 2^{exponent}

An nbit stored mantissa is actually an n+1bit mantissa.
Exponent
The range of representable floatingpoint numbers depends on the number
of bits reserved for the exponent. If we reserve n bits for the
exponent, then the range of the representable numbers is 1.M x
2^{(n1)} through 1.M x 2^{(n1).
}
Encoding Convention
The standard specifies that, in encoding the exponent, we
add (2^{n1}  1) to its two's complement.
We refer to the result as the excess representation.
For example, if we reserved 6 bits for the exponent and its value
is 24 (24_{10} = 011000_{2}), then
24 start with 24 (24_{10} = 011000_{2})
2^{s } complement = 011000
2^{61}  1 = 31 = 011111 // to be added
excess
representation^{ } = 110111 // is 011000 + 011111

If we reserved 6 bits for the exponent and its value is 24 (24_{10}
= 011000_{2}), then
24 start with 24 (24_{10} = 011000_{2})
1^{s }complement of 24 = 100111
2^{s }complement of 24 = 101000 // is 100111 + 000001
2^{61}  1 = 31 = 011111 // to be added
excess
representation^{ } = 000111 // is 011111 + 101000

The advantage of excess representation is the absence of a sign.
Unsigned comparators can be used for all values as shown in the table below.
Unsigned comparators are less expensive than signed comparators.
2^{s} complement

Decimal

Excess representation

100000

reserved

111111

100001

31

000000

100010

30

000001

100...

...

000...

101000

24

000111

1...

...

0...

000000

0

011111

0...

...

1...

011000

24

110111

011...

...

111...

011110

30

111101

011111

31

111110

Representable Numbers
The standard defines a representable number as a floatingpoint
number that the standard represents exactly.
In the figure below, the grey bars identify the major intervals
for the representable numbers. The vertical lines identify the
representable numbers within each interval.
For a 2 bit mantissa (M occupies 2 bits) there are 4
representable numbers.
For an N bit mantissa (M occupies N bits) there are 2^{N
representable numbers.
}
Features
The principal features of a representable number are:
 the exponent bits identify the range = the number of major intervals
 the mantissa bits identify the number of representable numbers within each interval
 0 is not representable
 representable numbers approach one another towards the neighbourhood of 0
 a gap exists in the interval of representable numbers adjacent to zero
Abrupt Underflow
The abrupt underflow convention was a common method of accommodating 0
before the 1985 standard,. Under that convention, any number with an exponent
of 0 was treated as 0.
The difficulty with that method was that it increased the gap in the interval
adjacent to 0.
Denormalization
The standard eliminated the gap in the vicinity of 0 by introducing denormalization.
Denormalization replaces the assumption that the mantissa is of the form 1.M by
a local assumption that in this particular range the mantissa is of the form 0.M instead.
Denormalized representation yields a uniformly spaced set of representable numbers in the interval adjacent
to 0.
The following code snippet illustrates that, if the exponent is 0, then the floatingpoint value
is determined by the first formula:
if (exponent == 0)
value = (1)^{sign} x 0.M x 2^{2n1+2} where 0.0 <= 0.M < 1.0
else
value = (1)^{sign} x 1.M x 2^{E2n1+1} where 1.0 <= 1.M < 2.0

Special Bit Patterns
Storing an exponent in excess representation leaves some ambiguity
in the limiting bit patterns. The standard defines the meaning
of the limiting bit patterns with respect to the value of the mantissa
as listed in the table below.
Exponent 
Mantissa 
Meaning 
Description 
11...1 
≠ 0 
NaN 
not a number 
11...1 
= 0 
(1)^{s} x ∞ 
infinity 
00...0 
≠ 0 
Denormalized 
near zero 
00...0 
= 0 
0 
exactly zero 
NaN stands for "Not A Number".
NaNs
A NaN is the result of an operation that
does not make arithmetic sense. For example, 0/0,
0 * ∞,
∞ / ∞, or
∞  ∞.
All representable numbers fall between ∞ and +∞.
The standard defines two types of NaNs:
A signaling NaN raises a hardware exception.
A quiet NaN percolates through
the program's execution and appears as NaN
if printed. It does not raise a hardware exception and can expose
data corruption that has occured during a parallel operation.
Arithmetic Accuracy
The accuracy of arithmetic operations requires attention due to
the approximate nature of floatingpoint representation.
We measure the accuracy of a floatingpoint operation by the
maximal error that that the operation introduces. The
larger the error, the lower the accuracy.
The representable floatingpoint numbers are quite limited.
The floatingpoint values between any two adjacent representable
numbers are necessarily stored as one or the other of the two bounds.
Whenever the mantissa requires more bits to represent the result of an
operation than the available number of bits, rounding occurs.
The most common source of error in an arithmetic operation occurs when
the result is not representable and must be rounded.
Consider an addition operation on two operands with different
exponents. Before performing the operation, we equate the
exponents by rightshifting the mantissa of the smaller value.
Once the exponents are equal the addition operation is straightforward.
Addition
Consider adding 1.00 x 2^{1} to 1.00 x 2^{2}
on a 5bit representation [0 00 01] + [0 10 00]:
1.00 x 2^{1} + 1.00 x 2^{2}
= 1.00 x 2^{1} + 0.001 x 2^{1}
= 1.001 x 2^{1} (accurate result)
1.00 x 2^{1} (closest representation) [0 10 00]
0.001 x 2^{1} (error introduced)

ULP
The error introduced in arithmetic operations is related
to the place value of the least significant bit in the mantissa.
We refer to this value as the unit in last place (ULP).
If the hardware is designed to perform arithmetic and
rounding operations perfectly, then the greatest error
that an arithmetic operation introduces is no more than 0.5 ULP.
Algorithmic Considerations
We can improve the accuracy of floatingpoint operations by designing algorithms
that accommodate the approximate nature of floatingpoint arithmetic.
While integer operations on CPUs are associative, floatingpoint
operations on GPUs are not associative. That is, the order in which
the GPU performs the operations does determine their outcome.
Example
Consider the addition of 4 numbers in sequence
1.00000 x 2^{0} + 1.00000 x 2^{0} + 1.00000 x 2^{5} + 1.00000 x 2^{5}
= 1.00000 x 2^{1} + 1.00000 x 2^{5} + 1.00000 x 2^{5}
= 1.00000 x 2^{1} + 1.00000 x 2^{5}
= 1.00000 x 2^{1}

Now consider the same set of 4 numbers in paired addition. Compare the
results
[1.00000 x 2^{0} + 1.00000 x 2^{0}] + [1.00000 x 2^{5} + 1.00000 x 2^{5}]
= 1.00000 x 2^{1} + 1.00000 x 2^{4}
= 1.00001 x 2^{1}

Note how the rounding of each intermediate result led to a different final result.
Sorting
Sorting the numbers in ascending order before performing the addition would clearly improve
accuracy.
Consider the sequential addition of the above set of 4 numbers
after sorting them in ascending order
1.00000 x 2^{5} + 1.00000 x 2^{5} + 1.00000 x 2^{0} + 1.00000 x 2^{0}
= 1.00000 x 2^{4} + 1.00000 x 2^{0} + 1.00000 x 2^{0}
= 1.00010 x 2^{0} + 1.00000 x 2^{0}
= 1.00001 x 2^{1}

Parallel Algorithms
A common technique to maximize floatingpoint accuracy in parallel
algorithms such as reductions is to presort the data prior to the
operation. If we divide the numbers into groups with identical
or neigbouring exponents, we can gaurantee that the operations are
performed on numbers close to one another and greater accuracy will
result.
Exercises
 Read Wikipedia on Coprocessors
 Read Wikipedia on IEEE 754
 Discuss why a reduction algorithm designed to minimize thread divergence
on an array presorted in ascending order can reduce the accuracy of the final
result. Suggest one way to improve accuracy.
