Achieving accuracy in floating-point operations requires some
understanding of the hardware that performs the operations.
The circuitry for floating-point operations is different from
that for integer operations. The CPU's arithmetic logic unit
(ALU) handles integer operations while its floating-point unit or
accelerator (FPA) handles floating-point operations. In the
1980s, the FPA shipped as a physically separate coproceesor to the
CPU. With the Pentium processors, Intel incorporated the FPA
on the main processor chip. On-chip FPAs still serve as
co-processors in the sense that they perform floating-point
operations in parallel with the CPU.
GPUs are specialized versions of floating-point co-processors.
Since we cannot represent all of the possible floating-point values
within a discrete number of bits, most stored values are approximations.
To understand the effect on approximate results we study how they are
stored in memory, how their representation affects accuracy in parallel
computations, and how to design algorithms that minimize the errors
that compound in parallel computations.
This chapter introduces the IEEE 754 standard for
representing floating-point values, describes the range
of representable numbers, identifies the special bit patterns,
describes the effects of rounding in arithmetic operations, and
outlines a technique for maximizing floating-point arithmetic
accuracy in parallel computations.
IEEE 754 Standard
In 1985, IEEE published its floating-point standard (IEEE 754-1985) in a
concerted effort to convince manufacturers to adopt a common representation
of floating-point data. Today, nearly all manufacturers conform to
this standard. The most recent version is IEEE 754-2008. Its
full name is "The IEEE Standard for Floating-Point Arithmetic".
IEEE 754-2008 defines:
- arithmetic formats for storing data
- rounding rules
- arithmetic operations
- advanced operations for trigonometric functions
- interchange formats for exchanging data efficiently and compactly
- exception handling
Data Representation
The bit pattern for storing a floating-point number is a set of three
contiguous bit groups
- a sign group - 1 bit
- an exponent group
- a mantissa or significand group
Arrangement of these three groups is open to the manufacturer.
The value of the complete set of bit groups is given by the formula
value = (-1)sign x mantissa x 2exponent
|
For example, if the groups are stored in the order listed above, if the sign bit
is off and if the mantissa and exponent are in base 10, then the triples on the
left represent the corresponding values on the right
Representation |
Value |
0 -1 1.0
|
0.5 = (-1)0 x 1.0 x 2-1
|
0 -1 1.8
|
0.9 = (-1)0 x 1.8 x 2-1
|
0 0 1.0
|
1.0 = (-1)0 x 1.0 x 20
|
0 0 1.9
|
1.9 = (-1)0 x 1.9 x 20
|
0 1 1.0
|
2.0 = (-1)0 x 1.0 x 2+1
|
0 1 1.9
|
3.8 = (-1)0 x 1.9 x 2+1
|
Normalized Numbers
Separating values into a mantissa and an exponent admits
multiple representations of the same floating-point number
(for instance, 2.0 = 2.0 x 100 and 0.2 x 101
are two possible representations of 2).
To avoid this multiplicity, we place the following constraint on the
mantissa
Under this constraint, a unique representation exists for each
floating-point number. We refer to representations that
satisfy this constraint as normalized numbers.
Since all normalized numbers have mantissa values of the form 1.*,
there is no need to use a bit to store the leading 1.
By convention, we assume the leading bit and store only the trailing bits. The
formula for the value in truncated mantissa (M) form is
value = (-1)sign x 1.M x 2exponent
|
An n-bit stored mantissa is actually an n+1-bit mantissa.
Exponent
The range of representable floating-point numbers depends on the number
of bits reserved for the exponent. If we reserve n bits for the
exponent, then the range of the representable numbers is 1.M x
2-(n-1) through 1.M x 2(n-1).
Encoding Convention
The standard specifies that, in encoding the exponent, we
add (2n-1 - 1) to its two's complement.
We refer to the result as the excess representation.
For example, if we reserved 6 bits for the exponent and its value
is 24 (2410 = 0110002), then
24 start with 24 (2410 = 0110002)
2s complement = 011000
26-1 - 1 = 31 = 011111 // to be added
excess
representation = 110111 // is 011000 + 011111
|
If we reserved 6 bits for the exponent and its value is -24 (2410
= 0110002), then
-24 start with 24 (2410 = 0110002)
1s complement of 24 = 100111
2s complement of 24 = 101000 // is 100111 + 000001
26-1 - 1 = 31 = 011111 // to be added
excess
representation = 000111 // is 011111 + 101000
|
The advantage of excess representation is the absence of a sign.
Unsigned comparators can be used for all values as shown in the table below.
Unsigned comparators are less expensive than signed comparators.
2s complement
|
Decimal
|
Excess representation
|
100000
|
reserved
|
111111
|
100001
|
-31
|
000000
|
100010
|
-30
|
000001
|
100...
|
-...
|
000...
|
101000
|
-24
|
000111
|
1...
|
-...
|
0...
|
000000
|
0
|
011111
|
0...
|
...
|
1...
|
011000
|
24
|
110111
|
011...
|
...
|
111...
|
011110
|
30
|
111101
|
011111
|
31
|
111110
|
Representable Numbers
The standard defines a representable number as a floating-point
number that the standard represents exactly.
In the figure below, the grey bars identify the major intervals
for the representable numbers. The vertical lines identify the
representable numbers within each interval.

For a 2 bit mantissa (M occupies 2 bits) there are 4
representable numbers.
For an N bit mantissa (M occupies N bits) there are 2N
representable numbers.
Features
The principal features of a representable number are:
- the exponent bits identify the range = the number of major intervals
- the mantissa bits identify the number of representable numbers within each interval
- 0 is not representable
- representable numbers approach one another towards the neighbourhood of 0
- a gap exists in the interval of representable numbers adjacent to zero
Abrupt Underflow
The abrupt underflow convention was a common method of accommodating 0
before the 1985 standard,. Under that convention, any number with an exponent
of 0 was treated as 0.

The difficulty with that method was that it increased the gap in the interval
adjacent to 0.
Denormalization
The standard eliminated the gap in the vicinity of 0 by introducing denormalization.
Denormalization replaces the assumption that the mantissa is of the form 1.M by
a local assumption that in this particular range the mantissa is of the form 0.M instead.
Denormalized representation yields a uniformly spaced set of representable numbers in the interval adjacent
to 0.

The following code snippet illustrates that, if the exponent is 0, then the floating-point value
is determined by the first formula:
if (exponent == 0)
value = (-1)sign x 0.M x 2-2n-1+2 where 0.0 <= 0.M < 1.0
else
value = (-1)sign x 1.M x 2E-2n-1+1 where 1.0 <= 1.M < 2.0
|
Special Bit Patterns
Storing an exponent in excess representation leaves some ambiguity
in the limiting bit patterns. The standard defines the meaning
of the limiting bit patterns with respect to the value of the mantissa
as listed in the table below.
Exponent |
Mantissa |
Meaning |
Description |
11...1 |
≠ 0 |
NaN |
not a number |
11...1 |
= 0 |
(-1)s x ∞ |
infinity |
00...0 |
≠ 0 |
Denormalized |
near zero |
00...0 |
= 0 |
0 |
exactly zero |
NaN stands for "Not A Number".
NaNs
A NaN is the result of an operation that
does not make arithmetic sense. For example, 0/0,
0 * ∞,
∞ / ∞, or
∞ - ∞.
All representable numbers fall between -∞ and +∞.
The standard defines two types of NaNs:
A signaling NaN raises a hardware exception.
A quiet NaN percolates through
the program's execution and appears as NaN
if printed. It does not raise a hardware exception and can expose
data corruption that has occured during a parallel operation.
Arithmetic Accuracy
The accuracy of arithmetic operations requires attention due to
the approximate nature of floating-point representation.
We measure the accuracy of a floating-point operation by the
maximal error that that the operation introduces. The
larger the error, the lower the accuracy.
The representable floating-point numbers are quite limited.
The floating-point values between any two adjacent representable
numbers are necessarily stored as one or the other of the two bounds.
Whenever the mantissa requires more bits to represent the result of an
operation than the available number of bits, rounding occurs.
The most common source of error in an arithmetic operation occurs when
the result is not representable and must be rounded.
Consider an addition operation on two operands with different
exponents. Before performing the operation, we equate the
exponents by right-shifting the mantissa of the smaller value.
Once the exponents are equal the addition operation is straightforward.
Addition
Consider adding 1.00 x 21 to 1.00 x 2-2
on a 5-bit representation [0 00 01] + [0 10 00]:
1.00 x 21 + 1.00 x 2-2
= 1.00 x 21 + 0.001 x 21
= 1.001 x 21 (accurate result)
1.00 x 21 (closest representation) [0 10 00]
0.001 x 21 (error introduced)
|
ULP
The error introduced in arithmetic operations is related
to the place value of the least significant bit in the mantissa.
We refer to this value as the unit in last place (ULP).
If the hardware is designed to perform arithmetic and
rounding operations perfectly, then the greatest error
that an arithmetic operation introduces is no more than 0.5 ULP.
Algorithmic Considerations
We can improve the accuracy of floating-point operations by designing algorithms
that accommodate the approximate nature of floating-point arithmetic.
While integer operations on CPUs are associative, floating-point
operations on GPUs are not associative. That is, the order in which
the GPU performs the operations does determine their outcome.
Example
Consider the addition of 4 numbers in sequence
1.00000 x 20 + 1.00000 x 20 + 1.00000 x 2-5 + 1.00000 x 2-5
= 1.00000 x 21 + 1.00000 x 2-5 + 1.00000 x 2-5
= 1.00000 x 21 + 1.00000 x 2-5
= 1.00000 x 21
|
Now consider the same set of 4 numbers in paired addition. Compare the
results
[1.00000 x 20 + 1.00000 x 20] + [1.00000 x 2-5 + 1.00000 x 2-5]
= 1.00000 x 21 + 1.00000 x 2-4
= 1.00001 x 21
|
Note how the rounding of each intermediate result led to a different final result.
Sorting
Sorting the numbers in ascending order before performing the addition would clearly improve
accuracy.
Consider the sequential addition of the above set of 4 numbers
after sorting them in ascending order
1.00000 x 2-5 + 1.00000 x 2-5 + 1.00000 x 20 + 1.00000 x 20
= 1.00000 x 2-4 + 1.00000 x 20 + 1.00000 x 20
= 1.00010 x 20 + 1.00000 x 20
= 1.00001 x 21
|
Parallel Algorithms
A common technique to maximize floating-point accuracy in parallel
algorithms such as reductions is to presort the data prior to the
operation. If we divide the numbers into groups with identical
or neigbouring exponents, we can gaurantee that the operations are
performed on numbers close to one another and greater accuracy will
result.
Exercises
- Read Wikipedia on Coprocessors
- Read Wikipedia on IEEE 754
- Discuss why a reduction algorithm designed to minimize thread divergence
on an array presorted in ascending order can reduce the accuracy of the final
result. Suggest one way to improve accuracy.
|