Floating-Point Computing: A Comedy of Errors?
By Gregory Tarsy and Neil Toda, Sun Microsystems, 20.Jan.04
This article explores some of the intricacies inherent in using floating-point arithmetic for numerical computing. It is devoted to analyzing a straightforward example of summing the sequence of square-roots of the first 1,000,000,000 positive integers. Thanks are due to Paul Murphy whose Linuxworld article of 1/29/2003 was the inspiration for this note.
Contents
- The Right Answer
- The Analysis
- Why are some results better than SPARC?
- IEEE-754 and the Results on Sun Platforms
- Notes
- Conclusions
Paul Murphy asked people to run the following code using default
compilers and no special switches, and to report the result:
% cc -o sqr -O sqr.c -lm
% time ./sqr
where sqr.c is:
#include <stdio.h>
#include <math.h>
main()
{
register i;
double f=0.0;
double sqrt();
for(i=0;i<=1000000000;i++)
{
f+=sqrt( (double) i);
}
printf("Finished %20.14f\n",f);
}
The table at the end of the article summarizes some reported
results. Studying these suggests the questions: why are the results
different for different (IEEE754 compliant) computers and different
compilers, what is the "correct" result, how is it found, why are some results apparently more accurate than others, and why are the results on Sun, IBM, and other RISC CPU based computers about the same?
The Right Answer
Murphy discovered that one method of computing a "correct" result was the utilization of the Euler-Maclaurin formula and high precision computation. It gives 21081851083600.37596259382529338 as a good approximation to the correct result.
The Analysis
The main points to be made are:
- The error as measured by the more suitable unit of ULPs (Units in the Last Place) is not really large and it is further controllable by improving the summation algorithm.
- The SPARC and other RISC CPU results are uniform across systems while the IA32 results vary from somewhat better to much worse; uniformity is a virtue.
- The variability of IA32 results stems from improper management of extended precision registers vis-a-vis memory precision, a subtlety that vitiates significantly the potential real advantage of extended precision intermediate computation. Note that IA32 does indeed comply with IEEE754; its differences with SPARC or other RISCs is within IEEE754 latitude.
- The example is typical of kernels in the solution of differential equations, like the three-body problem, which will overwhelm eventually any finite bounded precision representation, so better numerical approaches must be used anyway.
Next is a more detailed discussion. For a good foundation in
floating-point computing see the ACM article by David Goldberg "What Every Computer Scientist Should Know about Floating-Point Computing" especially with the addendum by Douglas Priest that addresses IA32. Also of interest might be Joe Darcy's JavaOne talk on What Everybody Using the Java Programming Language Should Know About Floating-Point Arithmetic.
The Results: Correct vs IA32 versus SPARC
The table presents results from some of the sources/runs
in order of accuracy.
1) Correct
21081851083600.37596259382529338
0x42b32c803ebb5060
2) SPARC-quad
21081851083600.37500000000000
0x42b32c803ebb5060
3) Correct+1ULP
21081851083600.3789
0x42b32c803ebb5061
4) IA32
21081851083600.38281250000000
0x42b32c803ebb5062
5) SPARC double
21081851083600.55859375000000
0x42b32c803ebb508f
Entry number 1 is the result from the Euler-Maclaurin formula as
presented by Murphy. The third column of the table gives the IEEE-754
floating-point double-precision hexadecimal representation of the
result.
Entry number 2, shows a result that in decimal representation, is in
error by -.00096259382529338. However, observe that the hex
representation shows an error of 0 ULPs, an exact match. The reason
for this apparent contradiction is that IEEE-754 double precision
represents only 16- or 17-decimal digits. Additional digits are used in
rounding when converting from decimal to the double precision format.
Entry number 2 presents the double-precision results when the sum is
kept in IEEE-754 128-bit format.
Entry number 3 presents not the result of a run, but rather serves
to illustrate the error in the decimal representation when the
hexadecimal number varies by 1 bit, or unit in the last place, denoted
ULP.
The best computed results are in error by 2 ULPs and are represented
by entry number 4.
The SPARC results which are in error by 47 ULPs, are shown as entry
number 5.
The Magnitude of the Errors
Consider now the errors encountered in the survey, based on the previous table. Consider #1 as perfect. Errors can be considered in terms of decimal representation, and ULPs.
When considering the absolute error alone, SPARC results appear very inaccurate. However, considering the ratio error/correct, the variance between results, in fact, is discovered to be very small.
SPARC-quad
-0.00096259382529338
0
0
Correct+1ULP
0.00293740617470662
1
1.39333E-16
BEST
0.00684990617470662
2
3.24920E-16
SPARC-double
0.18263115617470662
47
86.6296E-16
There are two sources of error in the algorithm as provided in
Murphy's article. The first is due to the sqrt operation, and the
second is due to the addition in summing the square roots.
Sqrt rounding error: the largest rounding errors for the sqrt
operations should occur for the largest numbers. A one ULP error for
sqrt(1,000,000,000) is on the order of 2*10^(-12). This rounding error
is very small compared to the one that is possible due to the addition
that follows the sqrt operations.
The sum just before sqrt(1000000000) is 21081851051977.5977. A 1
ULP error in the sum value results in an error of about .0029. Huge
compared to the rounding error of the sqrt operation.
The SPARC-quad run shows that a lot more precision will yield the
correct result. The cost as Murphy points out, is significantly more
compute time.
The question is, can something be done to minimize the inaccuracy
without pushing precision to quad?
Yes.
Compensated Summation: An Algorithm That Corrects for Errors
The following algorithm uses the notion of compensated summation, which
deals with the rounding error that is associated with each add in the
summation.
#include <stdio.h>
#include <math.h>
main()
{
register i;
double sum=0.0,newsum,f;
double sumerr=0.0;
double err;
double sqrt();
for(i=1;i<=1000000000;i++)
{
f=sqrt( (double) i);
newsum = sum + f;
err = (newsum - sum ) - f;
sumerr += err;
sum = newsum;
}
printf("Finished %20.14f\n",sum);
printf("Finished %20.14f\n",sum-sumerr);
}
This program was compiled/linked with the same command, but
returned results matching Correct while only taking about 20% more compute time.
The entry "SPARC double+e" in the following table shows the result
using compensated summation.
We offer a bit more detail about compensated summation in the Notes
section at the end of this article.
Correct
21081851083600.37596259382529338
0x42b32c803ebb5060
SPARC double+e
21081851083600.37500000000000
0x42b32c803ebb5060
SPARC-quad
21081851083600.37500000000000
0x42b32c803ebb5060
Correct+1ULP
21081851083600.3789
0x42b32c803ebb5061
BEST
21081851083600.38281250000000
0x42b32c803ebb5062
SPARC double
21081851083600.55859375000000
0x42b32c803ebb508f
Why are some results better than SPARC?
There is a clustering of systems in terms of errors, identified as RISC
machines that adhere to IEEE-754 64-bit format for double precision
work.
The best results were from systems based on Intel's Pentium model.
Pentium systems have extended precision registers, 80 bits versus
SPARC's 64-bits in double. In addition, Pentium systems allow
intermediate results to be kept in extended format. An explicit call is
required to force the Pentium processor to keep intermediate results in
64-bit format.
As the preceding section shows, the critical issue of concern here
should be the precision used in the problem. As Murphy points out, the
correct value was calculated using 200 digits of precision.
While the results from the Pentium class machines were better, the
question arguably remains: Are any of these answers bad? We contend
that the answer to this question is no. Given the magnitude of the
number, and the size of the error, the results are all perfectly
reasonable.
Finally, what should be learned from this example is that a
precision level may work for one algorithm and data set, but fail if
the data changes or if the algorithm is modified.
Another interpretation of this is that the compilers of C don't have
strong guarantees about what "double" means. In some versions of gcc
with some compiler options, "double" can mean "80-bits" instead of
64-bits. Some of the gcc compilers might be new enough to have some C99
support so it is conceivable that double_t is being used instead of
double.
If one wants more accurate results without compensated summation,
one should declare f as "long double" or "double_t". If one wants
predictable results one can use Java, even on x86.
IEEE-754 and the Results on Sun Platforms
IEEE 754 aids understanding where errors arise in computations by
enabling us to work out what error is committed in each operation just
by knowing the operands and the destination format; we need not know
such system-dependent details as whether the hardware uses a guard bit
or whether it chops or rounds. But IEEE 754 also obscures understanding
why a particular system computes the result that it does because it
allows implementations to arbitrarily choose which operations are
rounded to the format the programmer specifies and which are rounded to
a wider format.
The one pertinent observation in Murphy's analysis of his example is
that the results on SPARC systems dating back to 1989 are all the same,
whereas the results on x86 systems vary in ways that are not at all
obvious.
Notes
In the original algorithm:
for(i=0;i<=1000000000;i++)
{
f+=sqrt( (double) i);
}
whenever the addition is done, there are two possibilities:
- The result is exact and fits into the 64-bits allotted, or
- the result will not exactly fit, and rounding needs to occur.
With each addition, the rounding error is ignored. After a billion
additions, the results are noticeable in this problem.
The compensating algorithm does the following:
- Do the add, but don't worry about rounding error:
newsum = sum + f;
- Determine the part of f that contributes to the new value:
(newsum - sum )
- f's contribution to the new sum minus f give the amount of f that
is lost, i.e. the rounding error:
err = (newsum - sum ) - f;
- Keep track of the rounding errors separately:
sumerr += err;
- All that is left is to adjust things so that the loop can be used
again.
Linux
PII
400
GCC 2.95 with GMP 4.0
Precision set to 200 digits
21081851083600.37596259382529338
0.0
6 hours, 11 minutes, 45.9 sec
Solaris 2.7
Ultra-II
450
gcc 2.95
cc -o sqr -O2 sqr.c -lm
21081851083600.55859375
-0.18263116
79.0u 0.1s 0:15 96% 0+0k 0+0io 0pf+0w
SunOS 4.1.4 (Sun IPX)
SP 2+
40
K&R
cc -O -o sqr sqr.c -lm
21081851083600.55859375
-0.18263116
6 hours, 57 minutes, 18 seconds
Redhat7.1 Kernel 2.4.17
Intel PIII
933MHz
gcc version 2.96 20000731
(Red Hat Linux 7.1 2.96-81)
gcc -lm -o test test.c
21081851083598.38281250000000
1.99219
176.350u 0.340s 3:07.78 94.0%
Redhat7.1 Kernel 2.4.17
Intel PIII
933MHz
gcc version 2.96 20000731
(Red Hat Linux 7.1 2.96-81)
gcc -march=i686 -O3
-lm -o test test.c
21081851083600.38281250000000
-0.0078125
74.680u 0.110s 1:14.79 100.0% 0+0k 0+0io 109pf+0w
SUSE 7.1
Intel PI
Unknown
/usr/lib/gcc-lib/i486-suse-linux/2.95.2/specs
gcc version
2.95.2 19991024
gcc -o test test.c -lm
gcc -O3 -o test test.c
-lm
21081851083598.38281250000000
1.99219
real 39m57.842s user 39m1.750s sys 0m1.210s
SUSE 7.1
Intel PI
Unknown
/usr/lib/gcc-lib/i486-suse-linux/2.95.2/specs
gcc version
2.95.2 19991024
gcc -O3 -o test test.c
-lm
21081851083600.38281250000000
-0.0078125
real 11m56.810s user 11m48.850s sys 0m0.250s
NetBSD
PIV
1.7GHz
gcc version egcs-2.91.66 19990314
(egcs-1.1.2 release)
gcc -O sq.c -o sq -lm
21081851083600.55859375000000
-0.183594
112.383u 0.000s 1:52.98 99.4% 0+0k 2+75io 2pf+0w
NetBSD
PIV
1.7GHz
/usr/pkg/gcc-2.95.3/lib/gcc-lib/
i386--netbsdelf/2.95.3/specs
gcc -O sq.c -o sq -lm
21081851083600.55859375000000
-0.183594
110.656u 0.000s 1:51.64 99.1% 0+0k 2+110io 1pf+0w
Solaris 8
Ultra III
750
gcc version 2.95.3 20010315
(Not 64 bit as far as I can
tell!)
gcc -o sqrO3 -O3 sqr.c -lm
21081851083600.55859375000000
-0.183594
58.40u 0.01s 1:00.76 96.1%
Darwin
G4
450
Apple Computer, Inc. version gcc-932.1,
based on gcc version
2.95.2
19991024 (release)
gcc -o sqrO -O sqr.c
-lm
21081851083600.55859375000000
-0.183594
270.930u 1.040s 4:47.78 94.5% 0+0k 0+1io 0pf+0w
Linux 2.4.7 (redhat)
AMD
1.6GHz
gcc 2.96-98 (redhat)
cc -o sqr -O sqr.c -lm
21081851083600.38281250000000
-0.0078125
real 0m23.492s user 0m23.170s sys 0m0.000s
Linux 2.4.7
athlon
750
gcc 3.0
21081851083598.38281250000000
1.99219
121.730u 0.030s 2:22.15 85.6% 0+0k 0+0io 104pf+0w
Linux
G3
300
gcc 2.95.2
21081851083600.55859375000000
-0.183594
515.390u 0.030s 8:35.45 99.9% 0+0k 0+0io 154pf+0w
Linux
alpha/21164a
533
gcc 3.0
21081851083600.55859375000000
-0.183594
570.998u 0.049s 10:14.14 92.9% 0+0k 0+0io 0pf+0w
Linux
alpha/21164a
533
ccc here is DEC/Compaq C compiler V6.4, cpml is the
DEC/Compaq Alpha-optimized math library
ccc -lcpml fptest.c -o
fptest
21081851083600.55859375000000
-0.183594
151.426u 0.003s 2:31.44 99.9% 0+0k 0+0io 0pf+0w
Solaris 8
UIII
750
Sun Forte 6.2 Compiler
cc -o sqr -O sqr.c -lm
21081851083600.55859375000000
-0.183594
real 1:59.1 user 1:59.0 sys 0.0
Red hat 7.2
PIV
1.6GHz
gnu c compiler 2.96
cc -o sqr -O sqr.c -lm
21081851083600.38281250000000
-0.0078125
real 0m31.098s user 0m29.250s sys 0m0.040s
Red hat 7.1
PII
400
gcc 2.96
cc -o sqr -O sqr.c -lm
21081851083600.38281250000000
-0.0078125
real 2m54.350s user 2m53.180s sys 0m1.140s
Debian GNU/Linux 2.2.19
Athlon
1.2GH
gcc 2.95
gcc test.c -o test -lm
21081851083598.38281250000000
1.99219
real 2m7.523s user 2m3.160s sys 0m0.020s
Solaris 8
Ultra IIi
440
using Forte 6.1, "-O1" optimization
cc -o sqr -xO2
~/src/sqr.c -lm
21081851083600.55859375000000
-0.183594
real 174.59 user 168.24 sys 0.03
Solaris 8
Ultra IIi
440
using gcc-2.95.3
gcc -o sqr -O2
~/src/sqr.c -lm
21081851083600.55859375000000
-0.183594
real 88.69 user 87.06 sys 0.00
Solaris 8
U III
750
gcc-2.95.3, "-O2" optimization
gcc -o sqr -O2
~/src/sqr.c -lm
21081851083600.55859375000000
-0.183594
real 57.60 user 57.47 sys 0.00
Solaris 8
U III
750
using Forte 6.1, "-O2" optimization
cc -o sqr -xO2
~/src/sqr.c -lm
21081851083600.5585937500000
-0.183594
real 119.15 user 118.89 sys 0.00
HP-UX 11
PA-8600
550
HP Ansi C B.11.11.02,
"-O1" optimization
cc -o sqr -O1 ~/src/sqr.c -lm
21081851083600.55859375000000
-0.183594
real 3:20.8 user 3:20.7 sys 0.1
AIX 4.3.3.0
RS-64-III
500
IBM C (ibmcxx.cmp) 3.6.6.4,
"-O2" optimization
cc -o sqr -O2
~/src/sqr.c -lm
21081851083600.55859375000000
-0.183594
Real 341.19 User 170.95 System 0.01
Conclusions
Even the simple example of square-root summation affords the
opportunity for appreciation of the complexity of floating-point
computation. The complexity is inherent in the attempt to model
arithmetic with real numbers, which in "true" arithmetic have infinite
precision, by a finite precision approximation. Almost every arithmetic
operation, therefore, results in an error and, in fact, it is in
addition/subtraction that the largest errors can occur. Thus
floating-point computation is epitomized by the need to understand and
estimate the error to ensure the validity of what one computes.
About the Authors
Gregory Tarsy is the manager of the Floating Point and Numerical
Computing group at Sun Microsystems. Prior to joining Sun he worked
in scientific
applications on supercomputers and was a professor of
mathematics at the University of California and City University of
New York.
Neil Toda is a member of Sun's Floating Point and Numerical
Computing
group. Prior to joining Sun, he worked in Operating Systems
Engineering,
Database Engineering, and Software Performance Engineering.