Question about floating point

classic Classic list List threaded Threaded
58 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

R Smith-2

On 2018/12/18 1:21 AM, James K. Lowden wrote:

>
>> First, the problem is not storage it's calculation.
>>
>> Second, the thread was started because a floating point calculation
>> in SQLite, exactly as it is run today, led to the following value:
>>
>> 211496.25999999992
>>
>> which is typical of such problems.
> What problem?  Rounded to the number of significant digits -- 2 decimal
> places in the input -- the number is correct.


Exactly, and I would go further to suggest that the problem is much more
to do with the human brain and visual concepts than the numbers.

Sure enough, as a mathematician the numbers
211496.25999999992 and
211496.26

look exactly the same to me (when I consider scales in the sub 15 digit
range), as it does to an IEEE float, but you can see how, to the average
onlooker's brain, those look vastly different - see them as pictures
rather than numbers, which obviously look completely different - and
this is what scares people and why this keeps being a problem.

I'm not even going to touch on silly/stupid programming and calculations
that round along the intermediate steps, those have been mentioned
already, they are evil and it isn't the fault of the storage medium.

The fact that the SQLite programmers and the Python programmers (or
perhaps Python-SQLite-wrapper programmers) did not choose the same
presentation is just one more case of "not the storage method's fault
but indeed the interpretation's fault".


PS, a good video to cure you of "visual number deficit" sickness would
be one of those explaining why 0.999... (recurring) is exactly equal to
1. (And that's not even touching IEEE....)
This be as good as any: https://www.youtube.com/watch?v=G_gUE74YVos


Cheers!
Ryan


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Richard Damon
On 12/18/18 6:21 AM, R Smith wrote:
>
> I'm not even going to touch on silly/stupid programming and
> calculations that round along the intermediate steps, those have been
> mentioned already, they are evil and it isn't the fault of the storage
> medium.

Actually, periodically rounding IS a valid method IF you know the
'precision' of the native numbers. For example, if you know that all the
numbers in a list supposed to be exact values to two decimals, as you
add them up the rounding error grows, and if the list is long enough,
can cause an error i those two decimal digits, but by periodically
rounding to that two decimals resets the rounding error.

The error is in doing this sort of rounding when the input numbers are
NOT known to be 'exact' to those two digits, but you round to that
precision. THAT rounding will increase the error in the calculation.

--
Richard Damon

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Scott Robison-2
In reply to this post by Thomas Kurz
On Mon, Dec 17, 2018 at 2:50 PM Thomas Kurz <[hidden email]> wrote:

> Ok, as there seem to be some experts about floating-point numbers here,
> there is one aspect that I never understood:
>
> floats are stored as a fractional part, which is binary encoded, and an
> integer-type exponent. The first leads to the famous rounding errors as
> there is no exact representation of most fractions.
>
> Can someone explain to me why it has been defined this way? Having 1 bit
> sign, 11 bit exponent, and 52 bit mantissa, I would have stored the (in the
> meantime well known) number 211496.26 as 21149626E-2, i.e. I would have
> stored a 52 bit integer number and appropriate exponent. This way there
> should be no rounding errors and one would always have a guaranteed
> precision of ~15 significant digits.
>

To get the maximum precision possible from a binary floating point number,
the designers of the format took advantage of the fact that all numbers
other than zero would have a 1 bit set somewhere in their representation.
To that end, "normal" floating point numbers actually have a 53 bit
mantissa. "But that equals 65 bits! You can't cram 65 bits into a 64 bit
word." But you can if the most significant set bit of the mantissa is
implied just to the left of the explicitly given 52 bits of the mantissa.
The most significant digit of a decimal number can be any value from 1
through 9, so you can't use this same trick to extend the precision of a
decimal floating point number.

In addition to normal numbers, there are subnormal numbers, where the left
most digit is implicitly a 0 bit. The value zero happens to be a subnormal
number with all bits set to zero.

Even without the implicit bit, many / most schemes for encoding decimal
digits in binary lose some portion of the range that is possible with
binary representations, and the IEEE designers wanted the best of both
worlds, range and precision, so they gave up exact decimal representation
in favor of binary.

Your approach of coding is what the decimal type does in the .net platform,
among other examples, but the available range is smaller than IEEE binary
floating point numbers of the same size.

As far as it goes, you can still have rounding errors that propagate with a
decimal scheme such as you suggest. Simply add 1/3 + 1/3 + 1/3 in a decimal
representation.

333333333333333E-15 + 333333333333333E-15 + 333333333333333E-15 =
999999999999999E-15. But it should be 1000000000000000E-15 (or 1E0). It
doesn't matter how many bits of precision you add, you can never do this
type of math exactly with decimal floating point numbers. Any time the
decimal expansion extends beyond the bit length of the available precision,
rounding choices are going to have to be made at some point, and some
calculation will be inexact.

Note: I am spouting from memory, so my apologies if I've gotten any
terminology wrong (such as subnormal vs denormal, so similar other ideas).

SDR
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
ajm
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

ajm
Although the problem is well known by the readers, may be someone would like remember the basics (somethin written by me some years ago -in spanish-).
http://www.zator.com/Cpp/E2_2_4a.htm
If you want "play" whiths the numbers in IEE754 this page bay be the fun (unfortuately, the original is not longer available)
http://www.zator.com/Cpp/E2_2_4a1.htm

A.J. Millan

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Dennis Clarke
In reply to this post by R Smith-2
On 12/18/18 6:01 AM, R Smith wrote:> On 2018/12/17 11:53 PM, Dennis
Clarke wrote:
 >>
 >> This thread is getting out of hand. Firstly there is no such binary
 >> representation ( in this universe ) for a trivial decimal number such as
 >> one tenth ( 0.10 ) and really folks should refer to the text book
 >> recently published ( 2nd Edition actually ) where all this is covered
 >> : //....
 >
 >
 > My good man, did the discussion really irritate you ...
[WARNING : written with a smile ]

Well, I guess the real issue is that I see fairly baseline stuff going
in circles over and over and over and over and very little clarity. I
merely posted that textbook because it really was written by the experts
and I have done a fair amount of emails in life back and forth to a few
of the authors on various topics who always cleared the air. They really
are the experts and the only name missing from that list is the great
and dreaded William Kahan himself.  I say that with a smile as Professor
Emeritus of Mathematics Kahan is well known to write very bluntly about
people who have not a clue about trivial things. Trivial to him. The
rest of us merely try to catch up and get a solid understanding of the
basics which, as I was saying, have been covered over and over and over
and over in circles over and over and it gets ... annoying to walk into
a store and hear Beethoven's Ninth Symphony played over dirty cheap
speakers.  Certainly when I have heard live performances a few times in
my life. Very irritating is the word.

I see this sort of thing happen from time to time and I have to take the
approach of 'care' or 'do not care'.  In the case of the store with bad
speakers the option is 'do not care' and simply accept the noise. In the
case were good people can be led down a wrong path and then fall into a
pitfall or trap I feel 'care' happens.  I am not a sociopath nor some
old greybeard UNIX geek that merely enjoys retirement too much to 'care'
anymore.

So let's play a little game based on an early lesson from William Kahan
which will demonstrate how poorly floating point works when used with
nothing but blind trust in bit games.  Also we will assume that we are
going to play by the rules of his IEEE 754 and I may make a passing
reference to these three documents :

     1 ) IEEE 754-2008 - IEEE Standard for Floating-Point Arithmetic
         https://standards.ieee.org/standard/754-2008.html

     2 ) Formal Verification of Floating-Point Hardware Design
         https://www.springer.com/us/book/9783319955124

     3 ) Handbook of Floating-Point Arithmetic
         https://www.springer.com/us/book/9783319765259

     4 ) The Art of Computer Programming
         https://cs.stanford.edu/~knuth/taocp.html

     5 ) Oral history interview with Donald E. Knuth
         Charles Babbage Institute, 2001
         https://conservancy.umn.edu/handle/11299/107413

     6 ) How Futile are Mindless Assessments of Roundoff
         in Floating-Point Computation ?
         Prof William Kahan
         https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf

Also perhaps ISO/IEC/IEEE 60559:2011 and working group ISO/IEC
JTC1/SC22/WG11 publications.

Let's begin with a quick and incomplete definition of "floating point"
data representation thus :

     Given a radix B with precision p we express a 'floting point'
     number in the format

         [  (+/-)m_0 . m_1 m_2 m_3 ... m_(p-1)  ]  *  B^e

     where we call e the exponent which is always an integer and the
     expression m_0 . m_1  m_2 m_3 ... m_(p-1) will be called the not
     so friendy word "significand" which is expressed in radix B.

A complete and formal definition will be found in chapter 3 of reference
(3) above. This is not a new idea and in fact is quite old. One may find
a fairly nice history of "floating point" in computing machines such as
the Babbage difference engine ( see Charles Babbage et. al. ) and other
machines that performed numerical computation in (4) and (5) above.
Suffice it to say that radix 60 mathematics was common in the Babylonian
history and the Yale Babylonian Collection provides a tablet with an
approximation of the positive square root of two with four sexagesimal
digits 1, 24, 51, 10.  Numerical data representation is not new at all
however I personally fell into the problem in while working on long term
trajectory computations with early Apollo systems. We had not yet been
able to establish a formal method to detect and handle numerical error
conditions and the much loved William Kahan provides us this rather
trivial example to illustrate:

     Given   u_0 = 2  and   u_1 = -4   we then compute

         u_[n] = 111 - 1130/(u[n-1]) + 3000/(u[n-1]*u[n-2])

     where n > 1 clearly.


Trivial :


     #include <stdio.h>
     #include <stdlib.h>
     #include <math.h>

     #define LOOPCNT 30

     int main (int argc, char *argv[])
     {
         long double u[LOOPCNT];
         int j;

         u[0] = (long double)2.0L;
         u[1] = (long double)-4.0L;

         printf("u[00] = %+20.18Lf\n", u[0]);
         printf("u[01] = %+20.18Lf\n", u[1]);

         for (j = 2; j < LOOPCNT; j++){
             u[j] =   (long double)111.0L
                    - (long double)1130.0L/u[j-1]
                    + (long double)3000.0L/(u[j-1]*u[j-2]);
             printf("u[%02i] = %+20.18Lf\n", j, u[j]);
         }
         return EXIT_SUCCESS;
     }

$ $CC $CFLAGS -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -o foo foo.c
$ ./foo
u[00] = +2.000000000000000000
u[01] = -4.000000000000000000
u[02] = +18.500000000000000000
u[03] = +9.378378378378378379
u[04] = +7.801152737752161387
u[05] = +7.154414480975249402
u[06] = +6.806784736923633658
u[07] = +6.592632768704448309
u[08] = +6.449465933790438615
u[09] = +6.348452056656695458
u[10] = +6.274438598253189794
u[11] = +6.218695740390240090
u[12] = +6.175837314378373112
u[13] = +6.142359234424345102
u[14] = +6.115885561182135076
u[15] = +6.094780237887624276
u[16] = +6.078391827712722059
u[17] = +6.074956741464125884
u[18] = +6.234084939357549628
u[19] = +8.953054978972347186
u[20] = +38.535952308620171281
u[21] = +90.372020211576035841
u[22] = +99.357562247636820436
u[23] = +99.961042723675680582
u[24] = +99.997653560600116965
u[25] = +99.999858805729773391
u[26] = +99.999991508101661861
u[27] = +99.999999489474548205
u[28] = +99.999999969317897322
u[29] = +99.999999998156545070
$

So there we have it.  The sequence merely settles in at 100 over time.
However one may sit down with pen and paper and easily arrive at the
correct answer of 6 and then everyone should question how a piece of
computing machiney can be so vast horribly wrong with such a trivial
calculation.  Please note that most x86_64 hardware does a terrible mess
with the above and there are far better floating point implementations
in IBM Power and Fujitsu SPARC and other risc architectures.  We are
working on RISC-V with native 128 bit precision today and perhaps we may
have better hardware than x86_64 to the world shortly.  However that is
a side comment.

Let's look at a bit of code from our friendly Vincent Lefèvre as well as
Jean-Michel Muller ( http://perso.ens-lyon.fr/jean-michel.muller/ ) :


     /*
      * $Id: pb-muller1.c 47490 2011-11-10 17:07:08Z vinc17/ypig $
      *
      * Link with -lmpfr -lgmp *in that order*
      *
      * Example by Jean-Michel Muller:
      *   u_0 = 2
      *   u_1 = -4
      *   u_{n+1} = 111 - 1130 / u_n + 3000 / (u_n u_{n-1})
      * --> 6 (not 100)
      *
      * Arguments: precision, n
      */

     #include <stdio.h>
     #include <stdlib.h>
     #include <gmp.h>
     #include <mpfr.h>

     static const mp_rnd_t rnd[3] = { MPFR_RNDD, MPFR_RNDN, MPFR_RNDU };
     static int nep = 0;

     static void check_sign (mpfr_t x[3])
     {
       if (!nep && MPFR_SIGN (x[0]) != MPFR_SIGN (x[2]))
         nep = 1;
     }

     static void mdiv (mpfr_t tmp, unsigned long c, mpfr_t x[3], int r)
     {
       int neg;

       neg = MPFR_SIGN (x[0]) < 0;
       mpfr_set_ui (tmp, c, rnd[neg ? 2-r : r]);
       mpfr_div (tmp, tmp, x[2-r], rnd[r]);
     }

     static void display (mpfr_t x[3], long k)
     {
       int i;

       printf ("%6ld   ", k);
       for (i = 0; i < 3; i++)
         {
           size_t len;
           len = (i | nep) == i ? mpfr_out_str (stdout, 10, 17, x[i],
rnd[i]) : 0;
           if (i < 2)
             while (len++ < 24)
               putchar (' ');
         }
       putchar ('\n');
     }

     int main (int argc, char *argv[])
     {
       int prec, n, i, k;
       mpfr_t u[3], v[3], p[3], tmp;

       if (argc != 3)
         {
           fprintf (stderr, "Usage: pb-muller1 <prec> <n>\n");
           exit (EXIT_FAILURE);
         }

       prec = atoi (argv[1]);
       if (prec < MPFR_PREC_MIN || prec > MPFR_PREC_MAX)
         {
           fprintf (stderr, "pb-muller1: incorrect precision %d\n", prec);
           exit (EXIT_FAILURE);
         }
       mpfr_set_default_prec (prec);

       n = atoi (argv[2]);
       if (n <= 0)
         {
           fprintf (stderr, "pb-muller1: n must be positive\n");
           exit (EXIT_FAILURE);
         }

       for (i = 0; i < 3; i++)
         {
           mpfr_init (u[i]);
           mpfr_init (v[i]);
           mpfr_init (p[i]);
           mpfr_set_si (u[i], 2, rnd[i]);
           mpfr_set_si (v[i], -4, rnd[i]);
         }
       mpfr_init (tmp);

       printf("     N   ");
       printf("MPFR_RNDD");
       printf("               MPFR_RNDN");
       printf("               MPFR_RNDU\n");

       display (u, 0);
       display (v, 1);

       for (k = 2; k <= n; k++)
         {
           int s;

           check_sign (v);
           s = MPFR_SIGN(u[1]) * MPFR_SIGN(v[1]);
           for (i = 0; i < 3; i++)
             if ((i | nep) == i)
               {
                 mpfr_mul (p[s < 0 ? 2-i : i], u[i], v[i], rnd[i]);
                 mpfr_set (u[i], v[i], rnd[i]);
               }
           check_sign (p);
           for (i = 0; i < 3; i++)
             if ((i | nep) == i)
               {
                 mpfr_set_si (v[i], 111, rnd[i]);
                 mdiv (tmp, 1130, u, 2-i);
                 mpfr_sub (v[i], v[i], tmp, rnd[2-i]);
                 mdiv (tmp, 3000, p, i);
                 mpfr_add (v[i], v[i], tmp, rnd[i]);
               }
           display (v, k);
         }

       for (i = 0; i < 3; i++)
         {
           mpfr_clear (u[i]);
           mpfr_clear (v[i]);
           mpfr_clear (p[i]);
         }
       mpfr_clear (tmp);

       return 0;
     }

     /*
     First 11 decimals of u_{100}: 6.0000000160, with prec >= 437,
     and >= 577 to guarantee the result in interval arithmetic.
     */

$

Here we use extended precision floating point libs to try to reproduce
whatever has happened in our terrible code above :

$ $CC $CFLAGS -I/usr/local/include -L/usr/local/lib
-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -o pb-muller1 pb-muller1.c
-lmpfr -lgmp
$ ./pb-muller1
Usage: pb-muller1 <prec> <n>
$ ./pb-muller1 113 30
      N   MPFR_RNDD               MPFR_RNDN               MPFR_RNDU
      0   2.0000000000000000      2.0000000000000000      2.0000000000000000
      1   -4.0000000000000000     -4.0000000000000000
-4.0000000000000000
      2   1.8500000000000000e1    1.8500000000000000e1
1.8500000000000000e1
      3   9.3783783783783783      9.3783783783783784      9.3783783783783784
      4   7.8011527377521613      7.8011527377521614      7.8011527377521614
      5   7.1544144809752493      7.1544144809752494      7.1544144809752494
      6   6.8067847369236329      6.8067847369236330      6.8067847369236330
      7   6.5926327687044383      6.5926327687044384      6.5926327687044384
      8   6.4494659337902879      6.4494659337902880      6.4494659337902880
      9   6.3484520566543571      6.3484520566543571      6.3484520566543572
     10   6.2744385982163279      6.2744385982163279      6.2744385982163280
     11   6.2186957398023977      6.2186957398023978      6.2186957398023978
     12   6.1758373049212301      6.1758373049212301      6.1758373049212302
     13   6.1423590812383559      6.1423590812383559      6.1423590812383560
     14   6.1158830665510803      6.1158830665510808      6.1158830665510812
     15   6.0947394393336623      6.0947394393336811      6.0947394393337000
     16   6.0777223048464167      6.0777223048472427      6.0777223048480687
     17   6.0639403224632851      6.0639403224998087      6.0639403225363323
     18   6.0527217593924258      6.0527217610161519      6.0527217626398778
     19   6.0435520376871057      6.0435521101892642      6.0435521826914195
     20   6.0360286321323002      6.0360318810817798      6.0360351300311756
     21   6.0297013055414662      6.0298473250226262      6.0299933444426353
     22   6.0181710657610797      6.0247496523456908      6.0313281169442034
     23   5.7234455935329964      6.0205399837103304      6.3173863825532198
     24   -7.6979694052129188     6.0170582514956451
1.9224751938450067e1
     25                           6.0141748176010937
     26                           6.0117829758099306
     27                           6.0097744237027211
     28                           6.0077081713888591
     29                           5.9993587346059263
     30                           5.8818446518449744
$

I use 113 bits for obvious reasons here.
Please see
https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format.

The above results illustrate the wildly different results we can get
when we use different methods of "rounding" on the ULP which is the unit
of least precision. We seem to circle the correct result with "rounding
to even" but we don't quite get there.  In fact we can not.  Using 113
bits of binary precision in the significand will not be sufficient. My
own experiments show that we need a minimal 172 bits and I reveal the
state of various floating point flags :


     #include <stdio.h>
     #include <stdlib.h>
     #include <stdint.h>
     #include <gmp.h>
     #include <mpfr.h>

     #define PREC 172    /* based on experimental results */
     #define LOOPCNT 32

     int mpfr_check_flags( int status );

     int main (int argc, char *argv[]) {
         mpfr_t u, v, w, t0, t1, t2, c111, c1130, c3000;
         mpfr_flags_t mpfr_flags;

         int mpfr_underflow_flag, mpfr_overflow_flag, mpfr_divby0_flag,
             mpfr_nanflag_flag, mpfr_inexflag_flag, mpfr_erangeflag_flag;
         int i, max = LOOPCNT;
         int inex; /* ret from mpfr calls */

         printf(" GMP vers : %d.%d.%d\n", __GNU_MP_VERSION,
                   __GNU_MP_VERSION_MINOR, __GNU_MP_VERSION_PATCHLEVEL );

         printf("MPFR vers : %-12s\nMPFR header: %s (based on %d.%d.%d)\n",
               mpfr_get_version (), MPFR_VERSION_STRING, MPFR_VERSION_MAJOR,
               MPFR_VERSION_MINOR, MPFR_VERSION_PATCHLEVEL);

         printf("MPFR_VERSION_MAJOR = %d\n", MPFR_VERSION_MAJOR);
         printf("MPFR_VERSION_MINOR = %d\n", MPFR_VERSION_MINOR);
         printf("MPFR_VERSION_PATCHLEVEL = %d\n", MPFR_VERSION_PATCHLEVEL);

         if (mpfr_buildopt_tls_p()!=0)
             printf("          : compiled as thread safe using TLS\n");

         if (mpfr_buildopt_float128_p()!=0)
             printf("          : __float128 support enabled\n");

         if (mpfr_buildopt_decimal_p()!=0)
             printf("          : decimal float support enabled\n");

         if (mpfr_buildopt_gmpinternals_p()!=0)
             printf("          : compiled with GMP internals\n");

         if (mpfr_buildopt_sharedcache_p()!=0)
             printf("          : threads share cache per MPFR const\n");

         printf("MPFR thresholds file used at compile time : %s\n",
                                           mpfr_buildopt_tune_case () );

         printf("\n\n");

         mpfr_set_default_prec ((mpfr_prec_t) PREC);
         mpfr_inits(u, v, w, t0, t1, t2, c111, c1130, c3000, (mpfr_ptr)
NULL);

         inex = mpfr_set_si(u, (long) 2, MPFR_RNDN);
         if ( inex != 0 ){
             fprintf(stderr,"FAIL : mpfr_set_si(u, (long) 2, MPFR_RNDN)\n");
             goto fail_out;
         }

         inex = mpfr_set_si(v, (long) -4, MPFR_RNDN);
         if ( inex != 0 ){
             fprintf(stderr,"FAIL : mpfr_set_si(v, (long) -4,
MPFR_RNDN)\n");
             goto fail_out;
         }

         inex = mpfr_set_si(c111, (long) 111, MPFR_RNDN);
         if ( inex != 0 ){
             fprintf(stderr,"FAIL : mpfr_set_si(c111, (long) 111,
MPFR_RNDN)\n");
             goto fail_out;
         }

         inex = mpfr_set_si(c1130, (long) 1130, MPFR_RNDN);
         if ( inex != 0 ){
             fprintf(stderr,"FAIL : mpfr_set_si(c1130, (long) 1130,
MPFR_RNDN)\n");
             goto fail_out;
         }

         inex = mpfr_set_si(c3000, (long) 3000, MPFR_RNDN);
         if ( inex != 0 ){
             fprintf(stderr,"FAIL : mpfr_set_si(c3000, (long) 3000,
MPFR_RNDN)\n");
             goto fail_out;
         }

         for (i = 3; i <= max; i++){
             mpfr_clear_flags();
             inex = mpfr_div(t0, c1130, v, MPFR_RNDN);
             inex = mpfr_check_flags(inex);
             if ( inex != 0 ){
                 fprintf(stderr,"FAIL : mpfr_div(t0, c1130, v,
MPFR_RNDN)\n");
                 fprintf(stderr,"     : returned %i\n", inex );
                 mpfr_printf("t0 = %.30Rg\n", t0);
                 goto fail_out;
             }
             mpfr_printf("t0 = 1130 / v = %.30Rg\n", t0);

             mpfr_clear_flags();
             inex = mpfr_sub(w, c111, t0, MPFR_RNDN);
             inex = mpfr_check_flags(inex);
             if ( inex != 0 ){
                 fprintf(stderr,"FAIL : mpfr_sub(w, c111, t0,
MPFR_RNDN)\n");
                 goto fail_out;
             }
             mpfr_printf("w = 111 - t0 = %.30Rg\n", w);

             mpfr_clear_flags();
             inex = mpfr_mul(t1, v, u, MPFR_RNDN);
             inex = mpfr_check_flags(inex);
             if ( inex != 0 ){
                 fprintf(stderr,"FAIL : mpfr_mul(t1, v, u, MPFR_RNDN)\n");
                 goto fail_out;
             }
             mpfr_printf("t1 = v * u = %.30Rg\n", t1);

             mpfr_clear_flags();
             inex = mpfr_div(t2, c3000, t1, MPFR_RNDN);
             inex = mpfr_check_flags(inex);
             if ( inex != 0 ){
                 fprintf(stderr,"FAIL : mpfr_div(t2, c3000, t1,
MPFR_RNDN)\n");
                 goto fail_out;
             }
             mpfr_printf("t2 = 3000 / t1 = %.30Rg\n", t2);

             mpfr_clear_flags();
             inex = mpfr_add(w, w, t2, MPFR_RNDN);
             inex = mpfr_check_flags(inex);
             if ( inex != 0 ){
                 fprintf(stderr,"FAIL : mpfr_add(w, w, t2, MPFR_RNDN)\n");
                 goto fail_out;
             }
             mpfr_printf("w = w + t2 = %.30Rg\n", w);

             mpfr_clear_flags();
             inex = mpfr_set(u, v, MPFR_RNDN);
             inex = mpfr_check_flags(inex);
             if ( inex != 0 ){
                 fprintf(stderr,"FAIL : mpfr_set(u, v, MPFR_RNDN)\n");
                 goto fail_out;
             }
             mpfr_printf ("u = v = %.30Rg\n", u);

             mpfr_clear_flags();
             inex = mpfr_set(v, w, MPFR_RNDN);
             inex = mpfr_check_flags(inex);
             if ( inex != 0 ){
                 fprintf(stderr,"FAIL : mpfr_set(v, w, MPFR_RNDN)\n");
                 goto fail_out;
             }
             mpfr_printf ("v = w = %.30Rg\n\n", v);
             printf("u%02i = ", i);
             mpfr_printf ("%.30Rg\n", v);
         }
         mpfr_clears( u, v, w, t0, t1, t2, c111, c1130, c3000,
(mpfr_ptr) NULL);
         return EXIT_SUCCESS;
     fail_out:
         mpfr_clears( u, v, w, t0, t1, t2, c111, c1130, c3000,
(mpfr_ptr) NULL);
         return EXIT_FAILURE;
     }

$ $CC $CFLAGS -I/usr/local/include -L/usr/local/lib
-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -o hfpa_page_11
hfpa_page_11.c -lmpfr -lgmp

$ ./hfpa_page_11
  GMP vers : 6.1.2
MPFR vers : 4.0.1-p13
MPFR header: 4.0.1-p13 (based on 4.0.1)
MPFR_VERSION_MAJOR = 4
MPFR_VERSION_MINOR = 0
MPFR_VERSION_PATCHLEVEL = 1
           : compiled as thread safe using TLS
MPFR thresholds file used at compile time : src/amd/k8/mparam.h


t0 = 1130 / v = -282.5
w = 111 - t0 = 393.5
t1 = v * u = -8
t2 = 3000 / t1 = -375
w = w + t2 = 18.5
u = v = -4
v = w = 18.5

u03 = 18.5
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 61.0810810810810810810810810811
w = 111 - t0 = 49.9189189189189189189189189189
t1 = v * u = -74
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = -40.5405405405405405405405405405
w = w + t2 = 9.37837837837837837837837837838
u = v = 18.5
v = w = 9.37837837837837837837837837838

u04 = 9.37837837837837837837837837838
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 120.489913544668587896253602305
w = 111 - t0 = -9.48991354466858789625360230548
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 173.5
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 17.2910662824207492795389048991
w = w + t2 = 7.80115273775216138328530259366
u = v = 9.37837837837837837837837837838
v = w = 7.80115273775216138328530259366

u05 = 7.80115273775216138328530259366
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 144.850387883265607683782785371
w = 111 - t0 = -33.8503878832656076837827853713
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 73.1621621621621621621621621622
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 41.0048023642408570373106760251
w = w + t2 = 7.15441448097524935352789065386
u = v = 7.80115273775216138328530259366
v = w = 7.15441448097524935352789065386

u06 = 7.15441448097524935352789065386
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 157.944441575876490938193834874
w = 111 - t0 = -46.9444415758764909381938348738
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 55.8126801152737752161383285303
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 53.75122631280012392213559147
w = w + t2 = 6.80678473692363298394175659627
u = v = 7.15441448097524935352789065386
v = w = 6.80678473692363298394175659627

u07 = 6.80678473692363298394175659627
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 166.010832378799487206717895424
w = 111 - t0 = -55.0108323787994872067178954235
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 48.6985592907277428888067971925
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 61.6034651475039255994598981999
w = w + t2 = 6.59263276870443839274200277637
u = v = 6.80678473692363298394175659627
v = w = 6.59263276870443839274200277637

u08 = 6.59263276870443839274200277637
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 171.403449827232486505953949374
w = 111 - t0 = -60.4034498272324865059539493745
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 44.874632106159962823359322559
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 66.8529157610227744748224285946
w = w + t2 = 6.44946593379028796886847922015
u = v = 6.59263276870443839274200277637
v = w = 6.44946593379028796886847922015

u09 = 6.44946593379028796886847922015
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 175.208305866019214125873951209
w = 111 - t0 = -64.2083058660192141258739512095
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 42.51896045574882232016203054
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 70.5567579226735712729746427703
w = w + t2 = 6.34845205665435714710069156081
u = v = 6.44946593379028796886847922015
v = w = 6.34845205665435714710069156081

u10 = 6.34845205665435714710069156081
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 177.996146133851648579093411262
w = 111 - t0 = -66.996146133851648579093411262
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 40.9441252716931676575532714216
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 73.270584732067976492922789724
w = w + t2 = 6.27443859821632791382937846207
u = v = 6.34845205665435714710069156081
v = w = 6.27443859821632791382937846207

u11 = 6.27443859821632791382937846207
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 180.095793800776349973236624212
w = 111 - t0 = -69.0957938007763499732366242119
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 39.8329726231979286181076071689
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 75.3144895405787477615577757815
w = w + t2 = 6.21869573980239778832115156959
u = v = 6.27443859821632791382937846207
v = w = 6.21869573980239778832115156959

u12 = 6.21869573980239778832115156959
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 181.71012818130033215181419855
w = 111 - t0 = -70.7101281813003321518141985497
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 39.0188245803796070521231630828
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 76.8859654862215622716775384113
w = w + t2 = 6.17583730492123011986333986151
u = v = 6.21869573980239778832115156959
v = w = 6.17583730492123011986333986151

u13 = 6.17583730492123011986333986151
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 182.971141273355259617962054216
w = 111 - t0 = -71.9711412733552596179620542163
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 38.4056531378263756715326672655
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 78.1135003545936155573081943698
w = w + t2 = 6.14235908123835593934614015355
u = v = 6.17583730492123011986333986151
v = w = 6.14235908123835593934614015355

u14 = 6.14235908123835593934614015355
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 183.968404493242624565635107533
w = 111 - t0 = -72.9684044932426245656351075329
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 37.9342103541335313184967384766
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 79.0842875597937053293793082179
w = w + t2 = 6.11588306655108076374420068497
u = v = 6.14235908123835593934614015355
v = w = 6.11588306655108076374420068497

u15 = 6.11588306655108076374420068497
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 184.764814451764677499945188349
w = 111 - t0 = -73.7648144517646774999451883494
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 37.5659498936219153328075416891
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 79.8595538910983586282652275967
w = w + t2 = 6.09473943933368112832003924736
u = v = 6.11588306655108076374420068497
v = w = 6.09473943933368112832003924736

u16 = 6.09473943933368112832003924736
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 185.405793184087190266717806473
w = 111 - t0 = -74.405793184087190266717806473
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 37.2747137320618884011862075346
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 80.4835154889344330029996346197
w = w + t2 = 6.07772230484724273628182814674
u = v = 6.09473943933368112832003924736
v = w = 6.07772230484724273628182814674

u17 = 6.07772230484724273628182814674
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 185.924914519173870218307712212
w = 111 - t0 = -74.9249145191738702183077122119
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 37.042133832670492411520431721
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 80.9888548416736789735738791443
w = w + t2 = 6.06394032249980875526616693243
u = v = 6.07772230484724273628182814674
v = w = 6.06394032249980875526616693243

u18 = 6.06394032249980875526616693243
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 186.347480335058267383579967778
w = 111 - t0 = -75.347480335058267383579967778
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.8549453533196700991001096141
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 81.4002020960744195769362518193
w = w + t2 = 6.05272176101615219335628404129
u = v = 6.06394032249980875526616693243
v = w = 6.05272176101615219335628404129

u19 = 6.05272176101615219335628404129
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 186.692870516204205980381685952
w = 111 - t0 = -75.6928705162042059803816859523
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.7033435474978963079278362567
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 81.7364226263934748481591633164
w = w + t2 = 6.0435521101892688677774773641
u = v = 6.05272176101615219335628404129
v = w = 6.0435521101892688677774773641

u20 = 6.0435521101892688677774773641
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 186.976132479250061286265756921
w = 111 - t0 = -75.9761324792500612862657569211
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.5799393711776741269191244542
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 82.0121643603319180662764005427
w = w + t2 = 6.03603188108185678001064362156
u = v = 6.0435521101892688677774773641
v = w = 6.03603188108185678001064362156

u21 = 6.03603188108185678001064362156
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 187.209084090766363397143239031
w = 111 - t0 = -76.2090840907663633971432390312
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.4790732120819575455522510051
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 82.2389314157902652538562503843
w = w + t2 = 6.02984732502390185671301135315
u = v = 6.03603188108185678001064362156
v = w = 6.02984732502390185671301135315

u22 = 6.02984732502390185671301135315
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 187.401096427515395813994063809
w = 111 - t0 = -76.4010964275153958139940638091
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.3963506919004245801170798372
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 82.4258460798822437127375842389
w = w + t2 = 6.02474965236684789874352042985
u = v = 6.02984732502390185671301135315
v = w = 6.02474965236684789874352042985

u23 = 6.02474965236684789874352042985
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 187.559660600349561648427131321
w = 111 - t0 = -76.5596606003495616484271313206
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.3283205752629204238431248847
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 82.5802005844110777108582694273
w = w + t2 = 6.02053998406151606243113810664
u = v = 6.02474965236684789874352042985
v = w = 6.02053998406151606243113810664

u24 = 6.02053998406151606243113810664
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 187.690805640608133312326116058
w = 111 - t0 = -76.6908056406081333123261160582
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.272246176035326886178724729
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 82.7078638979371209235032988195
w = w + t2 = 6.0170582573289876111771827613
u = v = 6.02053998406151606243113810664
v = w = 6.0170582573289876111771827613

u25 = 6.0170582573289876111771827613
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 187.799411551919152411333134492
w = 111 - t0 = -76.7994115519191524113331344923
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.2259398246766766867425191838
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 82.8135864664699713738641132522
w = w + t2 = 6.01417491455081896253097875987
u = v = 6.0170582573289876111771827613
v = w = 6.01417491455081896253097875987

u26 = 6.01417491455081896253097875987
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 187.889447190179764969201236571
w = 111 - t0 = -76.8894471901797649692012365712
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.1876408306188637229490105522
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 82.9012317780510986425852750301
w = w + t2 = 6.01178458787133367338403845895
u = v = 6.01417491455081896253097875987
v = w = 6.01178458787133367338403845895

u27 = 6.01178458787133367338403845895
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 187.964153319757081280027483314
w = 111 - t0 = -76.9641533197570812800274833137
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.1559240600590085878407693164
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 82.9739545590555658478143724685
w = w + t2 = 6.00980123929848456778688915479
u = v = 6.01178458787133367338403845895
v = w = 6.00980123929848456778688915479

u28 = 6.00980123929848456778688915479
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 188.026185060972710385415447864
w = 111 - t0 = -77.0261850609727103854154478637
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.1296304665846704072244722298
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 83.0343394398849393132364128061
w = w + t2 = 6.00815437891222892782096494236
u = v = 6.00980123929848456778688915479
v = w = 6.00815437891222892782096494236

u29 = 6.00815437891222892782096494236
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 188.0777238291579164286779791
w = 111 - t0 = -77.0777238291579164286779790997
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.1078136322833302456565987844
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 83.0845099221891221872109178848
w = w + t2 = 6.00678609303120575853293878513
u = v = 6.00815437891222892782096494236
v = w = 6.00678609303120575853293878513

u30 = 6.00678609303120575853293878513
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 188.120566056276503242641451445
w = 111 - t0 = -77.1205660562765032426414514453
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.0896981680345182060442268243
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 83.1262147450479235105736445903
w = w + t2 = 6.00564868877142026793219314497
u = v = 6.00678609303120575853293878513
v = w = 6.00564868877142026793219314497

u31 = 6.00564868877142026793219314497
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t0 = 1130 / v = 188.156194036578734087817277753
w = 111 - t0 = -77.1561940365787340878172777526
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t1 = v * u = 36.0746470233432633440888930245
INFO : mpfr_inexflag_flag is set.
      : clear flags and carry on.
t2 = 3000 / t1 = 83.1608968497669092896537174097
w = w + t2 = 6.0047028131881752018364396571
u = v = 6.00564868877142026793219314497
v = w = 6.0047028131881752018364396571

u32 = 6.0047028131881752018364396571
$
$ echo $?
0

We get the exact same result on IBM Power hardware and Oracle SPARC
given that the implementation uses cross platform floating point libs
libgmp and libmpfr.

However why are we seeing such horrific results from trivial code ?

Let me quote one of the authors of (3) Vincent Lefèvre :

     ... for any computation system on approximated numbers with a format
     of bounded length, at least one operation will be inexact (because
     if the sequence is computed exactly, you will get an infinity of
     different numbers), and as soon as you get an error, the limit will
     change from 6 to 100.  Thus MPFR cannot help you to find the limit
     (and obviously unums won't help you either). Well, with arbitrary
     precision, you can increase the precision and observe the various
     values, so that you can conjecture that the limit is 6, but this is
     not a proof. Interval arithmetic in various precisions can be useful
     to see what's going on with some additional details.
     On this ( trivial ) example, MPFR can be used too to get bounds,
     without needing an interval arithmetic package ...

However that is not in any book. Let's try to sum up here and merely say
that some serious reading and experiments are needed to get a good
handle on why numerical computation is as much art as it is science. If
we wander into the problem without sufficient study and VERY careful
consideration then we are doomed to repeat the errors of the past. I want
to point out that this has been quite long enough and I did not bother
to drag in output data from the IBM Power or RISC-V systems but all one
needs to know is that the results are poor if rounding errors are not
trapped and handled.  Also, there will ALWAYS be inexact flags set and
computation rounding errors thrown in every language we have. People are
often "mindless" about how modern computing machinary works and have no
clue about roundoff errors. However I didn't say that. Kahan did. See
(6) above.

Dennis Clarke
ye old UNIX/Linux greybeard

ps: https://en.wikipedia.org/wiki/Minimax_approximation_algorithm


lastly just for fun :

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <fenv.h>
#pragma STDC FENV_ACCESS ON

size_t bin_printf ( uint8_t* f, size_t n );

int
main (int argc, char* argv[])
{

     float foo = 1.100f;
     float epsilon = 1.0E-7f; /* this is _about_ 0.000001 */
     int fpe;
     int j;

     feclearexcept(FE_ALL_EXCEPT);
     if (fesetround(FE_TONEAREST)!=0){
         fprintf(stderr,"ERROR : can not set floating point rounding!\n");
         return EXIT_FAILURE;
     }

     printf("starting value of foo = %8.7e\n", (double)foo );
     bin_printf((uint8_t*)&foo, (size_t)sizeof(foo));
     for ( j=0; fabsf(foo)>epsilon; j++ ){
         /* do a trivial subtraction */
         foo = foo - 0.10f;
         fpe = fetestexcept(FE_ALL_EXCEPT);
         if (fpe!=0){
             printf("INFO : Exception raised was");
             if(fpe & FE_INEXACT) printf(" FE_INEXACT");
             if(fpe & FE_DIVBYZERO) printf(" FE_DIVBYZERO");
             if(fpe & FE_UNDERFLOW) printf(" FE_UNDERFLOW");
             if(fpe & FE_OVERFLOW) printf(" FE_OVERFLOW");
             if(fpe & FE_INVALID) printf(" FE_INVALID");
             printf("\n");
         }
         printf("foo = %8.7e\n", (double)foo );
         bin_printf((uint8_t*)&foo, (size_t)sizeof(foo));
         feclearexcept(FE_ALL_EXCEPT);
     }
     printf("\nfinal value of foo = %8.7e\n", (double)foo );
     bin_printf((uint8_t*)&foo, (size_t)sizeof(foo));
     return EXIT_SUCCESS;
}

$ ./subtraction
starting value of foo = 1.1000000e+00

  00111111 10001100 11001100 11001101
INFO : Exception raised was FE_INEXACT
foo = 1.0000000e+00

  00111111 10000000 00000000 00000000
INFO : Exception raised was FE_INEXACT
foo = 8.9999998e-01

  00111111 01100110 01100110 01100110
INFO : Exception raised was FE_INEXACT
foo = 7.9999995e-01

  00111111 01001100 11001100 11001100
INFO : Exception raised was FE_INEXACT
foo = 6.9999993e-01

  00111111 00110011 00110011 00110010
INFO : Exception raised was FE_INEXACT
foo = 5.9999990e-01

  00111111 00011001 10011001 10011000
INFO : Exception raised was FE_INEXACT
foo = 4.9999991e-01

  00111110 11111111 11111111 11111101
INFO : Exception raised was FE_INEXACT
foo = 3.9999992e-01

  00111110 11001100 11001100 11001010
INFO : Exception raised was FE_INEXACT
foo = 2.9999992e-01

  00111110 10011001 10011001 10010111
INFO : Exception raised was FE_INEXACT
foo = 1.9999993e-01

  00111110 01001100 11001100 11001000
foo = 9.9999927e-02

  00111101 11001100 11001100 11000011
foo = -7.4505806e-08

  10110011 10100000 00000000 00000000

final value of foo = -7.4505806e-08

  10110011 10100000 00000000 00000000
$


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Dennis Clarke

Apologies ... I should have included a link to Jean-Michel Muller's work
on "Elementary Functions" and on preserving monotonicity and always
getting correctly rounded results when implementing the elementary
functions in floating-point arithmetic.

https://link.springer.com/book/10.1007/978-1-4899-7983-4


Also an interesting read in IEEE Transactions on Computers, Vol 66,
Issue 12 : Exponential Sums and Correctly-Rounded Functions.

https://ieeexplore.ieee.org/document/7891945




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Gerry Snyder-4
In reply to this post by Larry Brasfield
On Mon, Dec 17, 2018 at 10:04 AM Larry Brasfield <[hidden email]>
wrote:

> Gerry Snyder wrote:
> < I don't think anyone has pointed out that the "evil" is not floating
> point, it is the binary exponent.
>
> Disregarding the “evil” appellation, the fundamental fact is that, with
> modern floating point hardware (implementing the IEEE-754 standard), only
> that subset of rational numbers having a denominator which is a power of 2
> can be represented.  If that is what you were trying to say, I would point
> out that it is not the representation of the exponent (binary or otherwise)
> that creates the mismatch with (many) rational numbers having a denominator
> which is a power of 10; it is that many such denominators cannot be
> represented at all when the interpretation of the exponent Ne is as 2 ^ Ne.


All I meant was that with a decimal exponent, the units could be dollars,
and additions and subtractions of cents would be exact (assuming the
mantissa has enough bits), with no worries about rounding. Which is the
basis for this whole discussion.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Keith Medcalf


>All I meant was that with a decimal exponent, the units could be
>dollars,
>and additions and subtractions of cents would be exact (assuming the
>mantissa has enough bits), with no worries about rounding. Which is
>the
>basis for this whole discussion.

This is called fixed point.  All that is required is that you keep track of the decimal point yourself...sort of like using a slide rule.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Gerry Snyder-4
On Wed, Dec 19, 2018 at 4:57 PM Keith Medcalf <[hidden email]> wrote:

>
>
> >All I meant was that with a decimal exponent, the units could be
> >dollars,
> >and additions and subtractions of cents would be exact (assuming the
> >mantissa has enough bits), with no worries about rounding. Which is
> >the
> >basis for this whole discussion.
>
> This is called fixed point.  All that is required is that you keep track
> of the decimal point yourself...sort of like using a slide rule.
>

Er, no. Not at all.

It's called floating point with a decimal exponent. Calculations in cents
come out exact*, calculations in mils come out exact*, and the machine
keeps track of the decimal point.

* Assuming that the input numbers are integer cents (exponent -2) or mils
(exponent -3)
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

James K. Lowden
In reply to this post by Dennis Clarke
On Tue, 18 Dec 2018 17:34:29 -0500
Dennis Clarke <[hidden email]> wrote:

> some serious reading and experiments are needed to get a good
> handle on why numerical computation is as much art as it is science.
> If we wander into the problem without sufficient study and VERY
> careful consideration then we are doomed to repeat the errors of the
> past.

I think perhaps you left out "Numerical Methods for Scientists and
Engineers", by Richard Hamming.  :-)  

But when you boil it down, the answer is there is no answer, is there?
The best advice is to understand where things can go wrong, and stay
away from them.  

The truth is that any system for representing numbers is forced to
represent some numbers approximately.  

We think "pen and paper" and the good old decimal system is the gold
standard, but what of ? ?  Even bankers, ever counting pennies,
approximate to compute interest and averages.  Little known fact:
sometimes they compute interest on the basis of a 360-day year.  

--jkl


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Dennis Clarke
On 12/19/18 7:51 PM, James K. Lowden wrote:

> On Tue, 18 Dec 2018 17:34:29 -0500
> Dennis Clarke <[hidden email]> wrote:
>
>> some serious reading and experiments are needed to get a good
>> handle on why numerical computation is as much art as it is science.
>> If we wander into the problem without sufficient study and VERY
>> careful consideration then we are doomed to repeat the errors of the
>> past.
>
> I think perhaps you left out "Numerical Methods for Scientists and
> Engineers", by Richard Hamming.  :-)

I figure that if you are in the industry and have any experience at all
then you know that old gem by heart.

A more interesting topic of discussion would be the speed and complexity
of circuitry designed for another number base such as 5 or even decimal.
All of our collective experience over the past fifty years has been
marching towards ever more effective digital logic for our de facto
standard binary using the usual circuit assumptions and Moore's law has
been profitable to so many corporations. Anyone can manufacture a 15nm
product with $15M thus here we are in 2018 looking at verilog designs
and 9nm multi-layer manufacturing for the next decade.  We are caught
collectively in the economics and thinking of binary. So we should look
away from that for a moment. I do NOT mean the usual undergrad 4 bit
adder logic which we all know entirely too well. Let's go lower down
into the switch design that shall render sensor logic with at least ten
voltage levels and perhaps, in my own ideas, twelve for guard logic at
either end of the range.  The real issue on the table initially would be
transducer slew rates and noise control. Anyway, this is all off topic
at this point.

Dennis Clarke

ps: feel free to see https://www.youtube.com/watch?v=AOC7KmHvx9w
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Igor Tandetnik-2
On 12/20/2018 1:34 PM, Dennis Clarke wrote:
> A more interesting topic of discussion would be the speed and complexity
> of circuitry designed for another number base such as 5 or even decimal.

https://en.wikipedia.org/wiki/Ternary_computer

--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Warren Young
On Dec 20, 2018, at 3:38 PM, Igor Tandetnik <[hidden email]> wrote:
>
> On 12/20/2018 1:34 PM, Dennis Clarke wrote:
>> A more interesting topic of discussion would be the speed and complexity
>> of circuitry designed for another number base such as 5 or even decimal.
>
> https://en.wikipedia.org/wiki/Ternary_computer

Several of the early electronic stored program computers used decimal representation, including the ENIAC:

https://en.wikipedia.org/wiki/Decimal_computer#Early_computers
https://en.wikipedia.org/wiki/ENIAC#Comparison_with_other_early_computers
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Rowan Worth-2
In reply to this post by Frank Millman
On Sat, 15 Dec 2018 at 15:10, Frank Millman <[hidden email]> wrote:

> On Dec 15, 2018, at 08.58, Jay Kreibich wrote:
>
> > > On Dec 15, 2018, at 12:49 AM, Frank Millman <[hidden email]>
> wrote:
> > >
> > > I know that floating point is not precise and not suitable for
> financial uses. Even so, I am curious about the following -
> > >
> [...]
> > >
> > > With the same version of sqlite3 and the same select statement, why
> does python return a different result from sqlite3.exe?
> >
> > Because the shell is altering the output to make it easier to read.
> Consider:
> >
> > sqlite> select 211496.25999999992;
> > 211496.26
>

I just wanted to point out that python does the same thing (as does
basically every floating point display routine):

$ python
Python 2.7.5 (default, Aug  4 2017, 00:39:18)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2
>>> a=211496.25999999992
>>> a
211496.25999999992
>>> print a
211496.26
>>> repr(a)
'211496.25999999992'
>>> str(a)
'211496.26'

It's just that the python interpreter outputs the "representation" by
default.

(Interesting thread; wow at the uₙ = 111 - 1130/uₙ₋₁ + 3000/(uₙ₋₁·uₙ₋₂)
sequence!)
-Rowan
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Pavlos Christoforou-3
In reply to this post by Keith Medcalf
Happy new year all,

Have not followed the full thread, in case it has not been mentioned
already:

http://dec64.com/


Cheers


On Tue, 18 Dec 2018 at 02:42, Keith Medcalf <[hidden email]> wrote:

> >This thread is getting out of hand. Firstly there is no such binary
> >representation ( in this universe ) for a trivial decimal number such
> >as one tenth ( 0.10 ) and really folks should refer to the text book
> >recently published ( 2nd Edition actually ) where all this is covered
> >:
> >     Handbook of Floating-Point Arithmetic
> >     Authors: Muller, J.-M., Brunie, N., de Dinechin, F.,
> >              Jeannerod, C.-P., Joldes, M., Lefèvre, V.,
> >              Melquiond, G., Revol, N., Torres, S.
> >
> >     This handbook is a definitive guide to the effective use of
> >     modern floating-point arithmetic, which has considerably
> >     evolved, from the frequently inconsistent floating-point number
> >     systems of early computing to the recent IEEE 754-2008 standard.
>
>
> https://doc.lagout.org/science/0_Computer%20Science/3_Theory/Handbook%20of%20Floating%20Point%20Arithmetic.pdf
>
> While it is true there is no exact representation of 1/10th in binary
> floating point, at double precision the epsilon is 1.3877787807814457e-17
> which means that for all intents and purposes 1/10th is exact to 16.9
> decimal places.  Which is pretty damn good for a format that is only
> purported to be accurate to 15 decimal digits.
>
> ---
> The fact that there's a Highway to Hell but only a Stairway to Heaven says
> a lot about anticipated traffic volume.
>
>
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


--
Pavlos Christoforou

Point Nine Limited
Mobile:     +357 99 160960

[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Peter da Silva-2
Why is the exponent in the low bits, since it forces unnecessary shifts for
integer operations?

On Thu., 3 Jan. 2019, 03:34 Pavlos Christoforou <[hidden email] wrote:

> Happy new year all,
>
> Have not followed the full thread, in case it has not been mentioned
> already:
>
> http://dec64.com/
>
>
> Cheers
>
>
> On Tue, 18 Dec 2018 at 02:42, Keith Medcalf <[hidden email]> wrote:
>
> > >This thread is getting out of hand. Firstly there is no such binary
> > >representation ( in this universe ) for a trivial decimal number such
> > >as one tenth ( 0.10 ) and really folks should refer to the text book
> > >recently published ( 2nd Edition actually ) where all this is covered
> > >:
> > >     Handbook of Floating-Point Arithmetic
> > >     Authors: Muller, J.-M., Brunie, N., de Dinechin, F.,
> > >              Jeannerod, C.-P., Joldes, M., Lefèvre, V.,
> > >              Melquiond, G., Revol, N., Torres, S.
> > >
> > >     This handbook is a definitive guide to the effective use of
> > >     modern floating-point arithmetic, which has considerably
> > >     evolved, from the frequently inconsistent floating-point number
> > >     systems of early computing to the recent IEEE 754-2008 standard.
> >
> >
> >
> https://doc.lagout.org/science/0_Computer%20Science/3_Theory/Handbook%20of%20Floating%20Point%20Arithmetic.pdf
> >
> > While it is true there is no exact representation of 1/10th in binary
> > floating point, at double precision the epsilon is 1.3877787807814457e-17
> > which means that for all intents and purposes 1/10th is exact to 16.9
> > decimal places.  Which is pretty damn good for a format that is only
> > purported to be accurate to 15 decimal digits.
> >
> > ---
> > The fact that there's a Highway to Hell but only a Stairway to Heaven
> says
> > a lot about anticipated traffic volume.
> >
> >
> >
> >
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
>
>
> --
> Pavlos Christoforou
>
> Point Nine Limited
> Mobile:     +357 99 160960
>
> [hidden email]
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Gary R. Schmidt
On 03/01/2019 22:22, Peter da Silva wrote:
> Why is the exponent in the low bits, since it forces unnecessary shifts for
> integer operations?
>
That's easy, because the high bits are closer to the barrel shifter, so
it takes less time for the electron to get there![1][2]

        Cheers,
                Gary B-)

1 - Oh ghod, bit-order arguments, worse than the shell wars and the
editor wars combined!!!

2 - Sarcasm intended, for those who may not be certain.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Question about floating point

Peter da Silva-2
That wasn't "endian" argument, this is an arithmetic operation question,
avoiding extra operations in the common case of small (mag < 2^56)
operands. Since I posted that I've figured out some other optimizations
that work better the way they laid it out, and it makes more sense now.



On Thu, Jan 3, 2019 at 5:54 AM Gary R. Schmidt <[hidden email]> wrote:

> On 03/01/2019 22:22, Peter da Silva wrote:
> > Why is the exponent in the low bits, since it forces unnecessary shifts
> for
> > integer operations?
> >
> That's easy, because the high bits are closer to the barrel shifter, so
> it takes less time for the electron to get there![1][2]
>
>         Cheers,
>                 Gary    B-)
>
> 1 - Oh ghod, bit-order arguments, worse than the shell wars and the
> editor wars combined!!!
>
> 2 - Sarcasm intended, for those who may not be certain.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
123