Floatsum – Nuances Behind IMPRECISE FLOATING POINT PRECISION

CBAChellam | Atlanta,GA | May 25, 2016

Math

Primer:

This blog post talks about the idiosyncrasies of precision floating point arithmetic in high-level languages. The more I read and researched about this, it looks like I took a bite of this subject than more than I could chew. But the adventure kept me going. This blog post also takes a shot at oversimplification of the theory.

Problem Statement:

To get to the crux of the issue, there is an inherent limitation associated with representing floating point numbers in computer hardware. We will look at the incomplete nature of precision floating point arithmetic through code walk through, talk about the inherent semiconductor limitations (which is the cause of this odd behavior) and finally pitfalls of the rounding error and as an effect, imprecision. I am comprehensively drawing out inferences and references from various sources, so please bear with me.


Code:
// PRIMARY LANGUAGE TO ELUCIDATE (picked at random): C#

/*Precision is the main difference between different floating point data types.

Float – 7 digits (32 bit)
Double-15-16 digits (64 bit)
Decimal -28-29 significant digits (128 bit) – usually it’s called as
long double or double precision in other popular languages.*/

float flt = 1f/3;
double dbl = 1d/3;
decimal dcm = 1m/3;

Console.WriteLine(“float: {0} double: {1} decimal: {2}“, flt, dbl, dcm);
/* Output : float: 0.3333333 double: 0.333333333333333 decimal: 0.3333333333333333333333333333 */

Console.WriteLine(DoubleConverter.ToExactString(0.1d));
/* Output : 0.1000000000000000055511151231257827021181583404541015625*/
Console.WriteLine(DoubleConverter.ToExactString(0.2d));
/* Output : 0.200000000000000011102230246251565404236316680908203125*/
Console.WriteLine(DoubleConverter.ToExactString(0.3d));
/* Output : 0.299999999999999988897769753748434595763683319091796875*/

//A decimal is just a floating-point in base 10 rather than base 2.
//float and double are floating binary point types.

double i = 0.1 + 0.1;
Console.WriteLine(DoubleConverter.ToExactString(i));
/* Output : 0.20000000298023223876953125 */

Console.WriteLine(DoubleConverter.ToExactString(0.1));
/* Output : 0.1000000000000000055511151231257827021181583404541015625*/

Console.WriteLine(.1d + .2d == 0.3d); /* Output : false */
Console.WriteLine(.1f + .2f == 0.3f); /* Output : false */
Console.WriteLine(.1m + .2m == 0.3m); /* Output : true*/

double d= (1.0/103.0)*103.0;
Console.WriteLine(DoubleConverter.ToExactString(d));
/* Output : 0.99999999999999988897769753748434595763683319091796875*/

Console.WriteLine(((double) (1/110.0)) *110.0 < 1);
/* Output : true*/

Console.WriteLine(((float)(1/110.0F)) *110.0 < 1);
/* Output : true*/

Console.WriteLine(DoubleConverter.ToExactString(6.9));
/* Output : 6.9000000000000003552713678800500929355621337890625*/


Why Inexactitude/Fuzziness in the Precision Data Types?

In the case of a computer, the number of digits is limited by the technical nature of its memory and CPU registers. Primarily, rounding errors come from the fact that the infinity of all real numbers cannot possibly be represented by the finite memory of a computer, let alone a tiny slice of memory such as a single floating point variable, so many numbers stored are just approximations of the number they are meant to represent.[link]

“There are a fixed number of bits with which to represent the mantissa. We can illustrate the problem by considering decimal notation. Say we restrict ourselves to 4 figures after the decimal point. Assuming that we have chosen the closest number in this representation, x, to a given number we can only say that its true value lies somewhere within x±5E10-5. For example, given π to 4 decimal places, 3.1416, we can only state with certainty that it lies between 3.14155 and 3.14165.[link]”

Since there are only a limited number of values which are not an approximation, and any operation between an approximation and another number results in an approximation, rounding errors are almost inevitable.

Pitfalls

[You’re Going To Have To Think!] – “The dragon of numerical error is not often roused from his slumber, but if incautiously approached he will occasionally inflict catastrophic damage upon the unwary programmer’s calculations.”

x86-64 Intel Haswell-Intel Core i7 4770 – 2013 – 3.40GHz
The compiler is gcc 4.8 in Ubuntu 14.04.

The below graph is a million operations per second@1GHz vs different types of operation. It depicts the floating point arithmetic speed in Intel x86-64 processors.

Intel Core i7 4770 - 2013 - 3.40GHz - Floating Point Results

“(80 and 128 bits) have terrible performance, probably because both need to be implemented in software as there is no hardware instruction for these operations.

  • the sum is 30% slower in double precision;
  • product is twice faster in double precision;
  • the division has the same speed in single and double precision.[link]”

Leave a comment