Not all real numbers can be stored exactly as floating point numbers. Consider a real number in the normalized floating point form:
x=±1.b1b2b3...bn...×2mwhere n is the number of bits in the significand and m is the exponent for a given floating point system. If x does not have an exact representation as a floating point number, it will be instead represented as either x− or x+, the nearest two floating point numbers.
Without loss of generality, let us assume x is a positive number. In this case, we have:
x−=1.b1b2b3...bn×2mand
x+=1.b1b2b3...bn×2m+0.000000...0001⏟n bits×2mThe process of replacing a real number x by a nearby machine number (either x− or x+) is called rounding, and the error involved is called roundoff error.
IEEE-754 doesn’t specify exactly how to round floating point numbers, but there are several different options:
We will denote the floating point number as fl(x). The rounding rules above can be summarized in the table below:
Note that the gap between two machine numbers is:
|x+−x−|=0.000000...0001⏟n bits×2m=ϵm×2mHence we can use machine epsilon to bound the error in representing a real number as a machine number.
Adding two floating point numbers is easy. The basic idea is:
For example, in order to add
You’ll notice that we added two numbers with 4 significant digits, and our result also has 4 significant digits. There is no loss of significant digits with floating point addition.
Floating point subtraction works much the same was that addition does. However, problems occur when you subtract two numbers of similar magnitude.
For example, in order to subtract
When we normalize the result, we get