Rounding

Learning Objectives

Measure the error in rounding numbers using the IEEE-754 floating point standard
Predict the outcome of loss of significance in floating point arithmetic

Rounding Options in IEEE-754

Not all real numbers can be stored exactly as floating point numbers. Consider a real number in the normalized floating point form:

x = \pm 1. b 1 b 2 b 3 . . . b n . . . \times 2 m <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>x</mi><mo>=</mo><mo>\pm</mo><mn>1.</mn><msub><mi>b</mi><mn>1</mn></msub><msub><mi>b</mi><mn>2</mn></msub><msub><mi>b</mi><mn>3</mn></msub><mo>.</mo><mo>.</mo><mo>.</mo><msub><mi>b</mi><mi>n</mi></msub><mo>.</mo><mo>.</mo><mo>.</mo><mo>\times</mo><msup><mn>2</mn><mi>m</mi></msup></math>

where $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ is the number of bits in the significand and $m <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>m</mi></math>$ is the exponent for a given floating point system. If $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math>$ does not have an exact representation as a floating point number, it will be instead represented as either $x - <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>-</mo></mrow></msub></math>$ or $x + <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>+</mo></mrow></msub></math>$ , the nearest two floating point numbers.

Without loss of generality, let us assume $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math>$ is a positive number. In this case, we have:

x - = 1. b 1 b 2 b 3 . . . b n \times 2 m <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>-</mo></mrow></msub><mo>=</mo><mn>1.</mn><msub><mi>b</mi><mn>1</mn></msub><msub><mi>b</mi><mn>2</mn></msub><msub><mi>b</mi><mn>3</mn></msub><mo>.</mo><mo>.</mo><mo>.</mo><msub><mi>b</mi><mi>n</mi></msub><mo>\times</mo><msup><mn>2</mn><mi>m</mi></msup></math>

and

x + = 1. b 1 b 2 b 3 . . . b n \times 2 m + 0. 000000. . .0001 ⏟ n bits \times 2 m <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>+</mo></mrow></msub><mo>=</mo><mn>1.</mn><msub><mi>b</mi><mn>1</mn></msub><msub><mi>b</mi><mn>2</mn></msub><msub><mi>b</mi><mn>3</mn></msub><mo>.</mo><mo>.</mo><mo>.</mo><msub><mi>b</mi><mi>n</mi></msub><mo>\times</mo><msup><mn>2</mn><mi>m</mi></msup><mo>+</mo><mn>0.</mn><munder><mrow data-mjx-texclass="OP"><munder><mrow><mn>000000.</mn><mo>.</mo><mn>.0001</mn></mrow><mo>⏟</mo></munder></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi><mtext> bits</mtext></mrow></munder><mo>\times</mo><msup><mn>2</mn><mi>m</mi></msup></math>

The process of replacing a real number $x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math>$ by a nearby machine number (either $x - <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>-</mo></mrow></msub></math>$ or $x + <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>+</mo></mrow></msub></math>$ ) is called rounding, and the error involved is called roundoff error.

IEEE-754 doesn’t specify exactly how to round floating point numbers, but there are several different options:

round towards zero
round towards infinity
round up
round down
round to the next nearest floating point number (round up or down, whatever is closer)

We will denote the floating point number as $f l (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mi>l</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ . The rounding rules above can be summarized in the table below:

Roundoff Errors

Note that the gap between two machine numbers is:

| x + - x - | = 0. 000000. . .0001 ⏟ n bits \times 2 m = ϵ m \times 2 m <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>+</mo></mrow></msub><mo>-</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>-</mo></mrow></msub><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mo>=</mo><mn>0.</mn><munder><mrow data-mjx-texclass="OP"><munder><mrow><mn>000000.</mn><mo>.</mo><mn>.0001</mn></mrow><mo>⏟</mo></munder></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi><mtext> bits</mtext></mrow></munder><mo>\times</mo><msup><mn>2</mn><mi>m</mi></msup><mo>=</mo><msub><mi>ϵ</mi><mi>m</mi></msub><mo>\times</mo><msup><mn>2</mn><mi>m</mi></msup></math>

Hence we can use machine epsilon to bound the error in representing a real number as a machine number.

Absolute error:

| f l (x) - x | \leq | x + - x - | = ϵ m \times 2 m <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mi>f</mi><mi>l</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>-</mo><mi>x</mi><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mo>\leq</mo><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>+</mo></mrow></msub><mo>-</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mo>-</mo></mrow></msub><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mo>=</mo><msub><mi>ϵ</mi><mi>m</mi></msub><mo>\times</mo><msup><mn>2</mn><mi>m</mi></msup></math>

| f l (x) - x | \leq ϵ m \times 2 m <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mi>f</mi><mi>l</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>-</mo><mi>x</mi><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mo>\leq</mo><msub><mi>ϵ</mi><mi>m</mi></msub><mo>\times</mo><msup><mn>2</mn><mi>m</mi></msup></math>

Relative error:

|fl(x)−x||x|≤ϵm×2m|x|<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mfrac><mrow><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mi>f</mi><mi>l</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>−</mo><mi>x</mi><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo></mrow><mrow><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mi>x</mi><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo></mrow></mfrac><mo>≤</mo><mfrac><mrow><msub><mi>ϵ</mi><mi>m</mi></msub><mo>×</mo><msup><mn>2</mn><mi>m</mi></msup></mrow><mrow><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mi>x</mi><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo></mrow></mfrac></math>

|fl(x)−x||x|≤ϵm<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mfrac><mrow><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mi>f</mi><mi>l</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>−</mo><mi>x</mi><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo></mrow><mrow><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mi>x</mi><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo></mrow></mfrac><mo>≤</mo><msub><mi>ϵ</mi><mi>m</mi></msub></math>

Floating Point Addition

Adding two floating point numbers is easy. The basic idea is:

Bring both numbers to a common exponent
Do grade-school addition from the front, until you run out of digits in your system
Round the result

For example, in order to add $a = (1.101) 2 \times 21 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>=</mo><mo stretchy="false">(</mo><mn>1.101</mn><msub><mo stretchy="false">)</mo><mn>2</mn></msub><mo>\times</mo><msup><mn>2</mn><mn>1</mn></msup></math>$ and $b = (1.001) 2 \times 2 - 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi><mo>=</mo><mo stretchy="false">(</mo><mn>1.001</mn><msub><mo stretchy="false">)</mo><mn>2</mn></msub><mo>\times</mo><msup><mn>2</mn><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msup></math>$ in a floating point system with only 3 bits in the fractional part, this would look like:

a = 1.101 \times 21 b = 0.01001 \times 21 a + b = 1.111 \times 21 <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><mi>a</mi></mtd><mtd><mi></mi><mo>=</mo><mn>1.101</mn><mo>\times</mo><msup><mn>2</mn><mn>1</mn></msup></mtd></mtr><mtr><mtd><mi>b</mi></mtd><mtd><mi></mi><mo>=</mo><mn>0.01001</mn><mo>\times</mo><msup><mn>2</mn><mn>1</mn></msup></mtd></mtr><mtr><mtd><mi>a</mi><mo>+</mo><mi>b</mi></mtd><mtd><mi></mi><mo>=</mo><mn>1.111</mn><mo>\times</mo><msup><mn>2</mn><mn>1</mn></msup></mtd></mtr></mtable></math>

You’ll notice that we added two numbers with 4 significant digits, and our result also has 4 significant digits. There is no loss of significant digits with floating point addition.

Floating Point Subtraction and Catastrophic Cancellation

Floating point subtraction works much the same was that addition does. However, problems occur when you subtract two numbers of similar magnitude.

For example, in order to subtract $b = (1.1010) 2 \times 21 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi><mo>=</mo><mo stretchy="false">(</mo><mn>1.1010</mn><msub><mo stretchy="false">)</mo><mn>2</mn></msub><mo>\times</mo><msup><mn>2</mn><mn>1</mn></msup></math>$ from $a = (1.1011) 2 \times 21 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>=</mo><mo stretchy="false">(</mo><mn>1.1011</mn><msub><mo stretchy="false">)</mo><mn>2</mn></msub><mo>\times</mo><msup><mn>2</mn><mn>1</mn></msup></math>$ , this would look like:

a = 1.1011 ? ? ? ? \times 21 b = 1.1010 ? ? ? ? \times 21 a - b = 0.0001 ? ? ? ? \times 21 <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><mi>a</mi></mtd><mtd><mi></mi><mo>=</mo><mn>1.1011</mn><mo>?</mo><mo>?</mo><mo>?</mo><mo>?</mo><mo>\times</mo><msup><mn>2</mn><mn>1</mn></msup></mtd></mtr><mtr><mtd><mi>b</mi></mtd><mtd><mi></mi><mo>=</mo><mn>1.1010</mn><mo>?</mo><mo>?</mo><mo>?</mo><mo>?</mo><mo>\times</mo><msup><mn>2</mn><mn>1</mn></msup></mtd></mtr><mtr><mtd><mi>a</mi><mo>-</mo><mi>b</mi></mtd><mtd><mi></mi><mo>=</mo><mn>0.0001</mn><mo>?</mo><mo>?</mo><mo>?</mo><mo>?</mo><mo>\times</mo><msup><mn>2</mn><mn>1</mn></msup></mtd></mtr></mtable></math>

When we normalize the result, we get $1. ? ? ? ? \times 2 - 3 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1.</mn><mo>?</mo><mo>?</mo><mo>?</mo><mo>?</mo><mo>\times</mo><msup><mn>2</mn><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>3</mn></mrow></msup></math>$ . There is no data to indicate what the missing digits should be. Although the floating point number will be stored with 4 digits in the fractional, it will only be accurate to a single significant digit. This loss of significant digits is known as catastrophic cancellation.

Review Questions

See this review link

ChangeLog

2020-04-28 Mariana Silva mfsilva@illinois.edu: started from content out of FP page; added new rounding text