Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Qiao S.Principles of floating point computation

.pdf
Скачиваний:
21
Добавлен:
23.08.2013
Размер:
174.96 Кб
Скачать

CAS727, S. Qiao

Part 1 Page 11

codes abort when exceptions occur. We can install trap handlers that abort program when exceptions occur.

IEEE 754 specifies that when an overflow or underflow trap handler is called, it is passed the wrapped around result as an argument. The definition of wrapped around for overflow is that the result is computed as if to infinite precision, then divided by 2α, and then rounded to the relevant precision. For underflow, the result is multiplied by 2α. The exponent α is 192 for single precision and 1536 for double precision. For details of exceptions, flags, and trap handlers, see [4].

Example 11 The computation of the product

 

n

 

 

 

 

 

 

 

 

i=1 xi can potentially over-

flow or underflow. One solution is to

compute exp(

 

n

log x

). But this

 

Q

 

 

 

i=1

 

 

i

 

solution is less accurate and less e cient. Another

solution is to use trap

P

k

 

overflows, the

handler. A global counter is initialized to 0. If pk =

 

 

 

i=1 xi

 

 

 

 

by 2α. If p

 

underflows,

counter is increased by one and the result is divided

 

Q

 

 

 

k

 

the counter is decreased by one and the result is multiplied by 2α. Thus the result is wrapped around back to range. When all multiplications are done, if the counter is zero, the final result is pn, if the counter is positive, pn overflows, if the counter is negative, pn unerflows.

Why does IEEE 754 specify a flag for each of these kinds of exception? Without flags, detecting rare creations of ∞ and NaN before they disappear requires programmed tests and branches that, besides duplicating rests already performed by the hardware, slow down the program and impel a programmer to make decisions prematurely in many cases. With flags, fewer tests and branches are necessary because they can be postponed to propitious points in the program. They almost never have to appear in innermost loops.

Using flags is the only way to distinguish 1/0, a genuin infinity, and overflow.

We show two examples of using flags [4].

Example 12 To compute xn where n is an integer, we write the following function PositivePower(x, n) for positive n.

PositivePower(x, n) { while (n is even) {

x = x x; n = n/2;

}

CAS727, S. Qiao

Part 1 Page 12

u = x;

while (true) { n = n/2;

if (n ≡ 0) return u; x = x x;

if (n is odd) u = u x

}

}

When n < 0, PositivePower(1/x, −n) is not accurate. Instead, 1/PositivePower(x, −n) should be used. The problem is that when x−n underflows, the underflow trap handler is called or the underflow flag is set, either case is incorrect. Also, x−n underflows, xn may overflow or in range. The solution is as follows. We first disable underflow and overflow traps and save underflow and overflow flag status. Then we compute 1/PositivePower(x, −n). If neither underflow nor overflow flag is set, restore those flags and enable traps, else, restore those flags and compute PositivePower(1/x, −n), which causes correct exceptions to occur.

Example 13 We compute

 

 

 

 

s

 

 

 

 

 

 

1

+ x

arccosx = 2arctan

1

− x

.

 

When x = −1, if arctan(∞) returns π/2, then we get correct result arccos(−1) = π. The problem, however, is that when x = −1, (1 − x)/(1 + x) causes divide-by-zero and raises the divide-by-zero flag. The solution is simple. We save the divide-by-zero flag before the computation and restore it after the computation.

4Error Measurements

Suppose xˆ is an approximation of x, for example, xˆ is the computed value and x is the exact answer. How do we measure the error in xˆ?

The absolute error is defined as Eabs(ˆx) = |x − xˆ|. Obviously, the size of absolute error depends on the size of x. Thus the relative error defined as

Erel(ˆx) = |x − xˆ|/|x| is independent of the size of x. From this definition, we can see that if xˆ = x(1 + ρ) then |ρ| = Erel(ˆx) and |ρ| |x| = Eabs. Relative error can be used to determine the number of correct significant digits. For example, if xˆ = 1.0049 is an approximation of x = 1.0000, then

CAS727, S. Qiao

Part 1 Page 13

Erel = 4.9 ×10−3 , which indicates that xˆ agrees with x to three but not four digits.

Unit of roundo , usually denoted by u, is the most useful quantity associated with a floating-point number and is ubiquitous in the world of rounding error analysis. The unit roundo is given by

u = 12 β1−t,

recalling that t is the machine precision in terms of the number of digits. Suppose f l(x + y) is the floating-point addition of x and y, then the IEEE standard requires that f l(x + y) is the same as x + y rounded to the nearest floating-point number. In other words,

f l(x + y) = (x + y)(1 + δ) |δ| ≤ u.

Another error measurement useful in measuring the error in a computed result is the unit of the last place (ulp). The name itself explains its meaning. For example, if xˆ = d0.d1 · · · dt−1 × βe is a computed result, then one ulp of xˆ is βe−t+1.

5Sources of Errors

Due to finite precision arithmetic, a computed result must be rounded to fit storage format. Consequently, rounding errors are unavoidable. The IEEE standard requires that for an arithmetic operation op = +, −, , /, we have

f l(x op y) = (x op y)(1 + δ) |δ| ≤ u.

When an infinite series is approximated by a finite sum, truncation error is introduced. For example, if we use

 

 

x2

 

x3

 

xn

1 + x +

 

 

 

+

 

 

 

+ · · · +

 

 

 

 

2!

3!

 

n!

to approximate

 

 

 

 

 

 

 

 

 

 

 

 

ex = 1 + x +

x2

x3

xn

 

+

 

+ · · · +

 

+ · · · ,

2!

3!

n!

then the truncation error is

 

 

 

 

 

 

 

 

 

 

 

 

xn+1

 

 

 

 

xn+2

 

 

 

 

(n + 1)! + (n + 2)! + · · · .

≈ f 0(x).

CAS727, S. Qiao

 

Part 1 Page 14

 

 

 

 

 

 

h

yh(1)

error

 

 

−1

2.85884195487388

 

 

 

10

1.40560126415e − 1

 

 

 

 

 

10 2

2.73191865578708

1.36368273280e − 2

 

 

 

 

 

10 3

2.71964142253278

1.35959407373e − 3

 

 

 

 

 

10 4

2.71841774707848

1.35918619434e − 4

 

 

 

 

 

10 5

2.71829541991231

1.35914532646e − 5

 

 

 

 

 

10 6

2.71828318698653

1.35852748429e − 6

 

 

 

 

 

10 7

2.71828196396484

1.35505794585e − 7

 

 

 

 

 

10 8

2.71828177744737

−5.10116753283e − 8

 

 

 

 

 

10 9

2.71828159981169

−2.28647355716e − 7

 

 

 

 

 

10 10

2.71827893527643

−2.89318261570e − 6

 

 

 

 

 

10 11

2.71827005349223

−1.17749668154e − 5

 

 

 

 

 

10 12

2.71827005349223

−1.17749668154e − 5

 

 

 

 

 

10 13

2.71338507218388

−4.89675627517e − 3

 

 

 

 

 

10 14

2.66453525910038

−5.37465693587e − 2

 

 

 

 

 

10 15

2.66453525910038

−5.37465693587e − 2

 

Table 6: Values of yh(1) and errors using various sizes of h

When a continous problem is approximated by a discrete one, discretization error is introduced. For example, from the expansion

0

 

h2 00

 

f (x + h) = f (x) + hf

(x) +

 

f

(ξ), for some ξ [x, x + h],

2!

we can use the following approximation:

yh(x) = f (x + h) − f (x) h

The discretization error is Edis = |f 00(ξ)|h/2.

Note that both truncation error and discretization error have nothing to do with computation. If the arithmetic is perfect (no rounding errors), the discretization error decreases as h decreases. In practice, however, rounding errors are unavoidable. Consider the above example and let f (x) = ex. We computed yh(1) on a SUN Sparc V in MATLAB 5.20.

Table 6 shows

that as h decreases the error first decreases and then

increases. This is

because of the combination of discretization error and

rounding errors. In this example, the discretization error is

 

 

h

 

00

 

h 1+h

 

h

Edis =

 

|f

 

(ξ)| ≤

 

e

 

e for small h.

2

 

2

2

CAS727, S. Qiao

 

 

 

Part 1 Page 15

Now, we consider the rounding errors. Let the computed yh(x) be

 

 

h(x) =

f l

ex+hh− ex

!

 

 

(1)

=

(e(x+h)(1+δ0 )(1 + δ1) − ex(1 + δ2))(1 + δ3)

(1 + δ4)

 

(2)

 

 

 

h

 

 

 

ex+h(1 + δ0 + δ1 + δ3 + δ4) − ex(1 + δ2 + δ3 + δ4)

,

(3)

 

 

h

 

 

 

for |δi| ≤ u (i = 0, 1, 2, 3, 4). In the above derivation, we assume that δi are small so that we ignore the terms like δiδj or higher order. We also assume that ex is computed accurately, i.e., f l(ex) = ex(1 + δ) where |δ| ≤ u. Thus

we have the rounding error

 

ξ1ex+hh− ξ2ex

 

for |ξ1| ≤ 4u and |ξ2| ≤ 3u.

Eround = |yh(x) − yˆh(x)| ≈

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

When x = 1, we have

 

 

7u

 

 

 

 

Eround

e.

 

 

 

 

 

 

 

h

 

So the rounding error increases as h decreases. Combining both errors, we get the total error:

Etotal = Edis + Eround

h

+

7u

e.

 

 

2

h

Figure 3 plots Etotal.

To minimize the total error, we di erentiate Etotal with respect to h and set the derivative to zero and get the optimal h:

hopt = 14u ≈ u.

6Forward and Backward Errors

Suppose a program takes an input x and computes y, we can view the output y as a function of the input x, y = f (x). Denote the computed result as yˆ, then the absolute error |y − yˆ| and the relative error |y − yˆ|/|y| are called forward errors. Alternatively, we can ask: “For what set of data have we solved our problem?”. That is, the computed result yˆ is the exact result for the input x + x, i.e., yˆ = f (x + x). In general, there may be many such

x, so we are interested in minimal such x and a bound for | x|. This bound, possibly divided by |x|, is called backward error.

CAS727, S. Qiao

Part 1 Page 16

 

x 10−5

 

 

 

 

 

 

2

 

 

 

 

 

 

1.8

 

 

 

 

 

 

1.6

 

 

 

 

 

 

1.4

 

 

 

 

 

ERROR

1.2

 

 

 

 

 

1

 

 

 

 

 

TOTAL

 

 

 

 

 

0.8

 

 

 

 

 

 

 

 

 

 

 

 

0.6

 

 

 

 

 

 

0.4

 

 

 

 

 

 

0.2

 

 

 

 

 

 

0

 

 

 

 

 

 

10−10

10−9

10−8

10−7

10−6

10−5

 

 

 

 

H

 

 

Figure 3: Total Error

For example, the IEEE standard requires that

f l( x) = x(1 + δ) |δ| ≤ u.

Then the relative error or the forward error is |δ|, which is bounded by u.

What is the backward error? Let

√ √ √

f l( x) = x(1 + δ) = x + x,

then x = 2xδ + xδ2. Thus, ignoring δ2, we have the backward error:

| x|/|x| ≈ 2|δ| ≤ 2u.

The process of bounding the backward error is called backward error analysis. The motivation is to interpret rounding errors as perturbation in the data. Consequently, it reduces the question of estimating the forwarding error to perturbation theory. We will see its significance in the following sections.

To illustrate the forward and backward errors, let us consider the computation of xˆ −yˆ , where xˆ and yˆ can be previously computed results. Assume that x and y are exact results and xˆ = x(1 + δx) and yˆ = y(1 + δy ), then

f l(ˆx − yˆ) = (ˆx − yˆ)(1 + δ) |δ| ≤ u.

It then follows that f l(ˆx − yˆ) = x(1 + δx)(1 + δ) − y(1 + δy )(1 + δ). Ignoring the second order terms δxδ and δy δ and letting δ1 = δx + δ and δ2 = δy + δ, we get

f l(ˆx − yˆ) = x(1 + δ1) − y(1 + δ2).

CAS727, S. Qiao

Part 1 Page 17

If |δx| and |δy | are small, then |δ1|, |δ2| ≤ u + max(|δx|, |δy |) are also small i.e., the backward errors are small. However, the forward error (relative error)

Erel =

|f l(ˆx − yˆ) − (x − y)|

=

|xδ1 − yδ2|

.

 

|x − y|

 

|x − y|

If δ1 6= δ2 i.e., δx 6= δy , it is possible that Erel is large when |x − y| is small, i.e., x and y are close to each other. This is called catastrophic cancellation.

If δx = δy , in particular, if both x and y are original data (δx = δy = 0), then Erel = δ. This is called benign cancellation. The following example illustrates the di erence between the two cancellations.

Example 14 Consider the computation of x2 − y2 in our small floatingpoint number system. Suppose x = 1.11 and y = 1.10 and assume the nearest

 

rounding, then f l(x

 

x) = 1.10

×

21

(error 2−4) and f l(y

 

y) = 1.00

 

21

even

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

×

(error 2

 

2). Thus f l(x x − y y) = 1.00. The exact result is 1.101 × 2

1

and the

 

error is 0.0011 and E = 0.00111. However, f l((x

y)

 

(x + y)) =

 

 

−2

 

1

 

rel

 

 

−1

. The error in f l(x

 

 

 

 

 

 

f l(1.00

 

 

 

2 ) = 1.10

×

2

 

y) is 0 and the

 

 

× 2 1.10 ×

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

error in f l(x + y) is 2

 

3. Now the total error is 0.0001 and Erel = 0.000101.

7Instability of Certain Algorithms

A method for computing y = f (x) is called backward stable if, for any x, it produces a computed yˆ with a small backward error, that is, yˆ = f (x + x) for some small x. Usually there exist many such x. We are interested in the smallest. If x turns out to be large, then the algorithm is unstable.

Example 15 Suppose β = 10 and t = 3. Consider the following system:

Ax = b, where A =

.001

1.00

! and b =

 

1.00

! .

1.00

.200

3.00

 

 

 

 

 

 

Applying Gaussian elimination (without pivoting), we get the computed decomposition

b b

1.00

0

!

.001

1.00

! .

 

1.00

0

1000

LU = 1000

 

The computed solution

 

xˆ =

1.00

!

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

CAS727, S. Qiao

 

 

 

 

Part 1 Page 18

is the exact solution of the perturbed system

 

 

(A +

A)ˆx = b.

Solving for

A, we get

 

 

 

 

! .

 

A =

×

 

 

3.2

 

 

×

0

 

 

 

 

 

The smallest

A is

0

 

3.2

! ,

 

A =

 

 

 

0

0

 

 

 

 

 

 

 

which is of the same size as A.

This

means that Gaussian elimination

(without pivoting) is unstable. Note that

the exact solution is

 

1.0032

 

!

x =

−3.2 · · ·

.

 

 

· · ·

 

8Sensitivity of Certain Problems

Let us start with the problem of solving a system of linear equations:

Ax = b

where A and b are known variables and x is the result. The question is: How sensitive is x to the change in A and/or b?. We can assume that the change is only in b since the change in A can be transformed into the change in b. Let x˘ be the solution of the perturbed system:

Ax˘ = b + b.

The change in x (relative error) is kx˘ − xk/kxk and the change in b is k bk/kbk. We use the ratio of the two errors as the measurement of the sensitivity, called condition number:

cond =

kx˘ − xk/kxk

=

kA−1 bk

kAxk

 

A−1

A .

k bk/kbk

kxk

· k bk

≤ k

 

 

 

k k k

So kA−1k kAk is the condition number of the problem of solving a linear system. In general, we can view a problem with data x and result y as a

CAS727, S. Qiao

 

 

 

 

Part 1 Page 19

function y = f (x). The result of the perturbed problem is y˘ = f (x +

x).

The sensitivity is measured by

 

 

 

 

 

 

 

cond =

|y˘ − y|/|y|

=

|f (x +

x) − f (x)|

|x|

≈ |

f 0(x)

|x|

.

(4)

 

| x|/|x|

| x|

|f (x)|

|

|f (x)|

 

Note that the conditioning of a problem is independent of rounding errors and algorithms for solving the problem. The following example is due to Wilkinson(see [8].

Example 16 Let p(x) = (x−1)(x−2)...(x−19)(x−20) = x20 −210x19 +....

The zeros of p(x) are 1, 2, ..., 19, 20 and well separated. With the floatingpoint number system of β = 2, t = 30 we enter a typical coe cient into the computer, it is necessary to round it to 30 siginificant base-2 digits. If we make a change in the 30th significant base-2, only one of the twenty coe cients, the coe cient of x19, is changed from −210 to −210 + 2−23 . Let us see how much e ect this small change has on the zeros of the polynomial. Here we list (using β = 2, t = 90) the roots of the equation p(x) + 2−23x19 = 0, correctly rounded to the number of digits:

1.00000 0000 10.09526 6145 ± 0.64350 0904i 2.00000 0000 11.79363 3881 ± 1.65232 9728i 3.00000 0000 13.99235 8137 ± 2.51883 0070i 4.00000 0000 16.73073 7466 ± 2.81262 4894i 4.99999 9928 19.50243 9400 ± 1.94033 0347i 6.00000 6944 6.99969 7234 8.00726 7603 8.91725 0249

20.84690 8101

Note the small change in the coe cient −210 has caused ten of the zeros to become complex and that two have moved more than 2.81 units o the real axis. That means the zeros of p(x) are very sensitive to the change in coe cients. The results were computed under a very accurate computation. They did not get any side e ects from rounding errors, and nor is it a problem that the algorithm used solve this problem make some ill-e ects on the results. Actually the problem is the matter of sensitivity itself.

As discussed before, backward error analysis transforms rounding errors into perturbations of data. Thus we can establish a relation between forward

CAS727, S. Qiao

Part 1 Page 20

and backward errors and the conditioning of the problem. Clearly, (4) shows that

Eforward ≤ cond · Ebackward.

This inequality tells us that large forward errors can be caused by either ill-conditioning of the problem or unstable algorithm, or both. The significance of backward error analysis is that it allows us to determine whether an algorithm is stable (small backward errors). If we can prove the algorithm is stable, then we know that large forward errors are due to the ill-conditioning of the problem. On the other hand, if we know the problem is well-conditioned, then large forward errors must be caused by unstable algorithm.

9Machine Parameters

As shown in the previous sections, the behavior of a numerical software is dependent on a set of machine parameters such as base β, precision t, minimum exponent emin, and maximum exponent emax. A program paranoia originally written by Kahan investigates a computer’s floating-point arithmetic. There are Basic, C, Modula, Pascal, and Fortran versions available

[2].The following is a list of parameters tested by paranoia.

Radix: The base of the computer number system, such as 2, 10, 16.

Precision: The number of significant digits of radix.

Closest relative separation: U 1 = Radix−Precision = One Ulp of numbers a little less than 1.0.

Adequacy of guard digits for multiplication, division, subtraction and addition: In IEEE 754 there is an extra hardware bit called guard digit on the right of fraction during intermdiate caculation to help rounding acurately, see [3] for detail.

Is rounding on Mulitiply, divide, and add/subtract correct?

Is sticky bit used correctly for rounding? The sticky bit allows the computer to see the di erence between 0.50...00ten and 0.50...01ten when rounding, see [3] for detail.

Seeking underflow threshold Ufhold. It is related to 2emin . Below this value calculation may su er large relative error than merely round o . Also, seeking the smallest strictly positive number E0.