Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
15
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

FSCALE is often used in the calculation of exponential functions. The following code shows an exponential function without the slow FRNDINT and FSCALE instructions:

; extern "C" long double _cdecl exp (double x);

_exp

PROC

NEAR

 

PUBLIC

_exp

 

 

 

FLDL2E

 

 

 

FLD

QWORD PTR [ESP+4]

; x

 

FMUL

 

; z = x*log2(e)

 

FIST

DWORD PTR [ESP+4]

; round(z)

 

SUB

ESP, 12

 

 

MOV

DWORD PTR [ESP], 0

 

 

MOV

DWORD PTR [ESP+4], 80000000H

 

 

FISUB

DWORD PTR [ESP+16]

; z - round(z)

 

MOV

EAX, [ESP+16]

 

 

ADD

EAX,3FFFH

 

 

MOV

[ESP+8],EAX

 

 

JLE

SHORT UNDERFLOW

 

 

CMP

EAX,8000H

 

 

JGE

SHORT OVERFLOW

 

 

F2XM1

 

 

 

FLD1

 

 

 

FADD

 

; 2^(z-round(z))

 

FLD

TBYTE PTR [ESP]

; 2^(round(z))

 

ADD

ESP,12

 

 

FMUL

 

; 2^z = e^x

 

RET

 

 

UNDERFLOW:

 

 

 

FSTP

ST

 

 

FLDZ

 

; return 0

 

ADD

ESP,12

 

 

RET

 

 

OVERFLOW:

 

 

 

PUSH

07F800000H

; +infinity

 

FSTP

ST

 

 

FLD

DWORD PTR [ESP]

; return infinity

 

ADD

ESP,16

 

 

RET

 

 

_exp

ENDP

 

 

18.14 FPTAN (all processors)

According to the manuals, FPTAN returns two values, X and Y, and leaves it to the programmer to divide Y with X to get the result; but in fact it always returns 1 in X so you can save the division. My tests show that on all 32-bit Intel processors with floating-point unit or coprocessor, FPTAN always returns 1 in X regardless of the argument. If you want to be absolutely sure that your code will run correctly on all processors, then you may test if X is 1, which is faster than dividing with X. The Y value may be very high, but never infinity, so you don't have to test if Y contains a valid number if you know that the argument is valid.

18.15 FSQRT (P3 and P4)

A fast way of calculating an approximate square root on the P3 and P4 is to multiply the reciprocal square root of x by x:

SQRT(x) = x * RSQRT(x)

The instruction RSQRTSS or RSQRTPS gives the reciprocal square root with a precision of 12 bits. You can improve the precision to 23 bits by using the Newton-Raphson formula described in Intel's application note AP-803:

x0 = RSQRTSS(a)

x1 = 0.5 * x0 * (3 - (a * x0)) * x0)

where x0 is the first approximation to the reciprocal square root of a, and x1 is a better approximation. The order of evaluation is important. You must use this formula before multiplying with a to get the square root.

18.16 FLDCW (PPro, P2, P3, P4)

The PPro, P2 and P3 have a serious stall after the FLDCW instruction if followed by any floating-point instruction which reads the control word (which almost all floating-point instructions do).

When C or C++ code is compiled, it often generates a lot of FLDCW instructions because conversion of floating-point numbers to integers is done with truncation while other floatingpoint instructions use rounding. After translation to assembly, you can improve this code by using rounding instead of truncation where possible, or by moving the FLDCW out of a loop where truncation is needed inside the loop.

On the P4, this stall is even longer, approximately 143 clocks. But the P4 has made a special case out of the situation where the control word is alternating between two different values. This is the typical case in C++ programs where the control word is changed to specify truncation when a floating-point number is converted to integer, and changed back to rounding after this conversion. The latency for FLDCW is 3 when the new value loaded is the same as the value of the control word before the preceding FLDCW. The latency is still 143, however, when loading the same value into the control word as it already has, if this is not the same as the value it had one time earlier.

See page 127 on how to convert floating-point numbers to integers without changing the control word. On P3 and P4, use truncation instructions such as CVTTSS2SI instead.

18.17 Bit scan (P1 and PMMX)

BSF and BSR are the poorest optimized instructions on the P1 and PMMX, taking approximately 11 + 2*n clock cycles, where n is the number of zeros skipped.

The following code emulates BSR ECX,EAX:

TEST

EAX,EAX

 

JZ

SHORT BS1

 

MOV

DWORD PTR [TEMP],EAX

MOV

DWORD PTR [TEMP+4],0

FILD

QWORD PTR [TEMP]

FSTP

QWORD PTR [TEMP]

WAIT

; WAIT only needed for compatibility with old 80287

MOV

ECX, DWORD PTR [TEMP+4]

SHR

ECX,20

; isolate exponent

SUB

ECX,3FFH

; adjust

TEST

EAX,EAX

; clear zero flag

BS1:

The following code emulates BSF ECX,EAX:

TEST

EAX,EAX

JZ

SHORT

BS2

XOR

ECX,ECX

MOV

DWORD

PTR [TEMP+4],ECX

SUB

ECX,EAX