Добавил:

Andrey Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Электротехника

Файл:

Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf

Скачиваний:

Добавлен:

23.08.2013

Размер:

814.91 Кб

Скачать

☆

<<< < Предыдущая 19 20 21 22 23 24 25 26 27 28 29 30 31 3233 / 4333 34 35 36 37 38 39 40 41 42 43 > Следующая >>>

FSCALE is often used in the calculation of exponential functions. The following code shows an exponential function without the slow FRNDINT and FSCALE instructions:

; extern "C" long double _cdecl exp (double x);

_exp	PROC	NEAR
PUBLIC	_exp
	FLDL2E
	FLD	QWORD PTR [ESP+4]	; x
	FMUL		; z = x*log2(e)
	FIST	DWORD PTR [ESP+4]	; round(z)
	SUB	ESP, 12
	MOV	DWORD PTR [ESP], 0
	MOV	DWORD PTR [ESP+4], 80000000H
	FISUB	DWORD PTR [ESP+16]	; z - round(z)
	MOV	EAX, [ESP+16]
	ADD	EAX,3FFFH
	MOV	[ESP+8],EAX
	JLE	SHORT UNDERFLOW
	CMP	EAX,8000H
	JGE	SHORT OVERFLOW
	F2XM1
	FLD1
	FADD		; 2^(z-round(z))
	FLD	TBYTE PTR [ESP]	; 2^(round(z))
	ADD	ESP,12
	FMUL		; 2^z = e^x
	RET
UNDERFLOW:
	FSTP	ST
	FLDZ		; return 0
	ADD	ESP,12
	RET
OVERFLOW:
	PUSH	07F800000H	; +infinity
	FSTP	ST
	FLD	DWORD PTR [ESP]	; return infinity
	ADD	ESP,16
	RET
_exp	ENDP

18.14 FPTAN (all processors)

According to the manuals, FPTAN returns two values, X and Y, and leaves it to the programmer to divide Y with X to get the result; but in fact it always returns 1 in X so you can save the division. My tests show that on all 32-bit Intel processors with floating-point unit or coprocessor, FPTAN always returns 1 in X regardless of the argument. If you want to be absolutely sure that your code will run correctly on all processors, then you may test if X is 1, which is faster than dividing with X. The Y value may be very high, but never infinity, so you don't have to test if Y contains a valid number if you know that the argument is valid.

18.15 FSQRT (P3 and P4)

A fast way of calculating an approximate square root on the P3 and P4 is to multiply the reciprocal square root of x by x:

SQRT(x) = x * RSQRT(x)

The instruction RSQRTSS or RSQRTPS gives the reciprocal square root with a precision of 12 bits. You can improve the precision to 23 bits by using the Newton-Raphson formula described in Intel's application note AP-803:

x0 = RSQRTSS(a)

x1 = 0.5 * x0 * (3 - (a * x0)) * x0)

where x0 is the first approximation to the reciprocal square root of a, and x1 is a better approximation. The order of evaluation is important. You must use this formula before multiplying with a to get the square root.

18.16 FLDCW (PPro, P2, P3, P4)

The PPro, P2 and P3 have a serious stall after the FLDCW instruction if followed by any floating-point instruction which reads the control word (which almost all floating-point instructions do).

When C or C++ code is compiled, it often generates a lot of FLDCW instructions because conversion of floating-point numbers to integers is done with truncation while other floatingpoint instructions use rounding. After translation to assembly, you can improve this code by using rounding instead of truncation where possible, or by moving the FLDCW out of a loop where truncation is needed inside the loop.

On the P4, this stall is even longer, approximately 143 clocks. But the P4 has made a special case out of the situation where the control word is alternating between two different values. This is the typical case in C++ programs where the control word is changed to specify truncation when a floating-point number is converted to integer, and changed back to rounding after this conversion. The latency for FLDCW is 3 when the new value loaded is the same as the value of the control word before the preceding FLDCW. The latency is still 143, however, when loading the same value into the control word as it already has, if this is not the same as the value it had one time earlier.

See page 127 on how to convert floating-point numbers to integers without changing the control word. On P3 and P4, use truncation instructions such as CVTTSS2SI instead.

18.17 Bit scan (P1 and PMMX)

BSF and BSR are the poorest optimized instructions on the P1 and PMMX, taking approximately 11 + 2*n clock cycles, where n is the number of zeros skipped.

The following code emulates BSR ECX,EAX:

TEST	EAX,EAX
JZ	SHORT BS1
MOV	DWORD PTR [TEMP],EAX
MOV	DWORD PTR [TEMP+4],0
FILD	QWORD PTR [TEMP]
FSTP	QWORD PTR [TEMP]
WAIT	; WAIT only needed for compatibility with old 80287
MOV	ECX, DWORD PTR [TEMP+4]
SHR	ECX,20	; isolate exponent
SUB	ECX,3FFH	; adjust
TEST	EAX,EAX	; clear zero flag

BS1:

The following code emulates BSF ECX,EAX:

TEST	EAX,EAX
JZ	SHORT	BS2
XOR	ECX,ECX
MOV	DWORD	PTR [TEMP+4],ECX
SUB	ECX,EAX

<<< < Предыдущая 19 20 21 22 23 24 25 26 27 28 29 30 31 3233 / 4333 34 35 36 37 38 39 40 41 42 43 > Следующая >>>

Соседние файлы в предмете Электротехника

#
23.08.201378.64 Кб10Firebird Null guide.pdf
#
23.08.201360.5 Кб7Firebird's nbackup tool.pdf
#
23.08.2013384.6 Кб13Firth D.R.Balanced constant current excitation for dynamic strain measurements.pdf
#
23.08.2013447.05 Кб13FLTK human interface guidelines.2005.pdf
#
23.08.2013430.42 Кб11FLTK Subversion quick-start guide.2005.pdf
#
23.08.2013814.91 Кб15Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
#
23.08.2013163.76 Кб47Forth-83 standard.1983.pdf
#
23.08.2013551.69 Кб19Frame D.Printed circuit board and connector impedance matching using complex conjugation.2004.pdf
#
23.08.2013321.12 Кб12Fredriksson L.CAN for critical embedded automotive networks.pdf
#
23.08.2013665.38 Кб11FreeBSD developers' handbook.2001.pdf
#
23.08.2013177.78 Кб21Fuller J.P.MSW Logo.A simplified reference.1998.pdf