- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
FSCALE is often used in the calculation of exponential functions. The following code shows an exponential function without the slow FRNDINT and FSCALE instructions:
; extern "C" long double _cdecl exp (double x);
_exp |
PROC |
NEAR |
|
PUBLIC |
_exp |
|
|
|
FLDL2E |
|
|
|
FLD |
QWORD PTR [ESP+4] |
; x |
|
FMUL |
|
; z = x*log2(e) |
|
FIST |
DWORD PTR [ESP+4] |
; round(z) |
|
SUB |
ESP, 12 |
|
|
MOV |
DWORD PTR [ESP], 0 |
|
|
MOV |
DWORD PTR [ESP+4], 80000000H |
|
|
FISUB |
DWORD PTR [ESP+16] |
; z - round(z) |
|
MOV |
EAX, [ESP+16] |
|
|
ADD |
EAX,3FFFH |
|
|
MOV |
[ESP+8],EAX |
|
|
JLE |
SHORT UNDERFLOW |
|
|
CMP |
EAX,8000H |
|
|
JGE |
SHORT OVERFLOW |
|
|
F2XM1 |
|
|
|
FLD1 |
|
|
|
FADD |
|
; 2^(z-round(z)) |
|
FLD |
TBYTE PTR [ESP] |
; 2^(round(z)) |
|
ADD |
ESP,12 |
|
|
FMUL |
|
; 2^z = e^x |
|
RET |
|
|
UNDERFLOW: |
|
|
|
|
FSTP |
ST |
|
|
FLDZ |
|
; return 0 |
|
ADD |
ESP,12 |
|
|
RET |
|
|
OVERFLOW: |
|
|
|
|
PUSH |
07F800000H |
; +infinity |
|
FSTP |
ST |
|
|
FLD |
DWORD PTR [ESP] |
; return infinity |
|
ADD |
ESP,16 |
|
|
RET |
|
|
_exp |
ENDP |
|
|
18.14 FPTAN (all processors)
According to the manuals, FPTAN returns two values, X and Y, and leaves it to the programmer to divide Y with X to get the result; but in fact it always returns 1 in X so you can save the division. My tests show that on all 32-bit Intel processors with floating-point unit or coprocessor, FPTAN always returns 1 in X regardless of the argument. If you want to be absolutely sure that your code will run correctly on all processors, then you may test if X is 1, which is faster than dividing with X. The Y value may be very high, but never infinity, so you don't have to test if Y contains a valid number if you know that the argument is valid.
18.15 FSQRT (P3 and P4)
A fast way of calculating an approximate square root on the P3 and P4 is to multiply the reciprocal square root of x by x:
SQRT(x) = x * RSQRT(x)
The instruction RSQRTSS or RSQRTPS gives the reciprocal square root with a precision of 12 bits. You can improve the precision to 23 bits by using the Newton-Raphson formula described in Intel's application note AP-803:
x0 = RSQRTSS(a)
x1 = 0.5 * x0 * (3 - (a * x0)) * x0)
where x0 is the first approximation to the reciprocal square root of a, and x1 is a better approximation. The order of evaluation is important. You must use this formula before multiplying with a to get the square root.
18.16 FLDCW (PPro, P2, P3, P4)
The PPro, P2 and P3 have a serious stall after the FLDCW instruction if followed by any floating-point instruction which reads the control word (which almost all floating-point instructions do).
When C or C++ code is compiled, it often generates a lot of FLDCW instructions because conversion of floating-point numbers to integers is done with truncation while other floatingpoint instructions use rounding. After translation to assembly, you can improve this code by using rounding instead of truncation where possible, or by moving the FLDCW out of a loop where truncation is needed inside the loop.
On the P4, this stall is even longer, approximately 143 clocks. But the P4 has made a special case out of the situation where the control word is alternating between two different values. This is the typical case in C++ programs where the control word is changed to specify truncation when a floating-point number is converted to integer, and changed back to rounding after this conversion. The latency for FLDCW is 3 when the new value loaded is the same as the value of the control word before the preceding FLDCW. The latency is still 143, however, when loading the same value into the control word as it already has, if this is not the same as the value it had one time earlier.
See page 127 on how to convert floating-point numbers to integers without changing the control word. On P3 and P4, use truncation instructions such as CVTTSS2SI instead.
18.17 Bit scan (P1 and PMMX)
BSF and BSR are the poorest optimized instructions on the P1 and PMMX, taking approximately 11 + 2*n clock cycles, where n is the number of zeros skipped.
The following code emulates BSR ECX,EAX:
TEST |
EAX,EAX |
|
JZ |
SHORT BS1 |
|
MOV |
DWORD PTR [TEMP],EAX |
|
MOV |
DWORD PTR [TEMP+4],0 |
|
FILD |
QWORD PTR [TEMP] |
|
FSTP |
QWORD PTR [TEMP] |
|
WAIT |
; WAIT only needed for compatibility with old 80287 |
|
MOV |
ECX, DWORD PTR [TEMP+4] |
|
SHR |
ECX,20 |
; isolate exponent |
SUB |
ECX,3FFH |
; adjust |
TEST |
EAX,EAX |
; clear zero flag |
BS1:
The following code emulates BSF ECX,EAX:
TEST |
EAX,EAX |
|
JZ |
SHORT |
BS2 |
XOR |
ECX,ECX |
|
MOV |
DWORD |
PTR [TEMP+4],ECX |
SUB |
ECX,EAX |
|
