Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Лаб2012 / 319433-011.pdf
Скачиваний:
27
Добавлен:
02.02.2015
Размер:
2.31 Mб
Скачать

APPLICATION PROGRAMMING MODEL

The first position in the three digits of a FMA mnemonic refers to the operand position of the first FP data expressed in the arithmetic equation of FMA operation, the multiplicand.

The second position in the three digits of a FMA mnemonic refers to the operand position of the second FP data expressed in the arithmetic equation of FMA operation, the multiplier.

The third position in the three digits of a FMA mnemonic refers to the operand position of the FP data being added/subtracted to the multiplication result.

Note the non-numerical result of an FMA operation does not resemble the mathemat- ically-defined commutative property between the multiplicand and the multiplier values (see Table 2-2). Consequently, software tools (such as an assembler) may support a complementary set of FMA mnemonics for each FMA instruction for ease of programming to take advantage of the mathematical property of commutative multiplications. For example, an assembler may optionally support the complementary mnemonic “VFMADD312PD“ in addition to the true mnemonic “VFMADD132PD“. The assembler will generate the same instruction opcode sequence corresponding to VFMADD132PD. The processor executes VFMADD132PD and report any NAN conditions based on the definition of VFMADD132PD. Similarly, if the complementary mnemonic VFMADD123PD is supported by an assembler at source level, it must generate the opcode sequence corresponding to VFMADD213PD; the complementary mnemonic VFMADD321PD must produce the opcode sequence defined by VFMADD231PD. In the absence of FMA operations reporting a NAN result, the numerical results of using either mnemonic with an assembler supporting both mnemonics will match the behavior defined in Table 2-2. Support for the complementary FMA mnemonics by software tools is optional.

2.4ACCESSING YMM REGISTERS

The lower 128 bits of a YMM register is aliased to the corresponding XMM register. Legacy SSE instructions (i.e. SIMD instructions operating on XMM state but not using the VEX prefix, also referred to non-VEX encoded SIMD instructions) will not access the upper bits (255:128) of the YMM registers. AVX and FMA instructions with a VEX prefix and vector length of 128-bits zeroes the upper 128 bits of the YMM register. See Chapter 2, “Programming Considerations with 128-bit SIMD Instructions” for more details.

Upper bits of YMM registers (255:128) can be read and written by many instructions with a VEX.256 prefix.

XSAVE and XRSTOR may be used to save and restore the upper bits of the YMM registers.

2-12

Ref. # 319433-011

APPLICATION PROGRAMMING MODEL

2.5MEMORY ALIGNMENT

Memory alignment requirements on VEX-encoded instruction differ from non-VEX- encoded instructions. Memory alignment applies to non-VEX-encoded SIMD instructions in three categories:

Explicitly-aligned SIMD load and store instructions accessing 16 bytes of memory (e.g. MOVAPD, MOVAPS, MOVDQA, etc.). These instructions always require memory address to be aligned on 16-byte boundary.

Explicitly-unaligned SIMD load and store instructions accessing 16 bytes or less of data from memory (e.g. MOVUPD, MOVUPS, MOVDQU, MOVQ, MOVD, etc.). These instructions do not require memory address to be aligned on 16-byte boundary.

The vast majority of arithmetic and data processing instructions in legacy SSE instructions (non-VEX-encoded SIMD instructions) support memory access semantics. When these instructions access 16 bytes of data from memory, the memory address must be aligned on 16-byte boundary.

Most arithmetic and data processing instructions encoded using the VEX prefix and performing memory accesses have more flexible memory alignment requirements than instructions that are encoded without the VEX prefix. Specifically,

With the exception of explicitly aligned 16 or 32 byte SIMD load/store instructions, most VEX-encoded, arithmetic and data processing instructions operate in a flexible environment regarding memory address alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load semantics will support unaligned load operation by default. Memory arguments for most instructions with VEX prefix operate normally without causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE instructions). The instructions that require explicit memory alignment requirements are listed in Table 2-4.

Software may see performance penalties when unaligned accesses cross cacheline boundaries, so reasonable attempts to align commonly used data sets should continue to be pursued.

Atomic memory operation in Intel 64 and IA-32 architecture is guaranteed only for a subset of memory operand sizes and alignment scenarios. The list of guaranteed atomic operations are described in Section 7.1.1 of IA-32 Intel® Architecture Software Developer’s Manual, Volumes 3A. AVX and FMA instructions do not introduce any new guaranteed atomic memory operations.

AVX and FMA will generate an #AC(0) fault on misaligned 4 or 8-byte memory references in Ring-3 when CR0.AM=1. 16 and 32-byte memory references will not generate #AC(0) fault. See Table 2-3 for details.

Certain AVX instructions always require 16or 32-byte alignment (see the complete list of such instructions in Table 2-4). These instructions will #GP(0) if not aligned to 16-byte boundaries (for 16-byte granularity loads and stores) or 32-byte boundaries (for 32-byte loads and stores).

Ref. # 319433-011

2-13

APPLICATION PROGRAMMING MODEL

Table 2-3. Alignment Faulting Conditions when Memory Access is Not Aligned

Instruction Type

EFLAGS.AC==1 && Ring-3 && CR0.AM == 1

0

1

 

16or 32-byte “explicitly unaligned” loads

no fault

no fault

 

AVX2,

and stores (see Table 2-5)

 

 

 

 

 

VEX op YMM, m256

no fault

no fault

 

 

 

VEX op XMM, m128

no fault

no fault

FMA,

 

 

 

“explicitly aligned” loads and stores (see

#GP(0)

#GP(0)

Table 2-4)

 

 

AVX,

 

 

 

 

 

2, 4, or 8-byte loads and stores

no fault

#AC(0)

 

 

 

 

 

 

16 byte “explicitly unaligned” loads and

no fault

no fault

 

stores (see Table 2-5)

 

 

 

 

 

 

 

op XMM, m128

#GP(0)

#GP(0)

 

 

 

 

 

“explicitly aligned” loads and stores (see

#GP(0)

#GP(0)

SSE

Table 2-4)

 

 

 

 

 

2, 4, or 8-byte loads and stores

no fault

#AC(0)

 

 

 

 

 

Table 2-4. Instructions Requiring Explicitly Aligned Memory

Require 16-byte alignment

Require 32-byte alignment

 

 

(V)MOVDQA xmm, m128

VMOVDQA ymm, m256

 

 

(V)MOVDQA m128, xmm

VMOVDQA m256, ymm

 

 

(V)MOVAPS xmm, m128

VMOVAPS ymm, m256

 

 

(V)MOVAPS m128, xmm

VMOVAPS m256, ymm

 

 

(V)MOVAPD xmm, m128

VMOVAPD ymm, m256

 

 

(V)MOVAPD m128, xmm

VMOVAPD m256, ymm

 

 

(V)MOVNTPS m128, xmm

VMOVNTPS m256, ymm

 

 

(V)MOVNTPD m128, xmm

VMOVNTPD m256, ymm

 

 

(V)MOVNTDQ m128, xmm

VMOVNTDQ m256, ymm

 

 

(V)MOVNTDQA xmm, m128

VMOVNTDQA ymm, m256

 

 

2-14

Ref. # 319433-011

Соседние файлы в папке Лаб2012