- •Chapter 1 Intel® Advanced Vector Extensions
- •1.1 About This Document
- •1.2 Overview
- •1.3.2 Instruction Syntax Enhancements
- •1.3.3 VEX Prefix Instruction Encoding Support
- •1.4 Overview AVX2
- •1.5 Functional Overview
- •1.6 General Purpose Instruction Set Enhancements
- •2.1 Detection of PCLMULQDQ and AES Instructions
- •2.2 Detection of AVX and FMA Instructions
- •2.2.1 Detection of FMA
- •2.2.3 Detection of AVX2
- •2.3.1 FMA Instruction Operand Order and Arithmetic Behavior
- •2.4 Accessing YMM Registers
- •2.5 Memory alignment
- •2.7 Instruction Exception Specification
- •2.7.1 Exceptions Type 1 (Aligned memory reference)
- •2.7.2 Exceptions Type 2 (>=16 Byte Memory Reference, Unaligned)
- •2.7.3 Exceptions Type 3 (<16 Byte memory argument)
- •2.7.5 Exceptions Type 5 (<16 Byte mem arg and no FP exceptions)
- •2.7.7 Exceptions Type 7 (No FP exceptions, no memory arg)
- •2.7.8 Exceptions Type 8 (AVX and no memory argument)
- •2.8.1 Clearing Upper YMM State Between AVX and Legacy SSE Instructions
- •2.8.3 Unaligned Memory Access and Buffer Size Management
- •2.9 CPUID Instruction
- •3.1 YMM State, VEX Prefix and Supported Operating Modes
- •3.2 YMM State Management
- •3.2.1 Detection of YMM State Support
- •3.2.2 Enabling of YMM State
- •3.2.4 The Layout of XSAVE Area
- •3.2.5 XSAVE/XRSTOR Interaction with YMM State and MXCSR
- •3.2.6 Processor Extended State Save Optimization and XSAVEOPT
- •3.2.6.1 XSAVEOPT Usage Guidelines
- •3.3 Reset Behavior
- •3.4 Emulation
- •4.1 Instruction Formats
- •4.1.1 VEX and the LOCK prefix
- •4.1.2 VEX and the 66H, F2H, and F3H prefixes
- •4.1.3 VEX and the REX prefix
- •4.1.4 The VEX Prefix
- •4.1.4.1 VEX Byte 0, bits[7:0]
- •4.1.4.2 VEX Byte 1, bit [7] - ‘R’
- •4.1.5 Instruction Operand Encoding and VEX.vvvv, ModR/M
- •4.1.6 The Opcode Byte
- •4.1.7 The MODRM, SIB, and Displacement Bytes
- •4.1.8 The Third Source Operand (Immediate Byte)
- •4.1.9.1 Vector Length Transition and Programming Considerations
- •4.1.10 AVX Instruction Length
- •4.2 Vector SIB (VSIB) Memory Addressing
- •4.3 VEX Encoding Support for GPR Instructions
- •5.1 Interpreting InstructIon Reference Pages
- •5.1.1 Instruction Format
- •5.1.2 Opcode Column in the Instruction Summary Table
- •5.1.3 Instruction Column in the Instruction Summary Table
- •5.1.4 Operand Encoding column in the Instruction Summary Table
- •5.1.5 64/32 bit Mode Support column in the Instruction Summary Table
- •5.1.6 CPUID Support column in the Instruction Summary Table
- •5.2 Summary of Terms
- •5.3 Instruction SET Reference
- •MPSADBW - Multiple Sum of Absolute Differences
- •PALIGNR - Byte Align
- •PBLENDW - Blend Packed Words
- •PHADDW/PHADDD - Packed Horizontal Add
- •PHADDSW - Packed Horizontal Add with Saturation
- •PHSUBW/PHSUBD - Packed Horizontal Subtract
- •PHSUBSW - Packed Horizontal Subtract with Saturation
- •PMOVSX - Packed Move with Sign Extend
- •PMOVZX - Packed Move with Zero Extend
- •PMULDQ - Multiply Packed Doubleword Integers
- •PMULHRSW - Multiply Packed Unsigned Integers with Round and Scale
- •PMULHUW - Multiply Packed Unsigned Integers and Store High Result
- •PMULHW - Multiply Packed Integers and Store High Result
- •PMULLW/PMULLD - Multiply Packed Integers and Store Low Result
- •PMULUDQ - Multiply Packed Unsigned Doubleword Integers
- •POR - Bitwise Logical Or
- •PSADBW - Compute Sum of Absolute Differences
- •PSHUFB - Packed Shuffle Bytes
- •PSHUFD - Shuffle Packed Doublewords
- •PSHUFLW - Shuffle Packed Low Words
- •PSIGNB/PSIGNW/PSIGND - Packed SIGN
- •PSLLDQ - Byte Shift Left
- •PSLLW/PSLLD/PSLLQ - Bit Shift Left
- •PSRAW/PSRAD - Bit Shift Arithmetic Right
- •PSRLDQ - Byte Shift Right
- •PSRLW/PSRLD/PSRLQ - Shift Packed Data Right Logical
- •PSUBB/PSUBW/PSUBD/PSUBQ -Packed Integer Subtract
- •PSUBSB/PSUBSW -Subtract Packed Signed Integers with Signed Saturation
- •PSUBUSB/PSUBUSW -Subtract Packed Unsigned Integers with Unsigned Saturation
- •PXOR - Exclusive Or
- •VPBLENDD - Blend Packed Dwords
- •VPERMD - Full Doublewords Element Permutation
- •VPERMPD - Permute Double-Precision Floating-Point Elements
- •VPERMPS - Permute Single-Precision Floating-Point Elements
- •VPERMQ - Qwords Element Permutation
- •VPSLLVD/VPSLLVQ - Variable Bit Shift Left Logical
- •VPSRAVD - Variable Bit Shift Right Arithmetic
- •VPSRLVD/VPSRLVQ - Variable Bit Shift Right Logical
- •VGATHERDPD/VGATHERQPD - Gather Packed DP FP values Using Signed Dword/Qword Indices
- •VGATHERDPS/VGATHERQPS - Gather Packed SP FP values Using Signed Dword/Qword Indices
- •VPGATHERDD/VPGATHERQD - Gather Packed Dword values Using Signed Dword/Qword Indices
- •VPGATHERDQ/VPGATHERQQ - Gather Packed Qword values Using Signed Dword/Qword Indices
- •6.1 FMA InstructIon SET Reference
- •Chapter 7 Instruction Set Reference - VEX-Encoded GPR Instructions
- •7.1 Instruction Format
- •7.2 INSTRUCTION SET REFERENCE
- •BZHI - Zero High Bits Starting with Specified Bit Position
- •INVPCID - Invalidate Processor Context ID
- •Chapter 8 Post-32nm Processor Instructions
- •8.1 Overview
- •8.2 CPUID Detection of New Instructions
- •8.4 Vector Instruction Exception Specification
- •8.6 Using RDRAND Instruction and Intrinsic
- •8.7 Instruction Reference
- •A.1 AVX Instructions
- •A.2 Promoted Vector Integer Instructions in AVX2
- •B.1 Using Opcode Tables
- •B.2 Key to Abbreviations
- •B.2.1 Codes for Addressing Method
- •B.2.2 Codes for Operand Type
- •B.2.3 Register Codes
- •B.2.4 Opcode Look-up Examples for One, Two, and Three-Byte Opcodes
- •B.2.4.1 One-Byte Opcode Instructions
- •B.2.4.2 Two-Byte Opcode Instructions
- •B.2.4.3 Three-Byte Opcode Instructions
- •B.2.4.4 VEX Prefix Instructions
- •B.2.5 Superscripts Utilized in Opcode Tables
- •B.3 One, Two, and THREE-Byte Opcode Maps
- •B.4.1 Opcode Look-up Examples Using Opcode Extensions
- •B.4.2 Opcode Extension Tables
- •B.5 Escape Opcode Instructions
- •B.5.1 Opcode Look-up Examples for Escape Instruction Opcodes
- •B.5.2 Escape Opcode Instruction Tables
- •B.5.2.1 Escape Opcodes with D8 as First Byte
- •B.5.2.2 Escape Opcodes with D9 as First Byte
- •B.5.2.3 Escape Opcodes with DA as First Byte
- •B.5.2.4 Escape Opcodes with DB as First Byte
- •B.5.2.5 Escape Opcodes with DC as First Byte
- •B.5.2.6 Escape Opcodes with DD as First Byte
- •B.5.2.7 Escape Opcodes with DE as First Byte
- •B.5.2.8 Escape Opcodes with DF As First Byte
INTEL® ADVANCED VECTOR EXTENSIONS
1.5FUNCTIONAL OVERVIEW
Intel AVX and FMA provide comprehensive functional improvements over previous generations of SIMD instruction extensions. The functional improvements include:
•256-bit floating-point arithmetic primitives: AVX enhances existing 128-bit floating-point arithmetic instructions with 256-bit capabilities for floating-point processing. FMA provides additional set of 256-bit floating-point processing capabilities with a rich set of fused-multiply-add and fused multiply-subtract primitives.
•Enhancements for flexible SIMD data movements: AVX provides a number of new data movement primitives to enable efficient SIMD programming in relation to loading non-unit-strided data into SIMD registers, intra-register SIMD data manipulation, conditional expression and branch handling, etc. Enhancements for SIMD data movement primitives cover 256-bit and 128-bit vector floatingpoint data, and 128-bit integer SIMD data processing using VEX-encoded instructions.
Several key categories of functional improvements in AVX and FMA are summarized in the following subsections.
1.5.1256-bit Floating-Point Arithmetic Processing Enhancements
Intel AVX provides 35 256-bit floating-point arithmetic instructions. The arithmetic operations cover add, subtract, multiply, divide, square-root, compare, max, min, round, etc., on single-precision and double-precision floating-point data.
The enhancement in AVX on floating-point compare operation provides 32 conditional predicates to improve programming flexibility in evaluating conditional expressions.
FMA provides 36 256-bit floating-point instructions to perform computation on 256bit vectors. The arithmetic operations cover fused multiply-add, fused multiplysubtract, fused multiply add/subtract interleave, signed-reversed multiply on fused multiply-add and multiply-subtract.
1.5.2256-bit Non-Arithmetic Instruction Enhancements
Intel AVX provides new primitives for handling data movement within 256-bit floating-point vectors and promotes many 128-bit floating data processing instructions to handle 256-bit floating-point vectors.
AVX includes 39 256-bit data processing instructions that are promoted from previous generations of SIMD instruction extensions, ranging from logical, blend, convert, test, unpacking, shuffling, load and stores.
AVX introduces 18 new data processing instructions that operate on 256-bit vectors. These new primitives cover the following operations:
Ref. # 319433-011 |
1-5 |
INTEL® ADVANCED VECTOR EXTENSIONS
•Non-unit-stride fetching of SIMD data. AVX provides several flexible SIMD floating-point data fetching primitives:
—broadcast of single or multiple data elements into a 256-bit destination,
—masked move primitives to load or store SIMD data elements conditionally,
•Intra-register manipulation of SIMD data elements. AVX provides several flexible SIMD floating-point data manipulation primitives:
—insert/extract multiple SIMD floating-point data elements to/from 256-bit SIMD registers
—permute primitives to facilitate efficient manipulation of floating-point data elements in 256-bit SIMD registers
•Branch handling. AVX provides several primitives to enable handling of branches in SIMD programming:
—new variable blend instructions supports four-operand syntax with nondestructive source syntax. This is more flexible than the equivalent SSE4 instruction syntax which uses the XMM0 register as the implied mask for blend selection.
—Packed TEST instructions for floating-point data.
1.5.3Arithmetic Primitives for 128-bit Vector and Scalar processing
Intel AVX provides 131 128-bit numeric processing instructions that employ VEXprefix encoding. These VEX-encoded instructions generally provide the same functionality over instructions operating on XMM register that are encoded using SIMD prefixes. The 128-bit numeric processing instructions in AVX cover floating-point and integer data processing across 128-bit vector and scalar processing.
The enhancement in AVX on 128-bit floating-point compare operation provides 32 conditional predicates to improve programming flexibility in evaluating conditional expressions. This contrasts with floating-point SIMD compare instructions in SSE and SSE2 supporting only 8 conditional predicates.
FMA provides 60 128-bit floating-point instructions to process 128-bit vector and scalar data. The arithmetic operations cover fused multiply-add, fused multiplysubtract, signed-reversed multiply on fused multiply-add and multiply-subtract.
1.5.4Non-Arithmetic Primitives for 128-bit Vector and Scalar Processing
Intel AVX provides 126 data processing instructions that employ VEX-prefix encoding. These VEX-encoded instructions generally provide the same functionality over instructions operating on XMM register that are encoded using SIMD prefixes.
1-6 |
Ref. # 319433-011 |
INTEL® ADVANCED VECTOR EXTENSIONS
The 128-bit data processing instructions in AVX cover floating-point and integer data movement primitives.
Additional enhancements in AVX on 128-bit data processing primitives include 16 new instructions with the following capabilities:
•Non-unit-strided fetching of SIMD data. AVX provides several flexible SIMD floating-point data fetching primitives:
—broadcast of single data element into a 128-bit destination,
—masked move primitives to load or store SIMD data elements conditionally,
•Intra-register manipulation of SIMD data elements. AVX provides several flexible SIMD floating-point data manipulation primitives:
—permute primitives to facilitate efficient manipulation of floating-point data elements in 128-bit SIMD registers
•Branch handling. AVX provides several primitives to enable handling of branches in SIMD programming:
—new variable blend instructions supports four-operand syntax with nondestructive source syntax. Branching conditions dependent on floating-point data or integer data can benefit from Intel AVX. This is more flexible than non-VEX encoded instruction syntax that uses the XMM0 register as implied mask for blend selection. While variable blend with implied XMM0 syntax is supported in SSE4 using SIMD prefix encoding, VEX-encoded 128-bit variable blend instructions only support the more flexible four-operand syntax.
—Packed TEST instructions for floating-point data.
1.5.5AVX2 and 256-bit Vector Integer Processing
AVX2 promotes the vast majority of 128-bit integer SIMD instruction sets to operate with 256-bit wide YMM registers. AVX2 instructions are encoded using the VEX prefix and require the same operating system support as AVX. Generally, most of the promoted 256-bit vector integer instructions follow the 128-bit lane operation, similar to the promoted 256-bit floating-point SIMD instructions in AVX.
Newer functionalities in AVX2 generally fall into the following categories:
•Fetching non-contiguous data elements from memory using vector-index memory addressing. These “gather” instructions introduce a new memoryaddressing form, consisting of a base register and multiple indices specified by a vector register (either XMM or YMM). Data elements sizes of 32 and 64-bits are supported, and data types for floating-point and integer elements are also supported.
•Cross-lane functionalities are provided with several new instructions for broadcast and permute operations. Some of the 256-bit vector integer instructions promoted from legacy SSE instruction sets also exhibit cross-lane behavior, e.g. VPMOVZ/VPMOVS family.
Ref. # 319433-011 |
1-7 |