Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный Технический Университет Харьковский Политехнический Институт

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

Лаб2012 / 253665.pdf

Скачиваний:

Добавлен:

02.02.2015

Размер:

3.31 Mб

Скачать

☆

<<< < Предыдущая 1 2 3 4 5 6 78 / 1448 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

INTEL® 64 AND IA-32 ARCHITECTURES

System Bus		Frequently used paths
		Frequently used paths
		Less frequently used
		paths
Bus Unit
3rd Level	Cache
Optional
2nd Level Cache		1st Level Cache
8-Way		4-way
Front End
Fetch/Decode	Trace Cache	Execution	Retirement
	Trace Cache	Out-Of-Order
	Microcode ROM	Out-Of-Order
	Microcode ROM	Core
		Core
BTBs/Branch Prediction		Branch History Update
BTBs/Branch Prediction
			OM16521

Figure 2-2. The Intel NetBurst Microarchitecture

2.2.2.1The Front End Pipeline

The front end supplies instructions in program order to the out-of-order execution core. It performs a number of functions:

•Prefetches instructions that are likely to be executed

•Fetches instructions that have not already been prefetched

•Decodes instructions into micro-operations

•Generates microcode for complex instructions and special-purpose code

•Delivers decoded instructions from the execution trace cache

•Predicts branches using highly advanced algorithm

The pipeline is designed to address common problems in high-speed, pipelined microprocessors. Two of these problems contribute to major sources of delays:

•time to decode instructions fetched from the target

Vol. 1 2-11

INTEL® 64 AND IA-32 ARCHITECTURES

•wasted decode bandwidth due to branches or branch target in the middle of cache lines

The operation of the pipeline’s trace cache addresses these issues. Instructions are constantly being fetched and decoded by the translation engine (part of the fetch/decode logic) and built into sequences of µops called traces. At any time, multiple traces (representing prefetched branches) are being stored in the trace cache. The trace cache is searched for the instruction that follows the active branch. If the instruction also appears as the first instruction in a pre-fetched branch, the fetch and decode of instructions from the memory hierarchy ceases and the prefetched branch becomes the new source of instructions (see Figure 2-2).

The trace cache and the translation engine have cooperating branch prediction hardware. Branch targets are predicted based on their linear addresses using branch target buffers (BTBs) and fetched as soon as possible.

2.2.2.2Out-Of-Order Execution Core

The out-of-order execution core’s ability to execute instructions out of order is a key factor in enabling parallelism. This feature enables the processor to reorder instructions so that if one µop is delayed, other µops may proceed around it. The processor employs several buffers to smooth the flow of µops.

The core is designed to facilitate parallel execution. It can dispatch up to six µops per cycle (this exceeds trace cache and retirement µop bandwidth). Most pipelines can start executing a new µop every cycle, so several instructions can be in flight at a time for each pipeline. A number of arithmetic logical unit (ALU) instructions can start at two per cycle; many floating-point instructions can start once every two cycles.

2.2.2.3Retirement Unit

The retirement unit receives the results of the executed µops from the out-of-order execution core and processes the results so that the architectural state updates according to the original program order.

When a µop completes and writes its result, it is retired. Up to three µops may be retired per cycle. The Reorder Buffer (ROB) is the unit in the processor which buffers completed µops, updates the architectural state in order, and manages the ordering of exceptions. The retirement section also keeps track of branches and sends updated branch target information to the BTB. The BTB then purges pre-fetched traces that are no longer needed.

2-12 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES

2.2.3Intel® Core™ Microarchitecture

Intel Core microarchitecture introduces the following features that enable high performance and power-efficient performance for single-threaded as well as multithreaded workloads:

•Intel® Wide Dynamic Execution enable each processor core to fetch, dispatch, execute in high bandwidths to support retirement of up to four instructions per cycle.

—Fourteen-stage efficient pipeline

—Three arithmetic logical units

—Four decoders to decode up to five instruction per cycle

—Macro-fusion and micro-fusion to improve front-end throughput

—Peak issue rate of dispatching up to six micro-ops per cycle

—Peak retirement bandwidth of up to 4 micro-ops per cycle

—Advanced branch prediction

—Stack pointer tracker to improve efficiency of executing function/procedure entries and exits

•Intel® Advanced Smart Cache delivers higher bandwidth from the second level cache to the core, and optimal performance and flexibility for singlethreaded and multi-threaded applications.

—Large second level cache up to 4 MB and 16-way associativity

—Optimized for multicore and single-threaded execution environments

—256 bit internal data path to improve bandwidth from L2 to first-level data cache

•Intel® Smart Memory Access prefetches data from memory in response to data access patterns and reduces cache-miss exposure of out-of-order execution.

—Hardware prefetchers to reduce effective latency of second-level cache misses

—Hardware prefetchers to reduce effective latency of first-level data cache misses

—Memory disambiguation to improve efficiency of speculative execution execution engine

•Intel® Advanced Digital Media Boost improves most 128-bit SIMD instruction with single-cycle throughput and floating-point operations.

—Single-cycle throughput of most 128-bit SIMD instructions

—Up to eight floating-point operation per cycle

—Three issue ports available to dispatching SIMD instructions for execution

Vol. 1 2-13

INTEL® 64 AND IA-32 ARCHITECTURES

Intel Core 2 Extreme, Intel Core 2 Duo processors and Intel Xeon processor 5100 series implement two processor cores based on the Intel Core microarchitecture, the functionality of the subsystems in each core are depicted in Figure 2-3.

	Instruction Fetch and PreDecode
	Instruction Queue
M icro-		Decode
code		Decode
ROM
				Shared L2 Cache
	Rename/Alloc			Up to 10.7 GB/s
				FSB
	Retirement Unit
	(Re-Order Buffer)
		Scheduler
ALU	ALU	ALU
Branch	FAdd	FM ul	Load	Store
M M X/SSE/FP	M M X/SSE	M M X/SSE
M ove
		L1D Cache and DTLB

Figure 2-3. The Intel Core Microarchitecture Pipeline Functionality

2.2.3.1The Front End

The front end of Intel Core microarchitecture provides several enhancements to feed the Intel Wide Dynamic Execution engine:

•Instruction fetch unit prefetches instructions into an instruction queue to maintain steady supply of instruction to the decode units.

•Four-wide decode unit can decode 4 instructions per cycle or 5 instructions per cycle with Macrofusion.

•Macrofusion fuses common sequence of two instructions as one decoded instruction (micro-ops) to increase decoding throughput.

•Microfusion fuses common sequence of two micro-ops as one micro-ops to improve retirement throughput.

2-14 Vol. 1

<<< < Предыдущая 1 2 3 4 5 6 78 / 1448 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

Соседние файлы в папке Лаб2012

#
02.02.20153.31 Mб33253665.pdf
#
02.02.20153.33 Mб6725366517.pdf
#
02.02.20152.52 Mб25253666.pdf
#
02.02.20152.7 Mб2725366617.pdf
#
02.02.20152.09 Mб31253667.pdf
#
02.02.20152.19 Mб2625366717.pdf