Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный Технический Университет Харьковский Политехнический Институт

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

Лаб2012 / 25366517.pdf

Скачиваний:

Добавлен:

02.02.2015

Размер:

3.33 Mб

Скачать

☆

<<< < Предыдущая 83 84 85 86 87 88 89 90 91 92 93 9495 / 13995 96 97 98 99 100 101 102 103 104 105 106 107 > Следующая >>>

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

The PMULHUW (multiply packed unsigned word integers and store high result) instruction performs an SIMD unsigned multiply of the words in the two source operands and returns the high word of each result to an MMX register.

The PSADBW (compute sum of absolute differences) instruction computes the SIMD absolute differences of the corresponding unsigned byte integers in two source operands, sums the differences, and stores the sum in the low word of the destination operand.

The PSHUFW (shuffle packed word integers) instruction shuffles the words in the source operand according to the order specified by an 8-bit immediate operand and returns the result to the destination operand.

10.4.5MXCSR State Management Instructions

The MXCSR state management instructions (LDMXCSR and STMXCSR) load and save the state of the MXCSR register, respectively. The LDMXCSR instruction loads the MXCSR register from memory, while the STMXCSR instruction stores the contents of the register to memory.

10.4.6Cacheability Control, Prefetch, and Memory Ordering Instructions

SSE extensions introduce several new instructions to give programs more control over the caching of data. They also introduces the PREFETCHh instructions, which provide the ability to prefetch data to a specified cache level, and the SFENCE instruction, which enforces program ordering on stores. These instructions are described in the following sections.

10.4.6.1Cacheability Control Instructions

The following three instructions enable data from the MMX and XMM registers to be stored to memory using a non-temporal hint. The non-temporal hint directs the processor to when possible store the data to memory without writing the data into the cache hierarchy. See Section 10.4.6.2, “Caching of Temporal vs. Non-Temporal Data” for information about non-temporal stores and hints.

The MOVNTQ (store quadword using non-temporal hint) instruction stores packed integer data from an MMX register to memory, using a non-temporal hint.

The MOVNTPS (store packed single-precision floating-point values using non-temporal hint) instruction stores packed floating-point data from an XMM register to memory, using a nontemporal hint.

The MASKMOVQ (store selected bytes of quadword) instruction stores selected byte integers from an MMX register to memory, using a byte mask to selectively write the individual bytes. This instruction also uses a non-temporal hint.

Vol. 1 10-17

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

10.4.6.2Caching of Temporal vs. Non-Temporal Data

Data referenced by a program can be temporal (data will be used again) or non-temporal (data will be referenced once and not reused in the immediate future). For example, program code is generally temporal, whereas, multimedia data, such as the display list in a 3-D graphics application, is often non-temporal. To make efficient use of the processor’s caches, it is generally desirable to cache temporal data and not cache non-temporal data. Overloading the processor’s caches with non-temporal data is sometimes referred to as “polluting the caches.” The SSE and SSE2 cacheability control instructions enable a program to write non-temporal data to memory in a manner that minimizes pollution of caches.

These SSE and SSE2 non-temporal store instructions minimize cache pollutions by treating the memory being accessed as the write combining (WC) type. If a program specifies a nontemporal store with one of these instructions and the destination region is mapped as cacheable memory (write back [WB], write through [WT] or WC memory type), the processor will do the following:

•If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted.

•The non-temporal data is written to memory with WC semantics.

See also: Chapter 10, Memory Cache Control in the IA-32 Intel Architecture Software Developer’s Manual, Volume 3.

Using the WC semantics, the store transaction will be weakly ordered, meaning that the data may not be written to memory in program order, and the store will not write allocate (that is, the processor will not fetch the corresponding cache line into the cache hierarchy, prior to performing the store). Also, different processor implementations may choose to collapse and combine these stores.

The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in uncacheable memory. Uncacheable as referred to here means that the region being written to has been mapped with either an uncacheable (UC) or write protected (WP) memory type.

In general, WC semantics require software to ensure coherence, with respect to other processors and other system agents (such as graphics cards). Appropriate use of synchronization and fencing must be performed for producer-consumer usage models. Fencing ensures that all system agents have global visibility of the stored data; for instance, failure to fence may result in a written cache line staying within a processor and not being visible to other agents.

For processors that implement non-temporal stores by updating data in-place that already resides in the cache hierarchy, the destination region should also be mapped as WC. If mapped as WB or WT, there is the potential for speculative processor reads to bring the data into the caches; in this case, non-temporal stores would then update in place, and data would not be flushed from the processor by a subsequent fencing operation.

The memory type visible on the bus in the presence of memory type aliasing is implementation specific. As one possible example, the memory type written to the bus may reflect the memory type for the first store to this line, as seen in program order; other alternatives are possible. This

10-18 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

behavior should be considered reserved, and dependence on the behavior of any particular implementation risks future incompatibility.

10.4.6.3PREFETCHh Instructions

The PREFETCHh instructions permit programs to load data into the processor at a suggested cache level, so that the data is closer to the processor’s load and store unit when it is needed. These instructions fetch 32 aligned bytes (or more, depending on the implementation) containing the addressed byte to a location in the cache hierarchy specified by the temporal locality hint (see Table 10-1). In this table, the first-level cache is closest to the processor and second-level cache is farther away from the processor than the first-level cache. The hints specify a prefetch of either temporal or non-temporal data (see Section 10.4.6.2, “Caching of Temporal vs. Non-Temporal Data”). Subsequent accesses to temporal data are treated like normal accesses, while those to non-temporal data will continue to minimize cache pollution. If the data is already present at a level of the cache hierarchy that is closer to the processor, the PREFETCHh instruction will not result in any data movement. The PREFETCHh instructions do not affect functional behavior of the program.

See Section 11.6.13, “Cacheability Hint Instructions” for additional information about the PREFETCHh instructions.

Table 10-1. PREFETCHh Instructions Caching Hints

PREFETCHh
Instruction Mnemonic	Actions

PREFETCHT0	Temporal data—fetch data into all levels of cache hierarchy:
	• Pentium III processor—1st-level cache or 2nd-level cache
	• Pentium 4 and Intel Xeon processor—2nd-level cache
PREFETCHT1	Temporal data—fetch data into level 2 cache and higher
	• Pentium III processor—2nd-level cache
	• Pentium 4 and Intel Xeon processor—2nd-level cache
PREFETCHT2	Temporal data—fetch data into level 2 cache and higher
	• Pentium III processor—2nd-level cache
	• Pentium 4 and Intel Xeon processor—2nd-level cache
PREFETCHNTA	Non-temporal data—fetch data into location close to the processor, minimizing
	cache pollution
	• Pentium III processor—1st-level cache
	• Pentium 4 and Intel Xeon processor—2nd-level cache

10.4.6.4SFENCE Instruction

The SFENCE (Store Fence) instruction controls write ordering by creating a fence for memory store operations. This instruction guarantees that the result of every store instruction that precedes the store fence in program order is globally visible before any store instruction that follows the fence. The SFENCE instruction provides an efficient way of ensuring ordering between procedures that produce weakly-ordered data and procedures that consume that data.

Vol. 1 10-19

<<< < Предыдущая 83 84 85 86 87 88 89 90 91 92 93 9495 / 13995 96 97 98 99 100 101 102 103 104 105 106 107 > Следующая >>>

Соседние файлы в папке Лаб2012

#
02.02.20153.31 Mб43253665.pdf
#
02.02.20153.33 Mб8025366517.pdf
#
02.02.20152.52 Mб33253666.pdf
#
02.02.20152.7 Mб3525366617.pdf
#
02.02.20152.09 Mб40253667.pdf
#
02.02.20152.19 Mб3325366717.pdf
#
02.02.20152.31 Mб47319433-011.pdf