Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
CS 220 / ARM / ARM1176JZ-S Technical Reference Mmanual.pdf
Источник:
Скачиваний:
45
Добавлен:
16.04.2015
Размер:
4.47 Mб
Скачать

Cycle Timings and Interlock Behavior

16.1About cycle timings and interlock behavior

Complex instruction dependencies and memory system interactions make it impossible to describe briefly the exact cycle timing behavior for all instructions in all circumstances. The timings that this chapter describes are accurate in most cases. If precise timings are required you must use a cycle-accurate model of the processor.

Unless otherwise stated, cycle counts and result latencies that this chapter describes are best case numbers. They assume:

no outstanding data dependencies between the current instruction and a previous instruction

the instruction does not encounter any resource conflicts

all data accesses hit in the MicroTLB and Data Cache, and do not cross protection region boundaries

all instruction accesses hit in the Instruction Cache.

This section describes:

Changes in instruction flow overview

Instruction execution overview on page 16-3

Conditional instructions on page 16-4

Opposite condition code checks on page 16-4

Definition of terms on page 16-5.

16.1.1Changes in instruction flow overview

To minimize the number of cycles, because of changes in instruction flow, the processor includes a:

dynamic branch predictor

static branch predictor

return stack.

The dynamic branch predictor is a 128-entry direct-mapped branch predictor using VA bits [9:3]. The prediction scheme uses a two-bit saturating counter for predictions that are:

Strongly Not Taken

Weakly Not Taken

Weakly Taken

Strongly Taken.

Only branches with a constant offset are predicted. Branches with a register-based offset are not predicted. A dynamically predicted branch can be folded out of the instruction stream if the following instruction arrives while the branch is within the prefetch instruction buffer. A dynamically predicted branch takes one cycle or zero cycles if folded out.

The static branch predictor operates on branches with a constant offset that are not predicted by the dynamic branch predictor. Static predictions are issued from the Iss stage of the main pipeline, consequently a statically predicted branch takes four cycles.

The return stack consists of three entries, and as with static predictions, issues a prediction from the Iss stage of the main pipeline. The return stack mispredicts if the value taken from the return stack is not the value that is returned by the instruction. Only unconditional returns are

ARM DDI 0333H

Copyright © 2004-2009 ARM Limited. All rights reserved.

16-2

ID012410

Non-Confidential, Unrestricted Access

 

Cycle Timings and Interlock Behavior

predicted. A conditional return pops an entry from the return stack but is not predicted. If the return stack is empty a return is not predicted. Items are placed on the return stack from the following instructions:

BL #<immed>

BLX #<immed>

BLX Rx

Items are popped from the return stack by the following types of instruction:

BX lr

MOV pc, lr

LDR pc, [sp], #cns

LDMIA sp!, {….,pc}

A correctly predicted return stack pop takes four cycles.

16.1.2Instruction execution overview

The instruction execution pipeline is constructed from three parallel four-stage pipelines. See Table 16-1. For a complete description of these pipeline stages see Pipeline stages on page 1-24.

Table 16-1 Pipeline stages

Pipeline

Stages

 

 

 

 

 

 

 

 

ALU

Sh

ALU

Sat

WBex

 

 

 

 

 

Multiply

MAC1

MAC2

MAC3

 

 

 

 

 

 

Load/Store

ADD

DC1

DC2

WBls

 

 

 

 

 

The ALU and multiply pipelines operate in a lock-step manner, causing all instructions in these pipelines to retire in order. The load/store pipeline is a decoupled pipeline enabling subsequent instructions in the ALU and multiply pipeline to complete underneath outstanding loads.

Extensive forwarding to the Sh, MAC1, ADD, ALU, MAC2, and DC1 stages enables many dependent instruction sequences to run without pipeline stalls. General forwarding occurs from the ALU, Sat, WBex and WBls pipeline stages. In addition, the multiplier contains an internal multiply accumulate forwarding path. Most instructions do not require a register until the ALU stage. All result latencies are given as the number of cycles until the register is required by a following instruction in the ALU stage.

The following sequence takes four cycles:

LDR

R1,

[R2]

;Result latency three

 

ADD

R3,

R3, R1

;Register R1 required

by ALU

If a subsequent instruction requires the register at the start of the Sh, MAC1, or ADD stage then an extra cycle must be added to the result latency of the instruction producing the required register. Instructions that require a register at the start of these stages are specified by describing that register as an Early Reg. The following sequence, requiring an Early Reg, takes five cycles:

LDR

R1,

[R2]

 

;Result latency three plus

one

ADD

R3,

R3, R1

LSL#6

;plus one because Register

R1 is required by Sh

ARM DDI 0333H

Copyright © 2004-2009 ARM Limited. All rights reserved.

16-3

ID012410

Non-Confidential, Unrestricted Access

 

Cycle Timings and Interlock Behavior

Finally, some instructions do not require a register until their second execution cycle. If a register is not required until the ALU, MAC1, or Dc1 stage for the second execution cycle, then a cycle can be subtracted from the result latency for the instruction producing the required register. If a register is not required until this later point, it is specified as a Late Reg. The following sequence where R1 is a Late Reg takes four cycles:

LDR

R1,

[R2]

;Result latency three minus

one

 

ADD

R3,

R3, R1, R4 LSL#5

;minus one because

Register

R1

is a Late Reg

 

 

 

;This ADD is a two

issue cycle

instruction

16.1.3Conditional instructions

Most instructions execute in one or two cycles. If these instructions fail their condition codes then they take one and two cycles respectively.

Multiplies, MSR, and some CP14 and CP15 coprocessor instructions are the only instructions that require more than two cycles to execute. If one of these instructions fails its condition codes, then it takes a variable number of cycles to execute. The number of cycles is dependent on:

the length of the operation

the number of cycles between the setting of the flags and the start of the dependent instruction.

The worst-case number of cycles for a condition code failing multicycle instruction is five.

The following algorithm describes the number of cycles taken for multi-cycle instructions that condition-code fail:

Min(NonFailingCycleCount, Max(5 - FlagCycleDistance, 3))

Where:

 

Max (a,b)

Returns the maximum of the two values a,b.

Min (a,b)

Returns the minimum of the two values a,b.

NonFailingCycleCount

Is the number of cycles that the failing instruction would have taken had it passed.

FlagCycDistance Is the number of cycles between the instruction that sets the flags and the conditional instruction, including interlocking cycles. For example:

The following sequence has a FlagCycleDistance of 0 because the instructions are back-to-back with no interlocks:

ADDS R1, R2, R3

MULEQ R4, R5, R6

The following sequence has a FlagCycleDistance of one:

ADDS R1, R2, R3

MOV R0, R0

MULEQ R4, R5, R6

16.1.4Opposite condition code checks

If instruction A and instruction B both write the same register the pipeline must ensure that the register is written in the correct order. Therefore, interlocks might be required to correctly resolve this pipeline hazard.

ARM DDI 0333H

Copyright © 2004-2009 ARM Limited. All rights reserved.

16-4

ID012410

Non-Confidential, Unrestricted Access

 

Cycle Timings and Interlock Behavior

The only useful sequences where two instructions write the same register without an instruction reading its value in between are when the two instructions have opposite sets of condition codes. The processor optimizes these sequences to prevent unnecessary interlocks. For example:

The following sequences take two cycles to execute:

ADDNE R1, R5, R6 LDREQ R1, [R8]

LDREQ R1, [R8] ADDNE R1, R5, R6

The following sequence also takes two cycles to execute, because the STR instruction does not store the value of R1 produced by the QDADDNE instruction:

QDADDNE R1, R5, R6

STREQ R1, [R8]

16.1.5Definition of terms

Table 16-2 lists descriptions of cycle timing terms used in this chapter.

 

 

 

 

Table 16-2 Definition of cycle timing terms

 

 

Term

Description

 

 

Cycles

This is the minimum number of cycles required by an instruction.

 

 

Result Latency

This is the number of cycles before the result of this instruction is available for a following

 

instruction requiring the result at the start of the ALU, MAC2, and DC1 stage. This is the normal

 

case. Exceptions to this mark the register as an Early Reg.

 

 

Note

 

 

 

 

 

 

The result latency is the number of cycles from the first cycle of an instruction.

 

 

 

 

 

Register Lock Latency

For STM and STRD instructions only. This is the number of cycles that a register is write locked

 

for by this instruction, preventing subsequent instructions that want to write the register from

 

starting. This lock is required to prevent a following instruction from writing to a register before

 

it has been read.

 

 

Early Reg

The specified registers are required at the start of the Sh, MAC1, and ADD stage. Add one cycle

 

to the result latency of the instruction producing this register for interlock calculations.

 

 

Late Reg

The specified registers are not required until the start of the ALU, MAC1, and DC1 stage for the

 

second execution. Subtract one cycle from the result latency of the instruction producing this

 

register for interlock calculations.

 

 

FlagsCycleDistance

The number of cycles between an instruction that sets the flags and the conditional instruction.

 

 

 

 

 

ARM DDI 0333H

Copyright © 2004-2009 ARM Limited. All rights reserved.

16-5

ID012410

Non-Confidential, Unrestricted Access

 

Соседние файлы в папке ARM