Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
CS 220 / ARM / ARM1176JZ-S Technical Reference Mmanual.pdf
Источник:
Скачиваний:
40
Добавлен:
16.04.2015
Размер:
4.47 Mб
Скачать

Program Flow Prediction

5.2Branch prediction

In ARM processors that have no PU, the target of a branch is not known until the end of the Execute stage. At the Execute stage it is known whether or not the branch is taken. The best performance is obtained by predicting all branches as not taken and filling the pipeline with the instructions that follow the branch in the current sequential path. In ARM processors without a PU, an untaken branch requires one cycle and a taken branch requires three or more cycles.

Branch prediction enables the detection of branch instructions before they enter the integer core. This permits the use of a branch prediction scheme that closely models actual conditional branch behavior.

The increased pipeline length of the ARM1176JZ-S processor makes the performance penalty of any changes in program flow, such as branches or other updates to the PC, more significant than was the case on the ARM9TDMI or ARM1020T processors. Therefore, a significant amount of hardware is dedicated to prediction of these changes. Two major classes of program flow are addressed in the ARM1176JZ-S prediction scheme:

1.Branches, including BL, and BLX immediate, where the target address is a fixed offset from the program counter. The prediction amounts to an examination of the probability that a branch passes its condition codes. These branches are handled in the Branch Predictors.

2.Loads, Moves, and ALU operations writing to the PC, that can be identified as being likely to be a return from a procedure call. Two identifiable cases are Loads to the PC from an address derived from R13, the stack pointer, and Moves or ALU operations to the PC derived from R14, the Link Register. In these cases, if the calling operation can also be identified, the likely return address can be stored in a hardware implemented stack, termed a Return Stack (RS). Typical calling operations are BL and BLX instructions. In addition Moves or ALU operations to the Link Register from the PC are often preludes to a branch that serves as a calling operation. The Link Register value derived is the value required for the RS. This was most commonly done on ARMv4T, before the BLX <register> instruction was introduced in ARMv5T.

Branch prediction is required in the design to reduce the integer core CPI loss that arises from the longer pipeline. To improve the branch prediction accuracy, a combination of static and dynamic techniques is employed. It is possible to disable each of the predictors separately.

5.2.1Enabling program flow prediction

The enabling of program flow prediction is controlled by the CP15 Register c1 Z bit, bit 11, that is set to 0 on Reset. See c1, Control Register on page 3-44. The return stack, dynamic predictor, and static predictor can also be individually controlled using the Auxiliary Control Register. See c1, Auxiliary Control Register on page 3-49.

5.2.2Dynamic branch predictor

The first line of branch prediction in the processor is dynamic, through a simple BTAC. It is virtually addressed and holds virtual target addresses. In addition, a two bit value holds the prediction history of the branch. If the address mappings change, this cache must be flushed. A dynamic branch predictor flush is included in the CP15 coprocessor control instructions. Also included are direct dynamic branch predictor flush from main TLB and integer core.

A BTAC works by storing the existence of branches at particular locations in memory. The branch target address and a prediction of whether or not it might be taken is also stored.

ARM DDI 0333H

Copyright © 2004-2009 ARM Limited. All rights reserved.

5-4

ID012410

Non-Confidential, Unrestricted Access

 

Program Flow Prediction

The BTAC provides dynamic prediction of branches, including BL and BLX instructions in both ARM, Thumb, and Jazelle states. The BTAC is a 128-entry direct-mapped cache structure used for allocation of Branch Target Addresses for resolved branches. The BTAC uses a 2-bit saturating prediction history scheme to provide the dynamic branch prediction. When a branch has been allocated into the BTAC, it is only evicted in the case of a capacity clash. That is, by another branch at the same index.

The prediction is based on the previous behavior of this branch. The four possible states of the prediction bits are:

strongly predict branch taken

weakly predict branch taken

weakly predict branch not taken

strongly predict branch not taken.

The history is updated for each occurrence of the branch. This updating is scheduled by the integer core when the branch has been resolved.

Branch entries are allocated into the BTAC after having been resolved at Execute. BTAC hits enable branch prediction with zero cycle delay. When a BTAC hit occurs, the Branch Target Address stored in the BTAC is used as the Program Counter for the next Fetch. Both branches resolved taken and not taken are allocated into the BTAC. This enables the BTAC to do the most useful amount of work and improves performance for tight backward branching loops.

5.2.3Static branch predictor

The second level of branch prediction in the processor uses static branch prediction that is based solely on the characteristics of a branch instruction. It does not make use of any history information. The scheme used in the ARM1176JZ-S processor predicts that all forward conditional branches are not taken and all backward branches are taken. Around 65% of all branches are preceded by enough non-branch cycles to be completely predicted.

Branch prediction is performed only when the Z bit in CP15 Register c1 is set to 1. See c1, Control Register on page 3-44 for details of this register. Dynamic prediction works on the basis of caching the previously seen branches in the BTAC, and like all caches suffers from the compulsory miss that exists on the first encountering of the branch by the predictor. A second static predictor is added to the design to counter these misses, and to deal with any capacity and conflict misses in the BTAC. The static predictor amounts to an early evaluation of branches in the pipeline, combined with a predictor based on the direction of the branches to handle the evaluation of condition codes that are not known at the time of the handling of these branches. Only items that have not been predicted in the dynamic predictor are handled by the static predictor.

The static branch predictor is hard-wired with backward branches being predicted as taken, and forward branches as not taken. The SBP looks at the MSB of the branch offset to determine the branch direction. Statically predicted taken branches incur a one-cycle delay before the target instructions start refilling the pipeline. The SBP works in both ARM and Thumb states. The SBP does not function in Jazelle state.

5.2.4Branch folding

Branch folding is a technique where, on the prediction of most branches, the branch instruction is completely removed from the instruction stream presented to the execution pipeline. Branch folding can significantly improve the performance of branches, taking the CPI for branches significantly lower than 1.

Branch folding only operates in ARM and Thumb states.

ARM DDI 0333H

Copyright © 2004-2009 ARM Limited. All rights reserved.

5-5

ID012410

Non-Confidential, Unrestricted Access

 

Program Flow Prediction

Branch folding is done for all dynamically predicted branches, except that branch folding is not done for:

BL and BLX instructions, to avoid losing the link

predicted branches onto branches

branches that are breakpointed or have generated an abort when fetched.

5.2.5Incorrect predictions and correction

Branches are resolved at or before the Ex3 stage of the integer core pipeline. A misprediction causes the pipeline to be flushed, and the correct instruction stream to be fetched. If branch folding is implemented, the failure of the condition codes of a folded branch causes the instruction that follows the folded branch to fail. Whenever a potentially incorrect prediction is made, the following information, necessary for recovering from the error, is stored:

a fall-through address in the case of a predicted taken branch instruction

the branch target address in the case of a predicted not taken branch instruction.

The PU passes the conditional part of any optimized branch into the integer core. This enables the integer core to compare these bits with the processor flags and determine if the prediction was correct or not. If the prediction was incorrect, the integer core flushes the PU and requests that prefetching begins from the stored recovery address.

ARM DDI 0333H

Copyright © 2004-2009 ARM Limited. All rights reserved.

5-6

ID012410

Non-Confidential, Unrestricted Access

 

Соседние файлы в папке ARM