Branch and Loop Reorganization to Prevent Mispredicts

By Jeff Andrews, published on May 6, 2011

Translating...

This is a computer translation of the original content. It is provided for general information only and should not be relied upon as complete or accurate.

by Jeff Andrews

Introduction

Proven techniques and Intel tools enable developers to minimize branch mispredictions and keep deep-pipelined processors fully utilized.

Modern microprocessors are pipelined in order to get more instructions completed faster. This means that instructions do not wait for the previous ones to complete before their execution begins. A problem with this approach arises, however, due to conditional branches. If the microprocessor encounters a conditional branch and the result for the condition has not yet been calculated, how does it know whether to take the branch or not? This is where branch prediction comes in.

Branch prediction is what the processor uses to decide whether to take a conditional branch or not. Getting this information as accurately as possible is important, as an incorrect prediction (mispredict) will cause the microprocessor to throw out all the instructions that did not need to be executed and start over with the correct set of instructions. This process is particularly expensive with deeply pipelined processors.

This article introduces the various branch-prediction methods used by the microprocessor and provides some tips about how to avoid costly mispredicts. The paper assumes that the reader is familiar with programming in C and with IA32 assembly-language instructions.

Branch Examples

A pipelined processor makes it possible to begin execution of instructions before their predecessors are completed by breaking the instruction execution up into stages. When a conditional branch is encountered, the microprocessor uses branch prediction to determine which direction the branch will take. The following are examples of C language commands that cause conditional branches.

The first type of construction considered here that causes conditional branches is if-else:

//

// C code

//

if ( data == 0 )

…

else if ( data == 1 )

…

else

…



;

; Assembly code

;

mov    eax, DWORD PTR data

cmp    eax, 0

jne    NextIfElse1

…

jmp    EndIfElse

NextIfElse1:

cmp    eax, 1

jne    NextIfElse2

…

jmp    EndIfElse

NextIfElse2:

…

EndIfElse:

//

// C code

//

switch ( data )

{

case 0:

…

break;

case 1:

…

break;

default:

…

}



;

; Assembly code

;

mov    eax, DWORD PTR data

cmp    eax, 0

jne    NextCase1



…

jmp    EndSwitch

NextCase1:

cmp    eax, 1

jne    NextCase2

…

jmp    EndSwitch

NextCase2:

…

EndSwitch:

//

// C code

//

for ( i=0; i < 10; i++ )

{

…

}



;

; Assembly code

;

mov    esi, data

mov    ecx, 0

ForLoop:

cmp    ecx, 10

jge    EndForLoop

…

add    ecx, 1

jmp    ForLoop

EndForLoop:

//

// C code

//

while ( data > 0 )

{

…

data--;

}



;

; Assembly code

;

mov    eax, data

WhileLoop:

cmp    eax, 0

jle    EndWhile

…

sub    eax, 1

jmp    WhileLoop

EndWhile:

//

// C code

//

do

{

…

data--;

}

while ( data > 0 );



;

; Assembly code

;

mov    eax, data

DoWhileLoop:

…

sub    eax, 1

cmp    eax, 0

jg     DoWhileLoop

EndDoWhile:

Static Branch Prediction

There are two kinds of branch prediction: static and dynamic. Static branch prediction is used by the microprocessor the first time a conditional branch is encountered, and dynamic branch prediction is used for succeeding executions of the conditional branch code.

Static branch prediction is used when there is no data collected by the microprocessor when it encounters a branch, which is typically the first time a branch is encountered. The rules are simple:

A forward branch defaults to not taken
A backward branch defaults to taken

In order to effectively write your code to take advantage of these rules, when writing if-else or switch statements, check the most common cases first and work progressively down to the least common. Loops do not necessarily require any special ordering of code for static branch prediction, as only the condition of the loop iterator is normally used.

The Pentium® 4 Processor introduced new instructions for adding static hints to branches. It is not recommended that a programmer use these instructions, as they add slightly to the size of the code and are static hints only. It is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints.

In the event that a branch hint is necessary, the following instruction prefixes can be added before a branch instruction to change the way the static predictor behaves:

0x3E – statically predict a branch as taken
0x2E – statically predict a branch as not taken

For example:

mov    eax, data

cmp    eax, 5

__emit 0x3E

jne    branch

add    eax, 1

branch:

Dynamic Branch Prediction

Dynamic branch prediction is done in the microprocessor by using a history log of previously encountered branches containing data for each branch, noting whether or not it was taken. This branch-history log is known as the Branch Target Buffer (BTB). Every time a branch is encountered and the microprocessor knows which direction the branch has taken, the BTB is updated to reflect that information.

The BTB is a buffer that holds a branch’s address and a brief history of the direction that a branch has taken. The address field is used somewhat like an index to the BTB, where it looks up whether a branch should be taken or not taken. There are 16 bits in the Pentium 4 processor to signify whether a branch should be taken or not. The bits work like a circular buffer, with each bit being checked for each check into the BTB.

The following is an example of the BTB entry for a backward branch (i.e., a do-while loop in C++) doing four iterations with all its entries already filled in:

Figure 1. Full BTB Branch Entry

In this example, the do-while loop has been executed multiple times, with each execution of the loop containing a fixed amount of four iterations. Now that the history for this loop is in the BTB, whenever this code is executed again, it will not cause any branch mispredicts and the accompanying penalty.

One thing to remember is that the BTB is finite. Once all the entries in the BTB have been consumed, an older entry will need to be used for a new branch that is encountered.

Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:

//

// Not-unrolled

//

for (i=0; i < 32; i++)

{

some_val += some_array[i];

}



//

//Unrolled

//

for (i=0; i < 32; i+=2)

{

some_val += some_array[i] * 20;

some_val += some_array[i+1] * 20;

}

Another benefit of loop unrolling is that dependence chains are stretched, allowing for deep pipelines to get more utilization.

There are some rules that need to be followed:

Do not go beyond 16 iterations for the Pentium 4 processor and four iterations for the Pentium® II processor and Pentium® III processor.
Branches within a loop can place a heavy demand on the BTB, since they are now multiplied the same number of times the loop is unrolled.

It is also best to remove branches from within a loop, if possible. By doing so, the branch is only taken once, rather than for each iteration of the loop. This is only possible when the conditional does not change during the entire duration of the loop. The following is an example of how to do this:

//

//original version

//

for ( i=0; i < 10; i++ )

{

if ( some_value > 10 )

data++;

else

data--;

}





//

// conditional extracted version

//

if ( some_value > 10 )

{

for ( i=0; i < 10; i++ )

data++;

}

else

{

for ( i=0; i < 10; i++ )

data--;

}

Conditional Instructions

The Pentium® Pro processor introduced new instructions that help eliminate branches. These instructions behave differently, depending on whether the condition is met or not. The instructions are as follows:

SETcc – sets the destination register to 0 if the condition is not met and to 1 if the condition is met.
CMOVcc – moves the data from the source register to the destination register if the condition is met; otherwise, it is treated as a NOP by the microprocessor.
FCMOVcc – moves the floating point data from the source FP register to the destination FP register if the condition is met; otherwise, it is treated as a NOP by the microprocessor.

The cc stands for the different conditional expressions available (e.g., ne – not equal, z – zero). Given that, the following condi tional instruction moves ebx into eax if the value of ecx is not equal to 10:

cmp    ecx, 10

cmovne    eax, ebx

;

; original version

;

add    eax, somevalue

jcc    branch

add    edx, 1

branch:



;

; setcc version

;

add    eax, somevalue

setc   ebx, 1

add    edx, ebx

;

; original version

;

cmp    eax, somevalue

jge    branch

mov    ebx, constant0

branch:



;

; cmovcc version

;

cmp    eax, somevalue

mov    ecx, constant0

movl   ebx, ecx

;

; original version

;

fld      comp1

fcomp    comp2

fnstsw   ax

test     ah, 1

jne      branch

fld      data

fadd     dataadjust

fstp     data

branch:



;

; fcmovcc version

;

fld        dataadjust

fldz

fld        comp1

fcomp      comp2

fcmovnbe   st(1)

fadd       data

fstp       data

fdecstp

When using conditional instructions, verify that they execute faster than conditional branches, as they can incur some additional instruction overhead for the more complex branches.

For more information on conditional instructions, please refer to the IA-32 Intel® Architecture Software Developer’s Manual, Volume 2: Instructions Set Reference.

Intel® Software Development Tools

The VTune™ Performance Analyzer is an excellent tool for determining if branch mispredictions are occurring and how much of an impact they are causing on an executable. The VTune analyzer can sample many different processor events, including mispredicted branches. The simplest way to do this is to create an "Advanced Activity Configuration" and create a sampling collector with all the "Primary performance tuning events" selected. This will give you more than the needed number of events, but it will also alert you to some other possible issues.

The Intel® Compilers can automatically unroll loops and insert branch-removing instructions. Using the Intel Compilers is simple, as they are flag compatible with the products they integrate with (e.g., Microsoft Visual Studio*). When the capability is enabled, the compilers will insert conditional instructions and branch hints.

The compilers can also enable application profiling, so that elements like case statements will be reordered to have the most common cases appear first, taking better advantage of static branch prediction.

Evaluation copies of the VTune Performance Analyzer and the Intel compilers can be downloaded from the links given above.

Conclusion

Deep-pipelined processors give the ability to boost clock frequencies up to several gigahertz. In order to take advantage of deep pipelines, however, branch mispredictions should be avoided so that the pipelines remain fully utilized.

The techniques described in this article, which range to the simple to the more complex, allow developers to avoid mispredictions. Using Intel software-development tools simplifies the removal of branch mispredictions and provides efficient solutions. If hand tuning of the code is required, using the methods given here will certainly help to minimize branch mispredictions.

Additional Resources

Articles

Avoiding the Cost of Branch Misprediction
Performance Analysis of Applications Running on Itanium® Processor

Community

Intel® Software Development Products

About the Author

Jeff Andrews is an Application Engineer with Intel Corporation specializing in optimizing code for ISVs. He also researches software technologies that enhance the performance of applications on Intel® processors.

For more complete information about compiler optimizations, see our Optimization Notice.

3 comments

Top

Anonymous said on Sep 16,2011

i am doing my research work in this area.i had some idea,can you give some comment on that.

Anonymous said on Sep 1,2010

ITS VERY NICE
FOR UNDERSTANDING PURPOSE IT VERY EASY TO UNDERSTAND TO ME

Anonymous said on Sep 27,2009

Hi Jeff,

I am taking Parallel Computer Architecture course this semester. And above description is very helpful for branch prediction. But I still have a doubt on unconditional branch that appears for the first time in code and unconditional branch for the second or more. Can you please elaborate on this point? I would appreciate if you can e-mail me your point of view.

Thanks,
Saumil Patel

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.