Branch and Loop Reorganization to Prevent Mispredicts

Submit New Article

Published On :   November 1, 2009 11:00 PM PST
Rate
 


by Jeff Andrews

Introduction

Proven techniques and Intel tools enable developers to minimize branch mispredictions and keep deep-pipelined processors fully utilized.

Modern microprocessors are pipelined in order to get more instructions completed faster. This means that instructions do not wait for the previous ones to complete before their execution begins. A problem with this approach arises, however, due to conditional branches. If the microprocessor encounters a conditional branch and the result for the condition has not yet been calculated, how does it know whether to take the branch or not? This is where branch prediction comes in.

Branch prediction is what the processor uses to decide whether to take a conditional branch or not. Getting this information as accurately as possible is important, as an incorrect prediction (mispredict) will cause the microprocessor to throw out all the instructions that did not need to be executed and start over with the correct set of instructions. This process is particularly expensive with deeply pipelined processors.

This article introduces the various branch-prediction methods used by the microprocessor and provides some tips about how to avoid costly mispredicts. The paper assumes that the reader is familiar with programming in C and with IA32 assembly-language instructions.


Branch Examples

A pipelined processor makes it possible to begin execution of instructions before their predecessors are completed by breaking the instruction execution up into stages. When a conditional branch is encountered, the microprocessor uses branch prediction to determine which direction the branch will take. The following are examples of C language commands that cause conditional branches.

The first type of construction considered here that causes conditional branches is if-else:

 

//

// C code

//

if ( data == 0 )



else if ( data == 1 )



else





;

; Assembly code

;

mov eax, DWORD PTR data

cmp eax, 0

jne NextIfElse1



jmp EndIfElse

NextIfElse1:

cmp eax, 1

jne NextIfElse2



jmp EndIfElse

NextIfElse2:



EndIfElse:

 

 

//

// C code

//

switch ( data )

{

case 0:



break;

case 1:



break;

default:



}



;

; Assembly code

;

mov eax, DWORD PTR data

cmp eax, 0

jne NextCase1





jmp EndSwitch

NextCase1:

cmp eax, 1

jne NextCase2



jmp EndSwitch

NextCase2:



EndSwitch:

 

 

//

// C code

//

for ( i=0; i < 10; i++ )

{



}



;

; Assembly code

;

mov esi, data

mov ecx, 0

ForLoop:

cmp ecx, 10

jge EndForLoop



add ecx, 1

jmp ForLoop

EndForLoop:

 

 

//

// C code

//

while ( data > 0 )

{



data--;

}



;

; Assembly code

;

mov eax, data

WhileLoop:

cmp eax, 0

jle EndWhile



sub eax, 1

jmp WhileLoop

EndWhile:

 

 

//

// C code

//

do

{



data--;

}

while ( data > 0 );



;

; Assembly code

;

mov eax, data

DoWhileLoop:



sub eax, 1

cmp eax, 0

jg DoWhileLoop

EndDoWhile:

 


Static Branch Prediction

There are two kinds of branch prediction: static and dynamic. Static branch prediction is used by the microprocessor the first time a conditional branch is encountered, and dynamic branch prediction is used for succeeding executions of the conditional branch code.

Static branch prediction is used when there is no data collected by the microprocessor when it encounters a branch, which is typically the first time a branch is encountered. The rules are simple:

  • A forward branch defaults to not taken
  • A backward branch defaults to taken

 

In order to effectively write your code to take advantage of these rules, when writing if-else or switch statements, check the most common cases first and work progressively down to the least common. Loops do not necessarily require any special ordering of code for static branch prediction, as only the condition of the loop iterator is normally used.

The Pentium® 4 Processor introduced new instructions for adding static hints to branches. It is not recommended that a programmer use these instructions, as they add slightly to the size of the code and are static hints only. It is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints.

In the event that a branch hint is necessary, the following instruction prefixes can be added before a branch instruction to change the way the static predictor behaves:

  • 0x3E – statically predict a branch as taken
  • 0x2E – statically predict a branch as not taken

 

For example:

 

mov    eax, data

cmp eax, 5

__emit 0x3E

jne branch

add eax, 1

branch:

 


Dynamic Branch Prediction

Dynamic branch prediction is done in the microprocessor by using a history log of previously encountered branches containing data for each branch, noting whether or not it was taken. This branch-history log is known as the Branch Target Buffer (BTB). Every time a branch is encountered and the microprocessor knows which direction the branch has taken, the BTB is updated to reflect that information.

The BTB is a buffer that holds a branch’s address and a brief history of the direction that a branch has taken. The address field is used somewhat like an index to the BTB, where it looks up whether a branch should be taken or not taken. There are 16 bits in the Pentium 4 processor to signify whether a branch should be taken or not. The bits work like a circular buffer, with each bit being checked for each check into the BTB.

The following is an example of the BTB entry for a backward branch (i.e., a do-while loop in C++) doing four iterations with all its entries already filled in:

Figure 1. Full BTB Branch Entry

 

In this example, the do-while loop has been executed multiple times, with each execution of the loop containing a fixed amount of four iterations. Now that the history for this loop is in the BTB, whenever this code is executed again, it will not cause any branch mispredicts and the accompanying penalty.

One thing to remember is that the BTB is finite. Once all the entries in the BTB have been consumed, an older entry will need to be used for a new branch that is encountered.

Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:

 

//

// Not-unrolled

//

for (i=0; i < 32; i++)

{

some_val += some_array[i];

}



//

//Unrolled

//

for (i=0; i < 32; i+=2)

{

some_val += some_array[i] * 20;

some_val += some_array[i+1] * 20;

}

 

Another benefit of loop unrolling is that dependence chains are stretched, allowing for deep pipelines to get more utilization.

There are some rules that need to be followed:

  • Do not go beyond 16 iterations for the Pentium 4 processor and four iterations for the Pentium® II processor and Pentium® III processor.
  • Branches within a loop can place a heavy demand on the BTB, since they are now multiplied the same number of times the loop is unrolled.

 

It is also best to remove branches from within a loop, if possible. By doing so, the branch is only taken once, rather than for each iteration of the loop. This is only possible when the conditional does not change during the entire duration of the loop. The following is an example of how to do this:

 

//

//original version

//

for ( i=0; i < 10; i++ )

{

if ( some_value > 10 )

data++;

else

data--;

}





//

// conditional extracted version

//

if ( some_value > 10 )

{

for ( i=0; i < 10; i++ )

data++;

}

else

{

for ( i=0; i < 10; i++ )

data--;

}

 


Conditional Instructions

The Pentium® Pro processor introduced new instructions that help eliminate branches. These instructions behave differently, depending on whether the condition is met or not. The instructions are as follows:

  • SETcc – sets the destination register to 0 if the condition is not met and to 1 if the condition is met.
  • CMOVcc – moves the data from the source register to the destination register if the condition is met; otherwise, it is treated as a NOP by the microprocessor.
  • FCMOVcc – moves the floating point data from the source FP register to the destination FP register if the condition is met; otherwise, it is treated as a NOP by the microprocessor.

 

The cc stands for the different conditional expressions available (e.g., ne – not equal, z – zero). Given that, the following condi tional instruction moves ebx into eax if the value of ecx is not equal to 10:

 

cmp    ecx, 10

cmovne eax, ebx

 

 

;

; original version

;

add eax, somevalue

jcc branch

add edx, 1

branch:



;

; setcc version

;

add eax, somevalue

setc ebx, 1

add edx, ebx

 

 

;

; original version

;

cmp eax, somevalue

jge branch

mov ebx, constant0

branch:



;

; cmovcc version

;

cmp eax, somevalue

mov ecx, constant0

movl ebx, ecx

 

;

; original version

;

fld comp1

fcomp comp2

fnstsw ax

test ah, 1

jne branch

fld data

fadd dataadjust

fstp data

branch:



;

; fcmovcc version

;

fld dataadjust

fldz

fld comp1

fcomp comp2

fcmovnbe st(1)

fadd data

fstp data

fdecstp

 

When using conditional instructions, verify that they execute faster than conditional branches, as they can incur some additional instruction overhead for the more complex branches.

For more information on conditional instructions, please refer to the IA-32 Intel® Architecture Software Developer’s Manual, Volume 2: Instructions Set Reference.


Intel® Software Development Tools

The VTune™ Performance Analyzer is an excellent tool for determining if branch mispredictions are occurring and how much of an impact they are causing on an executable. The VTune analyzer can sample many different processor events, including mispredicted branches. The simplest way to do this is to create an "Advanced Activity Configuration" and create a sampling collector with all the "Primary performance tuning events" selected. This will give you more than the needed number of events, but it will also alert you to some other possible issues.

The Intel® Compilers can automatically unroll loops and insert branch-removing instructions. Using the Intel Compilers is simple, as they are flag compatible with the products they integrate with (e.g., Microsoft Visual Studio*). When the capability is enabled, the compilers will insert conditional instructions and branch hints.

The compilers can also enable application profiling, so that elements like case statements will be reordered to have the most common cases appear first, taking better advantage of static branch prediction.

Evaluation copies of the VTune Performance Analyzer and the Intel compilers can be downloaded from the links given above. Training for Intel® Software Development Tools is available from the Intel® Software College.


Conclusion

Deep-pipelined processors give the ability to boost clock frequencies up to several gigahertz. In order to take advantage of deep pipelines, however, branch mispredictions should be avoided so that the pipelines remain fully utilized.

The techniques described in this article, which range to the simple to the more complex, allow developers to avoid mispredictions. Using Intel software-development tools simplifies the removal of branch mispredictions and provides efficient solutions. If hand tuning of the code is required, using the methods given here will certainly help to minimize branch mispredictions.


Additional Resources

 

Articles

 

Community

 


About the Author

Jeff Andrews is an Application Engineer with Intel Corporation specializing in optimizing code for ISVs. He also researches software technologies that enhance the performance of applications on Intel® processors.