Saturday, November 24, 2001
Home | Articles | News Archive | Benchmarks
General Forum | Technical Forum

   
Latest News nourl SPECmine database last updated 4 days ago.

Reviews
In Full Force With Nvidia's nForce
Upgrading to eXtreme Performance
The DRDRAM Evolution: PC1066
i845: SDRAM and the Pentium 4
More Reviews...
Technical
Itanium: Titan or Titanic?
Floating-Point Compiler Performance Analysis
Pentium 4 Architecture In Depth, Part 2
Pentium 4 Architecture In Depth, Part 1
More Technical Articles...
How-To Guides
K6-III+: Super-7 to the Limit
Overclocking Socket A Processors
K6-2+ Optimization and Performance Guide
Buying and Overclocking the Athlon
More How-To Guides...
Latest Discussions
HELP! How do you change IRQs in Win XP?
Temperature of Pentium 4
440BX to old for Geforce 3????
Would a large, fast L3 cache on a motherboard be useful to the world?
FilterGate prevents me from viewing this message board
basic questions on implementing RAID 0

The Secrets of High Performance CPUs, Part 1
By Johan De Gelas
Wednesday, September 29, 1999 2:00 AM EDT

Have you ever wondered why an Alpha CPU is able to reach such high speeds? Why the P6-Core of intel will be able to reach 800 MHz and more and why the K6 core (most probably) never will? Why does AMD claim that the K7 is designed for high clockspeeds?

Well, it is time to find out. This article is the beginning of a series of articles that will explain you what makes a fast processor. We'll take a good look under the hood of all those number crunching beasts. Let me tell you, there is a lot to discover. And don't worry, the purpose of this article is to explain, not to bombard you with technical terms. Let us get on with it! 
 

CPU's of today: So much difference. 

It is quite intriguing: Alpha CPUs run at an earth scorching speed of 700 MHz while AMD is still at 450MHz and Intel was finally able to reach 500 MHz. The difference is even more remarkable if you realize that Alpha was able to reach 533-600 MHz using a .35 process!!! AMD and Cyrix were never able to squeeze more out of their .35 chips than 233 MHz. So don't let them tell you that the manufacturing process is the only thing that counts. There is more! 
 
Different ways to obtain the same high MHz! 

If you ask why some CPUs are able to reach high MHz you usually get three answers: 
1) Better manufacturing Capabilities (better manufacturing technology, smaller process) 
2) Small die size and lower power dissipation.
3) Deeper pipelines. 

Let us take a look at the CPUs out there: 
 
    Year Process Die size Max. freq.  Spec Spec
    Shipped µm mm² MHz Int 95 Fp 95
Intel PII 97 0,35 203 300 11,9 8,6
AMD K6 97 0,35 168 233 7,1* 3,9*
Sun UltraSparc IIi 98 0,35 156 360 15,2 19,9
DEC Alpha 21164 95 0,5 299 333 9,8 13,4
DEC Alpha 21164a 97 0,35 209 600 18,4 21,4
DEC Alpha 21264 98 0,25 302 667 40 60
HP PA-RISC 8200 97 0,5 345 220 15,5 25
* Estimations 

Although DEC used to make it's own CPUs, Intel has been fabbing the Alpha for some time, and they have better manufacturing processes. Intel had .25 CPUs long before DEC. Still, as we pointed out before, the Alpha CPUs run at much higher speeds. 

Small die size? The Alpha and the HP Chips are huge! But that does not keep them from beating the much smaller Intel and AMD chips in MHz (using the same process of course). 

IDTs winchip has a very small die size, and AMD always made smaller chips than their intel equivalents, but that did not help them beating Intel in the MHz race. No really, there is more to it than manufacturing technology. 

Deeper pipelines? That is a pretty confusing answer. How in the world can a deeper pipeline help us to reach higher MHz? What is a deep pipeline anyway? Read on! 
 

Your first microprocessor. 
 

To answer our questions, we need to build a small microprocessor ourselves. First of all you must know that the place where it all happens is the ALU (Arithmetic Logical Unit). This part of the CPU is the part that really processes the data. The ALU uses the data of the registers. 

It reads the values of the registers, does some calculations with those values (like adding two registers, incrementing a register with one, a logical or between two values of two registers etc.) and writes the results back in to the register. Then those registers will be written back into the main memory (via the cache). 

So the registers must be very fast to write and to read to, otherwise our ALU will be waiting for the input needed, and will also be waiting to write its output to the registers. 

Well, the registers can be read and written in one clock cycle! In other words if you have value X in register Z and the instruction given to the ALU states add 1 to X, it is possible to write the result, X+1, in the same register Z. Huh? won't that give big problems? Not really, let us take a look what our CPU does each clock cycle. 

 

As you know a clock signal is used in every microprocessor. This clock signal resembles ideally to an alternation of pulses (the blocks) and periods of no voltage (lines). 
On the falling edge of the pulse, we will set up all the gates of our registers (the black and blue arrows). In other words we make sure that the values coming from the L1-cache will be written (the blue arrows to the registers) into the right register. Also, did you notice the 4 horizontal arrows to the ALU. These arrows are the bit lines that make an instruction (with this CPU, we can give 24 or 16 different instructions) and these instruction lines will determine what sort of calculation the ALU will do. Setting up all those signals takes a very small of time, Dsig . It takes a small part of the CPUs clock cycle. 

Then we need the right values into the bus (the red yellow arrow) to the ALU. The values of the right register must be driven to the bus. The values travel to the ALU, and that takes also a bit of time, because the value must be stable, before the ALU calculates on it (DB)

Once that is done, the ALU can start working on it. After calculating (DALU), the results must travel through the input bus of the registers. This happens during DR. The registers will be loaded with the new results on the rising edge edge of the next pulse. 

 

This is the basic concept of CPU technology: each clockcycle has a pulse triggers some gates. For example, the registers can only be loaded on the rising edge of a pulse with the new values (the results), because at that moment the registers do not drive their contents to bus, that is the input of the ALU. 

In other words, there is no input to the registers on the falling edge of the pulse, only output. On the rising edge of the next pulse only input is possible, while the output is blocked. That way the ALU can read and write from/to the registers in one clock cycle. 

Back to our main issue: we want to attain lightning fast clock speeds. High clock speed means that the time between the two pulses must be as small as possible. As you can see in fig 2. we have can not make our clockcycle smaller than the sum of the time that DSig, DB, DALU, DR  (plus some extra time, a safety margin) needs. Otherwise we would be loading new values before the results were properly written to the registers. 

So if we can make the different D smaller, we can make our clockcycle smaller. 
Three possible solutions: 

1) Values (bits) should travel faster through the CPU. 
2) The ALU should calculate faster. 
3) We should try doing less work in one clockcycle. 
 

Faster traveling. 

Making the bit streams travel faster through the CPU is not easy. First we can should make sure that there are as few obstacles as possible when the electrons travel. This will depend on the manufacturing process the chip company uses and the temperature of the silicon. The better the manufacturing process the higher the yields will be. Intel, for example, had excellent yields this past year, most of their CPUs were able to run at 450 MHz. 

The temperature of the silicon? Well yes, atoms have a certain place in the silicon die and they vibrate at that place. As the chip gets warmer, the atoms vibrate more and the chance that a traveling electron collides with, or is pushed away from its ideal path by those vibrating atoms gets higher. If you can get the temperature down, the electrons and hence the bit streams will travel smoother. That is why those Kryotech cooled CPUs can run faster. 

Next we should take care of distances bits travel. We try to shrink the transistors and hence the distance between them. That is what will enable AMD and Intel to produce 600 MHz CPUs. They will use the smaller .18 micron process instead of the fatter .25 micron process. 
 

Faster calculating. 

Hardware is good at doing things in parallel. So a sophisticated addition unit will be able to add faster than a simple design. Hence, we will be able to reduce D ALU. It is another 
trade-off: sophisticated components are faster but bigger, they need more die space. 
 

Doing less in a clockcycle?  

Using better process technology is easier said than done. It cost truckloads of money to change the manufacturing process. It is a brute force approach. Isn't there a more subtle intelligent ways of things ? After all the Alpha engineers laugh at 500 MHz (.25), they attained 600 MHz with the older .35 process! 

Ok let us try to alter our first CPU, because the first Ace's hardware CPU would be running at 10 MHz or less! (If you find it in the your local shop, don't buy it! ) 

Another way could be to do less work in a clock cycle. Instead of doing 7 steps in one clockcyle, let us do 1 step in one clockcycle. If we do less in one clockcycle some of Ds will disappear, and we will be able to use much smaller clockcycle. 

That seems to be plain stupid! That would mean that (in this example), the same instruction would take 7 times longer to deliver results. Isn't  that pure nonsense? Nope, this is where pipelining comes in (more info here)! 

Pipelining enables you to make each component of your CPU work simultaneously each clockcycle. For example let us say we have 4 main components in our CPU, each has another task in the data flow: 
A fetcher, which looks up the next instruction. 
A decoder, who makes sense out of the instructions. 
An ALU, who executes the instructions. 
A retire unite, who writes the results back to the memory. 

Well if our design is pipelined, the fetcher will be fetching instruction 4, the decoder will be decoding instruction 3 , the ALU will execute instruction 2, and the results of instruction 1 will be written back to the memory by the retire unit during the same clockcycle. In other words each clockcycle will there be an instruction that will be finished. 

 

Superpipelining, or pipeling with a lot of stages, enables us to do less in a clockcyle and to the outside it will still appear as if the CPU is able to execute those instructions in one clockcycle. When we do less in a clockcycle, the clockspeed can be much higher. 
 

So why not increase the number of stages significantly ? 

So why don't chip bakers make CPUs with 20 stage super pipelining? Well deep pipelines increase the likelyhood that a pipeline stall may occur. We will explain this in greater detail shortly.

For now imagine that instruction 2 consists of B = A +1 and instruction 1 consists of 
A = C*2. If it takes 20 cycles to calculate A, instruction 2 can not begin but after 29 clockcycles! 
The deeper the pipeline the worse the stall will be if a value that is needed is still being calculated. 

Deep pipelines are not the only way to get high clock speeds. You can also try to do more in parallel. Remember it is actually the number of operations that must be done serially that increases the length of a single cycle.

Putting it all together. 

You may recall from Understanding the performance of the K6-3 that the K6-family has a 6 stage pipeline. The PII however has a 12 stage pipeline! Although the K6 has better ALUs (a smaller D ALU) and better decoders (they do more in parallel), is it very difficult for the K6-3 to reach the same clock speeds as the PII if they are using the same production process. The .35µ K6 maxed out at 233 MHz, while the .35µ PII maxed out at 300 MHz . 

The K6 does more each clockcycle. Luckily the components of the K6 can do more in parallel then the PII and so the difference is not 100%. What counts is the amount of serial operations (Ds) remember? 

Mind you, as I said before, it is not the only explanation for the difference between the K6 and PII maximum clock speeds, but it does explain why some CPUs can reach much higher speeds than others. 

The Ultrasparc II (Brian's favorite) uses a 9 stage pipeline, but you should consider the fact that RISC CPUs do not need as much decoding work. Fetching and decoding alone takes 7 stages on the PII, but only 3 on the UltraSPARC. So in a way, the UltraSPARC is pipelined deeper than the PII (compare the two RISC cores). 
 
The Alpha CPU has a 7-stage deep pipeline (10 for the FPU), but has also fewer problems with decoding the (simple RISC) instructions, compared to the PII. On top of that, the engineers of Alpha are masters in keeping each stage as simple as possible by making sure that each component does as much as it can in parallel. That keeps the Ds we discussed down to a minimum, but it comes with a price: die size! Look at the huge die size of the Alpha 21264: 302 mm². 

Do you now understand why the latencies (4 clock cycles for add, 4 for multiply) of the pipelined FPU of the K7 have risen compared to the K6-2 (2 for add, 2 for multiply)?  The K7 FPU does less work in one clockcycle then the K6-2, making much higher clock speeds possible. The 10 stage integer pipeline (15 stage FPU pipeline) will enable the K7 to reach clock speed of which the K6-core could only dream...Cyrix is doing the same thing with the Jalapeno (M7?). 

Of course, clock speed is not the only way to make a CPU fast, so expect more articles about all those CPU speed tricks... 

Discuss this article here! 

Back to Ace's hardware home page for more articles. 

All Content is Copyright (C) 1998-2001 Ace's Hardware. All Rights Reserved.
169 ms