Have you ever wondered why an Alpha CPU is able to reach such high speeds? Why the P6-Core of intel will be able to reach 800 MHz and more and why
the K6 core (most probably) never will? Why does AMD claim that the K7
is designed for high clockspeeds?
Well, it is time to find out. This article is the beginning of a series
of articles that will explain you what makes a fast processor. We'll take
a good look under the hood of all those number crunching beasts. Let me
tell you, there is a lot to discover. And don't worry, the purpose of this
article is to explain, not to bombard you with technical terms. Let us
get on with it!
CPU's of today: So much difference.
It is quite intriguing: Alpha CPUs run at an earth scorching speed
of 700 MHz while AMD is still at 450MHz and Intel was finally able to reach
500 MHz. The difference is even more remarkable if you realize that Alpha
was able to reach 533-600 MHz using a .35 process!!! AMD and Cyrix were
never able to squeeze more out of their .35 chips than 233 MHz. So don't
let them tell you that the manufacturing process is the only thing that
counts. There is more!
Different ways to obtain the same high MHz!
If you ask why some CPUs are able to reach high MHz you usually get
1) Better manufacturing Capabilities (better manufacturing technology,
2) Small die size and lower power dissipation.
3) Deeper pipelines.
Let us take a look at the CPUs out there:
Although DEC used to make it's own CPUs, Intel has been fabbing the
Alpha for some time, and they have better manufacturing processes.
Intel had .25 CPUs long before DEC. Still, as we pointed out before,
the Alpha CPUs run at much higher speeds.
Small die size? The Alpha and the HP Chips are huge! But that does
not keep them from beating the much smaller Intel and AMD chips in MHz
(using the same process of course).
IDTs winchip has a very small die size, and AMD always made smaller
chips than their intel equivalents, but that did not help them beating
Intel in the MHz race. No really, there is more to it than manufacturing
Deeper pipelines? That is a pretty confusing answer. How in the world
can a deeper pipeline help us to reach higher MHz? What is a deep pipeline
anyway? Read on!
Your first microprocessor.
To answer our questions, we need to build a small microprocessor ourselves.
First of all you must know that the place where it all happens is the ALU
(Arithmetic Logical Unit). This part of the CPU is the part that really
processes the data. The ALU uses the data of the registers.
It reads the values of the registers, does some calculations with
those values (like adding two registers, incrementing a register with one,
a logical or between two values of two registers etc.) and writes the
results back in to the register. Then those registers will be written back
into the main memory (via the cache).
So the registers must be very fast to write and to read to, otherwise
our ALU will be waiting for the input needed, and will also be waiting
to write its output to the registers.
Well, the registers can be read and written in one clock cycle! In other
words if you have value X in register Z and the instruction given to the
ALU states add 1 to X, it is possible to write the result, X+1, in the same
register Z. Huh? won't that give big problems? Not really, let
us take a look what our CPU does each clock cycle.
As you know a clock signal is used in every microprocessor. This clock
signal resembles ideally to an alternation of pulses (the blocks)
and periods of no voltage (lines).
On the falling edge of the pulse, we will set up all the gates of our
registers (the black and blue arrows). In other words we make sure that
the values coming from the L1-cache will be written (the blue arrows to
the registers) into the right register. Also, did you notice the 4 horizontal
arrows to the ALU. These arrows are the bit lines that make an instruction
(with this CPU, we can give 24 or 16 different instructions)
and these instruction lines will determine what sort of calculation the
ALU will do. Setting up all those signals takes a very small of time, Dsig
. It takes a small part of the CPUs clock cycle.
Then we need the right values into the bus (the red yellow arrow) to
the ALU. The values of the right register must be driven to the bus. The
values travel to the ALU, and that takes also a bit of time, because
the value must be stable, before the ALU calculates on it (DB).
Once that is done, the ALU can start working on it. After calculating
(DALU), the results must travel through the
input bus of the registers. This happens during DR.
The registers will be loaded with the new results on the rising edge edge
of the next pulse.
This is the basic concept of CPU technology: each clockcycle
has a pulse triggers some gates. For example, the registers
can only be loaded on the rising edge of a pulse with the new values (the
results), because at that moment the registers do not drive their contents
to bus, that is the input of the ALU.
In other words, there is no input to the registers on the falling
edge of the pulse, only output. On the rising edge of the next pulse only
input is possible, while the output is blocked. That way the ALU can read
and write from/to the registers in one clock cycle.
Back to our main issue: we want to attain lightning fast clock speeds.
High clock speed means that the time between the two pulses must be as small
as possible. As you can see in fig 2. we have can not make our clockcycle
smaller than the sum of the time that DSig,
DB, DALU, DR
(plus some extra time, a safety margin) needs. Otherwise we would be loading
new values before the results were properly written to the registers.
So if we can make the different D smaller,
we can make our clockcycle smaller.
Three possible solutions:
1) Values (bits) should travel faster through the CPU.
2) The ALU should calculate faster.
3) We should try doing less work in one clockcycle.
Making the bit streams travel faster through the CPU is not easy. First
we can should make sure that there are as few obstacles as possible when the
electrons travel. This will depend on the manufacturing process the chip
company uses and the temperature of the silicon. The better the manufacturing
process the higher the yields will be. Intel, for example, had excellent
yields this past year, most of their CPUs were able to run at 450 MHz.
The temperature of the silicon? Well yes, atoms have a certain place
in the silicon die and they vibrate at that place. As the chip gets warmer,
the atoms vibrate more and the chance that a traveling electron
collides with, or is pushed away from its ideal path by those vibrating
atoms gets higher. If you can get the temperature down, the electrons and
hence the bit streams will travel smoother. That is why those Kryotech cooled
CPUs can run faster.
Next we should take care of distances bits travel. We try to shrink
the transistors and hence the distance between them. That is what will enable
AMD and Intel to produce 600 MHz CPUs. They will use the smaller .18 micron
process instead of the fatter .25 micron process.
Hardware is good at doing things in parallel. So a sophisticated addition
unit will be able to add faster than a simple design. Hence, we will be able
to reduce D ALU. It is another
trade-off: sophisticated components are faster but bigger, they need
more die space.
Doing less in a clockcycle?
Using better process technology is easier said than done. It cost
truckloads of money to change the manufacturing process. It is a brute
force approach. Isn't there a more subtle intelligent ways of things
? After all the Alpha engineers laugh at 500 MHz (.25), they attained 600
MHz with the older .35 process!
Ok let us try to alter our first CPU, because the first Ace's hardware
CPU would be running at 10 MHz or less! (If you find it in the your local
shop, don't buy it! )
Another way could be to do less work in a clock cycle. Instead of doing
7 steps in one clockcyle, let us do 1 step in one clockcycle. If we do
less in one clockcycle some of Ds will disappear,
and we will be able to use much smaller clockcycle.
That seems to be plain stupid! That would mean that (in this example),
the same instruction would take 7 times longer to deliver results. Isn't
that pure nonsense? Nope, this is where pipelining comes in (more info
Pipelining enables you to make each component of your CPU work simultaneously
each clockcycle. For example let us say we have 4 main components in our
CPU, each has another task in the data flow:
A fetcher, which looks up the next instruction.
A decoder, who makes sense out of the instructions.
An ALU, who executes the instructions.
A retire unite, who writes the results back to the memory.
Well if our design is pipelined, the fetcher will be fetching instruction
4, the decoder will be decoding instruction 3 , the ALU will execute instruction
2, and the results of instruction 1 will be written back to the memory
by the retire unit during the same clockcycle. In other words each clockcycle
will there be an instruction that will be finished.
Superpipelining, or pipeling with a lot of stages, enables us to do less
in a clockcyle and to the outside it will still appear as if the CPU is
able to execute those instructions in one clockcycle. When we do less in
a clockcycle, the clockspeed can be much higher.
So why not increase the number of stages significantly
So why don't chip bakers make CPUs with 20 stage super
pipelining? Well deep pipelines increase the likelyhood that a pipeline stall
may occur. We will explain this in greater detail shortly.
For now imagine that instruction 2 consists of B = A +1 and instruction
1 consists of
A = C*2. If it takes 20 cycles to calculate A, instruction 2 can not
begin but after 29 clockcycles!
The deeper the pipeline the worse the stall will be if a value that
is needed is still being calculated.
Deep pipelines are not the only way to get high clock speeds. You can
also try to do more in parallel. Remember it is actually the number of
operations that must be done serially that increases the length of a single
Putting it all together.
You may recall from Understanding the performance of the K6-3 that
the K6-family has a 6 stage pipeline. The PII however has a 12 stage pipeline!
Although the K6 has better ALUs (a smaller D
ALU) and better decoders (they do more in parallel), is it very difficult
for the K6-3 to reach the same clock speeds as the PII if they are using
the same production process. The .35µ K6 maxed out at 233 MHz, while
the .35µ PII maxed out at 300 MHz .
The K6 does more each clockcycle. Luckily the components of the K6 can
do more in parallel then the PII and so the difference is not 100%. What
counts is the amount of serial operations (Ds)
Mind you, as I said before, it is not the only explanation for the difference
between the K6 and PII maximum clock speeds, but it does explain why some
CPUs can reach much higher speeds than others.
The Ultrasparc II (Brian's favorite) uses a 9 stage pipeline,
but you should consider the fact that RISC CPUs do not need as much decoding
work. Fetching and decoding alone takes 7 stages on the PII, but only 3 on
the UltraSPARC. So in a way, the UltraSPARC is pipelined deeper
than the PII (compare the two RISC cores).
The Alpha CPU has a 7-stage deep pipeline (10 for the FPU),
but has also fewer problems with decoding the (simple RISC) instructions,
compared to the PII. On top of that, the engineers of Alpha are masters
in keeping each stage as simple as possible by making sure that each component
does as much as it can in parallel. That keeps the Ds
we discussed down to a minimum, but it comes with a price: die size! Look at the
huge die size of the Alpha 21264: 302 mm².
Do you now understand why the latencies (4 clock cycles for add, 4 for
multiply) of the pipelined FPU of the K7 have risen compared to
the K6-2 (2 for add, 2 for multiply)?
The K7 FPU does less work in one clockcycle then the K6-2, making
much higher clock speeds possible. The 10 stage integer pipeline (15 stage
FPU pipeline) will enable the K7 to reach clock speed of which the K6-core
could only dream...Cyrix is doing the same thing with the Jalapeno
Of course, clock speed is not the only way to make a CPU fast, so expect
more articles about all those CPU speed tricks...
Discuss this article here!
Back to Ace's hardware home page
for more articles.