next up previous contents
Next: Intel Pentium 4 Up: The Main Architectural Classes Previous: IBM POWER3/4

Intel Itanium IA-64

The discussion on the Intel IA-64 chip has a little different background for two reasons. First, it is a family of processors that has different characteristics from the RISC chips presented elsewhere in this section. Second, its first implementation, the Itanium processor is not yet available in regular products but will be so by the second quarter of this year. Many vendors have announced to market systems with the Itanium processor, like NEC and Fujitsu Siemens while HP and SGI will offer them as alternative processors in their high end systems like the HP SuperDome and the SGI Origin3000. A block diagram of the Itanium is shown in 11.

Block diagram of the Intel IA-64 Itanium
 processor
Figure 11: Block diagram of the Intel IA-64 Itanium processor.

The expected clock frequency for the Itanium in the products to be shipped this year will probably be about 800 MHz (evaluation systems are now scantily available at 666 MHz). Figure 11 shows a large amount of functional units that must be kept busy. This is done by large instruction words of 128 bits that contain 3 41-bit instructions and a 5-bit template that aids in steering and decoding the instructions. This is an idea that is inherited from the Very Large Instruction Word (VLIW) machines that have been on the market for some time about ten years ago. The two load/store units fetch two instruction words per cycle so six instructions per cycle are dispatched. The Itanium has also in common with these systems that the scheduling of instructions, unlike in RISC processors, is not done dynamically at run time but rather by the compiler. The VLIW-like operation is enhanced with predicated execution which makes it possible to execute instructions in parallel that normally would have to wait for the result of a branch test. Intel calls this refreshed VLIW mode of operation EPIC, Explicit Parallel Instruction Computing. Furthermore, load instructions can be moved and the loaded variable used before a branch or a store by replacing this piece of code by a test on the place is originally came from to see whether the operations have been valid. To keep track of the advanced loads an Advanced Load Address Table records them. When a check is made about the validness of an operation depending on the advanced load, the ALAT is searched and when no entry is present the operation chain leading to the check is invalidated and the appropriate fix-up code is executed. Note that this is code that is generated at compile time so no control speculation hardware is needed for this kind of speculative execution. This would become exceedingly complex for the many functional units that may be simultaneously in operation at any time.
As can be seen from Figure 11 there are four floating-point units capable of performing Fused Multiply Accumulate (FMAC) operations. However, two of these work at the full 82-bit precision which is the internal standard on Itanium processors, while the other two can only be used for 32-bit precision operations. When working in the customary 64-bit precision the Itanium has a theoretical peak performance of 3.2 Gflop/s at a clock frequency of 800 MHz. Using 32-bit floating arithmetic, the peak is doubled. In addition to the floating-point units there are 4 integer units for integer arithmetic and other integer or character manipulations and four MMX units to accommodate instructions for multi-media operations, an inheritance from the Intel Pentium processor family. For compatibility with this Pentium family a special IA-32 decode and control unit is present.
The register files for integers and floating-point numbers is large: 128. However, only the first 32 entries of these registers are fixed while entries 33--128 are implemented as a register stack. The primary data and instruction caches are 4-way set associative and rather small: 16 KB each. This reflects the long development time of the Itanium. At the design time of the chip 16 KB was considered to be large. Likewise is the L2 cache of 96 KB now considered small in contrast to the L3 cache of 4 MB.

The introduction of the Itanium has been deferred time and again but it will be available from the second quarter of 2001 on in reasonable quantities. Its successor with the code name McKinley will be built along the same principles but with a clock frequency of > 1 GHz and larger caches. At the time of writing this report the definitive design of the McKinley chip (the "tape-out") has been completed and it is expected to be available in the first half of 2002 and so replace the Itanium processor quite quickly.


next up previous contents
Next: Intel Pentium 4 Up: The Main Architectural Classes Previous: IBM POWER3/4



Aad van der Steen
Mon Jul 16 11::46:26 MDT 2001