The Mips R8000 takes microprocessors where they have never been before: Into the realm of supercomputing
The new Mips R8000 chip set signals a number of trends in processors. First and foremost, it shows the determination of Silicon Graphics, which acquired Mips in 1992, to take a piece of the supercomputer market. Conventional wisdom has thus far held that microprocessors can't cut it in supercomputing, despite what they've done for mainframes and minicomputers.
The launch of the R8000--currently in limited production--is also further confirmation that the RISC processor market has matured to a point where products need to be differentiated. Applying every trick in the design textbook and announcing the world's fastest CPU is no longer enough; now you need to answer the question ``Fastest at what?'' a
nd then demonstrate the proper design trade-offs to make your answer credible.
Mips's answer is that the R8000 is fastest for technical and scientific computing tasks involving huge data sets and lots of floating-point math. The R8000's design emphasizes an external cache of up to 16 MB in size and with gigabyte-per-second bandwidth, so it's not intimidated by applications that won't fit into on-chip cache. Combined with a loosely coupled superscalar floating-point processor, this cache enables the R8000 to claim a peak performance of 310 SPECfp92 and 108 SPECint92 (faster than IBM's Power2 on floating-point operations), or 300 MFLOPS, which is equivalent to the performance of the Cray Y-MP.
The cost of this performance is a four-chip set with over 1000 pins, high power consumption, and a high price, so you definitely won't want to use it in a PDA (personal digital assistant). But because the R8000 is binary-compatible with existing Mips chips, you can run a network of Silicon Graphics workstati
ons driven by less expensive R4400s, all accessing an R8000-based departmental compute server. Mips is not alone in moving to this high-end/low-end strategy; the PowerPC group (with the yet-to-appear 620 at the high end) and Hewlett-Packard (with the PA-RISC 7200) are blazing similar trails.
The R8000 chip contains the integer-register file and multiple execution units, primary caches, and the branch-prediction unit, while its sister FPU--the R8010--contains the floating-point registers and two full floating-point execution units. Both chips are fabricated in a 0.5-micron CMOS process operating at 3.3 V and are designed to run at 75 MHz in the first generation. The R8000 and R8010 are large chips (17.3 mm by 17.2 mm each), although between them they contain only 3.4 million transistors, which is fewer than the PowerPC 604 crams onto a single, much smaller die. Both chips are packaged as 595-pin PGAs (pin-grid arrays).
To build an R8000 system, you also need a pair of custom dua
l-ported RAM chips, which act as tag RAM for the external cache; this external cache must be built from SSRAM (synchronous static RAM) chips, which are more expensive than ordinary SRAMs. The SSRAMs and the huge die will probably make the R8000 the most expensive microprocessor chip set ever built, but the supercomputer market is not known for bargain pricing.
Mips Goes Superscalar
The R8000 is a 64-bit superscalar processor that issues up to four instructions per cycle. This is a departure for Mips, which was previously the champion of superpipelining (i.e., deep pipelines and fast clocks) as an alternative to multiple-instruction issue. However, Mips's designers realized that their floating-point performance goals could be met only by a vector processor or by going superscalar, and they preferred the latter as a more general-purpose solution.
The R8000 integer unit contains four integer-execution units (see the figure ``R8000 Microarchitecture''). Three of these units--two ALUs and a shi
fter--operate in one cycle; the fourth, a multiply/divide unit, is not pipelined and takes four to six cycles for a multiply and 21 to 73 cycles for a divide. There are four parallel pipelines within the R8000, each five stages deep (i.e., fetch, decode, address generate, execute, and write back). Two pipelines feed the integer ALUs, and two are for loads/stores, allowing ALU operations to occur in parallel with data-cache accesses. Of the four instructions that the R8000 can dispatch each cycle, two are integer or load/store operations, and two are floating-point operations.
With this degree of parallel issue, Mips had to find a cure for the ``load shadow'' problem of earlier RISC pipeline designs: The cycle immediately following (and, hence, shadowing) a load instruction can't use the result of that load. As with branch-delay slots, the compiler tries to fill this dead slot with an independent instruction, but no compiler can be expected to find four such instructions.
Mips's solution was to p
lace the ALU one stage later in the pipeline than usual (after address generation), which removes the load-to-ALU shadow but instead introduces an ALU shadow over the load addresses: Whenever a base address is calculated using an ALU operation, there is a one-cycle delay before the address can be used by a following load or store. Mips considers this a good trade-off for two reasons: because load-use dependencies occur far more often than compute-load dependencies and because extensions to the R8000 instruction set include a new register+register addressing mode for floating-point loads/stores that reduces the need to precalculate addresses.
The integer-unit pipelines incorporate several other tricks to increase internal parallelism. For historical reasons, the Mips instruction set requires all branches to be followed by a one-cycle delay slot. The R8000 has to retain these delay slots for backward compatibility, but it executes them in parallel with their branch instruction. The on-chip data cache is
dual-ported to support two memory accesses per cycle, and it incorporates a bypass so that a store followed by a dependent load can still be issued in parallel.
Floating-point instructions are executed in the R8010, so the R8000's dispatcher puts floating-point operations into a queue where they can wait, without holding up the integer pipelines, until the R8010 is ready for them. Floating-point operations can, therefore, execute out of order in relation to any following integer instructions, but the R8010 uses FIFO (first-in/first-out) queues at both input and output to ensure they execute in order among themselves. This decoupling of the FPU from the integer pipelines improves integer performance by hiding not only the delay of floating-point execution, but also the latency of the external cache pipeline, which is mostly used for floating-point data (more about this later).
The downside of this decoupling is that while integer exceptions remain precise, floating-point exceptions
become imprecise, being reported as asynchronous interrupts some time after the causing event. You can confine the extent of this lag, at a price, by writing code that repeatedly reads the floating-point status register (which flushes all pending exceptions), or you can enter a precise-exception mode, in which the integer pipeline is stalled for the whole duration of a floating-point operation. This provides backward compatibility with earlier Mips CPUs, but at the cost of a large hit in integer performance.
The R8010 contains two identical floating-point data paths that are completely indistinguishable to software. These data paths can perform double-precision multiplies, adds, divides, square roots, and conversions, as well as a new, fused multiply-add (i.e., multiply A by B and then add C without intermediate rounding), which is especially useful in image processing and similar applications.
It's the execution of two such multiply-adds per cycle at 75 MHz that allows Mips to claim 300-MFLOPS
peak performance. Floating-point compares and moves take one cycle; multiplies and multiply-adds take four cycles and are fully pipelined; long operations, such as divides and square roots, are not pipelined and can take up to 20 cycles. However, the data paths are fully independent, so while one unit is executing a long, nonpipelined operation, the other can still execute one pipelined operation per cycle.
Dispatch and Branch Prediction
A major design goal for the R8000 was to achieve superscalar instruction issue that's not critically sensitive to instruction alignment. Some superscalar RISC processors can issue instructions in parallel only if they're aligned on 64-bit or 128-bit boundaries; otherwise, the time taken to align the instructions would consume all the advantage gained. For example, the DEC Alpha can waste up to three cycles when branching to a nonaligned target.
So often in the RISC world it's up to the compiler (and, hence, the wretched compiler writer) to solve this probl
em by padding the instruction stream with NOPs to achieve the correct alignment. This reduces code density and can cause secondary performance losses by increasing the cache-miss rate. Mips solved this problem by implementing an instruction buffer that takes blocks of instructions from the cache aligned on 128-bit boundaries and issues up to four aligned instructions per cycle into the execution units. The text box ``Aligning Instructions for Multiple Dispatch'' explains how the instruction buffer does its job.
To dispatch four instructions per cycle, you have to fetch four instructions per cycle, and that makes the R8000 very vulnerable to breaks in the instruction stream caused by program branches. To reduce this impact, the R8000's designers implemented a dynamic branch-prediction scheme. The branch cache holds 1-KB entries, and, for the sake of speed and economy of silicon space, it's direct-mapped, allowing it to be physically laid out as just an appendage of the single-ported instruction-cache RA
M. The branch cache employs a 1-bit prediction scheme, which has a rather low accuracy. However, its large size of one entry per line (i.e., per four instructions) in the instruction cache compensates for this by eliminating almost all conflicts.
The fetch stage of the integer pipeline accesses the branch and instruction caches in parallel, reading from the branch cache a single prediction bit (i.e., taken/not taken), the instruction-cache index of the branch target, and the alignments of both the last instruction in the source block and the branch target. These latter two quantities are used to mask off the unwanted instructions from the quadwords that contain the source and target. In the interest of space, only 10 bits of the instruction-cache index are stored in each branch-cache entry; the remaining high-order bits of the target address are recovered on the next cycle from the virtual instruction-cache tag. This scheme assumes that all predicted branch targets will hit in the instruction cache; co
nsequently, this increases the penalty for instruction-cache misses.
Because each branch-prediction entry is shared among four instructions, the possibility of cache thrashing arises if two branch instructions occur in the same four-word block (no more than two could occur because of Mips's compulsory branch-delay slots). The answer to this problem is to design compilers that avoid creating such small basic blocks (i.e., branches so close together).
Another extension to the R8000 instruction set helps with this task; the new conditional move instructions can often implement IFTHENELSE structures, using either a single branch or none. For example, the instruction
says, ``move the contents of register r2 into r1, but only if condition code cc1 is set.'' Conditional instructions always execute, consuming one cycle even when the move doesn't happen, so they don't cause a discontinuity in the instruction stream.
Mips claims that, although this branch-prediction s
cheme is somewhat less efficient than multibit ones, it works uniformly with respect to the taken and fall-through cases and for jump-to-register instructions, so the same hardware can predict branches, jumps, and subroutine calls/returns. It turns out to be very effective for dynamic object-oriented programs that call many small procedures (i.e., methods) indirectly via pointers. The code generated by such programs is full of jump-to-register instructions with invariant targets, which get predicted very well.
The Cache Hierarchy
The biggest problem faced by the R8000's designers was getting sufficient memory bandwidth to keep such a fast floating-point processor fed. The answer, as usual, lay in the cache architecture. The integer R8000 has separate 16-KB on-chip instruction and data caches, both with a 32-byte line size filled from an external secondary cache (which Mips calls the global cache) in two 16-byte chunks. The instruction cache is direct-mapped and is both virtually addressed and tag
Virtual tagging confers two advantages here. First, it dispenses with an instruction TLB (translation look-aside buffer) and the associated speed penalty for TLB misses. Also, it means that the instruction cache's contents need not always be a subset of the global cache, so the loading of huge floating-point data sets into the global cache doesn't have to displace still-useful instructions.
The data cache is direct-mapped and virtually addressed, but it's physically tagged and is used only for integer loads and stores. It's dual-ported to support two loads (or one load and one store) per cycle. Unlike the instruction cache, the data cache's contents are always a proper subset of the global cache, with coherency maintained by hardware. All floating-point loads and stores bypass the on-chip data cache and go directly to the off-chip global cache after they're translated to physical addresses in the TLB.
The global cache, which can be anywhere from 1 to 16 MB in size, directly feeds the
R8010's FPU and acts as local memory in multiprocessor systems. To reduce thrashing problems (such as during repetitive matrix processing), the global cache is four-way set-associative, with a sector size configurable to between 32 and 128 bytes.
To meet its floating-point performance target, Mips needed to achieve a bandwidth of over 1 GBps from the global cache; this required some drastic steps. The first was to implement the cache in what was at design time (1990) an exotic new breed of memory, the SSRAM. These chips integrate input and output registers onto the chip, so access is internally pipelined into three cycles: address setup, RAM access, and data output.
The R8000 global cache is interleaved in two banks of 64-bit-wide SSRAM cycling at 10 to 12 nanoseconds, for a total bandwidth of 1.2 GBps. This, however, does not take into account set associativ-ity, which, if implemented in the most straightforward way (i.e., reading all four sets in parallel), could quadruple the bandwidth requi
rement again. To avoid this, Mips designed a custom four-way-associative tag RAM, in which addresses must be looked up before the SSRAMs are accessed.
The end result is a five-stage pipeline between the R8000 and the global cache: cycle 1, address into tag RAM; cycle 2, tag lookup; cycle 3, signal cross from tag RAM chip to SSRAM chip; cycle 4, SSRAM access; and cycle 5, data out to the R8000 or R8010. The cost of a miss in the on-chip data cache is, therefore, seven cycles (five for the pipeline plus two loads to fill a line), while an instruction-cache miss costs 11 cycles, as it must also pass through the TLB and branch-prediction cache.
The main motivation for uncoupling the R8000 from its R8010 FPU was to hide this five-cycle external cache latency from integer code. In a tightly coupled design, each floating-point load could cast a 20-instruction shadow (5 4 instructions per cycle), whereas in the actual design it casts none.
One problem with interleaved caches is that two data ref
erences might attempt to access the same bank during the same cycle. Even the smartest compiler cannot always foresee such conflicts, which threaten to halve the available bandwidth by stalling one of the references. Mips provides special hardware--a one-entry queue coupled to a crossbar called the address bellow--that has the ability to delay one of each pair of cache references to improve the chances of the ideal odd-even, odd-even sequence.
For example, a pathological sequence such as odd-odd, even-even, odd-odd, even-even would normally stall on one reference of each pair and run at 50 percent efficiency, but the address bellow delays one reference (the one marked with an asterisk) to yield odd-stall, *odd-even, odd-even, odd-even, in which only a single cycle is lost. The address bellow can resolve only local conflicts, however, and it relies on the compiler to generate a sensible global mix of odd-even references.
The R8000 design uses a relatively simple scheme to ma
intain consistency between the on-chip contents and the global cache's contents. First, the on-chip integer-data cache was made write-through with respect to the global cache; there is enough write bandwidth for the global cache to absorb write-through at full speed without buffering. This reduces the amount of chip logic and also simplifies multiprocessor support, since you can invalidate a data-cache line at any time with no pending write-backs to worry about.
Integer-cache write-throughs can potentially steal bandwidth that is needed for floating-point loads and stores, but Mips claims this is seldom a problem with real-world code. The global cache itself is write-back and uses a snooping protocol to support multiprocessor buses.
This split-level cache poses a special problem of consistency between integer and floating-point data. Suppose, for example, that location A is first written by an integer operation in both caches and then rewritten by a floating-point operation in the global cache o
nly; a subsequent integer load would fetch stale data from the on-chip cache. Simply invalidating the on-chip cache line whenever a floating-point store occurs would provide a cure, but at the cost of causing severe cache thrashing during the processing of data structures that contain both floating-point and integer data within the same cache line. This is quite common, especially in graphics applications.
Instead, Mips employed an invalidation scheme whose granularity is finer than a whole cache line. Every 32-bit word in the on-chip integer-data cache has a valid bit attached. In a newly filled cache line, all these bits are set, but a floating-point store to the global cache causes the corresponding bits in the on-chip cache to be cleared. Integer stores of 32 or 64 bits set their corresponding valid bits, while integer stores of smaller size (e.g., bytes) and integer loads cause a cache miss if the appropriate bit is not set.
Like the Hewlett-Packard PA-RISC 7200 I wrote about last month (se
e ``A Different Kind of RISC,'' August BYTE), the Mips R8000 design is evidence that RISC designers are looking beyond benchmark performance to the difficult real-world computing problems that are, at present, fodder for supercomputers. The R8000 looks like it's set to be the floating-point performance winner for a while, but it's hard for me to shake off the nagging feeling that the multichip implementation raises some of the very problems it solves and that in the future we will see the R8000 chip set shrunk down to a single die, just as the IBM RS/6000 shrank to become the PowerPC.
Figure: R8000 Microarchitecture
Consisting of two logic chips and two dual-ported RAM chips, the R8000 has two high-powered floating-point pipelines and enough bandwidth to keep them fully stocked with instructions and data. The external global cache must be implemented in SSRAMs.
Dick Pountain is a BYTE contributing editor based in London. You can reach him on the Internet or BIX at