HP's elegant new implementation of its PA-RISC architecture delivers world-class performanceDick Pountain
When people argue about RISC architectures nowadays, Hewlett-Packard's PA-RISC is unlikely to figure prominently in the discussion. PA-RISC chips have a lower profile than the PowerPC, Mips, or DEC Alpha chips, because HP has so far kept them almost to itself. The company doesn't sell its PA-RISC chips on the open merchant market; instead, it sells only to partners in its PRO (Precision RISC Organization). HP has also been relatively slow in licensing second sources.
The irony of this situation is that the recently announced PA-RISC 7200 (HP's ninth implementation of the architecture) is likely to hold the ``fastest RISC in town'' title for the immediate future, at least until the PowerPC 620 and Mips T5 come on stream next year. This becomes even more impressive when you realize that the 7200's superscalar design is far less aggressive than that of its competitors. Nevertheless, it is expected to top 175 SPECint92 and 250 SPECfp92, just bettering the Alpha 21064A's 170 SPECint92 rating. But raw SPECmarks are perhaps less appropriate than usual for measuring the 7200, because HP has clearly stated that its aim is to optimize the PA-RISC architecture for the real-world applications that its workstation customers run--mainly scientific and commercial transaction processing on huge data sets--rather than for the best benchmark figures.
A splendid sentiment, and one that can't be dismissed as mere manufacturer's hype because the technical details support it. In the 7200 implementation, HP's design team has concentrated on an artful cache design and a fast new memory bus, rather than on the multiple instruction issue and fancy branch prediction that the competition focuses on. Combined, the new design and faster bus will tend to accelerate large programs and data sets that don't fit in the cache.
Inside the 7200
Fabricated in HP's new three-metal 0.55-micron CMOS process, the 7200 is designed to run at up to 140 MHz. Its 540-pin ceramic PGA (Pin Grid Array) package is truly gigantic. This pin count reflects the fact that like its predecessors, the 7200 supports external data and instruction caches with separate 64-bit interfaces. It also includes a 64-bit interface to the new high-bandwidth Runway bus. The chip's RISC core operates at an unusual 4.4 V but the I/O circuitry works at 3.3 V; power dissipation is expected to be up to 29 W at 140 MHz.
By current standards, the 7200 is only a modestly superscalar design. It can issue two operations per cycle to its two integer units and one FPU. The instructions are classified into three groups; integer, load/store, and floating point. You can pair any two from different groups or two from the integer group. Branches are considered to be special integer operations that may be paired with their predecessor but not their successor. Branch instructions employ static branch prediction.
The 7200's five-stage execution pipeline is designed to minimize the stall penalties caused by data, control, and fetch dependencies between instructions; you incur only a one-cycle penalty for a mispredicted branch, for immediately using a floating-point result, and for store/load or load/use combinations. Unlike in previous PA-RISC chips, store/store incurs no penalty, as the off-chip SRAM (static RAM) cache now cycles at full processor frequency.
To keep the pipeline flowing as smoothly as possible, instructions with data dependencies and resource conflicts should not be paired. The 7200 uses hardware checking for dependencies, but to save time, it performs some of this work as the instructions are loaded from memory into the instruction cache. Six extra predecode bits are stored with each pair of instructions in the cache to encode this information. On their own, these predecode bits don't completely specify whether the instructions can be paired, but they enable the final checks made in the pipeline to be fast enough so that instruction decode/issue is never prolonged beyond one cycle. The predecode bits add about 10 percent to the SRAM overhead.
As with its PA-RISC predecessors, the 7200 uses off-chip caching; however, its main innovation is an on-chip assist cache that makes the caching system much more efficient. The 7200 also separates its instruction and data caches (up to 1 MB each) in place of the single unified cache that the 7100 uses. These caches have to be built from the fastest SRAM and must be able to cycle at full processor speed, which means a 6-nanosecond access time at speeds of greater than 120 MHz. Because such memory is expensive (and hard to source), it increases system costs.
The 7200's assist cache is a 2-KB on-chip memory that holds 64 32-byte cache lines and is fully associative, storing the full address of the last 64 memory accesses. Full associativity requires a lot of lookup logic and is too expensive for all but the smallest of caches. In contrast, both off-chip caches are direct-mapped, which means that many main memory locations map to the same cache line. Direct mapping is inexpensive and fast, because the logic need only inspect one line to look for a hit. But it suffers badly from ``thrashing'' if your program continually accesses several different addresses that all happen to map to the same cache index, which can happen easily in vector calculations.
For example, in the following vector calculation
FOR i := 0 TO n
DO A[i] := B[i] + C[i] + D[i]
it is possible for elements A[i], B[i], C[i], and D[i] to map to the same physical cache location. A direct-mapped cache will thrash by reloading the same line as each element is accessed, with a devastating performance penalty of four cache misses per iteration of the loop. Larger cache size can't help this problem but greater associativity can.
The assist cache sits between main memory and the off-chip primary data cache. Lines from memory move through the assist cache in FIFO (first-in/first-out) order into the data cache; in effect, acting as an overflow queue for the primary cache. The assist cache would eliminate the thrashing described above because each line can move into the assist cache without displacing the others. Both the primary and assist caches respond in a single cycle, and they behave like a single logical cache whose associativity varies dynamically with the data. The assist cache might hold 64 lines that map to the same primary cache line, or 64 different primary cache lines, or anything in between. When a processing unit requests data from the cache, 65 entries (i.e., 64 assist cache entries plus one main cache entry) get searched for a match. This work needs to be done inside one cycle, and HP had to use the fastest self-timed logic for the assist cache's lookup circuitry. In effect, the assist cache combines the high associativity of an on-chip cache with the large size of an off-chip cache. HP is so pleased with the result that it's patenting the assist cache.
Another twist is a new ``spatial locality only'' hint bit you can incorporate into the encoding of load/store instructions. The hint bit tells the assist cache that the data will be used only once, and that when the line needs to be replaced, it should write the data straight back to main memory (bypassing the off-chip cache). This enables efficient processing of long sequences of contiguous data without polluting the primary cache's temporally local data (i.e., variables that are being used repeatedly).
The 7200 uses simple but effective prefetch strategies for both instructions and data, which can often hide the penalties caused by cache misses and memory latency. When the instruction cache misses, it fetches not just the missing line but the next line, too. When such a prefetched line is accessed for the first time, the next line is fetched again, even if another prefetch is still in progress--up to four prefetches can be outstanding. This results in significant speed ups on long linear code sequences, but you can turn it off for programs with short routines and many branches.
Data is prefetched explicitly (i.e., by instructing a load to register zero) or automatically whenever an instruction that modifies a base register address is executed. For example, the load-word-indexed instruction LDWX,m R1(R2),R3 loads R3 from the address held in R2 and then post-increments R2 by adding R1 to it. If this instruction causes a data-cache miss then the 7200 is smart enough to prefetch from R2+R1 (rather than from R2+1) after it fills the missing line; it takes note of the ``stride'' of the indexed load.
The Runway Bus
To make full use of its efficient caches, the 7200 needed a high-bandwidth data path into memory--hence, the new Runway bus. This proprietary synchronous 64-bit bus runs at 120 MHz; however, it supports 1-to-1, 3-to-2, and 4-to-3 ratios between its own clock speed and the CPU's so that the CPU can be run faster. It employs a distributed arbitration scheme where each device attached to the bus contains its own arbiter logic, and arbitration proceeds in parallel with data transfer along separate wires.
The Runway bus uses a split transaction protocol in which up to six transactions can be pending at once, so the bus is available even while waiting for memory to deliver. Each transaction is labeled with an identification code--carried via yet another set of signal wires--so each device can sort out its own return data from the stream. The Runway bus multiplexes address and data at the cost of one address cycle for every four data cycles, making for a total sustainable bandwidth of 786 MBps. That's an impressive figure, not only three times faster than HP's own previous processor bus but faster than Sun Microsystems' advanced XDbus and pushing up into supercomputer territory.
More to the point, it's sufficient to support four 7200 chips in an SMP (symmetric multiprocessing) system without becoming a bottleneck. The bus interface supports a snooping cache coherency protocol, and to minimize the penalties for snooping on processor-to-cache bandwidth, the interface maintains deep coherency queues (up to 10 transactions for the main cache and three for the translation look-aside buffer, or TLB).
By building the bus interface onto the PA-RISC 7200 chip, HP will be able to build multiprocessor systems with a minimum of glue logic. In doing so, the company will keep the price and performance of its SMP workstations and servers highly competitive.
Illustration: The PA-RISC 7200 Unique in a number of ways, the PA-RISC architecture is best exemplified by its use of off-chip primary instruction and data caches. It integrates 1.3 million transistors onto a 210-mm superscript 2 die.
Dick Pountain is a BYTE contributing editor based in London. You can reach him on the Internet or BIX at email@example.com.