Getting higher performance through greater instruction parallelism
The ultimate expression of the RISC philosophy must be the TTA (Transport-Triggered Architecture), in which there's just the MOVE instruction. As the name suggests, it's the moving of data from one unit to another that triggers computation in a TTA design. TTAs rely on smart compilers rather than hardware to schedule instructions, allocate resources, and handle dependencies. A research team led by Henk Corporaal at Holland's University of Delft recently built an experimental 32-bit TTA processor.
You program a traditional processor by specifying operations that move data between registers and modify it in ALUs--for example, an A
DD operation typically moves two operands and one result. Such machines (which includes CISCs and RISCs) can be described as having OTAs (Operation-Triggered Architectures). A TTA machine breaks these operations down further into individual data moves, so that an ADD now becomes three separate instructions. This finer-grained approach can offer higher performance through greater instruction parallelism.
A TTA processor consists of several FUs (function units), a bank of GPRs (general-purpose registers), and some transports (i.e., buses) that move data from unit to unit (see ``
A Transport-Triggered Processor
''). The transports constitute a communications network between the units, which might be fully or only partially connected.
The registers of a TTA fall into four categories; operand registers, trigger registers, result registers, and GPRs. The first three types belong to function units--in an OTA machine they would be called latches, but in a TTA they have names and are visib
le to software. Only the GPRs correspond exactly to the register file of an OTA machine.
An operation on a TTA begins by loading operands into an FU's operand registers. The last operand goes into the trigger register, which signals the FU that all the operands are ready, and the operation starts. The result goes into the FU's result register. You must store the result in a GPR if you need to keep it for more than one cycle. For example, the TTA code to add two numbers might look like this:
r2 -> add_o
r3 -> add_t
add_r -> r4
The arrows represent the TTA's single MOVE instruction. The first statement moves an operand from the GPR called r2 into the add unit's operand register. The second line puts the second operand into add's trigger register (add_t) and causes the addition to occur, leaving the result in the add unit's result register (add_r). The last line saves this result into GPR r4.
Two points need to be made here. First, though the trigger move can't
be performed before its other operand moves, it can be made in the same cycle. In principle, you can perform as many independent moves per cycle as there are transports. Second, the time interval between the trigger and result moves can't be less than the add unit's latency time. However, unlike a superscalar RISC in which units share pipelines, each unit in a TTA has its own internal pipeline, and it's the compiler's job to space out the instructions to account for the latency of each unit.
Control flow in a TTA is managed by ``guards'' that determine whether or not a move will actually be executed, based on the values of special 1-bit predicate registers (like status bits in an OTA). Comparison FUs set the values of these bits. The program line if r = r3 then goto label compiles into the following moves:
r2 -> eq_o ; load operand into `equals' unit
r3 -> eq_t ; second operand triggers comparison
eq_r -> b4 ; result into predicate register b4
b4 ? label -> pc ; if b4 then ju
mp to label
Jumps are performed by moving an address directly into the program counter, which is exposed to software in a TTA. The last line shows b4 guarding such a move. TTAs also make the instruction register visible, to allow the fetching of long immediate values from the instruction stream. Short immediate values are specified directly in a move, as in 48 -> r2.
The TTA concept offers several advantages from a hardware designer's point of view, the most important of which are faster clocks and the potential to increase the utilization of the FUs compared to superscalar RISC designs.
Superscalar RISC designs are already showing signs of diminishing return; as they issue more instructions per cycle, the hardware required to detect dependencies and resource conflicts between operations becomes more complex and consumes more chip area. A TTA design dispenses with all this complex scheduling and issue logic. Furthermore, the separation of t
he FUs from the transport network means each FU's pipeline can be optimized for the best cycle time. The transport network itself may be superpipelined, in which case the only lower limit on the chip's cycle time is the register-to-register transfer time across the bus.
Superscalar OTA machines also have to provide abundant transport bandwidth because they must cater to the worst case of three buses (two operands and one result) and three register ports for each operation issued per instruction cycle, even though most instructions will not use them all. In a TTA machine, the separation of transport from operation means that the FUs share a common pool of transports, so the total bandwidth required is closer to the actual utilization than to the worst case. Put another way, the same amount of metal spent on the transport network of a TTA serves more FUs than it would in a superscalar RISC machine. This separation also makes TTAs more inherently scalable than OTAs because adding more FUs and transports b
ecomes an almost mechanical copying process, which adds little to the complexity of the design.
In addition to these hardware advantages, TTAs make possible several new classes of compiler optimization that are not available to OTAs. The finer-grained instructions allow greater scheduling freedom, so that the compiler might interleave whole operations by bringing forward an operand move from a future operation to fill the latency gap before the current operation's result move.
Register-bypassing, which the hardware in current OTA RISC designs performs, is a software issue for TTAs. The compiler can move the result of an operation directly from the FU that produced it without waiting for it to be stored in a GPR, as in the following:
add_r -> mul_o
add_r -> r2
If the compiler can see that all the uses of a result are bypassed in this way, then it will eliminate that second result-to-register move altogether, a form of dead-code elimination. Similarly, if an FU us
es the same value in a series of operations (e.g., within a loop), the compiler can eliminate all but the first operand move. This is called operand sharing, a special case of common subexpression elimination.
As a result of these optimizations, a TTA machine can get by with fewer GPRs than an equivalent OTA, because it will frequently store temporary results in operand registers or pipeline stages without occupying a GPR.
As you might have guessed by now, compilers for TTA machines need to be very clever indeed in order to find sufficient instruction-level parallelism and take advantage of all those optimization opportunities. TTA machines represent the end point of a historical sequence from CISC to RISC to VLIW (Very Long Instruction Word) computers, in which those tasks that microcode or hardware on the chip (i.e., instruction decode, scheduling, register allocation, and dependence checks) once performed have been progressively off-loaded onto the c
In particular, compilers for TTA machines need to examine a much wider scope than the single basic block inspected by many conventional compilers. (A basic block is a sequence of consecutive statements in which flow of control enters at the beginning and leaves at the end, without halting or without the possibility of branching, except at the end.) The modified Gnu C/C++ compiler under development at Delft works at the level of ``regions,'' which are groups of basic blocks corresponding to higher-level structures such as loops. The scheduler works by visiting each basic block in the region in topological order, scheduling each block with a simple list algorithm and then inspecting all the other basic blocks reachable from the current one to see if any of their operations can be pulled back into it.
The modified Gnu C/C++ compiler begins scheduling by placing a trigger move in the earliest cycle in which it finds a suitable FU and a move bus available. The operand and result moves are th
en placed, subject to checks on pipeline depth, register availability, and instruction ordering (i.e., operand > trigger > result). Should a check fail, a move may have to be switched to an earlier (for operands) or later (for results) cycle. If this move doesn't work, then the scheduler must scrap the whole deal and backtrack by rescheduling the original trigger move one cycle later.
This sort of complex behavior belongs in the territory of AI programming techniques, so it's no coincidence that I first heard of TTAs from a compiler writer at a workshop for functional-language developers.
The Future of TTA
The team at Delft has produced an experimental 32-bit integer TTA processor with eight function units and four buses, called MOVE32INT. Even implemented in a conservative 2-micron process, MOVE32INT runs at 80 MHz, and benchmarking suggests that it's between 25 percent to 50 percent faster than a superscalar OTA with equivalent FUs.
However, speed is not the only
consideration, and many of the same objections leveled at VLIW also apply to TTA designs. Principally, it's hard to preserve binary-code compatibility between successive processor generations because the code is so intimately tied to the hardware. MOVE32INT is actually part of ongoing research into the automatic generation of ASIC (application-specific IC) layouts from software specifications, and was never intended as a general-purpose CPU.
Nevertheless, it's known that Intel and H-P have both done research into TTAs, and that one of the Delft team, Dr. J. M. Mulder, now works at Intel on the P7 project, so it's not inconceivable that some TTA ideas might find their way into the joint Intel/H-P VLIW products.
illustration_link (7 Kbytes)
A TTA processor consists of wholly independent FUs connected by a transport network. Scheduling moves
between units becomes a job for the compiler, not for hardware.
Dick Pountain is a BYTE contributing editor based in London. You can reach him on the Internet or BIX at