# Parallel vs. Serial On-Chip Communication

Rostislav (Reuven) Dobkin rostikd@tx.technion.ac.il Arkadiy Morgenshtein arkadiy@tx.technion.ac.il Avinoam Kolodny kolodny@ee.technion.ac.il Ran Ginosar ran@ee.technion.ac.il

VLSI Systems Research Center, Electrical Engineering Department Technion – Israel Institute of Technology, Haifa, Israel

## ABSTRACT

Synchronous parallel links are widely used in modern VLSI designs for on-chip inter-module communication. Long range parallel links occupy large area and incur high capacitive load, high leakage power and cross-coupling noise. The problems exacerbate for applications having low utilization of the links or suffer from congestion of the interconnect. While standard synchronous serial links are unattractive due to limited bit-rate, novel high performance serial links may change the balance. In this paper we show that novel serial links provide better performance than parallel links for long range communications, beyond several millimeters. We analyze the technology dependence of link performance. An example for 65 nm technology is presented, and compare wave-pipelined and register-pipelined parallel links to a high performance serial link in terms of bit-rate, power, area and latency.

## **Categories and Subject Descriptors**

B.4.3 [Interconnections (Subsystems)]: Topology (e.g., bus, point-to-point); Asynchronous/ synchronous operation

General Terms: Performance, Design

#### **1. INTRODUCTION**

Transistor size scaling drastically improves on-chip clock rates, practically doubling the performance every five years [1]. While local interconnect follows transistor scaling, global lines do not, challenging long range on-chip data communications in terms of latency, throughput and power [1]. In addition, as Systems-on-Chip (SoC) integrate an ever growing number of modules, on-chip inter-modular communications become congested and the modules must turn to serial interfaces, similar to the trend from parallel to serial inter-chip interconnects.

Long-range bit-parallel data links provide high data rates at the cost of large chip area, routing difficulty, noise and power. In addition, such links are often utilized only a small portion of the time, but dissipate leakage power at all times. Leakage is incurred at the line drivers and at the repeaters, which are often necessary for long interconnects [2][3]. Parallel link performance is bounded by available clock rate and by clock skew, delay uncertainty due to process variations, cross-talk noise, and layout geometries.

Bit-serial communications offer an alternative to bit-parallel interconnects, mitigating the issues of area, routability, and leakage power, since there are fewer wires, fewer line drivers, and fewer repeaters. However, to provide the same throughput as an

SLIP'08, April 5-6, 2008, Newcastle, United Kingdom.

Copyright 2008 ACM 978-1-59593-918-0/08/04...\$5.00.

*N*-bit parallel interconnect, the serial link must operate *N* times faster. Simple synchronous serial links that employ the system clock are incapable of providing the required throughput. Recently proposed novel wide-bandwidth serial link circuits [4]—[14], which operate faster than the system clock, may deliver the required bandwidth.

Synchronous serial links are typically employed for off-chip communications, where pin-out limitations call for a minimal number of wires per link. Source-synchronous protocols are often used for these applications [15]-[20]. A common timing mechanism for serial interconnects injects a clock into the data stream at the transmitting side and recovers the clock at the receiver. Such clock-data recovery (CDR) circuits often require a power-hungry PLL, which may also take a long while to converge on the proper clock frequency and phase at the beginning of each transmission. If the receiver and transmitter operate in different clock domains, the transaction must also be synchronized at both ends, incurring additional delay and power. Alternatively, an asynchronous data link employs handshake instead of clocks. Traditional asynchronous protocols are relatively slow due to the need to acknowledge transitions [14][21]. In [22] asynchronous protocols share data lines, but their performance depends on wire delays

High-speed serial schemes, having data cycle of a few gate delays (down to single gate-delay cycle), have been recently proposed [4]—[14]. These fast schemes exploit wave-pipelining, low-swing differential signaling, fast clock generators and asynchronous protocols. In addition, these schemes require channel optimization to support wide-bandwidth data transmission over the link wires. A wave-front train serialization scheme was presented in [11]. The serializer is based on a chain of MUXes (similar to [23]). The link is single-ended and employs wave-pipelining. The link data cycle is approximately  $7 \cdot d_4$  (3Gbps@180nm), where  $d_4$  is an inverter FO4 delay. Wave-pipelined multiplexed (WPM) routing technique was presented in [12][13]. WPM routing employs source synchronous communication and its performance is limited by the clock skew and delay variations. Employing low-voltage differential pairs for on-chip serial interconnect was discussed in [9][10], where data was sampled at the receiver without any attention to synchronization issues. A three level voltage swing was presented in [24], requiring non-standard amplifiers.

Circuits that had originally been designed for off-chip communications [15][20] were adopted for on-chip serial link in [8]. An output-multiplexed transmitter is connected to a multiplexed receiver, requiring clock calibration at the receiver side. Both transmitter and receiver use multi-phase DLL circuits. The link employs low-swing differential signaling and transfers eight-bit words. The output-multiplexed architecture delivers better performance than input-multiplexing (down to  $2 \cdot d_4$  data cycle), but at the expense of much higher output capacitance (that

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

grows linearly with the word-width). A fabricated chip demonstrated an operational 3mm link.

In this paper we consider a different novel architecture [4][5] achieving a data cycle of a single gate delay  $(d_4)$  and throughput that is independent of the word width. This novel link is studied in comparison with parallel links that provide the same throughput (a preliminary comparison was presented in [25]). Such comparative analysis is of great importance for predicting system-level interconnect performance and is the main subject of this paper. We analyze the various costs of serial versus parallel links. Keeping bandwidth the same, we compare area, power and latency. We show that long range serial links outperform parallel links. The rest of the paper is structured as follows. In Section 2 we define the parallel and serial links under study. Section 3 provides analytical models for bit-rate, area, power and latency, and comparison results are presented in Section 4.

## 2. HIGH BIT-RATE PARALLEL AND SERIAL COMMUNICATION

In this section we define the parallel and serial links under study and explain why these specific architectures were selected.

A *Parallel Link* comprises at least *N* wires that can carry *N* bit simultaneously. Either no data or *N* bits traverse the link together and pass through any given cross section of the link at the same time. The data rate is  $F_{PAR}$ .*N*, where  $F_{PAR}$  is the rate at which the words are presented at the input of the link.

A *Serial Link* is one or more wires that are able to carry single-bit words. The bit is presented to the link at the transmitter side and is sampled subsequently at the receiver side. Either no data or a single bit traverses the link at any given cross section of the link. The bandwidth of the serial link is  $F_{SER}$ , where  $F_{SER}$  is the rate at which one-bit words are presented at the input of the link.

The serial link is a special case of the parallel link for N=1. In this study  $8 \le N \le 128$ .

Different implementations of parallel and serial links exist. Some are used more than others in actual chips. In this paper we study only a few representative architectures, as defined below.

## 2.1. Parallel Links

We consider two types of parallel links, *register-pipelined* and *wave-pipelined*.

The widely used "*register-pipelined*" parallel link is fully synchronous where the interconnect is considered as combinational logic between two registers. When the interconnect delay exceeds the clock cycle, the link is pipelined to yield the required bit-rate (Figure 1). The single clock is either generated globally or sent with the data from the transmitter (*source synchronous* communication). Interconnect delay is usually optimized by means of buffering (repeaters).

A primary drawback of the register-pipelined parallel link is the high cost of pipelining that is incurred when a high bit-rate is desired over a long range. Wave-pipelining [26]—[29] exploits buffers and wire delays instead of flip-flops. In a source-synchronous *wave-pipelined* parallel link (Figure 2), the bit rate is limited by the relative skew of the link wires rather than by the clock cycle. Multiple *N*-bit words can transverse the link simultaneously. The data is presented to the bus on each rising edge of CLKR and is sampled by a receiver register on each falling edge of CLKR. Wave-pipelining may improve the bit-rate relative to register-pipelined links [27].

Several enhancements and variations may be applied to the basic architectures of Figure 1 and Figure 2 to mitigate crosstalk and

reduce power. Examples include shielding, interleaved bidirectional lines, asynchronous signaling, data encoding, staggered repeaters, special worst-case transition patterns handling [30] and power-saving techniques [2]. While most variations result in minor performance differences, shielding may significantly affects performance. We consider the two extremes of shielding (in terms of achieved bit rates and required area): unshielded and fully-shielded wires (Figure 3).



Figure 1: Register-pipelined parallel link (no wave-pipelining)



Figure 3: Differently shielded links: (a) unshielded, (b) fully-shielded

## 2.2. A High Performance Serial Link

Standard on-chip serial links are unattractive due to their inferior bit-rates relative to the parallel link. With the same clock as the parallel link, the bit-rate of a standard synchronous serial link is limited to N times lower than the parallel link.



Figure 4: Serial communication scheme

A novel high bit-rate asynchronous serial link was presented in [4]—[7]. The link (Figure 4) employs low-latency synchronizers at the source and sink [31], two-phase NRZ Level Encoded Dual Rail (LEDR) data/strobe (DS) encoding and an asynchronous handshake protocol (allowing non-uniform delay intervals between successive bits) [32]—[34], serializer and de-serializer and line drivers and receivers. Acknowledgment is returned only once per word, rather than bit by bit, enabling multiple bits in a wave-pipelined manner over the serial channel. The wires (D and S) employ (fully shielded) wave-guides, enabling multiple traveling signals. On a well-designed wave-guide long wires may carry a number of successive bits simultaneously.

The minimal data cycle of the serial link is bounded by one  $d_4$  gate delay [4][5] due to the digital logic forming the serializer and de-serializer circuits, which consist of fast shift-registers that can deliver and consume one bit every  $d_4$  [4]. The *N*-bit shift-register

consists of *N*-1 Transition-Latch (XL) stages [5]. The serial link channel consists of either two or four lines for single-ended or differential signaling, respectively. Although differential signaling is preferred for lower power and higher rates, in this paper we analyze only the single-ended case since all other circuits that are compared below are single-ended. For 65nm technology, the typical  $d_4$  gate delay is 15ps, resulting in link data rate of 67 Gbps.

## **3. ANALYTICAL MODELS**

This section presents an analytical study of the performance and cost functions of the parallel and serial links.

## 3.1. Bit-Rates

## 1) Parallel Link Bit-Rate

The following factors bound the bit-rate B of the synchronous parallel link:

(a) Fastest available clock. The shortest clock cycle that can be generated using a ring oscillator is typically bounded by  $8 \cdot d_4$  [20], resulting in 8 GHz for 65nm technology [4]. However, most SoC modules operate at slower clock rates, e.g.  $11 \cdot d_4$  for modern fast processors [35] and  $100-400 \cdot d_4$  for standard SoC/ASIC (IC based on standard cells and designed using standard EDA tools) [36].

(b) Synchronization Latency. Data synchronization is required at the receiver side of the link. Synchronization latency depends on the relationship between transmitter and receiver clocks and on synchronizer architecture. The worst case relates to mutuallyasynchronous clocks, when synchronization may take several cycles. Faster synchronization is possible when using high performance synchronizers [37][38].

(c) Clock uncertainty. This is typically added to the design critical path. In source synchronous communication, clock uncertainty extends the minimal data cycle on the link, as explained below.

(d) Delay and Delay Uncertainty of the link. In register-pipelined links (Figure 1), both global clock cycle and delay uncertainty bound the link performance. In wave-pipelined links (Figure 2), the data rate is not bounded any more by the clock cycle, but only by delay uncertainty. A long parallel wave-pipelined link may carry multiple words simultaneously when its delay is longer than the transmitter clock cycle. The delay uncertainty of the link results from the following factors:

- i. The skew and jitter of the clock.
- ii. Repeater delay variations. Uncertainty grows monotonically with the number of repeaters [29].
- iii. Wire delay variations, mostly due to variations in metal thickness that affect resistance [39][40].
- iv. Via variations [39].
- v. Cross-Coupling. Unknown bit-patterns sent through the parallel link may result in cross-talk noise that affects the delay of the victim lines in an unpredictable way. Links should be optimized for worst-case switching patterns that cause worst cross-coupling noise. Cross-coupling is usually mitigated by means of shielding and spacing [3][30].
- vi. Geometry. Wide busses may encounter routing limitations, resulting in different geometries for different link lines (even in the same metal layer). This, of course, changes the worst-case link delay. In a multi-layer link structure the link delay is bounded by its slowest (lowest) metal layer (this paper analyzes only single layer interconnects).

Seeking to achieve maximal bit-rate, we first analyze the delay uncertainty of the wave-pipelined parallel link and then extend the results to register-pipelined links. The following two worst cases bound the minimal clock cycle of the link: (a) Latest data clocking: The latest signal should arrive early enough to be clocked by the sampling register at the receiver (namely the signal should arrive before CLKR in Figure 2).

(b) Earliest data clocking: The first arrival of the next signal should not interfere with sampling of the previous word.

We adopt the notation of [27] and draw the delay uncertainty for source-synchronous communication in Figure 5. The clock cycle is restricted as follows:

$$T_{CLK} > 2 \cdot (\delta_{MAX} - \delta_{MIN}) + 4 \cdot \Delta_{CLK} + T_{SU} + T_H$$
(1)

where  $\delta_{MAX}$  and  $\delta_{MIN}$  are the max and min data delays (which are also the clock uncertainty in source-synchronous communication),  $\Delta_{CLK}$  is the one side clock skew, and  $T_{SU}$ ,  $T_H$  are the setup and hold times of a flip-flop. Below we explore the dependency of  $T_{CLK}$  on other parameters: the link width N and the link length L.



Figure 5: Parallel link minimal clock cycle is limited by clock jitter and skew and by link-length-dependent delay differences among the parallel wires due to variations and cross-talk

There are two types of in-die variations: random variations of closely placed devices and "systematic" variations, which typically depend on location on the die [41][42]. When a single line is considered, its delay may vary significantly (up to a factor of four) relative to the nominal delay, due to systematic variations in repeaters that are placed far from one another, and due to interconnect variations [39]. But when multiple lines are involved, such as in a parallel link, the effect of their relative skew ( $\delta_{MAX}$ - $\delta_{MIN}$ ) on  $T_{CLK}$  should also be analyzed. The effect of repeater transistor variation on that skew is low, thanks to the correlation between  $\delta_{MAX}$  and  $\delta_{MIN}$ , as follows. Repeaters that belong to the same stage (see Figure 2) are highly correlated in terms of systematic variations since they are placed close together. In addition, since repeaters are typically large [43], their random variations are averaged out. The length of the link also affects the skew: many repeaters result in smaller relative skew, because systematic inter-stage variations are averaged along the link. To conclude, even though the delay uncertainty of a single wire can be very high, the relative skew among the lines of a parallel link due to process variations is small and can be neglected.

The worst case delay of a single-wire with repeaters can be expressed as follows:

$$D^{Worst} = v_{SI} \cdot K \cdot d_{RPTR} + v_{INT} \cdot K \cdot d_{INT}(L)$$
<sup>(2)</sup>

where  $v_{SI}$  is transistor variation coefficient (up to 30%), *K* is the number of repeaters,  $d_{RPTR}$  is the nominal delay of single repeater,  $v_{INT}$  is the interconnect variation coefficient (up to ×3 [39]), *L* is the wire length,  $d_{INT}\approx 0.5 \cdot R_{INT} \cdot C_{INT} \cdot (L/K)^2$  [44][45],  $R_{INT}$  and  $C_{INT}$  are wire resistance and capacitance per length computed according to [43] and [46]. Note that this approximation of  $d_{INT}$ 

leads to an optimistic estimate of the total delay of the parallel link, and consequently to an optimistic estimate of the performance of the parallel link. When cross-coupling is also taken into account the wire delay is multiplied by  $\eta$  as follows:

$$D_{+cross-talk}^{Worst} = v_{SI} \cdot K \cdot d_{RPTR} + \eta \cdot v_{INT} \cdot K \cdot d_{INT}(L)$$
(3)

The worst case skew  $\Phi$  between two lines in the parallel link happens when one line is victimized by the worst possible aggression, while the other one experiences no aggression at all. Since the relative skew due to process variation can be neglected,  $\Phi \approx \delta_{MAX} - \delta_{MIN}$ .

$$\delta_{MAX} - \delta_{MIN} \approx \Phi(L) = D_{+cross-talk}^{Worst} - D_{+cross-talk}^{Best} = (\eta^{Worst} - \eta^{Best}) \cdot (v_{INT} \cdot K \cdot d_{INT}(L))$$
(4)

Values of  $\eta$  differences, based on PTM interconnect models [46] and on [30], are listed in Table 1. Actual shielding as in Figure 3 yield factors somewhat larger than zero, but we use zero for fully shielded links for the sake of simplicity.

Table 1: Coupling Factors Residual,  $\eta^{Worst} -\!\!\!-\!\!\!-\!\!\!-\!\!\!\eta^{Best}$ 

| Shielding      | $\eta^{Worst}$ — $\eta^{Best}$ |
|----------------|--------------------------------|
| Not-Shielded   | 1.9                            |
| Fully-Shielded | 0                              |

Combining Eq. (1) and (4) we get:

$$T_{CLK} > 2 \cdot \Phi(L) + 4 \cdot \Delta_{CLK} + T_{SU} + T_H \tag{5}$$

The parallel link clock frequency is demonstrated for 65nm technology in Figure 6 as a function of length. This is based on the following assumptions:  $\Delta_{CLK}$  is 10% of the clock cycle,  $T_{SU}+T_H=50$  ps (about 3d<sub>4</sub>), and  $v_{INT}^{WC} = 3$  [39]. The minimal clock cycle is 8·d<sub>4</sub>, as discussed above. Note that for the fully-shielded link and for very short distances of the unshielded link, the rate is bounded by clock cycle rather than by delay uncertainty. Since in typical SoC the clock cycle is substantially longer than 8·d<sub>4</sub>, the maximal link rate is smaller. This is expressed as follows:

$$T_{CLK}^{PAR} = \max\left\{2 \cdot \Phi(L) + 4 \cdot \Delta_{CLK} + T_{SU} + T_{H}, \quad T_{SYSTEM-CLOCK}\right\}$$
(6)

In the register-pipelined parallel link the clock rate is equal to the system clock rate, and delay uncertainty affects the distance between successive pipeline stages.

#### 2) Serial Link Bit-Rate

Serial links differ from parallel links in two ways: the serial link consists of only two wires (Figure 4), and the coupling factor over the serial link is always known [4]. The skew due to in-die variations over a serial link is much smaller than for even the narrowest of parallel links (eight-bit parallel) and therefore the skew is neglected. In addition, thanks to the fact that in the serial link only one of the two lines changes per every new bit, the skew is not affected by cross-coupling, and as a result the link delay is the same for all symbols. The minimal data cycle of the link is d<sub>4</sub> (a new bit is sent every gate delay), resulting in maximal bit-rate:

$$B_{SER} = 1/d_4 \tag{7}$$

E.g., for 65nm, B<sub>SER</sub>=67Gbps.



### Figure 6: Wave-pipelined parallel link maximal frequency

#### 3) Interconnect Characteristics

Note that the parallel and serial lines operate at very different rates. While the parallel link operates at the "RC" region, the serial link operates at the "RLC" region [47]. Repeater insertion is treated differently for these two domains [48]. Moreover, the cost and latency of the serial link can be further improved when interconnects without repeaters are considered [8]. These considerations are applied in the following.

## 3.2. Area

#### 1) Wave-Pipelined Parallel Link Area

In general, the total area requirement consists of silicon (drivers and repeaters) and interconnect (wires, shields and spacing):

$$A_{PAR} = (A_{DRIVERS}^{PAR} + A_{REPEATERS}^{PAR}) + (A_{WIRES} + A_{SHIELDS})$$
(8)

In Section 4 we consider the active silicon and the interconnect separately, since they differ significantly in area, scaling and leakage issues. The detailed computation of the areas appears in [49], while Eq. (9) and (10) show the final results.

$$A_{ACTIVE}^{PAR-WP} = (N+1) \cdot \left[ \delta + k_{RPTR}^{PAR} \cdot L \cdot h_{RPTR}^{PAR} \right] \cdot A_{INV}$$
<sup>(9)</sup>

$$A_{INT}^{PAR} = (N+1) \cdot (k_{RPTR}^{PAR} \cdot L + 1) \cdot (s+1) \cdot A_{WIRE}^{PAR}$$
(10)

where:

N+1 N data bits and one clock line,

 $A_{INV}$  minimal inverter size,

- δ driver sizing factor relative to A<sub>INV</sub>, the drivers assume cascaded buffers as in Figure 1,
- $k_{RPTR}^{PAR}$  number of repeaters per unit length of a wire,
- $h_{RPTR}^{PAR}$  repeater sizing factors relative to  $A_{INV}$ , optimized for minimal delay [43][50][51],
- $k_{RPTR}^{PAR} \cdot L + 1$  number of wire segments,
- s shielding coefficient (s=0 for no shielding and s=1 for full shielding),
- $A_{WIRE}^{PAR}$  wire/shield segment area, including spacing

#### 2) Register-pipelined Parallel Link Area

The register-pipelined parallel link contains pipeline stages as well as repeaters in between them. The number of stages  $M_{FF}$  depends on the clock cycle, on the link length and on the delay uncertainty of the link. Note that the pipeline stages act as repeaters, in addition to re-synchronizing the signal.  $M_{FF}$  is computed as follows:

$$M_{FF} = \frac{L}{T_{CLK}^{PAR} \cdot V_P} \tag{11}$$

where  $V_P$  is the propagation velocity of the voltage front, *L* is the link length and *T* is computed according to Eq. (6). We assume that the flip-flop is larger than the repeater by a factor  $\theta$  and obtain the following equation for register-pipelined parallel link active area [49]:

$$A_{ACTIVE}^{PAR-RP} = (N+1) \cdot \begin{bmatrix} (k_{RPTR}^{PAR} \cdot L - M_{FF}) \cdot h_{RPTR}^{PAR} + \\ +\delta + M_{FF} \cdot \theta \cdot h_{RPTR}^{PAR} \end{bmatrix} \cdot A_{INV}$$
(12)

The interconnect area is similar to that of the wave-pipelined parallel link (Eq. (10)).

#### 3) Serial Link Area

In addition to the components of Eq. (8), the area of the serial link contains also the SERDES (shields are neglected for serial links):

$$A_{SER} = (A_{SERDES} + A_{DRIVERS}^{SER} + A_{REPEATERS}^{SER}) + A_{WIRES}$$
(13)

The number and size of the repeaters on the serial link are smaller than for parallel links by factors  $\lambda$ ,  $\gamma$  thanks to RLC characteristics of the serial link [48]:

$$h_{RPTR}^{SER} = \sqrt{\frac{R_0 \cdot C_{INT}}{R_{INT} \cdot C_0}} \cdot \lambda = h_{RPTR}^{PAR} \cdot \lambda$$
(14)

$$k_{RPTR}^{SER} = \sqrt{\frac{0.4 \cdot R_{INT} \cdot C_{INT}}{0.7 \cdot R_0 \cdot C_0}} \cdot \gamma = k_{RPTR}^{PAR} \cdot \gamma$$
(15)

The trade-off between latency and power can be optimized further [52], but this is not considered here.

Given transistor count and ratios for XL stages (transition latches) of the *N*-bit shift-register [5], the SERDES area is  $\kappa N : A_{INV}$ , where according to [5],  $\kappa$ =240. The serial link also contains a LEDR encoder of negligible area [4]. Noting that the link consists of 2 wires, we get ( $\chi$  is driver sizing factor relative to  $A_{INV}$ ) [49]:

$$A_{ACTIVE}^{SER} = \left[ \kappa \cdot N + 2 \cdot \left( \chi + k_{RPTR}^{PAR} \cdot L \cdot h_{RPTR}^{PAR} \cdot \lambda \cdot \gamma \right) \right] \cdot A_{INV}$$
(16)  
$$A_{INT}^{SER} = 2 \cdot \left( k_{RPTR}^{PAR} \cdot L \cdot \gamma + 1 \right) \cdot A_{WIRE}^{SER}$$
(17)

#### 4) Area Ratio Expressions

Given the area expressions for each case, we can derive the parallel-to-serial area ratios. We look for the link length  $L_{AREA}$  above which the serial link takes less area than the parallel link. For example, the area ratio of the active portions of the serial and wave-pipelined links (Eq. (9) and (16)), which is also indicative of leakage power, is (driver's components  $\delta$  and  $\chi$  can be neglected relative to the repeaters) [49]:

$$\frac{(N+1) \cdot k_{RPTR}^{PAR} \cdot L \cdot h_{RPTR}^{PAR}}{\kappa \cdot N + 2 \cdot k_{RPTR}^{PAR} \cdot L \cdot h_{RPTR}^{PAR} \cdot \lambda \cdot \gamma} \ge 1$$
(18)

Eq. (18) has three main parameters  $k_{REP}^{PAR}$ ,  $h_{REP}^{PAR}$  and N ( $\lambda \cdot \gamma$  is negligible). Both  $k_{REP}^{PAR}$  and  $h_{REP}^{PAR}$  depend on the technology node. According to data from [1][46], the number of repeaters per unit length  $k_{REP}^{PAR}$  grows roughly linearly as feature size shrinks (the extrapolation in Figure 7a beyond 65nm is speculative), while the repeater relative size  $h_{REP}^{PAR}$  stays almost the same. Therefore, as technology advances, the serial approach becomes preferable for shorter ranges. For example, Figure 7b shows  $L_{AREA}$  for the fully-shielded parallel link (N=8, equal bit rate for the parallel and serial links, and the extrapolation beyond 65nm is speculative).

The ratio of interconnect areas (practically the total link area) is expressed as follows (Eq. (10)):

$$\rho_{INT\_AREA} = A_{INT}^{PAR} / A_{INT}^{SER} = (N+1) \cdot (s+1) / 2$$
(19)

Again, in fully shielded case where  $N \ge 8$  it is clear from Eq. (19) that the serial link always consumes less interconnect area.





dissipated by the serial link than the parallel link

#### 3.3. Power

#### 1) Wave-Pipelined Parallel Link Power

The total link power comprises dynamic and standby power. The link utilization can be very low, typically less than 30%. Assuming that average data patterns would incur N/2 transitions per word, plus two clock transitions we obtain [49]:

$$P_{DYN}^{PAR} = \left(N/2+2\right) \cdot \begin{pmatrix} k_{RPTR}^{PAR} \cdot L \cdot h_{RPTR}^{PAR} + \delta + \\ + \left(k_{RPTR}^{PAR} \cdot L + 1\right) \cdot \beta \cdot h_{RPTR}^{PAR} \end{pmatrix} \cdot C_{INV} \cdot V_{DD}^2 \cdot F_{PAR}$$
(20)

where  $\beta$  is the ratio between the capacitances of wire segment and repeater ( $\beta \approx I$ ), C<sub>INV</sub> is minimal inverter capacitance and  $F_{PAR}$  is the link clock frequency.

When the clock is gated (neglecting the power of clock gating circuits) the standby power is the leakage power:

$$P_{STANDBY} = P_{LEAK} = A_{ACTIVE}^{PAR-WP} \cdot V_{DD} \cdot I_{OFF}$$
(21)

where  $I_{OFF}$  is the off-current per device area [1][53] and  $A_{ACTIVE}^{PAR-WP}$  is calculated according to Eq. (9).

#### 2) Register-Pipelined Parallel Link Power

The dynamic power expression for the pipelined wire case is as follows (clock input capacitance of flip-flops is ignored):

$$P_{DYN}^{PAR} = \begin{bmatrix} \binom{N_{2}}{2} + 2 \cdot k_{RPTR}^{PAR} \cdot L \cdot h_{RPTR}^{PAR} \cdot (1 + \beta) + \\ + \binom{N_{2}}{2} + 2 \cdot (\delta + h_{RPTR}^{PAR} \cdot (\beta - M_{FF})) + \\ + \binom{N_{2}}{2} \cdot M_{FF} \cdot \theta \cdot h_{RPTR}^{PAR} \end{bmatrix} \cdot C_{INV} \cdot V_{DD}^{2} \cdot F_{PAR} \quad (22)$$

In the register-pipelined case there is an additional leakage power component due to pipeline stage logic. Assuming again gated-clock and calculating  $A_{ACTIVE}^{PAR-RP}$  according to Eq. (12):

$$P_{STANDBY} = P_{LEAK} = A_{ACTIVE}^{PAR-RP} \cdot V_{DD} \cdot I_{OFF}$$
(23)

#### 3) Serial Link Power

The serial link always incurs one transition per bit [4]. The dynamic power of the serial link is [49]:

$$P_{DYN}^{SER} = P_{DYN}^{SERDES} + P_{DYN}^{CHANNEL} = 2 \cdot a_{SR}(N) \cdot C_{SR}(N) \cdot V_{DD}^2 \cdot F_{PAR} + \left( k_{RPTR}^{PAR} \cdot h_{RPTR}^{PAR} \cdot L \cdot \gamma \cdot \lambda + \chi + + \left( k_{RPTR}^{PAR} \cdot L + 1 \right) \cdot \beta \cdot h_{RPTR}^{PAR} \right) \cdot C_{INV} \cdot V_{DD}^2 \cdot B_{SER}$$

$$(24)$$

The factor of two relates to the two shift-registers in the transmitter and in the receiver.  $C_{SR}$  is a single shift-register capacitance and  $a_{SR}$  is the activity factor, accounting also for toggling inside the shift-register (see details in [49]).

Leakage power is dissipated by the SERDES, line driver and repeaters (Eq. (16)):

$$P_{LEAK}^{SER} = A_{ACTIVE}^{SER} \cdot V_{DD} \cdot I_{OFF}$$
(25)

#### 4) Power Ratio Expressions

We look for the link length  $L_{POWER}$  above which the serial link dissipates less power than the parallel link. To compare same bit rates,  $N=B_{SER}/F_{PAR}$ . For example, the dynamic power ratio of the wave-pipelined and the serial links (Eq. (20) and (24)) is (neglecting  $\delta$  and  $\chi$  as above) [49]:

$$\frac{(N/2+2)\cdot(k_{RPTR}^{PAR}\cdot L\cdot h_{RPTR}^{PAR}\cdot (1+\beta)+\beta\cdot h_{RPTR}^{PAR})}{\left[(42\cdot N+5)+(k_{RPTR}^{PAR}\cdot h_{RPTR}^{PAR}\cdot L\cdot(\gamma\cdot\lambda+\beta)+\beta\cdot h_{RPTR}^{PAR})\right]\cdot N} \ge 1 \quad (26)$$

Eq. (26) depends on  $k_{REP}^{PAR}$ ,  $h_{REP}^{PAR}$  and N, similarly to Eq. (18), and likewise the length threshold becomes smaller with the shrinking feature size (Figure 7c, extrapolated speculatively).

#### 3.4. Relative Latency Overhead

As above, the inherent wire delay is  $d_{INT} \approx 0.5 R_{INT} \cdot C_{INT} \cdot (L)^2$ . We consider the additional latency incurred by the various links.

#### 1) Wave-Pipelined Parallel Link

The relative latency overhead of wave-pipelined parallel link is

$$\Lambda_{WP}^{PAR} = (\eta^{Worst} - 1) \cdot v_{INT} \cdot d_{INT}$$
(27)

#### 2) Register-pipelined Parallel Link

For register-pipelined parallel link the overhead consists also of additional pipeline stage delays:

$$\Lambda_{RP}^{PAR} = (\eta^{Worst} - 1) \cdot v_{INT} \cdot d_{INT} + M_{FF} \cdot (d_{FF} - d_{RPTR})$$
(28)

## 3) Serial Link

The serial link latency consists of the time preserved for serialization and the flight time over the channel  $(d_{INT})$ . Then, the overhead is the serialization time:  $N \cdot d_4$ .

Delay uncertainty affects also the serial link resulting in skew in between the D and S lines. However, the delay uncertainty was found to be much smaller since in-die variations for two closely placed wires are much smaller than for a wider link. In addition, the number of repeaters was also smaller thanks to working in RLC region. Hence we neglect the delay uncertainty due to in-die variations.

The coupling noise is also small in the serial structure. Since in the serial link there are no concurrent transitions, the same pattern is sent for each bit [30], resulting always in the same delay over the channel. Special layout mitigates the crosstalk further enabling differential encoding over the channel [4]. In addition, since density restrictions are less strict for serial channels, wider spacing can be employed for further cross-talk mitigation. Hence, we assume that for a serial line the cross-coupling noise can also be neglected.

## 4. COMPARATIVE ANALYSIS FOR 65nm

In this section we compare the area, power and latency of serial and parallel links that deliver bandwidth  $B_{SER}$ , the bit rate of the serial link (Eq. (7)). Figure 8 and Figure 9 show the parallel link widths that are required to achieve that bit rate in wave-pipelined and register-pipelined parallel links respectively.

Note that for ranges above 6mm, the unshielded wave-pipelined parallel link requires hundreds and thousands of lines in order to provide the required bit-rate. The same is true for register-pipelined links operating at low rates (clock cycle >  $130 \cdot d_4$ ). Wide links over 128 lines are impractical and are marked by dotted lines in the analysis. Note that fully shielded links which double the number of wires may be limited to 64 bit lines. Figure 10 and Figure 11 compare active area of the links. As expected there is a clear improvement in interconnect and total area requirement (Figure 14, Figure 15, Figure 18 and Figure 19).

The leakage power ratios are the same as active area (Figure 10 and Figure 11), as expected from Eq. (21), (23) and (25). The serial link dissipates less dynamic power than the fully-shielded wave-pipelined parallel link (Figure 12) at ranges above 2mm, and the unshielded wave-pipeline link dissipates less power at shorter lengths. Similarly, the serial link dissipates less power than the pipelined wire link (Figure 13), except for the fully shielded, slow and very wide parallel link (which requires significant area).

Note that the dynamic power of the serial link consists of both channel power and the power dissipated by the SERDES registers. In all links, dynamic power is significantly higher than leakage, as is evident in Figure 16 and Figure 17 (one exception is the dotted segment in Figure 16, where leakage is proportional to the impractically large area). This observation is independent of utilization levels.

Latency overhead is presented in Figure 20 and Figure 21. The serial link incurs higher latency than unshielded wave-pipelined parallel link due to long SERDES shift-registers. In Figure 21, the register-pipelined links incur higher latencies at longer wires.



Figure 8: WP link width required to deliver B<sub>SER</sub> bit rate



Figure 9: RP link width required to deliver  $B_{SER}$  bit rate (same for un- and fully-shielded, bounded by system clock).



Figure 13: Dynamic power ratio (RP)



Figure 17: Total power ratio, 20% utilization (RP)



Figure 18: Total area ratio (WP)



Figure 19: Total area ratio (RP)



Table 2 summarizes the decision thresholds and costs for serial link employment. In the table, we specify minimal ranges for which a serial link is preferred over parallel. In some cases, the serial link is never better and for the specified minimal length incurs some penalty, which is also specified in the table. When there is no penalty, but only an improvement, only the minimal length is specified. The table provides length thresholds above which we should prefer the serial link, depending on whether area, power or latency, or their combinations, are minimized.

## 5. CONCLUSIONS

Novel serial links outperform standard parallel links when long range communication is considered. This advantage scales with technology, making the serial links more attractive for shorter links in future technologies. Future large SoCs should employ serial links to mitigate the cost of communication in terms of area, congestion, power and latency. We have provided a detailed analysis of the serial link with an example for 65nm technology. In the example we compared the serial link versus two typical parallel links.

|                                       | Wave-Pipeline vs. Serial                    |            | Register-pipelined vs. Serial |                   |                  |                   |  |
|---------------------------------------|---------------------------------------------|------------|-------------------------------|-------------------|------------------|-------------------|--|
| Shielding                             | Fully<br>Shielded                           | Unshielded | Fully<br>Shielded             |                   | Unshielded       |                   |  |
| Length of<br>parallel<br>link         | unlimited                                   | up to 6mm  | unlimited                     |                   | unlimited        |                   |  |
| Clock<br>cycle of<br>parallel<br>link | 8d4                                         | 8d4        | 10d <sub>4</sub>              | 130d <sub>4</sub> | 10d <sub>4</sub> | 130d <sub>4</sub> |  |
| To mini-<br>mize the<br>following:    | choose a serial link for links longer than: |            |                               |                   |                  |                   |  |
| area                                  | Always                                      | Always     | Always                        |                   | Always           |                   |  |
| power                                 | 2 mm                                        | 4mm        | 3mm                           | 3mm               | 1mm              | 3mm               |  |
| latency                               | 2 mm                                        | Never*     | 4mm                           | 12mm              | 2mm              | 9mm               |  |

Table 2: 65nm example – minimal length above which the serial link is preferred (various criteria)

\* The serial link incurs  $2-10 \times$  latency overhead penalty for 0-6 mm link.

## 6. **REFERENCES**

- [1] International Technology Roadmap for Semiconductors, 2005.
- [2] A. Morgenshtein, et al., "Low-Leakage Repeaters for NoC Interconnects", ISCAS, 600-603, 2005.
- [3] R. Weerasekera, et al., "Minimal-Power, Delay-Balanced Smart Repeaters for Interconnects in the Nanometer Regime," SLIP, 113-120, 2006.
- [4] R. Dobkin, et al., "High Rate Wave-Pipelined Asynchronous On-Chip Bit-Serial Data Link," ASYNC, 3-14, 2007.
- [5] R. Dobkin, et al., "Fast Asynchronous Shift Register for Bit-Serial Communication," ASYNC, 117-126, 2006.
  [6] R. Dobkin, et al., "High-Speed Serial Interconnect for NoC," NoC
- [6] R. Dobkin, et al., "High-Speed Serial Interconnect for NoC," NoC Workshop, DATE, 2006.
- [7] R. Dobkin, et al., "Fast Asynchronous Bit-Serial Interconnects for Network-on-Chip," CCIT TR529, EE Dept., Technion, 2005.
- [8] A.P. Jose, et al., "Pulsed Current-Mode Signaling for Nearly Speedof-Light Intrachip Communication," JSSC, 41(4), 772–780, 2006.
- [9] I. Saastamoinen, et al., "Interconnect IP for gigascale system-onchip," ECCTD, 116-120, 2001.
- [10] T. Suutari, et al., "High-speed Serial Communication with Error Correction In 0.25µm CMOS Technology," ISCAS, 618-621, 2001.
- [11] S.J. Lee, et al., "Adaptive Network-on-Chip with Wave-front Train Serialization Scheme," Proc. VLSI Circuits, 104-107, 2005.
- [12] A.J. Joshi, et al., "Wave-pipelined multiplexed (WPM) routing for gigascale integration (GSI)," TVLSI 13(8): 889-910, 2005.
- [13] A.J. Joshi, et al., "Design and Optimization of On-Chip Interconnects Using Wave-Pipelined Multiplexed Routing," TVLSI, 990-1002, 2007.
- [14] J. Teifel, et al., "A High-Speed Clockless Serial Link Transceiver," ASYNC, 151-161, 2003.
- [15] C.K.K. Yang, "Design of High-Speed Serial Links in CMOS", PhD, Stanford U., 1998.
- [16] S. Sidiropoulos, "High Performance Inter-Chip Signaling," Tech. Rep. CSL-TR-98-760, Stanford U., 1998.
- [17] W.F. Ellersick, "Data Converters for High Speed CMOS Links," PhD Thesis, Stanford Univ., 2001.
- [18] H.O. Johansson, et al., "A 4 Gsamples/s Line-Receiver in 0.8 um CMOS," Proc. Int. Symp. VLSI Circuits, pp. 116-117, 1996.
- [19] C. Svensson et al., "High Speed CMOS Chip to Chip Communication Circuit," ISCAS, 2228-2231, 1991.
- [20] M.J.E. Lee, "An Efficient I/O and Clock Recovery for TERABIT Integrated Circuits Design," PhD Thesis, Stanford Univ., 2001.
- [21] W. Bainbridge, S. Furber, "Delay Insensitive System-on-Chip Interconnect using 1-of-4 encoding", ASYNC, 118-126, 2001.
- [22] R. Ho, et al., "Long Wires and Asynchronous Control," ASYNC, 240-249, 2004.

- [23] G. Lakshminarayanan, et al., "Optimization Techniques for FPGA-Based Wave-Pipelined DSP Blocks," TVLSI, 13(7), 2005.
- [24] C. Svensson, et al., "A 3-Level Asynchronous protocol for a Differential Two-Wire Communication Link," JSSC, 29(9), 1994.
- [25] A. Morgenshtein, et al., "Comparative Analysis of Serial vs. Parallel Links in Networks on Chip," SoC, 185-188, 2004.
- [26] J. Xu, et al., "Wave Pipelining for Application-Specific Networkson-Chips," CASES, 198-201, 2002.
- [27] J. Xu, et al., "Wave-Pipelined On-chip Interconnect Structure for Networks-on-Chips," HOTI, 10-14, 2003.
- [28] W. P. Burleson, et al., "Wave-Pipelining: A Tutorial and Research Survey," TVLSI, 6(3):464-474, 1998.
- [29] B.D. Winters, et al., "A Negative-Overhead, Self-Timed Pipeline," ASYNC, 37-46, 2002.
- [30] L. Li, et al., "A Crosstalk Aware Interconnect with Variable Cycle Transmission," DATE: 102-107, 2004.
- [31] R. Dobkin, et al., "High Rate Data Synchronization in GALS SoCs," TVLSI, 14(10):1063-1074, 2006.
- [32] M.T. Dean, et al., "Efficient Self-Timing with Level-Encoded 2-Phase Dual-Rail (LEDR)," ARVLSI, 55-70, 1991.
- [33] D.H. Linder, et al., "Phased Logic: Supporting the Synchronous Design Paradigm with Delay-Insensitive Circuitry," IEEE Trans. Computers 45(9):1031-1044, 1996.
- [34] DS-DE, IEEE1355-1955, http://grouper.ieee.org/groups/1355/ index.html.
- [35] D.Pham, et al., "The Design and Implementation of a First-Generation CELL Processor," ISSCC, 184-592, 2005.
- [36] International Technology Roadmap for Semiconductors, 2003.
- [37] R. Ginosar, "Fourteen Ways to Fool Your Synchronizer," ASYNC, 89-96, 2003.
- [38] R. Dobkin, R. Ginosar, "Zero phase latency synchronizers using four and two phase protocols," TR, EE Dept, Technion, 2007, www.ee.technion.ac.il/~ran/papers/zerolatency.pdf.
- [39] L. Scheffer, "An Overview of On-chip Interconnect Variation," SLIP, 27-28, 2006.
- [40] R. O. Topaloglu, et al., "Generation of Design Guarantees for Interconnect Matching," SLIP, 29-34, 2006.
- [41] H. Chang, et al., "The Certainty of Uncertainty: Randomness in Nanometer Design," PATMOS, 36-47, 2004.
- [42] K.A. Bowman, et al., "Impact of Die-to-Die and Within-Die Parameter Fluctuations on the Maximum Clock Frequency Distribution for Gigascale Integration," JSSC, 37(2):183-189, 2002.
- [43] H.B. Bakoglu, "Circuits, Interconnections and Packaging for VLSI", Adison-Wesley, 194-219, 1990.
- [44] W. C. Elmore, "The transient response of damped linear networks with particular regard to wideband amplifiers," J. Applied Physics, 19(1), 1948.
- [45] M. Moreinis, et al., "Logic gates as Repeaters (LGR) for Timing Optimization," TVLSI, 14(11):1276-1281, 2006.
- [46] Predictive Technology Model (PTM), Interconnect models, 2005, http://www.eas.asu.edu/~ptm.
- [47] C. Svensson, "Electrical Interconnects Revitalized," TVLSI, 10(6):777-788, 2002.
- [48] Y.I. Ismail, et al., "Repeater Insertion in RLC Lines for Minimum Propagation Delay," ISCAS, 404-407, 1999.
- [49] R. Dobkin, et al., "Parallel vs. Serial On-Chip Communication," CCIT Report, EE Dept, Technion, December 2007 (www.ee.technion.ac.il/~ran/papers/parallelserial 2007.pdf).
- [50] E.G. Friedman, et al., "Low-Power Repeaters Driving RC and RLC Interconnects With Delay and Bandwidth Constraints," TVLSI, 14(2):161-172, 2006.
- [51] Y. Cao, et al., "Effects of global interconnect optimizations on performance estimation of deep submicron designs," CAD, 56-61, 2000.
- [52] Y. Cao, et al., "Effects of Global Interconnect Optimizations on Performance Estimation of Deep Submicron Designs," IEEE/ACM CAD, 56-61, 2000.
- [53] R. Venkatesan, et al., "Minimum Power and Area N-Tier Multilevel Interconnect Architectures Using Optimal Repeater Insertion," ISLPED, 167-172, 2000.