The ASCI Option Red Supercomputer

 

Timothy G. Mattson and Greg Henry

 

 

Intel Corporation

Beaverton, OR, 97006

 

tgmattso@ichips.intel.com

henry@co.intel.com

Disclaimer

 

 

 

Computer simulations needed by the U.S. Department of Energy (DOE) greatly exceed the capacity of the world’s most powerful supercomputers. To satisfy this need, the DOE created the Accelerated Strategic Computing Initiative (ASCI). This program accelerates the development of new scalable supercomputers and will lead to a 100 TFLOPS supercomputer by early next century.

Intel built the first computer in this program; the so-called ASCI Option Red Supercomputer. This system will have over 4500 nodes, 594 Gbytes of RAM, and two independent 1 Tbyte disk systems. In December 1996, we used a 3632 node portion of this machine to set the MP LINPACK world record of 1.06 TFLOPS.

In this paper, we describe both the hardware and the system software for the ASCI option red supercomputer. We also describe early benchmark results including the record setting MP LINPACK computation.

1 Introduction

100 TFLOPS (Trillion FLoating point OPerations per Second) computers are needed to support the simulation needs of the U.S. Department of Energy (DOE). Given enough time, the computer industry will create 100 TFLOPS supercomputers. The DOE, however, cannot wait for industry to get around to building these machines. These machines are needed by early next century. In response to this need the DOE launched a program to accelerate the development of extreme scale, massively parallel supercomputers with the goal of a 100 TFLOPS computer early next century. This program is called the Accelerated Strategic Computing Initiative or ASCI.

The first ASCI Supercomputer - the ASCI option red supercomputer - is being built by Intel and will be installed at the Sandia National Laboratories in Albuquerque NM. The ASCI option red supercomputer will satisfy the following specific performance milestones:

In December of 1996, we met the first milestone by running an MP LINPACK calculation at 1.06 TFLOPS. We are on schedule for meeting the second milestone.

The design of the ASCI option red supercomputer is loosely based on the Intel Paragon supercomputer. The Paragon supercomputer used a 2D mesh interconnection facility (ICF) that could move messages at a peak uni-directional bandwidth of 200 Mbytes per second. Each Paragon node held two (the GP node) or three (the MP node) Intel i860â XP processors. The system was organized into three scalable partitions:

The OS on the Paragon supercomputer presented the user with a single system image. This OS was a full OSF/1 distributed Unix (the Paragon OS) and it ran on each node. The Paragon architecture was scalable to large numbers of nodes. In December 1994, for example, Intel worked with scientists from Sandia and Oak Ridge national laboratories to set an MP LINPACK record of 281 GigaFLOPS [3] on a Paragon XP/S supercomputer containing over 2,000 nodes.

Figure 1 is a diagram of the ASCI option red supercomputer as it will sit at Sandia National Laboratories in Albuquerque NM. The machine is organized into a large pool of compute nodes in the center, two distinct blocks of nodes at either end, and two one-Tbyte disk systems. The end-blocks and their disk systems can be isolated from the rest of the machine by disconnecting the X-mesh cables in the disconnect cabinets (marked with an X in Figure 1). This design satisfies DOE security requirements for a physically isolated classified disk system while assuring that both disk systems are always available to users. Rectangular meshes are needed, hence the number of cabinets set up for isolation must be the same in each row on each end. The most likely configuration would put disconnect cabinets four cabinets over from each end, but this can be varied to meet the customer’s needs. Depending on which types of node boards are used in which slots, this would yield a 400 GFLOPS stand-alone system.


Figure 1: Schematic diagram of the ASCI Option Red supercomputer as it will be installed at Sandia National Laboratories in Albuquerque NM. The cabinets near each end labeled with an X are the disconnect cabinets used to isolate one end or the other. Each end of the computer has its own I/O subsystem (the group of 5 cabinets at the bottom and the left), and their own SPS station (next to the I/O cabinets). The lines show the SCSI cables connecting the I/O nodes to the I/O cabinets. The curved line at the top of the page show the windowed-wall to the room where the machine operators will sit. The black square in the middle of the room is a support post.

Parameters for the full system are given in Table 1. The numbers are impressive. It will occupy 1,600 sq. ft. of floor-space including the access area required on all sides of the system. The system’s 9,216 Pentium â Pro processors with 596 Gbytes of RAM are connected to a dual-plane 38 x 32 x 2 mesh. The system has a peak 1.8 TFLOP computation rate and a cross-section bandwidth (measured across the two 32 x 38 planes) of over 51 GB/sec.


Compute Nodes

4,536

Service Nodes

32

Disk I/O Nodes

32

System Nodes (Boot)

2

Network Nodes (Ethernet, ATM)

10

System Footprint

1,600 Square Feet

Number of Cabinets

85

System RAM

594 Mbytes

Topology

38x32x2

Node to Node bandwidth - Bi-directional

800 Mbytes/sec

Bi-directional - Cross section Bandwidth

51.6 Gbytes/sec

Total number of Pentiumâ Pro Processors

9,216

Processor to Memory Bandwidth

533 Mbytes/sec

Compute Node Peak Performance

400 MFLOPS

System Peak Performance

1.8 TFLOPS

RAID I/O Bandwidth (per subsystem)

1.0 Gbytes/sec

RAID Storage (per subsystem)

1 Tbyte

Table 1: System parameters for the ASCI Option Red Supercomputer.

Different parts of the ASCI option red supercomputer use different operating systems. The nodes involved with computation (compute nodes) run an efficient, small operating system called Cougar. The nodes that support interactive user services (service nodes) and booting services (system node), will run a distributed UNIX that provides a single system image.

When scaling to so many nodes, even low probability points of failure can become a major problem. To build a robust system with so many nodes, the hardware and software must be explicitly designed for RAS: Reliability, Availability and Serviceability. All major components are hot-swappable and repairable while the system remains under power, and in many cases, while applications continue to run. Of the 4,536 Compute Nodes and 16 on-line hot spares, for example, all can be replaced without having to cycle the power of any other module. Similarly, system operation can continue if any of 308 patch service boards (to support RAS functionality), or 640 disks, or 1540 power supplies, or 616 ICF backplanes should fail each of which reduce system capacity by less than 1%.

This paper provides an overview of the ASCI option red system. We begin with a description of the microprocessor used throughout the system: the Intel Pentiumâ Pro processor. We then describe the interconnection facility (ICF) and the various node boards that will be used in the system. Next, we describe the system software. We close the paper with some early results from programs running on the system including MP LINPACK and the sPPM [11] benchmark from Lawrence Livermore laboratory.

2. The PentiumÒ Pro Processor

The Intel Pentiumâ Pro processor [7] is used on all the nodes in the ASCI option red supercomputer. The Intel Pentiumâ Pro processor is both a CISC and a RISC chip. Its instruction set is essentially the same as that for the Pentiumâ processor. At runtime, however, the CISC instructions are broken down into simpler, RISC instructions called micro-operations (or uops). The uops execute in a RISC core inside the Pentiumâ Pro processor with the order of execution dictated by the availability of data. This lets the CPU continue with productive work when other uops are waiting for data or functional units. This out of order execution is combined with sophisticated branch prediction and register renaming to provide what Intel calls, dynamic execution.

The Pentiumâ Pro processor’s RISC core can execute a burst rate of up to five uops per cycle running on five functional units:

Up to 3 uops can be retired per cycle of which only one can be a floating point operation. The floating point unit requires two cycles per multiply and one cycle per add. The adds can be interleaved with the multiplies so the Pentiumâ Pro processor can have a result ready to retire every cycle. Hence, the peak multiply-add rate is 200 MFLOPS at 200 Mhz.

The Pentiumâ Pro processor has separate on-chip data and instruction L1 caches (each of which is 8 KBytes). It also has an L2 cache (256 KBytes) packaged with the CPU in a single dual-cavity PGA package. The L1 Data Cache is dual-ported and non-blocking, supporting one load and one store per cycle (peak bandwidth of 3.2 GB/sec on a 200 MHz). The L2 cache interface runs at the full CPU clock speed, and can transfer 64 bits per cycle (1.6 GB/sec on a 200 MHz Pentiumâ Pro processor). The external bus is also 64 bits wide and supports a data transfer every bus-cycle.

The Pentiumâ Pro processor bus offers full support for memory and cache coherency for up to 4 Pentiumâ Pro processors. It has 36 bits of address and 64 bits of data. Bus efficiency is enhanced through the following features:

The bus can support up to 8 pending transactions while the Pentiumâ Pro processor and the memory controller can have up to 4 pending transactions. Memory controllers can be paired-up to match the bus’s support for 8 pending transactions.

The bus can sustain data on every clock cycle, so at 66 MHz the peak data rate is 533 MB/sec. Unlike most Commodity Commercial Off The Shelf (CCOTS) processor buses, which can only detect data errors by using parity coverage, the Pentiumâ Pro processor data bus is protected by ECC. Address signals are protected by parity.

3 The ASCI Option Red Supercomputer

The ASCI Option Red Supercomputer is a Massively Parallel Processor (MPP) with a distributed memory Multiple-Instruction, Multiple Data (MIMD) architecture. All aspects of this system are scalable including the aggregate communication bandwidth, the number of compute nodes, the amount of main memory, disk storage capacity and I/O bandwidth.

In Figure 2, we show block diagram for the ASCI Option Red computer. There are 4 logical partitions in the system:

In normal operation, one of the sets of disconnect cabinets will cut the system in two. In this case, each side will see the logical partition model outlined in Figure 2.


Figure 2: Logical System Block Diagram for the ASCI Option Red Supercomputer. This system uses a split-plane mesh topology and has 4 partitions: System, Service, I/O and Compute. Two different kinds of node boards are used and described in the text: the Eagle node and the Kestrel node. The operators console (the SPS station) is connected to an independent ethernet network that ties together patch support boards on each card cage.

In the next few subsections, we will discuss the major hardware components used to implement the ASCI Option Red supercomputer. We will begin with Interconnection Facility (ICF) used to connect the nodes together. We will follow this with a discussion of the two kinds of node boards used in the system: the Eagle and Kestrel boards.

3.1 Interconnection Facility

The interconnection facility (ICF) is shown in Figure 3. It utilizes a dual plan mesh to provide better aggregate bandwidth and to support routing around mesh failures. It uses two custom components: the: Network Interface Chip (NIC) and the Mesh Router Chip (MRC). The MRC sits on the system backplane and routes messages across the machine. It supports bi-directional bandwidths of up to 800 Mbytes/sec over each of 6 ports (i.e., two directions for each X, Y, and Z port). Each port is composed of four virtual lanes that equally share the total bandwidth. This means that as many as four message streams can pass through an MRC on any given port at any given time. This reduces the impact of communication contention and leads to a more effective use of total system bandwidth.


Figure 3: ASCI Option Red Supercomputer 2 Plane Interconnection Facility (ICF). The red squares on each node board are the Network Interface Chips (NIC) while the black squares on the dual backplanes are the Mesh Router Chips (MRC). Bandwidth figures are given for NIC-MRC and MRC-MRC communication. Bi-directional bandwidths are given on the left side of the figure while uni-directional bandwidths are given on the right side. In both cases, sustainable (as opposed to peak) numbers are given in parentheses.

The Network Interface Chip (NIC) resides on each node and provides an interface between the node's memory bus and the MRC. The NIC can be connected to another NIC to support dense packaging on node boards. For example, the NIC on one node (the outer node) can be connected to the NIC on the other node (the inner node) which then connects to the MRC. Contention is minimized in this configuration since the virtual lane capability used on the MRC’s was included nn the NIC’s.

3.2 The Eagle Board

The node boards used in the I/O and system partitions are the Eagle Boards. In Figure 4, we show a block diagram for an Eagle Board. Each node includes two200 MHz Pentiumâ Pro processors. These two processors support two on-board PCI interfaces that each provide 133 MB/sec I/O bandwidth. One of the two buses can support two PCI cards through the use of a 2-level riser card. Thus, a single Eagle board can be configured with up to 3 long-form PCI adapter cards. CCOTS PCI adapter boards can be inserted into these interfaces to provide Ultra-SCSI, ATM, FDDI, and numerous other custom and industry-standard I/O capabilities. In addition to add-in card capabilities, there are base I/O features built into every board that are accessible through the front panel. These include RS232, 10 Mbit Ethernet, and differential FW-SCSI.

Each Eagle board provides ample processing capacity and throughput to support a wide variety of high-performance I/O devices. The throughput of each PCI bus is dictated by the type of interface supported by the PCI adapter in use, the driver software, and the higher-level protocols used by the application and the "other end" of the interface. The data rates associated with common I/O devices fit well within the throughput supported by the PCI bus. Ultra-SCSI, for example, provides a hardware rate of 40 MB/sec. This rate can easily be supported by CCOTS PCI adapters.


Figure 4: The ASCI Option Red Supercomputer I/O Node (Eagle Board). The NIC connects to the MRC on the backplane through the ICF Link.

The memory subsystem is implemented using Intel’s CCOTS Pentiumâ Pro processor support chip-set (82453). It is structured as four rows of four, independently-controlled, sequentially-interleaved, banks of DRAM to produce up to 533 MB/sec of data throughput. Each bank of memory is 72 bits wide, allowing for 64 data bits plus 8 bits ECC, which provides single bit error correction and multiple bit error detection. The banks are implemented as two 36-bit SIMMs, so industry standard SIMM modules can be used to provide 64 Mbytes to one Gbytes of memory.

3.3 The Kestrel Board

Kestrel boards (see Figure 5) are used in the compute and service partitions. Each Kestrel board holds two compute nodes. The nodes are connected through their NIC’s with one of the NIC’s connecting to an MRC on the backplane. Each node on the Kestrel board includes its own boot support (FLASH ROM and simple I/O devices) through a PCI bridge on its local bus. A connector is provided to allow testing of each node through this PCI bridge. The FLASH ROM contains the Node Confidence Tests, BIOS, plus additional features needed to diagnose board failures and to load a variety of operating systems. Local I/O support includes a serial port, called the "Node Maintenance Port". This port interfaces to the system’s internal Ethernet through the patch support board (PSB - see section 5) on each card cage.


Figure 5: The ASCI Option Red supercomputer Kestrel Board. This board includes two compute nodes daisy-chained together through their NICs. One of the NICs connects to the MRC on the backplane through the ICF Link.

The memory subsystem on an individual compute node is implemented using Intel’s CCOTS Pentiumâ Pro processor support chip-set (82453). It is structured as two rows of four, independently-controlled, sequentially-interleaved, banks of DRAM to produce up to 533 MB/sec of data throughput. Each bank of memory is 72 bits wide, allowing for 64 data bits plus 8 bits ECC, which provides single bit error correction and multiple bit error detection. The banks are implemented as two 36-bit SIMMs, so industry standard SIMM modules may be used. Using commonly available 4 and 8 MByte SIMMs (based on 1Mx4 DRAM chips) and 16 and 32 MByte SIMMs (based on 4Mx4 DRAM chips), 32 MB to 256 MB of memory per node is supported. The system will be delivered with 128 Mbytes/node.

4 The ASCI Option Red supercomputer Software

The Paragon supercomputer taught us a great deal about building operating systems and communication software for computers with thousands of nodes. To capitalize on this experience, the software for the ASCI Option Red system is an evolution of the scalable environment for the Paragon system.

In the next few subsections, we will describe the key features of the system software on the ASCI Option red computer. We will start with the two operating systems used on the system. We will then describe in a fair amount of detail the portals mechanism used for high performance communication. We close this section with a description of the programming environment used on the system.

4.1 The Operating Systems

Each partition in the system places different demands on the operating system. We could have developed one operating system that met the needs of all of the partitions. This would guarantee, however, that the operating system did all jobs adequately but none of them particularly well. We took a different approach and used multiple operating systems tuned to the specific needs of each partition.

The Service, I/O and System partitions are directly visible to interactive users. These partitions need a familiar, UNIX operating system. We used Intel's distributed version of Unix (POSIX 1003.1 and XPG3, AT\&T System V.3 and 4.3 BSD Reno VFS) developed for the Paragon XP/S supercomputer. The port of the Paragon OS to the ASCI Option Red system resulted in an OS we call the TFLOPS OS. The TFLOPS OS presents a single system image to the user. This means that users see the system as a single UNIX machine despite the fact that the operating system is running on a distributed collection of nodes.

The compute partition has different needs. Users will only run parallel applications on these nodes - not general interactive services. Furthermore, these nodes are the most numerous so the aggregate costs of wasted resources (such as memory consumed by an OS) grows rapidly. Therefore, for the compute partition, we wanted an operating system that was small in size, very fast, and provided just those features needed for computation.

On our Paragon XP/S supercomputers, we had great success with SUNMOS - a light weight operating system from Sandia National laboratories and the University of New Mexico. For the ASCI option red computer, we decided to work with their next light weight OS (Puma [15]). We call our version of Puma, Cougar.

Cougar easily meets our criteria for a compute partition operating system. It is small (less than half a megabyte), of low complexity, and scalable. To describe Cougar, we view the system in terms of four entities:

Since it is a minimal operating system, Cougar depends on a host OS to provide system services and to interactive users. For the ASCI Option Red computer, the host OS is the TFLOPS OS running in the service partition. The Q-Kernel is the lowest level component of Cougar. All access to hardware resources come from the Q-Kernel. Above the Q-Kernel sits the process control thread (PCT). This component runs in user space and manages processes. At the highest level is the user's applications.

Cougar takes a simple view of a user's application. An application is viewed as a collection of processes grouped together and identified by a common group identifier. Within each group, a process is assigned a group rank which ranges from 0 to (n-1) where n is the number of processes. While the PCT supports priority multi-tasking, it is anticipated that most users will run only one application process per node.

Memory integrity in Cougar is assured by a hierarchy of trusts ranging from the Q-kernel to the PCT to the application. At each level in the hierarchy, lower levels are trusted but higher levels are not trusted. Hence, an application can not corrupt the PCT or Q-Kernel while a flawed Q-Kernel can corrupt anyone.

4.1.1 The Quintessential Kernel or Q-Kernel

There is one Q-Kernel running on each of the nodes in the compute partition. It provides access to the physical resources of a node. For example, only the Q-Kernel can directly access communication hardware or handle interrupts. In addition, any low level functions that need to be executed in supervisor mode are handled by the Q-Kernel. This includes memory management, context switching and message dispatch or reception.

The Q-Kernel is accessed through a series of system traps. These are usually invoked by an application or a PCT, but they can also arise from an exception (e.g. a floating point exception) or an interrupt (e.g. timer or communication interrupts).

The Q-Kernel does not set the policy for how a node's resources will be used. Rather, the Q-Kernel performs its low level tasks on behalf of the PCT or a user’s application. This design keeps the Q-kernel small and easy to maintain.

4.1.2 The Process Control Thread

The Process Control Thread (PCT) sits on top of the Q-Kernel and manages process creation, process scheduling, and all other operating system resources. While part of the operating system, the PCT is a user level process meaning that it has read/write access to the user address space. There will typically be only one PCT per node.

Most of the time, the PCT is not active. It only becomes active when the Q-Kernel receives an application exception, a process blocks itself by a call to the quit_quantum() routine, or in response to certain timer interrupts.

When a PCT becomes active, it first checks to see if the most recently suspended process has any outstanding requests for the PCT. Once these requests are resolved, it handles requests from any other PCTs. Finally, it tries to execute the highest priority, runnable application.

4.2 ASCI Option Red Message Communication software: Portals.

Low level communication on the ASCI Option Red computer uses Cougar Portals [15]. A Portal is a window into a process's address space. This address space is user accessible meaning copies between kernel space and user space are avoided. Since extra copies increase communication latencies, Portals supports low latency communication. In addition to low latencies, Portals provide high-performance asynchronous transfers with buffering in the application’s data space.

Portals are managed through a Portal table residing on each node. Entries in the portal table are either portal descriptors pointing to different blocks of memory or a matching list. We'll say more about matching lists later. For now, consider the different types of memory blocks that the portal descriptors point to:

Dynamic memory is managed by the Q-Kernel as a heap. When a message arrives, a block of memory is malloc'ed from the heap and the message is placed in that memory. These memory blocks within the heap can also be freed using user level commands.

Single Block memory is implemented as a single contiguous block of memory. Multiple messages arrive on this portal and the message bodies are placed in the next available segment of the single memory block. This portal supports multiple nodes reading or writing a single memory block and is used to implement I/O and other system servers.

Independent block memory uses an array of independent memory buffers. Each message is placed in one and only one memory buffer. The array can be managed as a linear array or a circular array. This type of memory block is used to set up repetitive data transfers. For example, one would use independent block portals for updating the data in ghost cells.

Finally, the Combined Block memory shares traits with the independent and single block memories. It looks like an independent block memory since it is implemented as an array of independent memory buffers. The messages, however, are placed into the buffers as if they were going into a logical single memory block. In other words, message bodies are broken up as needed so an entire buffer is filled before going to the next one.

There are a range of options that can be associated with the different types of memory blocks. These are:

Not all options are available with each type of memory block. Table 2 indicates which options are available with which type of memory block.


 

Memory Block Type

Option

Dynamic

Single

Combined

Independent

Sender Managed Offsets

No

yes

yes

no

Receiver Managed Offsets

no

yes

yes

no

Read Only

no

yes

yes

yes

Write Only

yes

yes

yes

yes

Acknowledge Write

yes

yes

yes

yes

Save Header

yes

no

no

yes

Save Body

no

yes

yes

no

Save Header and Body

yes

no

no

yes

Specialty

General message passing

Multiple servers to support I/O and other services

Scatter or gather operations

Repetitive data transfers

Table 2: Available options for Portal Memory blocks and example specialties for each memory block type.

Since the memory associated with the portals is in user addressable space, copies between kernel and user space are minimized resulting in low latency communication. In addition, multiple processes can interact with a portal for reading and writing making them well suited for implementing parallel servers (e.g. I/O servers). The table summarizes the major anticipated uses of the various types of memory blocks.

A portal table entry can optionally point to a matching list rather than a specific memory block. The matching list lets messages bind to specific portals based on a matching criteria. The criteria is based on the source group id, the source group rank and a 64 bit matching key. If a match occurs, the indicated portal operation takes place. If the match does not occur, the user has control over the system behavior. The message can be discarded or placed into an overflow buffer for later matching.

Application programs will have access to Portals, but we expect most applications will use higher level message passing libraries. A variety of message passing protocols will be available on top of portals. The preferred library will be MPI. This will be a full implementation of the MPI 1.1 specification. We are also planning to implement the one sided communication routines from MPI-2.

To support applications currently running on Paragon supercomputers, we implemented a subset of the NX message passing library on top of Portals. Since both MPI and NX run on top of Portals, latency and bandwidth numbers will be comparable for applications utilizing either library.

4.3 Programming Environment

The programming environment will be familiar to anyone experienced Unix-based supercomputers. The programs running on each node use standard sequential languages. To move data between nodes, the programs explicitly call routines from a message passing library. When developing software, building programs or other interactive operations, the computer will have the look and feel of a Unix workstation.

Fortran77, Fortran90, C and C++ compilers from Portland Group Incorporated (PGI) will be fully supported. Interactive debuggers and performance analysis tools will work with and understand the source code for each of these languages. For data-parallel programming, the HPF compiler from PGI will also be provided - though the software development tools will not be able to resolve operations back to the HPF source code.

While message passing (either explicitly or implicitly through HPF) is used between nodes, shared memory mechanisms are used to exploit parallelism on a node. The user has three multiprocessor options. First, the second processor can be completely ignored. Alternatively, it can be used as a communication co-processor. Finally, a simple threads model can be used to utilize both processors on a single application. The threads model is accessed through compiler-driven parallelism (using the -Mconcur switch to the compiler) or through an explicit dispatch mechanism referred to as the COP interface.

The COP interface lets a programmer execute a routine within a thread running on the other processor. Global variables are visible to both the COP thread and the parent thread. To use COP, the programmer passes COP the address of the routine, the address of a structure holding the routine’s input parameters, and an output variable to set when the routine is completed. The COP’ed routine can not make system calls (including calls to the message passing library).

The -Mconcur option is the same approach that was used on the MP-Paragon supercomputer. The PGI compilers attempt to automatically detect and utilize loop level parallelism when they are invoked with the -Mconcur switch. In most cases, the compiler doesn't find much that it can safely parallelize. The programmer, however, can help the compiler with its parallelization by analyzing the code and adding directives. Of the two application co-processor modes (Mconcur and COP), only Mconcur is fully supported at this time.

The debugger on the ASCI Option red computer will be a major reimplementation of the ipd debugger developed for the Paragon XP/S supercomputer. The debugger has been designed to scale up to the full size of the ASCI option red computer. It includes both graphical and a command line interfaces. The debugger’s command line interface has been designed to mimic the dbx interface where ever possible.

The performance analysis tools will use the counters included on the Pentiumâ Pro processor and on the Network Interface Chip (NIC). The counters on the Pentiumâ Pro processor let users track a range of operations including floating point operation counts and cache line hits/misses. Each Pentiumâ Pro Processor has two counters so only two independent events can be counted at one time. The NIC has 10 independent counters.

We anticipate that the applications on this system will run for many hours. Hence, even a system mean time between failure in excess of our target (>50 hours) will not be sufficient. Therefore, a check-point/restart capability will be provided. Automatic check-pointing is exceedingly difficult to implement on systems as large as this one. Hence, applications will need to assist the check-pointing by putting themselves into a clean state prior to explicitly invoking a check-point. A clean state is one where the communication network does not hold any message passing state for the application being check-pointed. The I/O bandwidth will be sufficient to check-point the entire system memory in approximately five minutes.

5 System RAS Capabilities

We have set aggressive reliability targets for this system. The key target is that a single application will see a mean time between failure of greater than 50 hours. In the aggregate, however, we expect far more from our system. We are committed to having the system in continuous operation for greater than 4 weeks with no less than 97% of the system resources being available. In order to meet these targets, the system includes sophisticated Reliability, Availability, and Serviceability (RAS) capabilities. These capabilities will let the system continue to operate in the face of failures in all major system components.

Three techniques are used to meet our system RAS targets. First, the system includes redundant components so failures can be managed without swapping hardware. For example, the dual plane ICF uses Z-X-Y-Z routing so a bad mesh router chip (MRC) can be by-passed without causing system failure. In addition, the system will include on-line spare compute-nodes that can be mapped into the system without interruption.

The second RAS technique is to build the system so all major components are hot-swappable and can be repaired while the system continues to run. The compute nodes and the on-line spares, for example, can all be replaced without having to cycle the power of any other module. Similarly, system operation can continue if any of the 640 disks, 1540 power supplies, or 616 ICF backplanes should fail.

Finally, to manage these RAS features and to manage the configuration of such a large system, an active Monitoring and Recovery Subsystem (MRS) is included. At the heart of this system is a Patch Support Board (PSB) -- one per 8 board card cage. The PSB board monitors the boards in its card cage and updates a central MRS data base using a separate RAS ethernet network. This data base will drive an intelligent diagnostic system and will help manage the replacement of system units. The console for the RAS subsystem is the Scalable Platform Support Station (SPS station). There is one of these connected to the RAS ethernet network on each side of the machine (see Figure 1).

6. System Performance

At the time this paper is being written (first quarter, 1997), the full ASCI option red computer had not been built. Approximately three quarters of the system was available - more than enough to make this the faster supercomputer in the world. In the next few sections we describe the results of performance benchmarks and results from a simple benchmark application.

6.1 Performance Tracking

Most of the early applications work on the ASCI Option Red supercomputer was designed to validate the soundness of the system design and its ability to scale to thousands of nodes. This work was quite successful with a number of applications - including some full production applications - running on up to 3600 nodes.

While it is important that the ASCI Option Red supercomputer functions correctly, it is equally important that the system delivers the expected performance. To track system performance, we created a performance benchmark suite. The goal of this suite is to produce a handful of numbers to assess system performance. The performance tracking suite includes the following codes:

Some results from the performance tracking benchmark suite are included in Table 3. In every case, the machines were kestrel based systems located at Intel’s factory.

The performance levels in Table 3 are impressive. The Livermore Loop and Stream test numbers are in the same ballpark as those from other high end workstations. The communication numbers are among the best ever reported for an MPP system. Finally, the matrix multiply numbers show that compiled Fortran77 dgemm() is able to execute at approximately 50% of the speed of the assembly coded routine in the libkmath library.

The two columns of data reflect changes in the system over an important phase of the project. The first column refers to data collected in September of 1996. At that time, only limited optimization had been done on the system software. The compiler backend was a generic x86 backend without any special knowledge of the Pentiumâ Pro processor. Also, many of the vector and loop level optimizations within the compiler were only partially working.

As the data in Table 3 shows, the system improved significantly over this period. The communication numbers improved significantly as did the maximum Livermore loops MFLOPS. The other Livermore Loop numbers and the stream test numbers, however, showed only modest improvements. The matrix multiply numbers degraded slightly -- though it is unlikely that the observed change is large enough to be significant.

Additional improvements are expected. The communication numbers should improve dramatically. Intel's engineers have already seen bandwidths in excess of 350 MBytes/sec. Furthermore, once the second processor can be used as a communication co-processor, the latencies should drop down into the single digits. Finally, An updated Pentiumâ Pro processor aware compiler backend is expected early in 1997. This will improve the performance numbers even further.

These tests provide a good relative measure of the system performance. These tests aren’t very good, however, at detecting systematic errors in the system’s performance. To resolve this issue, we need a benchmark for which we have an analytic performance target. If we match this target, then we know our system is performing as it should.

An application well suited to this type of analysis is Quest [12] - an ab initio quantum chemistry program developed at the Sandia National Laboratories (SNL). In an earlier study[6], we analyzed the nboxcd() kernel from QUEST. This kernel resembles a modified dense matrix multiply operation. Our analysis showed that this kernel should run somewhere between 110 MFLOPS to 130 MFLOPS (depending on the state of the L2 cache prior to the kernel's operation).


System

si238

si58

Software Release

WW34a_1

WW45

Date tests were ran

9/17/96

12/31/96

Speed MHz

200

200

Livermore Loops

   

AM MFLOPS

33.9

42.6

GM MFLOPS

29.6

33.9

HM MFLOPS

24.3

25.9

Minimum MFLOPS

5.9

5.7

Maximum MFLOPS

61.4

111.8

Standard Deviation MFLOPS

16

28.4

Comtest

   

Bandwidth - MBytes/sec

272.4

302

CSEND Latency - usecs

12

10

Stream Test

 

 

Copy MBytes/sec

85.9

109.1

Scale MBytes/sec

107.2

108.6

Add MBytes/sec

128.7

129

Triad MBytes/sec

128.5

129.3

Matrix Multiply

 

 

F77, per-node MFLOPS

62.4

54.6

libkmath, per-node MFLOPS

119.4

111.1

Table 3: Results from the performance tracking benchmark suite. The tests are not strongly dependent on the number of nodes. These particular tests used four nodes. None of these tests used the second processor for communication or for computation.

We created a stand alone benchmark program based on this kernel. Table 4 compares results for these tests built with the PGI compiler and the Intel reference compiler (i.e. the Proton compiler). These tests used two forms for the benchmark: one with the original code and the other with the loops unrolled.

The Proton compiler hits the target performance. This compiler is highly optimized for the Pentiumâ Pro processor so its high performance is not surprising. The PGI compilers are well short of the target performance. PGI is still working on the compiler, however, and the next releases will hopefully close the gap. Between September and December they improved by 150%. They only need a 50% performance gain to reach the desired performance.


 

Nboxcd() MFLOPS for three different compilers

 

Code

PGI 9/96

PGI 12/96

Proton

Original Kernel

26

56

83

Kernel with unrolled loops

30

75

120

Table 4: Performance in MFLOPS for the nboxcd() kernel from QUEST with various compilers. Two different releases of the PGI compilers are included -- 9/96 (Rel 1.1) and 12/96 (Rel 1.2-5). The Proton compiler is the Pentium Pro processor reference compiler developed by Intel. These are single node computations carried out on a 200 MHz Kestrel based node. The expected optimum performance ranges from 110 to 130 MFLOPS.

 

6.2 MP LINPACK Performance

MP LINPACK is a well known benchmark for high performance computing[4]. The benchmark measures the time to solve a real double precision (64 bits) linear system of equations with a single right hand side. On December 4, 1996, we set a new world record for MP LINPACK by running the benchmark in excess of one TFLOPS. At that time, the ASCI Option red computer was only 80% complete, but that was more than enough to break the MP LINPACK TFLOPS barrier. Actually, the previous record was 368 GFLOPS so we didn’t just break the record, we shattered it! The specifics of this breakthrough calculation are given in Table 5.

While the rules for the LINPACK benchmark require use of the standard benchmark code, MP LINPACK lets you rewrite the program as long as certain ground rules are followed [4]. Our MP LINPACK code used a two dimensional block scattered data decomposition with a block size 64 [5]. The algorithm is a variant of the right looking LU factorization with row pivoting and is done in accordance with LAPACK [1]. The parallel implementation [8] used a two dimensional processor mesh and did a block wrapped mapping of the matrix. Columns of processors cooperated synchronously to compute a block of pivots which were then passed asynchronously across the rows. Look ahead pivot was used to keep pivoting out of the critical latency path. We report timings for real floating point operations and not "macho" FLOPS obtained by using Strassen [13] (or Winograd [16]) multiplication. The code explicitly computed all the relevant norms and did several rigorous residual checks to guarantee accuracy. The matrix generation was identical to ScaLAPACK version 1.00 Beta, which is a standard MPP package for Linear Algebra [2].


Number of computer nodes

3632

Peak Performance

1.45 TFLOPS

Number of 200 MHz Pentiumâ Pro processors

7264

Total system memory

465 Gbytes

Total memory used by the benchmark program

360 Gbytes

Matrix Order

215000

N1/2

53400

Rate (Rmax)

1.068 TFLOPS

Table 5: Key numbers describing the calculation that broke the one TFLOPS MP LINPACK barrier [4]. N1/2 is the minimum problem size (to the nearest 100) such that half the RMAX performance was achieved.

The code for the 1.06 TFLOPS MP LINPACK record was derived from programs used to set earlier MP LINPACK records on Intel’s Paragon supercomputers. The initial implementation was based on work by Robert van de Geijn to capture the world record in 1991 on the Intel Touchstone Delta [14]. The Delta code was modified to run on the Intel MP Paragon and used hand-tuned i860â assembly code kernels written by Bob Norin (Intel Corp.) and Brent Leback (Axian Corp.) with additional optimizations by Satya Gupta and Greg Henry (both from Intel Corp.). For the TFLOPS benchmark, these kernels were written in x86 assembly code by Satya Gupta, Stuart Hawkinson, and Greg Henry (all from Intel Corp.). For a detailed description of the techniques and algorithms used in this code, see the paper by Bolen et. al. [3].

Our past work with MP LINPACK has shown that for very large problems, at least 93% of the runtime is consumed by the BLAS-3 matrix multiplication code, DGEMM (which computes C=C-A*B). A Pentiumâ Pro processor can start one floating point instruction every clock with the caveat that two multiplies cannot be started on successive clocks. The DOT product kernel of DGEMM contains (almost) the same number of adds and multiplies. Therefore, it should be possible to approach a rate of one flop per clock (i.e. 200 MFLOP at 200 Mhz). In practice it is necessary to arrange the inner loop to hide all of the memory references and loop control. Data movement-costs limit DGEMM performance to about 90% of peak or 180 MFLOPS on a single processor. We wrote an assembly coded version of DGEMM optimized for the DGEMM operations within MP LINPACK. This dual processor code for large DGEMM problems ran at 345 MFLOPS. Combined with the 93% MP LINPACK efficiency (pivoting, communication, etc), we estimate MP LINPACK performance of approximately 320 MFLOPS/node or 1.4 TFLOPS on the final system installed at Sandia National Laboratories.

Our TFLOPS run did not achieve this level of performance. We achieved around 294 MFLOPS/node. We have conducted experiments with larger local matrices on each node and have confirmed that for larger local matrices, our MP LINPACK program runs at around 310 Mflops/node. This is close to the expected per node performance.

6.3 sPPM

While benchmarks are useful for evaluating supercomputers, the purpose of such a large supercomputer is to run applications of interest to the DOE. At the time the TFLOPS MP LINPACK benchmark was run, we had over 20 applications oriented kernels and full programs running on the system. In this section, we will talk about one of these.

PPM [11] (Piecewise Parabolic Method) is a 3-D hydrodynamics code used heavily at DOE national laboratories. The program models a wide range of shock physics problems. It performs PPM hydrodynamics in Lagrangian style using a Riemann solver. A simple gamma-law equation of state is used, and an initially uniform grid with either periodic or continuation boundary conditions is assumed. As part of the ASCI Option Blue program (a program to build a three TFLOP machine by the end of 1998), a benchmark version of this code was created and released to the supercomputing benchmarking community. The code is very similar to PPM but it avoids sensitive algorithms and was designed to be easy to run for benchmarking

The sPPM program was written in Fortran77 with all system dependent calls taking place through C. It uses a small number of MPI routines for communication between nodes. It was straight forward to port the code to our system. In Table 6, we provide results with the code for calculations on 8 to 3600 nodes. These results have been compared to those generated with PPM on other systems. The ASCI Option Red system performed well when compared to other high end RISC based systems.


 

Decomposition

Time in

Nodes

Nodes (x)

Nodes(y)

Nodes(z)

Seconds

8

2

2

2

358

64

4

4

4

370

512

8

8

8

377

3375

15

15

15

394

3600

15

15

16

395

Table 6: sPPM results for a scaled data set with a 64x64x64 point tile on each node. The computations were run on 8 to 3600 nodes. We could have used additional nodes, but we only had 3600 nodes at the time these tests were run.

In this benchmark, the problem size increases as more nodes are added. Hence, an ideal parallel computer would run this benchmark in a roughly constant time as a function of the number of nodes. As can be seen in Table 5, the Intel ASCI Option Red computer comes within 90% of this ideal.

5. Conclusion

In late 1995, Intel accepted the DOE’s challenge to build a machine before the end of 1996 that could run the MP LINPACK benchmark in excess of one FLOPS. We successfully met this challenge using 80% of a machine (3642 nodes) that has come to be known as the ASCI Option Red supercomputer.

With this benchmark behind us, the next phase of the project is to install the full system (4536 nodes) at Sandia National Laboratories in Albuquerque New Mexico. This is being carried out methodically one block at a time to be sure the system will work as expected at final assembly. This process should be complete in the first half of 1997. At that time, the next big test will be to confirm that an application can run on the full machine. Our early work with over 20 application programs - many of which ran on thousands of nodes - leaves us confident that this milestone will be met.

Acknowledgments

The figures used in this paper are based on figures provided by Peter Wolochow and Paul Prince. All trademarks are the property of their respective owners.

 

References

[1] E. Anderson., Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorenson. LAPACK User’sGuide, SIAM Publications, Philadelphia, PA, 1992

[2] S. Blackford, J. Choi, A. Cleary, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R.C. Whaley, "SaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance," Proceedings of Supercomputing '96, Pittsburgh, Pennsylvania, http://www.supercomp.org/sc96/proceedings, 1996.

[3] J. Bolen, A. Davis, B. Dazey, S. Gupta, G. Henry, D. Robboy, G. Schiffler, D. Scott, M. Stallcup, A. Taraghi, S. Wheat, L. Fisk, G. Istrail, C. Jong, R. Riesen, L. Shuler, "Massively Parallel Distributed Computing: World's First 281 Gigaflop Supercomputer," Proceedings of the Intel Supercomputer Users Group 1995, http://www.cs.sandia.gov/ISUG/program.html.

[4] J.J. Dongarra, "Performance of various computers using standard linear equations software in a Fortran environment", Computer Science Technical Report CS-89-85, University of Tennessee, 1989, http://www.netlib.org/benchmark/performance.ps

[5] G. Golub, C. Van Loan, Matrix Computations 2nd Ed., The John Hopkins University Press, 1989.

[6] S. Gupta and T.G. Mattson, "Optimization of QUEST for the ASCI Option Red System", Intel TFLOPS Project Research Report, 1996.

[7] L. Gwennap, "Intel’s P6 Uses Decoupled Superscalar Design", Microprocessor Report, Feb. 16, 1995.

[8] B.A. Hendrickson, D.E. Womble, D.E., "The torus-wrap mapping for dense matrix calculations on massively parallel computers", SIAM J. Sci. Stat. Comput., http://www.cs.sandia.gov/~bahendr/torus.ps.Z, 1994.

[9] J.D. McCalpin, "STREAM: Measuring sustainable Bandwidth in High Performance Computers", http://www.cs.virginia.edu/stream, 1996.

[10] L. McVoy, "Tools for Performance Analysis: LMBENCH", http://reality.sgi.com/employees/lm/lmbench/lmbench.html, 1996.

[11] J. Owens, "The ASCI sPPM Benchmark code", http://www.llnl.gov/asci_benchmarks/asci/limited/ppm, 1997.

[12] M Sears, "QUEST User Guide", Documentation distributed with the QUEST program.

[13] V. Strassen, "Gaussian Elimination is not Optimal", Numer. Math. Vol. 13, pp. 354—356, 1969.

[14] R.A. van de Geijn, "Massively Parallel LINPACK Benchmark on the Intel Touchstone DELTA and iPSC(R)/860 Systems",1991 Annual Users’ Conference Proceedings, Intel Supercomputer Users' Group, Dallas, TX, 10/91

[15] S.R. Wheat, R. Riesen, A.B. Maccabe, D.W. van Dresser, and T. M. Stallcup, "Puma: An Operating System for Massively Parallel Systems", Proceedings of the 27’th Hawaii International Conference on Systems Sciences Vol II, p. 56, 1994.

[16] S. Winograd, "A new algorithm for inner product", IEEE Trans. Comp., Vol. C-37, pp. 693—694, 1968.