Welcome to UnixReview.com

[Prognosis UNIX]


Main Menu
  • Home
  • Archives
  • Reviews
  • Books
  • Editor's Choice
  • Geek Links
  • Contact Us

  • Sections
  • Open Source
  • Certification
  • Shell Corner
  • Unix Shows
  • Updates

  • [Advertisement]
    Newsletter
    Get the Newsletter
    Get the Newsletter

      


    The rule that the bigger the cache, the better the performance is not always true. Understanding your applications is the best way to maximize cache benefits.
    David Wilson

    The term cache is used in a variety of contexts in computer jargon. There are processor caches, memory caches, disk caches, buffer caches, and peripheral caches. The general use of the term describes any small memory that logically sits in the flow of data between two larger data sources or sinks.

    For computer designers, a cache is a hardware crutch. Caches are needed because, at each level in the processor, memory, and peripheral design of a computer, components run at inherently different speeds. Caches make these speed differences less apparent to the supported devices and allow each subsystem to run asynchronously at or near its rated speed.

    This article is about processor caches--the generally small, fast memory that is logically located between a CPU chip and the system RAM, and about how the processor cache affects system performance. The processor cache allows the CPU to execute as close to its maximum instruction rate as possible despite the system RAM access time being too slow to service the CPU directly.

    Several terms are used to describe a cache. Practically speaking, the key parameters of the cache are speed (in nanoseconds) and size (in kilobytes). The speed should be fast enough to allow the processor to run at full throttle, if possible, and the size is the independent factor that trades higher performance for higher cost.

    Other terminology, such as cache line width, is often reported but is of little interest. The cache line width is the number of bytes moved from system memory to cache when a cache miss occurs. Typically, a cache has a relatively wide (in bytes) but slow connection to system RAM and a narrower but faster connection to the processor. When the processor needs information, it is fetched quickly if it is in the cache. If the requested information is not in the cache, a cache miss occurs. Several data words (equal to the line width) are moved from system RAM to cache, and at the same time, the requested data word is provided to the CPU. If the application being run next requests instruction or data from the next (or nearby) location, it will already have been placed in cache by the previous cache-miss operation. Thus, the "locality of reference" or "locality of data" (how often the processor needs "close" instructions or data) becomes the determining factor in how well an application interacts with processor cache.

    While the effect of line width on performance certainly is interesting, a particular system is designed with a fixed, specific cache-replacement policy. Regardless of how this policy is established and what its effects are, its role in system performance is secondary to factors such as cache speed and size.

    The terms L1 and L2 will be used frequently in this article. L1 (for level 1) is the cache that is physically on the processor chip, and L2 is the cache that is external to the processor chip. Systems can be designed with any combination of L1 and L2 cache. However, the design of the processor chip itself determines the presence of L1 cache, whereas the system designer can add L2 cache to nearly any CPU.

    The Middle Ages

    The first appearance of processor cache in typical personal computers was in the mid- to late-1980s, with the advent of the 80386. While there is no reason that an 80286 system cannot have L2 cache (and perhaps some actually do) most 8088 and 80286 systems were designed without L2 cache.

    Similarly, cache was adopted into RISC workstations in the mid- to late-1980s. Cache appeared in servers and the more expensive workstations first, becoming pervasive by the end of the decade. Often, the presence of cache was the primary difference between workstation models, with the cache causing a significant bulge in the list price of a system.

    Figure 1 shows the performance of two systems from 1985 to 1990, run with and without caches. These NCR machines, using 16.67MHz and 25MHz 68020 processors, came with plug-in cache modules. Naturally, we tested the systems with and without their processor caches because they were easy to remove.

    First, the cache size increased from 8KB to 32KB as the processor clock increased. This clock rate increase was due to a couple of years of technology, and both cache sizes are typical of the era. For the 16.67MHz 32/400, the 8KB cache was an expensive option, whereas the 32KB cache was standard on the 25MHz 32/450.

    The Dhrystone value, the old benchmark standard, shows no difference in performance between the two machines when run without cache. This was perfectly reasonable for these systems. The CPU card was a plug-in board, and the memory board was identical between the two machines. So, for systems without cache, the processor/memory interface ran at the memory speed, and Dhrystone results reflect that. This is why the 25MHz machine came standard with a cache--if it did not, the performance of the faster and slower systems would have been essentially identical for most tasks, and customers would likely have taken exception.

    From the Figure 1 data, it seems that the 8KB cache gave full processor performance. The larger 32KB cache on the faster machine gave a Dhrystone value a little less than a clock-rate linear estimate would have predicted. Alternatively, the smaller cache is doing a little better than expected, but in any event, its small size clearly was not affecting Dhrystone performance.

    While Dhrystone is not considered a reliable benchmark for most purposes, running the same binary file on similar systems is a valid test and does show relative performance, at least for the physically small Dhrystone program.

    From 1985 to 1990, processor caches were external to the CPU chip. The decision to incorporate the cache onto the processor chip was primarily controlled by chip size (technically known as die size) and transistor count. Moderate-size caches consume an enormous number of transistors, hence potentially using a large portion of the available die. The 80386, the Motorola 68020, and the first generation of RISC processors were, for their time, relatively large and complex even without an on-chip cache.

    As the size of the transistors on the chip decreased, however, more space became available for caches. The increase in processor clock speeds has also increased the importance of a cache, so more recent CPUs generally have at least a small L1 cache.

    The Modern Era

    Today, a typical processor is a Pentium, characterized by a complex integrated CPU with a 16KB L1 cache. Most Pentium system designs also allow a substantial L2 cache. Similarly, current RISC processors generally incorporate a small L1 cache and allow substantial L2 caches.

    The rest of this article discusses the performance examples that were run on an Intel AltServer. It is a good machine for benchmark purposes. Our machine has two 166MHz Pentium processors, a 512KB L2 cache (used by both CPUs), a Fast-Wide SCSI interface, and both PCI and EISA/ISA card slots. The machine can run at any speed from 75MHz to 166MHz; one processor is easily removable, and the L2 cache can be 0KB, 256KB, or 512KB. Thus, this machine can easily be run in a variety of configurations, perfect for benchmarking cache sizes. The results show trends and principles of cache sizes that are similar to RISC implementations.

    Figure 2 shows the Dhrystone performance of the Pentium test system with three different size caches. Comparing Figure 1 and Figure 2 shows performance improvement by a factor of 25 on Dhrystone performance since the 25MHz 68020. But Figure 2 also shows that the L1 cache is sufficient for maximum Dhrystone performance. Increasing the size of the L2 cache does not improve the system's performance due to the small workload represented by the Dhrystone benchmark.

    Figure 3 shows the results of a benchmark that uses increasingly larger amounts of data. In this case, the test does double-precision addition (8 bytes per data element), making use of three square matrices. The test does an element-by-element addition of two arrays, generating the third array.

    For the 5x5 array size, requiring 600 bytes of data, the entire array fits into the L1 cache. Performance, measured in millions of operations per second, is essentially constant at 7.3 to 7.4 million operations per second, regardless of the size of the L2 cache.

    Increasing the array size to 25x25 increases the data requirement to 15KB, nearly the same as the L1 cache. Performance of the configuration with no external cache drops substantially, while either external cache size is able to sustain the original performance rates. This holds true for 100x100 matrices as well, although a 256KB L2 cache is just barely big enough for the 240,000 bytes of data and performance drops off a little with it.

    For 500x500 or 1,000x1,000 matrices, the total data size, 6MB or 24MB respectively, is substantially larger than the size of the L2 cache. For our Pentium system, when the size of the data being manipulated is substantially larger than the cache, the cache benefit is minimal--the performance of the three cache versions is essentially the same, at 2.9 to 3.0 million operations per second.

    Figure 4 shows data similar to Figure 3, except in Figure 4 the basic operation is single-precision, rather than the double-precision results in Figure 3. The single-precision test uses 4 bytes per data element. Thus, the total data space used is one-half the amount used in the case of double precision.

    One key difference between the results of Figures 3 and 4 is the 25x25 case with no L2 cache. For the double-precision test, the performance dropped dramatically from the 5x5 version. But in Figure 4 (single precision), the 25x25 matrix fits easily into the L1 cache, and the performance of the 0KB L2 cache test is essentially the same as the performance of the tests when run with 256KB or 512KB L2 cache.

    Figure 4 shows the same net result as Figure 3 when 500x500 or 1,000x1,000 matrices are used. For either case, when the amount of data being processed is substantially larger than the cache size, the performance is unaffected by the cache size. And in single- or double-precision tests, the performance on large matrices is substantially reduced from the peak (array in-cache) performance levels.

    Effects On Benchmarks

    This illustrates one of the issues with the SPEC benchmarks over time. As SPEC benchmarks have been developed (SPEC89, SPEC92, SPEC95), those programs have grown "larger"--in the sense that the working size of data for the average program is substantially larger in the later benchmarks. The purpose of increasing the size of the benchmarks was to negate the effect of any vendor with a slightly larger cache.

    As the data in Figures 3 and 4 shows, performance can change drastically when the size of the data set increases past the size of the cache. So some benchmarks, including those in early SPEC versions, might run far more quickly on a system with a large cache than on a system with a small cache. But SPEC attempts to make general statements about the performance of measured systems, and the cache-sensitive nature of the benchmarks should be viewed as narrowly defined by the size of the benchmark. Hence, a new generation of benchmarks is required when processor L2 caches make substantial increases in size.

    While we have been complaining about the performance loss suffered when increasing the order of the matrices in use, it could be worse. In system memory, matrices are packed in linear order. When, for the larger array sizes of Figures 3 and 4, a cache miss occurs, many successive words are moved from memory to cache. Then, a few successive array operations will work from this cached data until the next cache miss occurs and the cycle repeats.

    Suppose in the benchmark we simply reverse which array index varies most rapidly. Then, each cache miss brings in a set of data that will not be used in the near future, and every array operation will cause cache misses every time. For small (fully cached) arrays, this makes no difference, but for the 1,000x1,000 matrix, we encounter the same situation shown for the no-cache machine in Figure 1. Each operation requires a fetch from system memory. Whether the system has L2 cache or not, all tests get the same performance level, that of continuous main memory accesses. This costs a factor of 10 in performance, dropping the results from 3.0 to 0.3 million operations per second-all caused by changing just two lines of code in the benchmark.

    Even worse, take the original benchmark and double the order of the matrix once again. Now 96MB of data space is required for the double-precision test, and suddenly, in a 64MB RAM system, virtual-memory paging starts. So instead of losing a mere factor of 10, as in the preceding test, we lose a factor of 100 or more. This proves the old principle that nothing helps improve performance in a loaded virtual-memory system like more physical memory. It also shows that processor caches can help in some situations, but if the application is perverse or if the system is paging, the performance improvement gained from processor cache is marginal at best.

    The matrix tests shown in Figures 3 and 4 use the same data elements repeatedly, hence showing the maximum use of a cache. When an element is brought into the cache from the system RAM, these array tests reference it multiple times (up to 100 or more depending on the array size being tested). So, in some sense, this is best-case performance for a cache.

    On The Server Front

    Figure 5 presents a different benchmark scenario. Here, the Intel Pentium system is acting as a file server to a gaggle of personal computers. The server shares its disk drives, and each personal computer accesses one of the shared drives on the server. The test involves the clients (running Windows 95) doing reads and writes to the server's disk drives, using the NetBIOS/NetBEUI protocols. The result of the test is the throughput, in MB/sec, achieved in total by all client systems.

    We ran this test with three different L2 cache sizes: 0KB, 256KB, and 512KB. Increasing the size of the L2 cache substantially improved maximum total throughput (from 4.3MB/sec to 7.4MB/sec). Associated with this performance increase, the number of clients at the maximum performance level also increased, from 4 to 8.

    This benchmark was run through 18 clients. There are two key points to this massive amount of data:

    • At every level of client load, the larger the cache, the higher the total throughput level.

    • After the peak performance level was reached, the total system throughput began dropping off slowly. But even at 18 clients, the 512KB L2 cache performance level was still more than 2MB/sec greater than the 0KB L2 configuration. The 256KB L2 results were right in between.

    Because we could not increase the size of the cache in this test system beyond 512KB, we tried several other hardware changes to increase performance. We added physical memory, disk drives, disk controllers, and a second CPU; changed the number and types of Ethernet NICs; increased the CPU clocks to 166MHz; and tried various operating systems. Only two of these modifications had a significant effect on results, and in both cases, only the absolute performance level was changed--the relative performance at each cache size remained constant.

    Changing the NIC made a measurable difference in performance, although adding a second or third card and splitting the clients across separate networks did not change results. A different Ethernet card affected performance by approximately 10%.

    Changing the operating system affected results by up to 20%, again changing only the absolute numbers without changing any relative numbers and without changing the conclusions.

    This benchmark involves no application running on the server. The clients read and write data, and servicing this data access is the only thing the server is doing. This process is inherently multithreaded and so is potentially adaptable to a multiprocessor. Adding a second processor to the system increases performance when the test is run with only a few client systems (two or three), but as the client load increases to four or more, the system is limited in performance by the processor/cache/memory bottleneck. Thus, the additional processor is able to contribute little to overall performance because cache has become the scarce system resource.

    After this experimentation and monitoring performance with all the tools available, it was clear the processor/cache/memory was the bottleneck in limiting performance. A larger cache, if available, would have given us at least one more boost in system throughput. Plenty of bandwidth seemed to be available elsewhere in the system.

    These results were a little surprising, at least initially. The server environment, where data is almost never reused, still showed the positive effect of increasing the cache size. Clearly the data from Figures 3 and 4, representing nearly a best-case cache environment, shows that larger caches are often better, although in this earlier testing the size of the working data set could easily swamp the good effects of the L2 cache.

    Back To The Future

    Processor caches were developed because of a growing disparity between the improvements in CPU clock rates and the access time of system memory. The cache attempts to make up this speed difference.

    As a result, the faster the processor, the larger the required cache--simply because a faster processor is able to process more data and instructions per second and because the faster the processor, the larger the penalty (measured in instructions not run by the CPU) when a cache miss occurs.

    In the past, some systems were designed with minimal or no cache. This was often done to reduce cost. Early IBM RT PC systems and the Sun 4/110 SPARC desktop were noticeably slower because of the lack of cache. More recently, some IBM RS/6000 models and some DEC Alpha systems suffered performance problems due to small caches. In either case, more cache would have cost a small amount but would have markedly increased performance.

    The result is that cache sizes are increasing over time. It is not unusual to see L2 caches of 1MB to 4MB per processor on server systems today, and this size is increasing. L1 cache will also increase in size as semiconductor design rules allow a higher density of transistors on the CPU. It is unlikely a machine will be introduced without some cache unless cost is the dominant design criterion.

    The ultimate system of the future might have all its memory running at cache speeds, obviating the need for a processor cache. Including many megabytes of high-speed memory is prohibitively expensive but certainly would make for an impressive machine. But another factor is at work here.

    Electricity can travel approximately one foot in a nanosecond under ideal circumstances. As CPU clock rates pass 200MHz and head for 1GHz, the required access speed for processor cache drops toward one nanosecond. If all main memory ran at cache speeds, then even the physical distance of the RAM from the CPU would cause an appreciable delay in access, and the concept of making all RAM run at cache speed becomes difficult.

    Here again is the perfect role for L1 cache, which is physically associated with the processor. In the Intel Pentium Pro, for example, even the 256KB or 512KB L2 cache is on-chip, thereby minimizing the cache to CPU physical distance.

    The Intel Pentium MMX has increased the size of the L1 cache from 16KB (on older Pentiums) to 32KB. The effects of this cache increase are clearly measurable, although again it depends on the application. Dhrystone, for example, will run no faster on MMX because it fits into the small Pentium cache. But across the board, the MMX's larger cache will increase performance on typical jobs.

    Conclusions

    Cache is good. The bigger the better. Cache interacts in fairly complex ways with the applications being run. While we never have seen a situation where a larger cache is a negative, often a larger cache can provide a solid gain in performance.

    Cache size is increasing as a function of semiconductor technology. Internal cache size is determined by the chip designer but increases over time as the density of transistors on the processor chip increases. External-cache cost is significant, but cache RAM is becoming faster and more dense. This allows bigger external caches to be built to keep up with newer processors.

    The increasing speed of processors will create a growing gap between L1 and L2 cache performance. Just as processor cache initially was created to buffer speed differences between the processor and system RAM, faster processors will lead to the L2 cache merely supporting, at a slower rate, the processor L1 cache.

    We recommend purchasing the largest cache possible. If the system will run only a small set of applications, it is possible to benchmark a machine with two different cache sizes and see whether the larger cache helps. But on workstations and servers, if the machines run in both heavily and lightly loaded environments and with a variety of applications, the larger cache will certainly help sometimes. As long as the cache cost is a small portion of the total system cost, it will be worthwhile.

    Finally, despite the virtues of cache, its effect on system performance becomes apparent only when the system and applications are well-behaved. Processor cache serves only to allow the processor to run at full speed. When cache fails, such as the array index order example, or when some other system aspect is the bottleneck, such as the paging situation, increasing cache size will not help.


    David Wilson is president of Workstation Laboratories. Written inquiries and suggestions are welcome. You can contact him at P.O. Box 368, Humboldt, AZ 86329.


    FIGURE 1
    Performance Of NCR 32

    Processor @ Speed

    Cache Size

    Dhrystone

    68020/16.67MHz

    8KB

    4,006

    68020/16.67MHz

    0KB

    2,069

    68020/25MHz

    32KB

    4,888

    68020/25MHz

    0KB

    2,181

    Note: The effect of cache (and cache removal) on a small CPU-intensive application. Without a cache, the performance increase of the faster processor was negated.


    FIGURE 2
    Namespace Requirements For File Sharing Between Different Clients And Servers

    Processor @ Speed

    Cache Size
    L1/ L2

    Dhrystone

    Pentium @ 100MHz

    16KB Internal/no External

    118,519

    Pentium @ 100MHz

    16KB Internal/256KB External

    120,755

    Pentium @ 100MHz

    16KB Internal/512KB External

    118,519

    Note: The effect of cache size on a small CPU-intensive application in 1997. Dhrystone fits into the internal L1 cache, so increasing the L2 cache size does not affect performance.

    FIGURE 3
    Double-Precision Addition

    Order Of Matrices

    Total Bytes Of Data

    Cache Size
    L1/L2 (in KB)

    Performance
    (Mil. Ops/sec)

    5

    600

    16/0

    7.34

    5

    600

    16/256

    7.39

    5

    600

    16/512

    7.41

    25

    15,000

    16/0

    4.33

    25

    15,000

    16/256

    7.27

    25

    15,000

    16/512

    7.71

    100

    240,000

    16/0

    3.07

    100

    240,000

    16/256

    6.42

    100

    240,000

    16/512

    7.54

    500

    6,000,000

    16/0

    3.03

    500

    6,000,000

    16/256

    3.11

    500

    6,000,000

    16/512

    3.11

    1,000

    24,000,000

    16/0

    2.86

    1,000

    24,000,000

    16/256

    2.97

    1,000

    24,000,000

    16/512

    3.00

    Note: How the amount of data in the working set affects performance for various cache sizes. The test did an arithmetic addition on all elements in two matrices to generate the third A[I][J]=B[I][J]+C[I][J], and these tests were done in double precision thus using 8 bytes per matrix element. The performance was measured during 100,000,000 operations.


    FIGURE 4
    Single-Precision Addition

    Order Of Matrices

    Total Bytes Of Data

    Cache Size
    L1/L2 (in KB)

    Performance
    (Mil. Ops/sec)

    5

    300

    16/0

    7.36

    5

    300

    16/256

    7.37

    5

    300

    16/512

    7.41

    25

    7,500

    16/0

    7.95

    25

    7,500

    16/256

    8.54

    25

    7,500

    16/512

    8.41

    100

    120,000

    16/0

    4.60

    100

    120,000

    16/256

    8.49

    100

    120,000

    16/512

    9.11

    500

    3,000,000

    16/0

    4.64

    500

    3,000,000

    16/256

    4.77

    500

    3,000,000

    16/512

    4.77

    1,000

    12,000,000

    16/0

    4.33

    1,000

    12,000,000

    16/256

    4.49

    1,000

    12,000,000

    16/512

    4.56

    Note: How the amount of data in the working set affects performance for various cache sizes. The test did an arithmetic addition on all elements in two matrices to generate the third A[I][J]=B[I][J]+C[I][J], and tests were done in double precision thus using 8 bytes per matrix element. The performance was measured during 100,000,000 operations.


    FIGURE 5
    Pentium Acting As PC File Server

    Cache Speed
    L1/ L2 (KB)

    Maximum
    Performance

    Number Of Clients At
    Maximum Perfomance

    16/0

    4.3MB/sec

    4

    16/256

    5.9MB/sec

    6

    16/512

    7.4MB/sec

    8

    Note: How cache size affects the system throughput of a file server. Throughput is measured as total bytes/sec transferred between the file server and Windows 95 clients via 100MHz Ethernet using the NetBIOS or NetBEUI protocols.


    This article first appeared in the May 1997 issue of UNIX Review.

       

    Home | Top | Editor's Choice


    [Sys Admin Rocks!]
    Copyright © 2000 UnixReview.com ,UnixReview.com's Privacy Policy,
    Comments: rreames@cmp.com
    SDMG Web Sites: C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, Sys Admin, SD Expo, SD Magazine, UnixReview.com, Windows Developer's Journal