Sunday, March 23, 2003
Home | Articles | News Archive | Benchmarks
General Forum | Technical Forum

   
Latest News nourl


Reviews
Barton: 512 KB Athlon XP Reviewed
Granite Bay: Memory Technology Shootout
A Quick Look at the Fastest Apple PowerMac
3.06 GHz Pentium 4 and HyperThreading
More Reviews...
Technical
Scaling Server Performance
The Hitchhiker's Guide to the Mainframe
Ace's Guide to Memory Technology: Part 3
Volume Multi-Processor Systems: Part 3
More Technical Articles...
How-To Guides
K6-III+: Super-7 to the Limit
Overclocking Socket A Processors
K6-2+ Optimization and Performance Guide
Buying and Overclocking the Athlon
More How-To Guides...
Latest Discussions
Dual Opteron, needs 1 or 2 series chips?
What is the point of the 1-series Opteron?
help me build a tiny computer!
Centrino ads... clock speed or no clock speed?
Die machine.
plextor 24/10/40 problem! can you duplicate it?

Barton: 512 KB Athlon XP Reviewed
By Johan De Gelas
Monday, February 10, 2003 12:09 AM EST
PDF Purchase the PDF

The upcoming Opteron and Athlon 64 are constantly in the limelight of the hardware community. No other AMD processor has created so much hype, high hopes, and discussion. In the shadow of its big brother is "Barton," the first AMD processor with 512 KB of L2-cache integrated in the die.

This exclusive 512 KB L2-cache works together with the 128 KB L1-cache (64 KB data, 64 KB instruction) to form one impressive 640 KB on-die cache. According to AMD, the extra 256 KB cache boosts, an 2170 MHz Athlon XP from a 2700+ level to a 3000+ one. The 54.3 million transistor 2.17 GHz Barton Athlon XP will thus take on the mighty 55 million transistor 3.06 GHz Pentium 4 with Hyperthreading. Will 256 KB extra cache and a clockrate of 2.17 GHz be enough to compete with the fastest Intel CPU available today? Well, we'll find out in a moment. But before we look at the benchmarks, I'd like to discuss the different L2-caches, as caches are extremely important for modern CPUs.

A 512 KB L2 for the Athlon

L2-cache has often been a drag on the performance of AMD's processors. The K6 was a sixth generation architecture but it came with a fifth generation off-die L2-cache running at only 66 or 100 MHz. The L2 cache of the K6-III was pretty impressive, but the clock frequency of the K6-III did not scale past 450 MHz. The Athlon was a very impressive seventh generation architecture, but it was launched with a six generation L2-cache system.

In contrast, the L2-cache made the Intel processors really shine. The PII had a 512 KB half speed, back side bus cache, which gave Intel's CPU a considerable advantage over competitors like the Cyrix MII and AMD K6. The most important reason why the Coppermine Pentium III could somewhat keep up with the more advanced Athlon was its low latency, high bandwidth cache. Extremely impressive for its time, as the 256 KB cache was not only accessed via a 256-bit data path, but it could also respond in an amazing 4 clockcycles (total L2-cache latency was 7 cycles).
Advertisement:

Back to today: as the Pentium 4 was built to reach very high clockspeeds, a 4 cycle L2-cache latency was not possible. The L2-cache of the Pentium 4 is still pretty impressive, though, as you can see below. ScienceMark 2.0 tells us what Intel's engineers have been capable of. We tested with the 3.06 GHz Pentium 4. The most accurate numbers are the 32 byte to 256 byte step numbers (columns) in rows between the 32768 byte and 131072 dataflows, as we are sure that these measurements happen in the L2-cache.

A latency of 8 cycles (10 including the latency of the L1-cache) to 17 (total latency of 19) cycles is still very impressive for a CPU that runs at 3 GHz. Eight 3 GHz cycles equals 2.4 ns, faster than the Pentium III's L1-cache has ever been! Let us take a look at the bandwidth of the L2-cache.

Although 19 GB/s is nowhere near the theoretical 96 GB/s (3 GHz x 32 bytes/s), the Pentium 4 has a very fast L2-cache. One of the reasons for this big gap between theory and practice is the fact that only SSE(-2) instructions can move more than 8 bytes per cycle. And it is very unlikely that the Pentium 4 can sustain those 128-bit instructions at a rate higher than 1 per cycle.

Let us see how important cache is for performance. When the Pentium 4 was upgraded to a 512 KB L2 cache instead of a 256 KB one, performance was between 6% and 61% higher. The 61% higher performance in 3DSMax may surprise you, but it can be explained. The tiny 8 KB data cache can be accessed in 2 cycles by the integer units, but only in 6 cycles by the FPU/SSE-2 units of the Pentium 4. As the datacache is so small and relatively slow to access, the L2-cache is of the utmost importance to the Pentium 4 when crunching through FPU intensive apps. That is also the reason why integer intensive applications see a smaller boost.

Modern games which also tend to be FPU intensive, reported an impressive 15 to 17% boost thanks to the larger L2-cache. Only the older games (like Unreal Tournament) did not perform much better as their critical loops were satisfied with 256 KB.

Now let us see what the new AMD has in store. AMD has finally caught up to the Pentium 4, and has even more cache on board than the fastest CPU of Santa Clara. I'd like to point out again what marvelously efficient architecture the Athlon is: even Barton with 640 KB cache onboard is only 101 mm�, which still a lot smaller than Intel's Northwood 130 mm�. Of course, the slightly larger die size is no problem for Intel, given its huge fab capacity, and 300 mm� wafers.

Back to Barton, though. How good is Barton's cache? Well, latency is identical to the cache of Thoroughbred, the other 130 nanometer Athlon XP. Take a look below...

The L2-cache seems to have a latency between 15 (+L1-cache latency = 18) and 21 (24) cycles. The 24 cycles are a bit odd, as AMD's technical documentation talks about a (total) latency between 11 and 20 cycles and other cache programs (cachemem) confirmed the maximum of 20 cycles. Nevertheless, the important point is that the total L2-cache latency of the Athlon is higher than the Pentium 4's. What about bandwidth?

The 64-bit 2.17 GHz L2-cache offers up to 5.5 GB/s to the CPU core, between 3 to 4 times less than the 3 GHz Pentium 4. However, you may not conclude immediately that Athlon's L2-cache is very slow and hampering the performance of the Athlon. Contrary to the Pentium 4, the L1-cache will deliver a lot of the bandwidth needed. Just imagine an FPU intensive application that runs 85% in the L1-cache and 15% in the L2-cache (ignoring the memory subsystem for the moment). As the Pentium 4 only searches and uses its L2-cache, itwill have a 19 GB/s pipe to its FPU pipeline. The Athlon will have a (0.85 x 19 GB/s + 0.15 * 5.5 GB/s) 17 GB/s pipe to the FPU unit. In most applications, especially the FPU intensive ones, the Athlon needs its L2-cache much less than the Pentium 4.

Therefore, we can already say that the performance increase from Thoroughbred (384 KB cache) to Barton (640 KB) will be much less than what we have witnessed with the transition from Willamette (256 KB) to Northwood (512 KB) for the following reasons:

  • The latency of the Athlon's L2-cache is higher
  • The Pentium 4 L1 data cache is very small in integer applications, and non-existent in FP applications
  • The Athlon, in contrast, relies heavily on its huge 64 KB L1 data cache in all applications
  • Barton has 67% more cache than Thoroughbred, Northwood had 100% more cache than Willamette
So we can not expect too much of Barton's L2-cache increase...

Overview

Before we begin, let's take a quick look at what's covered in this review:


Now, let's first take a look at the various Athlon core revisions, including the newest one, "Barton."

The Athlon Family of CPUs

All Content is Copyright (C) 1998-2003 Ace's Hardware. All Rights Reserved.
3 ms