Home Software Reviews Instructions Sets Links Developers Contact

The AMD K8 Architecture



Introduction

The Athlon XP becomes old. Of course it still provides a very good performance level, but it can hardly face the latest Pentium 4. The K7 CPU family will soon be at the end, and the K8 will have the hard role to succeed to it, through the Athlon 64 especially. Hopefully the Athlon 64 becomes accessible, and the introduction of the 3000+ version that is not very expensive creates a real debate in the choice of a high-range CPU.
In this review we'll have a deep look in the AMD K8 architecture, through a study the K8 core, and point the differences with the K7 family ; we'll also explain the K8 cache architecture, comparing it to the Intel's one.
Endly, we'll look into the AMD64, the new 64 bits mode introduced with the K8. What does it consist in, what can we expect from this technology, and is it useful now ? Rather than usual benchmarks, you'll find in this review a comparison between the K8 and itself, using two different platforms.

Study of the K8

Successor of the Athlon XP

Intel's P6 core is known for being a longevity example, as it is still alive today in the Pentium M CPU. The K7 core of the Athlon also has a long history, from the first 500MHz Slot A Athlon to the latest Barton.
However, the K7 core did not evolve much in its history , the only changes concern the L2 cache, and from the Athlon XP model the use of the SSE instructions set. This can be explained because the Athlon core was originally equipped with very powerful features, that made it very efficient as soon as it was released. We can mention :

  • 3 x86 decoding units
  • 3 integer units (ALU)
  • 3 floating point units (FPU)
  • A 128KB L1 cache

Unlike the way chosen by Intel, AMD designed the K7 architecture in an efficiency aim, through the IPC mark (Instructions Per Cycle). The K7 units allow to handle up to 9 instructions per clock cycle, and one of the key of this efficiency deals with the many units it includes, especially the three FPUs that make the K7 the more powerful CPU for floating point computation. Today, the increasing use of SIMD floating point instructions sets like SSE tends to favor the Pentium 4, whose architecture was optimized for SIMD computation.

The K8 core design was widely inspired by the K7. The differences between the two cores only consist in a few changes or improvements. Unusually, the change of the CPU generation is not due to a major evolution in the core, but in an evolution in the way the CPU operates, namely the 64 bits. The K8 promises more speed, but for a big part in a software way.
We'll come back on this point later in this review, but before all let's use our magnifying glass and look what's inside the K8 core.

The K8 core

A glance at this diagram ...



... confirms that the K8 core is very close to the K7, as we can see the same units. However, one of the differences between the two cores concerns the pipeline subdivision. The K7 pipeline has a very specific design : it is splitted in two parts, one for the integer computation, one for the floating point. The "Fetch/Decode" stage, that consists in the 6 first steps of the pipeline, is common for both instructions types. The "Execution" stages are different : 4 steps for the integer pipeline (that makes 10 steps in total) and 9 steps for the floating point one (15 steps in total).
So doing, the integer computation are not slowed by a too long pipeline, and the floating point computation does not prevent the CPU from getting MHz.

The K8 also uses one splitted pipeline, but two steps are added for each branch. The integer pipeline is now 12 steps long, the floating point one is 17, as shown on the following table :



The K7 and the K8 pipelines

The "Fetch/Decode" stage was redesigned on the K8, and it is now 7 steps long. It includes an additional step that packs some instructions together before sending them to the next stage. This packing allows to optimize the dispatch on the units.

Let's see in details the way used by the instructions in the K8 pipeline :

  • The fetch stage is able to feed the 3 decoders with 16 instructions bytes each clock cycle. The fetching process uses the L1 code cache and the branch prediction logic.

  • The decoders convert the x86 instruction in fixed lenght micro-operations (µOPs). Three µOPs can be generated each clock cycle.
    The "simple" instructions, that are decoded in one or two µOPs, are decoded by hardware. The generated µOPs are packed then together and dispatched on the execution units. This way is called the FastPath.
    The complex instructions, that are decoded in more than two µOPs, are decoded using the internal ROM, that needs more time. These instructions are then microcoded.

    In comparison to the K7, more instructions use the FastPath in the K8 core, especially SSE instructions. AMD claims that the microcoded instructions number decreased by 8% for integer and 28% for floating point instructions.

  • The µOPs are then dispatched on the units. The K8 core includes :

    • Three address generation units (AGU)
    • Three integer units (ALU). These units are able to achieve most operations in one cycle, in both 32 and 64 bits : addition, rotation, shift, logical operations (and, or). The integer multiplication has a 3 cycles latency in 32 bits, and a 5 cycles latency in 64 bits.
    • Three floating point units (FPU), that handle x87, MMX, 3DNow!, SSE and SSE2.

  • The last stage of the pipeline process consists in the Load/Store stage. This stage uses the L1 data cache. The L1 is dual-ported, that means it can handle two 64 bits reads or writes each clock cycle. We'll see the effect of this feature in the bandwidth tests.

The improvements of the K8 core allow to solve some problems of the K7 core, especially the SSE performances. But the most noticeable features of the K8 are out of the core, as we'll see now.

The K8 innovations

In addition to the core improvements, the K8 introduces two major innovation in integrating some chipset features in the CPU : the DDR memory controller and the HyperTransport bus interface.

An integrated memory controller

The inclusion of the memory controller in the CPU core represents a major change in the relationship between the motherboard components, as the memory control used to be done by the north bridge of the chipset.
The following diagram shows the classic relationship between a CPU and the memory controller. This example can be a 200MHz FSB Pentium 4 CPU working with a synchronous memory bus.

The clock generator generates a 200MHz clock to the north bridge, this is the FSB. The bus between north bridge and the CPU is 64 bits wide at 200MHz, but four 64 bits packets are sent every clock cycle. This is as if the bus was 4x200MHz and 64 bits wide, this is why the bus speed is often reported as being 800MHz clocked. The memory bus (that links the memory and the controller) is also 200MHz and 64 or 128 bits wide (single or dual channel). As it is DDR memory, two 64/128 bits packs are sent every clock cycle.

The K8 way is quite different.

The clock generator always drives the north bridge, and provides the reference frequency for the HyperTransport link between the north bridge and the CPU. The HyperTransport frequency can so be considered as the FSB, because the CPU uses this frequency to generate its own internal clock, through an internal multiplier.
As shown on the diagram, the memory controller speed is the same than the CPU speed. Memory requests are consequently sent at the CPU speed, on a 64/128 wide bus, according to the number of memory channels. We can see there is no more link between the clock generator and the memory. The memory clock is so obtained from the CPU clock, divided by a factor that depends on the memory specifications. The table below shows the dividers used according to the CPU frequency and the requested memory clock.

Of course, the integrated memory controller does not improve the memory bandwidth, but it allows to drastically reduce the request time. The measured latency is very low, as we'll see further. Moreover, unlike an external memory controller, the performances of the integrated controller of the K8 increase as the CPU speed increase ; consequently, so does the requests speed.
The integrated controller has a particular interest for multi-CPUs systems : in this case, the addressable memory size and the total bandwidth increase with the number of CPUs.

The problem with the integration of the memory controller is the lack of flexibility. The controller is dedicated to a memory technology, and every change in memory standard will need a change in the CPU design. Of course, this does not occur very often (at least not so often as CPU family change), but this could drastically increase the cost of the CPU, as its manufacturation process needs to be changed.

The HyperTransport technology

HyperTransport is a link protocol between CPU and peripherals, and between CPUs themselves in a multi processors system. It allows low latency exchanges, that makes it very relevant for CPUs communications. It uses a 16 bits wide bus at 800MHz, and a double data rate system, that allows it to reach a 3,2GB peak bandwidth.
HyperTransport is planned to be used with the new PCI-X devices.

The Cool'n'Quiet feature

Cool'n'Quiet is not really a technical improvement, but just something that improves the comfort of a K8 system. CnQ is nothing less than the PowerNow! technology of the mobile Athlon, but used for desktop CPUs. It reduces or increases the CPU clock and core voltage according to the CPU load. This allows to reduce the PC consumption, the fan speed (and noise), the temperatures, and improve the CPU lifespan. The states switch is so fast that the performance decrease is insignifiant. A very useful feature, that deserved to be mentionned !

All K8 do not support Cool'n'Quiet. Opteron do not, Athlon FX and Athlon 64 do. In order to work properly, this feature must be enables in the BIOS (that will allow the FID/VID change), and a driver must be installed in order the changes operate according to the the load within Windows.

The evolution of the K8

Most critics against Athlon XP concern the difficulty it has to get MHz, and the pipeline of the K8 may not change this situation in a wide range. Of course, the efficiency of the K7 and K8 cores lies in the low depth of their pipelines, but the K8 may meet the MHz barrier very soon. The MHz increase represents a commonly used way for the progression of a CPU family.
AMD found a solution however : the evolution of its K8 won't be done with clock speed only, but also with other features like the L2 cache size and the memory bus width. All these parameters are mixed to produce a performance index that comes with the CPU name.

This method has an drawbacks as it introduces some confusion in the CPU designation. For example, the "Athlon 64 3400+" refers to three different CPUs :

  • 2,2GHz, 1MB L2 cache, 64 bits wide memory bus
  • 2,4GHz, 512KB L2 cache, 64 bits wide memory bus
  • 2,2GHz, 512KB L2 cache, 128 bits wide memory bus
  • This kind of designation is far from being clear, even if AMD claims that these three models provide the same performance.



    The K8 caches

    The L1 cache

    CPU K8 Athlon XP Pentium 4 Northwood Pentium 4 Prescott
    Size code : 64KB
    data : 64KB
    code : 64Ko
    data : 64KB
    TC : 12Kµops
    data : 8KB
    TC : 12Kµops
    data : 16KB
    Associativity code : 2 way
    data : 2 way
    code : 2 way
    data : 2 way
    TC : 8 way
    data : 4 way
    TC : 8 way
    data : 8 way
    Cache line size code : 64 bytes
    data : 64 bytes
    code : 64 bytes
    data : 64 bytes
    TC : n.a
    data : 64 bytes
    TC : n.a
    data : 64 bytes
    Write policy Write Back Write Back Write Through Write Through
    Latency
    (given by the manufacturer)
    3 cycles 3 cycles 2 cycles 4 cycles

    The L1 code and L1 data caches of the K8 are very similar to the K7 ones. This seems logical regarding the similarities in the core of these two CPUs. This big size cache is very efficient as the K7 showed in the past. It uses a 2-way set associativity, that results in a two 32KB blocs organization. The size of these blocs allows them to contain a big range of data or code in the same memory area, but the low associativity tends to create conflicts during the caching phase.

    The L2 cache

    CPU K8 Athlon XP Pentium 4 Northwood Pentium 4 Prescott
    Size 512KB (NewCastle)
    1024KB (Hammer)
    256 and 512KB 512KB 1024KB
    Associativity 16 way 16 way 8 way 8 way
    Cache line size 64 bytes 64 bytes 64 bytes 64 bytes
    Latency
    (given by manufacturer)
    ? 8 cycles 7 cycles 11 cycles
    Bus width 128 bits 64 bits 256 bits 256 bits
    L1 relationship exclusive exclusive inclusive inclusive



    One more time, the L2 cache of the K8 shares lot of common features with the K7. They both use a 16-way set associativity that partially compensates for the low associativity of the L1.
    The width of the bus between the core and the L2 cache increases, from 64 bits on K7 to 128 bits on K8. On the K7, this bus was sized according to the specifications of the first Athlon with discrete cache, but now this choice begins to show some limitations on the latest on-chip full-speed caches. The increase to 128 bits should allow to improve the L2 bandwidth, we'll check this in the bandwidth tests.
    The K8 also includes an hardware prefetch logic, that allows to get data from memory to the L2 cache during the the memory bus idle time.

    K7 and K8 use an exclusive relationship between L1 and L2, in opposition to Intel that uses an inclusive relationship. This choice has lot of consequences on the global cache architecture, that's why we'll now explain what these relations consists in, and what influence they have upon performance.

    Inclusive and exclusive caches

    In order to understand the way a cache works, let's consider the case of a CPU that has one cache level. When a read request occurs, the CPU will ask to its cache for the requested data. If it does not contain the data, the CPU will get it from memory and in the same time will copy it to its cache. Why ? because the CPU assumes that if it needed this data once, it may need it again soon. This statistically has good chances to occur. A x86 CPU contains a small number of registers, and the value that it just get back from memory to a register won't stay more than a few clock cycles, because the register will be quickly needed for another instruction. Storing the data in the cache is a way to keep it not too far.

    With one cache level, a read request from the CPU has two possibles ends :

    • If the data is in the cache, there is a cache success. It is obviously the most favorable case.
    • If the data is not in the cache, there is a cache miss. The following step consists then in getting data from memory and copying it to the cache. This is the caching process, or cache-fill. At this point, two cases may occur, depending on the cache is already full or not. If it is not full, a new cache line is filled.

      Figure 1 : cache fill

      The situation becomes more complicated if the cache is full. The cache fill will need an existing line to be replaced. In order to know what line must be replaces, the CPU uses a replacement algorithm. The most common choice consists in replacing the line that was the least recently used : this is the LRU algorithm.

      Figure 2 : Eviction of a cache line
      Update page or click on F5 to run the animation.

    As the animation clearly shows, the evicted cache line is just lost. The first aim of a second level cache is to get this line back instead of deleting it. In another words, a role of garbage.

    The addition of a 2nd level cache creates new possible states when a read request occurs :

    • the data is in the L1 : L1 success
    • the data is not in the L1 but is in the L2 : L1 miss, L2 success
    • the data is not in the L1 and not in the L2 : L1 and L2 misses

    Let's now see how this works. As long as the L1 is not full, the caching phase is the same as for the one cache level configuration :

    Figure 3 : L1 cache fill

    As soon as the L1 is full, the L2 has an active role : when a line is evicted from the L1, it is copied into the L2, and a new line coming from memory is copied in the freed line :

    Figure 4 : L2 fill
    Update page or click on F5 to run the animation.

    From this moment, the L2 contains data and is able to answer to a read request. If the requested data is not in the L1 but is in the L2, the line is one more time copied in the L1. Why not leave it in the L2 only ? For the same reason as before, the CPU may need it again. So, a line must be freed in the L1 to get the data from the L2. The LRU algorithm selects the candidate line, copies it into the L2, and the requested line from L2 is copied back in the L1.

    Figure 5 : L1 miss, L2 success
    Update page or click on F5 to run the animation.

    So doing, we notice that a cache line never exists in the same time in the L1 and in the L2. This means that the L1 and the L2 do not contain the same data, and the data is exclusively in one cache level. This is the exclusive relationship.
    The total cache size is consequently the sum of the size of the two cache levels. And this method works whatever the size of the L1 and the L2 are, the L2 can even be smaller than the L1.

    The exclusive relationship allows lot of flexibility, but has a drawback in performance. In fact, when a L2 success occurs, a line from the L1 must be copied to the L2 before getting back the data from the L2. This additional step needs lot of clock cycles, and slowes down the total time needed to get the data from the L2.
    In order to speed-up the process, the exclusive caches very often use a victim buffer (VB), that is a very little and fast memory between L1 and L2. The line evicted from L1 is then copied into the VB rather than into the L2. In the same time, the L2 read request is started, so doing the L1 to VB write operation is hidden by the L2 latency. Then, if by chance the next requested data is in the VB, getting back the data from it is much more quickly than getting it from the L2.

    The VB is a good improvement of the exclusive relationship, but it is very limited by its small size (generally between 8 and 16 cache lines). Moreover, when the VB is full, it must be flushed into the L2, that is an additional step and needs some extra cycles.

    In fact, in order to avoid this additional write in the L2 in case of L2 success, this write should be done before. How can it be ? Well, this line comes from L1, so it was written to the L1 in the process history. Then, if this line is copied in the L2 in the same time, it will already be in the L2 !

    In this configuration, a data is get from memory and copied into the L1 and the L2. So doing, the caching step needs two writes instead of one.

    Figure 6 : Caching

    Once the L1 is full and the requested data is not in the L1 and not in the L2, a new line is then copied into both levels. This will result in a deleted line in the L1, but there is no need to save to the L2 because it is already in the L2. So, the total number of writes is the same as in the previous configuration.
    From this point, the L2 cache contains data that are not in the L1.

    Figure 7 : L1 and L2 miss
    Update page or click on F5 to run the animation.

    If the requested data is not in the L1 but is found in the L2 (L2 success), the only needed operation is to copy a line from L2 to L1. So, only one write instead of two for the previous configuration.

    Figure 8 : L1 miss, L2 success

    In this configuration, all the lines of the L1 are duplicated in the L2, in other words an image of the L1 is included in the L2. This is the inclusive relationship.
    An inclusive cache allows to avoid one write in case of L2 success, that makes it faster than an exclusive cache for this step. In practice, an inclusive L2 cache is faster than an exclusive one. On the other hand, the duplication of the L1 in the L2 reduces the "useful" size of the L2 cache from the L1 size. That means :

    • the L2 size must be greater than the L1 size, and the efficiency of the L2 depends on this size difference.
    • the total "useful" cache size is : L1 size + L2 size - L1 size, that is to say : L2 size.

    Advantages and drawbacks of each method

    This table summarizes the plus and the minus of an exclusive cache :

    + -
    • No constraint on the L2 size.
    • Total cache size is sum of the sub-level sizes.
    • L2 performance.

    Regarding this, we can guess what an exclusive cache must look like :

    • A big size L1 cache. This is possible because there are no constraint on the L2 cache size. Moreover, a big L1 reduces the access to the L2.
    • A victim buffer to improve performances.

    AMD made the choice of an exclusive relationship for the first time on the Thunderbird. The CPU architecture fits on this choice, with a big L1 cache and a 8-entries victim buffer.
    This choice allowed AMD to build CPUs with a L2 cache size from 64 to 512KB with the same core, and even the Duron that has a 64KB L2 cache provides very good performance. In another hand, the increase of the L2 size does not provide a big jump in performance.

    In comparison, an inclusive cache provides :

    + -
    • L2 performance.
    • Constraint on the L1/L2 size ratio
    • Total cache size.

    This table is exactly the opposite of the exclusive table. Indeed, the only advantage of an inclusive cache stands in the performance, but the improvement needs some conditions to be respected.
    It is far from being easy.to draw what an inclusive cache should look like. The constraint on the L1/L2 size ratio needs the L1 to be small, but a small size will result in reducing its success rate, and consequently its performance. On the other hand, if it is too big, the ratio will be too large for good performance of the L2. In a word : headache.

    Intel made the choice of an inclusive relationship with the Pentium Pro, that is the first CPU than includes L2 cache on chip. This choice was used on the whole CPU line following the PPro. That's why no Intel CPU has a very large L1 cache. The biggest size was reached on the Pentium M, that includes a 64KB L1 and a 1MB L2.
    The Pentium 4 was introduced with a very small 8KB data L1. This choice was made for two reasons : the Pentium 4 was the first CPU designed with an integrated full-speed L2 cache (excepting the PPro, but Pentium II and III began with a discrete L2 cache) ; so, the CPU architecture was designed knowing that the small L1 could be supported by a large a fast L2 ; moreover, a very small L1 can be very fast, and the Pentium 4 L1 cache has the lowest latency ever seen with 2 clock cycles.
    The constraints of an inclusive L2 are hardly compatible with commercial considerations. In fact, it is very hard to build a CPU line with such constraints. Intel released the Celeron P4 as a budget CPU, but its 128KB L2 cache completely breaks the performance. The Celeron P4 does not fit the constraints of the inclusive relatonship, and the result is catastrophic. On the other hand, an inclusive relationship can be very efficient, as the Pentium M shows.

    Conclusion

    The choice of a cache architecture is a very important step in the design of a CPU, as it determines the performance, but also the evolution in the low and high range.
    The exclusive relationship is the most flexible, as it allows lot of different configurations in keeping a good performance index. The drawback is that the performance does not increase very much with the L2 size. The inclusive relationship can only be chosen for performance purpose, knowing for example that increasing the L2 will create a performance boost. However, the constraints of this mode are very hard, and not respecting them can have the opposite result and break the performance.



    The K8 cache bandwidth

    The bandwidth tests consists in measuring the requested time to read a buffer. The buffer size increases from 1KB to 32MB. Three reading methods are benched : integer (32 bits reads), MMX (64 bits reads) and SSE (128 bits reads). The results provide some information about the data caches, the load/store units, and the memory performance.

    The tests were ran on an Opteron 244 (1,8GHz) and on an Athlon XP 2200+, clocked at 1,8GHz also, Let's first have a look at the 32 bits results.


    • The general shape of the Opteron curve shows the different cache levels : the 64KB data L1 cache and the 1024KB L2. The L2 step extends up to 1088KB, that represents the sum of the two caches (1024+64), that is due to the exclusive relationship between the two caches, that causes the total cache size to be the sum of all levels. The same result is obtained on the Athlon XP, whose L2 steps reaches 320KB (64+256).

    • The bandwidth of the L1 cache reaches 10000MB/s on the Athlon XP, that represents 5,2 bytes per cycle, and 12500MB/s on the Opteron, namely 7 bytes per clock cycle.
      Considering that the 32 bits test consists in reading 4 bytes per instruction, it means that the K8 is able to process two of these operations per clock cycle. This is possible thanks to the load/store unit (LSU), that is able to process two 64 bits load or store operations per cycle. In order to feed the LSU, the L1 cache is dual-ported, that means it is also able to handle two 64 bits read or write operations each clock cycle. At 1,8GHz, the couple LSU/L1 is consequently able to provide a maximum bandwidth of 2 ops x 8 bytes x 1800MHz = 28800MB/s in 64 bits, or 2 ops x 4 bytes x 1800MHz = 14400MB/s in 32 bits.

      The Athlon XP provides 70% of the maximal bandwidth. The Opteron is more efficient and reaches 87%, thanks to the improvements of its pipeline, especially the best use of the units.

    • The L2 cache of the Athlon XP provides 5000MB/s and the Opteron 6530MB/s. The L2 cache of the two CPUs is not dual ported, so it can only handle one operation per clock cycle. That means 1 op x 4 bytes x 1800MHz = 7200MB/s. One more time the L2 of the Opteron provides a value very close from this theorical maximum.

    The next test uses MMX 64 bits reads.


    • As mentionned before, the theorical maximum bandwidth of the L1 is 28800MB/s when using 64 bits reads. The Athlon XP offers 65% of this bandwidth with a raw value of 18800MB/s, and the Opteron reaches 79% of this theorical value with 22700MB/s. These results confirms what we obtained for 32 bits reads.

    • The L2 of the Opteron provides a 8260MB/s bandwidth, and the Athlon XP 5700MB/s. In both cases, the results are far from the maximum theorical value of 1 op x 8 bytes x 1800MHz = 14400MB/s. With 64 bits reads, the L2 is saturated, and offers the maximum bandwidth it can provide. In comparison to 32 bits, the Opteron increased its L2 bandwidth from 27%, whereas the Athlon XP increased from 14% only. This difference is due to the Opteron largest L2 bus width, that allows the L2 to increase its bandwidth.

    The last test uses 128 bits reads through SSE instructions set.


    Both CPUs obtain the same 14200MB/s bandwidth, that is below the 64 bits results. We see here a limitation of the LSU. It is able to handle two 64 bits load or store per clock cycle, but it can only handle on 128 bits read at the time, and this operation needs no less than 2 clock cycles ! The maximum theorical bandwidth is then 16 bytes x 1800MHz x 2 cycles = 14400MB/s. The two CPUs offer bandwidth very close from this maximum.
    We are far from the results obtained on a Pentium 4, that is able to operate one 128 bits read per clock cycle. This difference partially explains why the Pentium 4 is sometimes faster than K7 (and K8) for SSE computation.

    The L2 bandwidth are the same as in 64 bits, that shows that they both already saturated with 64 bits reads.



    Latencies

    The latency is the time between a read request and the effective data availability. Applied to an on-chip cache, it is expressed in CPU clock cycles. This values allows to measure the response speed of a cache level, and, applied to memory, of the memory subsystem.
    Indeed, all memories need some time to respond to a request, and in practive this time is hidden by the instructions flow in the pipeline. The algorithm we use represents the worst case, it means that it uses the data just after the request.

    Here are the results obtained on an Athlon XP 2200+ (1,8GHz).

    
     stride		4	8	16	32	64	128	256	512	
     size (Kb)
     1		3	3	3	3	3	3	3	3	
     2		3	3	3	3	3	3	3	3	
     4		3	3	3	3	3	3	3	3	
     8		3	3	3	3	3	3	3	3	
     16		3	3	3	3	3	3	3	3	
     32		3	3	3	3	3	3	3	3	
     64		3	3	3	3	3	3	3	3	
     128		4	6	9	18	20	20	20	20	
     256		4	6	9	18	20	20	20	20	
     512		27	48	96	195	360	359	358	359	
     1024		28	48	97	193	359	359	359	359	
     2048		27	49	98	194	360	361	361	362	
     4096		28	48	97	193	359	360	361	362	
     8192		28	49	98	193	360	361	360	363	
     16384		27	49	98	192	360	360	362	363	
     32768		27	48	97	193	360	362	359	364	
    
     2 cache levels detected
     Level 1	size = 64Kb	latency = 3 cycles
     Level 2	size = 256Kb	latency = 20 cycles (17 cycles for this only level)
    

    The L1 latency is the one announced by AMD : 3 cycles. On the other hand, the L2 latency is 17 cycles, whereas AMD announces between 8 and 11 cycles. The difference is due to the victim buffer penalty, as the latency algorithm causes the overflow of VB, that has to be flushed in the L2 cache. This is, in other words, the worst case that could happen on an exclusive cache architecture with a VB. Consequently, the 17 cycles include 8 cycles needed for the VB flush, that provides a real L2 latency of 9 cycles.

    Let's see the results on an Opteron 244 (1,8GHz) :

    
     stride		4	8	16	32	64	128	256	512	
     size (Kb)
     1		3	3	3	3	3	3	3	3	
     2		3	3	3	3	3	3	3	3	
     4		3	3	3	3	3	3	3	3	
     8		3	3	3	3	3	3	3	3	
     16		3	3	3	3	3	3	3	3	
     32		3	3	3	3	3	3	3	3	
     64		3	3	3	3	3	3	3	3	
     128		4	6	8	16	16	16	16	16	
     256		4	6	8	16	16	16	16	16	
     512		4	6	8	16	16	16	16	16	
     1024		4	6	8	16	16	16	16	16	
     2048		10	17	34	61	112	113	116	121	
     4096		10	17	34	61	112	112	115	120	
     8192		10	17	34	61	112	113	115	119	
     16384		10	17	34	61	112	113	115	119	
     32768		10	17	34	61	112	113	115	119	
    
     2 cache levels detected
     Level 1	size = 64Kb	latency = 3 cycles
     Level 2	size = 1024Kb	latency = 16 cycles (13 cycles for this only level)
    

    The L1 latency keeps begin the same. The L2 latency is 4 cycles lower and reaches 13 cycles. If we assume that the VB penalty is the same, this provides a real latency of 5 cycles for the only L2 cache. Good improvement !

    This test also allows to compare the memory latency, once the test exceeds the L2. The Opteron memory subsystem shows a 120 cycles latency, whereas the Athlon XP shows 360 cycles, namely three times slower, as the clock speed are the same. We can see here the effect of the integrated memory controller, that allows to minimize the request time to memory, and then drastically reduces the read latency.

    To conclude with the K8 cache architecture, if the L1 cache did quite not change from K7 to K8, the L2 shows lot of improvements : better latency, better bandwidth, that allows to fill lot of the weaknesses of the Athlon XP.



    The AMD64 technology

    The real challenge of the K8 deals with the introduction of the 64 bits technology, named x86-64 or AMD64. That's why the longest chapter of this review is dedicated to it.

    The AMD64 technology offers to repeat the 386 challenge : change the x86 architecture, nothing less. It is necessary to realize how much time was needed for the 32 bits technology to hit every systems, 10 years after the 386 was introduced. Nowaday, systems are from 50 to 100 times faster than 386, but this is only due to the CPU architecture evolution, because the same code for 386 is still used.
    The K8 promises an acceleration that will justify the change of CPU generation from the K7, but the way it uses is especially a software way. An alternative to the technological race that makes CPU architectures more and more complex, but a risky alternative. Indeed, AMD64 success depends on software, and first of all on operating systems. For this time, a step in PC architecture won't be measured in transistors number.

    AMD64 raises lot of questions, the main being : what can provide a 64 bits technology now ? Do we really already need it ? We'll try to give an answer to this question.

    The GPR

    The belonging of a CPU to a 16, 32 or 64 technology lies in the size of its General Purpose Registers, or GPR. The GPR are the working registers of the CPU : they are the operands of the x86 instructions set, and moreover the only registers that can be used to address memory. This means that the memory a CPU can address directly depends on the GPR size.

    The first x86 CPU, the 8086, used eight 16 bits GPR : ax, bx, cx, dx, si, di, bp, sp. 16 bits GPR allow to store memory addresses between 0 and 65535, that represents 64KB of addressable memory. Of course this is not enough, even for the 8086. The trick consists in extending the addressable range in managing memory in blocs, each of them being of course 64KB sized. This method does no provide lot of flexibility. The readers who programmed MS-DOS in real mode may remember the impossibility to define tables that exceed 64KB.

    The introduction of the 386 allowed to bypass this limitation. The 386 extended the GPR size to 32 bits, that made it the first IA32 CPU (Intel Architecture 32 bits), and allowed to address up to 4GB memory. In order to be compatible with 8 and 16 bits programs, the GPR of the 386 are abled to be used as 8 or 16 bits registers. In order a program use the GPR as 32 bits register and be able to use 4GB of memory, the 386 needed to be switched in a specific mode, named protected mode.

    The limitations of IA32

    Current x86 CPUs are always obeing to IA32 specifications, and this technology begins to show some limitations.

    The first limitation is an horizontal one, as it concerns the GPR size, and especially the addressable memory with 32 bits GPR. 4GB are today still a big value for personal computers, but was already reached for servers. In practice, the 4GB barrier can be put back thanks to an extension mechanism, similar to the one used on the 8086. On Xeon CPUS for example, memory addresses are extended to 36 bits, that allows to use 64GB, splitted in 4GB blocs. This method keeps being not very convenient, and does not show very good performance as soon as blocs are switched.

    The second limitation concerns the GPR number, namely a vertical limitation. The number of GPR did not change since the x86 was introduced, whereas no-x86 CPUs has more and more GPR (the Itanium has 128 integer GPR). Of course, the x86 was improved, especially with extended instructions sets, that brought new registers : eight 80 bits registers for x87, MMX and 3DNow!, eight 128 bits registers for SSE 1, 2 & 3. But these registers are not GPR but only calculation registers. They especially do not allow to address memory.

    The small GPR number has a direct effect on the x86 code efficiency. In fact, the limitation does not come from the CPU but from the compilers. Indeed, modern x86 CPUs use more than eight GPR internally, that allows them to use several versions of the same GPR. This is called register renaming. But as efficient it can be, a CPU only runs the code provided by the compiler, and the compiler can only use eight GPR.
    CPUs have evolved, but so did programs. These are always bigger and more complex, and have nothing to do with programs dedicated to 8086. Compilers have more work, but the same tools, namely only eight GPR. Consequently, the compilers need to juggle with registers, and lot of instructions are generated in the only aim to free one or several registers, using the stack or the memory. The presence of these instructions slowes down the code.

    What is the workaround ? The most common way is to compensate this with architectural improvements in the CPU. So, for example the Pentium M brings very specific optimizations to the instructions that manage the stack, that are very often used to save GPR (refer to this review for more details about Pentium M).
    This keeps being a trick, and the real solution is : more GPR. But more GPR means a break with IA32, and consequently the definition of a new architecture. Intel and AMD are on this way. Intel's solution is quite fuzzy, but AMD's is available for several months and is called AMD64.

    More registers please !

    The AMD64 technology consists in a new operating mode that consists in two extensions :

    • an horizontal extension, that consits in 64 bits GPR. The GPR are still useable as 8, 16 and 32 bits registers (that allows the K8 to be compatible with previous programs), and when the CPU has been switched to long mode, softwares can use the whole 64 bits. This is the same kind of extension the 386 brought in its time, and it gives the name "64 bits" to the AMD64 technology.
      The direct effect of this extension are :

      • The native use of 64 bits integers.
      • The increase of the addressable memory.
      These features are interesting, but for long term, as they don't compensate an urgent need.

    • The second extension provided by AMD64 consists in increasing the number of GPR from eight to sixteen in long mode. This is a vertical extension, and the first one of x86 history. The increase of GPR number will make the compiler work easier, and will especially reduce the register saving instructions generated with eight GPR. Finally, the code will be shorter, more efficient, and faster.


    Notice that the vertical extension has no direct link with the "64 bits" designation, whereas it is much more relevant as the 64 bits extension, at least at short time. This causes us to warn the reader from shortcuts about AMD's 64 bits technology. AMD64 is of course a 64 bits extension of the GPR, but it is not only that.

    In addition to the GPR extension, AMD increased the number of 128 bits SSE/SSE2 registers from eight to sixteen, always in long mode. Besides, the number of x87 registers keeps being the same, as shown on the diagram below. We'll explain the reasons of this choice in the next part.

    From 32 to 64 bits

    One of the challenges of the K8 is to introduce a new 64 bits architecture and in the same time provide very good performances with current 32 bits applications. In a word : a smooth transition.
    In order to achieve that, we'll see in this part that AMD did not create AMD64 from scratch, but that this technology was build on IA32 fundations. This makes AMD64 very different from Intel's IA64, because the IA32 legacy brought lot of restrictions. In another hand, so doing makes the transition from 32 to 64 bits very progrssive, that is necessary for the AMD to impose the K8, and consequently AMD64, to the users.

    The K8 operating modes

    The K8 basically has two operating modes : the legacy mode, that causes the K8 to behave like an IA32 CPU, and the long mode that is the 64 bits mode. This table summarize the different operating modes of the K8.


    • The legacy mode is the common 32 bits mode, used by current operating systems. When running under this mode, the K8 behaves like a K7, boosted with the improvements we talked about in the previous chapter ; it has eight 32 bits GPR, and can address 4GB of memory. Thanks to the Physical Address Extension, the addresses are extended to 52 bits, that allows the K8 to use 4096TB, or 4PB (Petabytes) of virtual memory.

    • The long mode is the extended mode dedicated to the K8. To give an idea, it is to the protected mode what the protected mode is to real mode. The CPU needs to be switched in long mode by the OS, that implies a specific OS. For the moment, Linux and two versions of Windows are able to use the K8 in long mode.

      The compatibility mode is a submode of the long mode, and allows a 64 bits OS to run a 32 bits program. Of course, these applications keep being 32 bits applications, that means they won't be able to access extra registers or extra memory. This mode is based on the Wow64 (Windows on Windows 64) layer, that creates a 32 bits virtual memory space, seen by the application like a 32 bits OS.
      Wow64 works with no performance drop. This is possible thanks to the IA32 legacy, that causes IA32 code to be handle without any software emulation. Indeed, we ran 32 bits benchmarks on both Windows XP (32 bits) and Windows Server 2003 for AMD64, and we obtained exactly the same results.

      Wow64 is obviously not able to encapsulate the applications that run at the same privilege level as itself, especially the system drivers. That means that drivers need to use 64 bits code in order to run with the 64 bits kernel. The consequence is bad : 64 bits OS need dedicated drivers. This will need several months for all drivers to be recompiled for AMD64 OS, and meanwhile lot of hardware won't be useable on these OS.

    From 8 to 16 GPR

    Sixteen GPR are a great improvement in comparison to IA32, we can just wonder why AMD did not go further and increase the GPR number more than that. The reason of this choice lies in IA32 legacy, and we'll explain in this part how AMD extended IA32 to create AMD64.

    The IA32 instructions encoding is made with a special byte, in which are encoded the source and destination registers of the instruction. This is the ModRM byte, that means Mode / Register / Memory. Three bits of this byte are used to encode the source register, three other for the destination. And three bits allows to encode 8 different values, namely the 8 GPR of the IA32.
    No way to change the ModRM byte, that would break the IA32 compatibility. In order to add 8 new GPR, it is necessary to use 4 bits for registers encoding, that is to say one more bit. This additional bit is contained in a prefix register named REX, that is dedicated to long mode, and specified that the instruction is a 64 bits one.


    This clearly shows how AMD64 inherits from IA32, and explains why Wow64 works without any performance lost.

    Floating points calculations : x87, SSE and SSE2

    Why more SSE registers and no more x87 registers ?
    Following Intel, AMD just want to leave x87. This is an old instructions set, and not very convenient because of its stacked based management instead of direct-access registers. So, bye bye x87 in long mode. The first 64 bits MMX instructions and the 3DNow! instructions sets, that operate on the same physical registers, are abandoned also.

    However, this intensive use of SSE/SSE2 instead of x87 may raise some questions, as the K7/K8 cores are not known for being particularly efficient in SSE/SSE2 computation, especially in comparison to the Pentium 4 that was designed to give its best on these instructions sets.

    The weakness of the K7 for SSE code is due to the architecture of its floating point units. These units include logic schemes that are specifically designed for the operation they are in charge of. For example, the Wallace tree scheme is commonly used for additioners and multipliers. The Pentium 4 was designed to provide the best performance in 128 bits SSE calculation, so the Wallace trees of its units are 128 bits wide, in order the operation to be treated in one step by the logic circuit. On the other hand, the Wallace trees of the FP units of the K7 are 80-bits wide, as these units were designed for x87 instructions set. Consequently, an operation on a 128 bits register will need to be treated in two steps by the tree. That explains why in much cases Pentium 4 outperforms the K7 as soon as SSE code is used.

    As we've mentionned in the core study chapter, more SSE instructions use the Fastpath in the K8, that should provide more performance in SSE. However, the K8 FP units still use 80-bits Wallace trees. Does that mean that the use of SSE instead of x87 will result in worse floating point performance for the K8 ? In most cases, the naswer is : no.
    In order to understand why, we have to study how compilers generate SSE/SSE2 code from floating point C/C++ code. A floating point operation will be translated into scalar SSE instructions, that will only use the low part of the 128 bits register, in opposition to the vectorized instructions that operate on all 32 or 64 bits parts of the registers.
    Of course scalar instructions provide less performance than vectorized ones, because only one operation is made instead of 2 or 4. But as they only treat a 32 or a 64 bits part of the register, they can be treated in one step by a 80-bits Wallace tree. This is exactly what the K8 needs !

    Compilers are still not able to vectorize C/C++ code in a very efficient way, so they always generate scalar instructions. That means that the SSE code generated by compilers should provide good performance on K8, regarding to the use of x87 instructions. That's what we'll check in the next chapter.

    The 64 bits code

    The 64 bits code will so consists in the use of sixteen GPR for integer instruction and sixteen 128 bits registers for floating point instructions. A question concerns the size of the code, that will tend to increase in comparison to 32 bits binaries. Indeed, 64 bits instructions use 64 bits operands (memory address, immediate values), that's twice the size of 32 bits code. Moreover, in the case of AMD64, 64 bits instructions use the REX prefix, that adds one more byte to the instruction size. Finally, SSE/SSE2 instructions also use a prefix.
    A bigger code means more time for decoding, and more cache pollution in comparison to a 32 bits code.

    AMD however added some tricks in order to reduce the 64 bits code size. For example, let's suppose we want to write 1 in a register, that is written in pseudo-code as :

    mov register, 1

    In the case of a 32 bits register, the immediate value 1 will be encoded on 32 bits :

    mov eax, 00000001h

    In the case the register is 64 bits :

    mov rax, 0000000000000001h

    As you can see, the "1" needs 4 bytes in 32 bits and 8 bytes in 64 bits. Including the REX prefix, this will make the 64 bits instruction at least 5 bytes bigger as the 32 bits one.
    This is a real waste of space, especially because the actual need for 64 bits integer may be rare in practice. The trick used by AMD to reduce the size appears in this table :


    The default size of the operand won't be 64 but 32 bits, and the use of a 64 bits sized operand will need a bit to be set in the REX register. So doing, our instruction becomes :

    mov rax, 00000001h

    The 64 bits instruction is now only one byte bigger as the 32 bits one, due to the REX prefix.

    Concerning the memory operands, the opposite way was chosen : the default address size is 64 bits, and the use of a 32 bits needs a prefix. That means that C/C++ pointers will be 64 bits long by default.


    We can estimate that a 64 bits code will be 20-25% bigger compared to the same IA32 instructions based code. However, the use of sixteen GPR will tend to reduce the number of instructions, that could cause the 64 bits code to be shorter than the 32 bits code. This will depend on the function complexity, and of course on the compiler.

    Besides, the K8 is able to handle the code size increase, thanks to its 3 decoding units, and its big L1 code cache. The use of big 32KB blocs in the L1 organization in order seems now very useful !

    The 64 bits performances of the K8

    In this capter we'll look at the performances of the K8 used with a 64 bits OS, namely Windows 2003 Server for AMD64.
    Unfortunately, not a lot of applications already use AMD64, that's why we decided to compile some C code by ourselves. The chosen functions are pure calculation functions, that could be found in current applications like games. There is no call to graphic functions like DirectX or OpenGL, as the performance of these APIs greatly depends in the video drivers quality, and the AMD64 drivers are far from providing the best performances.

    The rarity of AMD64 applications is mainly due to the lack of a retailed AMD64 compiler. We used a pre-release of the Microsoft AMD64 compiler, included in the Microsoft DDK for Windows Server 2003 suite, that prefigures the release of Visual C++ 8.0. The DDK also includes a 32 bits version of the compiler, that generates common IA32 code.

    It is important to notice that compiling code for AMD64 does not request lot of effort from the programmer, unlike the use of a specific instructions set that must be done manually in assembly language. We just had to change some very low-level functions because AMD64 compiler does not manage inline assembly. Then, our work just consisted in recompiling.

    Test platform

    CPU AMD Athlon 64 3400+ (2,2GHz)
    Motherboard Asus K8V
    Chipset VIA K8T800
    Southbridge VIA VT8237
    Memory DDR PC3200 Samsung, 2x512Mo
    Video card ATI Radeon 9600
    Operating system Windows XP Service Pack 1a
    Windows Server 2003 for AMD64

    64 bits compilation

    The hardest work consisted in finding representative functions for our tests, in order the comparison to me as relevant as possible.

    The pre-release of the AMD64 compiler caused some problems. The generated code is completely safe, but the optimizer is not finished, and some optimizations are just not here. For example, the use of trigonometric functions (sinus, cosinus) drastically slowed the code. Indeed, the compiler uses standard C library sin and cos functions whereas the IA32 compiler uses the floating point dedicated instructions. The code of these C libraries is all bu fast.
    In the same way, the use of C++ provided a very slow code, that's why all our code was written in C only.

    This made the tests very difficult, and we had to look at the generated code before being able to conclude. Hopefully these problems will be solved in the release version of the compiler.

    We used the following functions :

    • Filter : an average function that works on several tables.
    • Rotozoom : a zoom and rotation function, that operates on a 512 x 512 x 2 bytes buffer.
    • Arithmetic : basic integer arithmetic functions : additions, multiplications, divisions, shifts, branches..
    • Whetstone consists in two similar functions that mix floating point operations : additions, multiplications, divisions. One function use single precision floating point numbers (32 bits), the other one double precision (64 bits).

    All these functions were compiled to provide two binaries : a 32 bits version and an AMD64 version. First observation : the 32 bits executable is 33280 bytes, the AMD64 is 41472 bytes, namely 24% bigger. No doubt, 64 bits code is bigger !
    We gave an arbitrary performance index of 100 for 32 bits results, in order to clearly see the AMD64 results in comparison. Here are the results for integer functions :


    With a 35% performance increase, the rotozoom function seems to take advantage of 64 bits recompilation. Let's have a look at the generated code for this particular function.

    32 bits code 64 bits code
    
    $L53371:
    	mov	 eax, DWORD PTR _v0$[esp+16]
    $L53311:
    	mov	 ecx, DWORD PTR _du$[esp+16]
    	mov	 esi, ebx
    	mov	 edi, eax
    	sub	 ebx, ebp
    	add	 eax, ecx
    	mov	 DWORD PTR 12+[esp+16], 512
    	mov	 DWORD PTR _v0$[esp+16], eax
    $L53314:
    	mov	 eax, edi
    	mov	 ecx, esi
    	sar	 eax, 16
    	mov	 ebp, DWORD PTR _du$[esp+16]
    	and	 eax, 511
    	sar	 ecx, 16
    	and	 ecx, 511
    	add	 esi, ebp
    	mov	 ebp, DWORD PTR _dv$[esp+20]
    	add	 edx, 2
    	shl	 eax, 9
    	add	 eax, ecx
    	mov	 ecx, DWORD PTR __src$[esp+16]
    	add	 edi, ebp
    	mov	 ax, WORD PTR [ecx+eax*2]
    	mov	 WORD PTR [edx-2], ax
    	mov	 eax, DWORD PTR 12+[esp+16]
    	dec	 eax
    	mov	 DWORD PTR 12+[esp+16], eax
    	jne	 SHORT $L53314
    	mov	 eax, DWORD PTR 16+[esp+16]
    	dec	 eax
    	mov	 DWORD PTR 16+[esp+16], eax
    	jne	 SHORT $L53371
    			
    
    $L23099:
    	mov	 r8d, edi
    	mov	 r9d, esi
    	sub	 edi, r11d
    	add	 esi, ebx
    	mov	 r10, r15
    
    $L23102:
    	mov	 ecx, r8d
    	sar	 ecx, 16
    	and	 ecx, r14d
    	mov	 eax, r9d
    	sar	 eax, 16
    	and	 eax, r14d
    	add	 r8d, ebx
    	add	 r9d, r11d
    	shl	 eax, 9
    	add	 eax, ecx
    	movsxd	 rax, eax
    	mov	 ax, WORD PTR [r13+rax*2]
    	mov	 WORD PTR [rdx], ax
    	add	 rdx, 2
    	sub	 r10, 1
    	jne	 SHORT $L23102
    
    	sub	 r12, 1
    	jne	 SHORT $L23099
    			
    31 instructions 23 instructions

    The performance increase for the AMD64 code is essentially due to the decrease of the instructions number, regarding to the 32 bits code. This function uses lot of variables, and the 32 bits compiler manages them using the stack (the concerned instructions include a reference to [esp]). The AMD64 version is able to store each variable in a register, using the eight additionnal GPR (r8 to r15 in the code).

    As we can see, the decrease of the instructions number provides a great performance improvement. But there is another way to make the code more efficient, and that appears in the arithmetic function. Here is an extract of the generated code.

    32 bits code 64 bits code
    
    $L53327:
    	add	 eax, ecx
    	add	 eax, ebx
    	add	 eax, esi
    	lea	 ebp, DWORD PTR [eax+edi+1]
    	lea	 eax, DWORD PTR [edi+edi]
    	sub	 eax, esi
    	lea	 edx, DWORD PTR [eax+eax*2]
    	lea	 edi, DWORD PTR [edx+ebp*2]
    	cmp	 edi, 20
    	jle	 SHORT $L53330
    	mov	 edi, 20
    
    $L53330:
    	test	 ebp, ebp
    	je	 SHORT $L53331
    	lea	 eax, DWORD PTR [ebx+esi]
    	add	 eax, edi
    	cdq
    	idiv	 ebp
    	mov	 esi, eax
    	jmp	 SHORT $L53332
    
    $L53331:
    	lea	 eax, DWORD PTR [ebx+edi]
    	mov	 ebp, 2
    	add	 esi, eax
    
    $L53332:
    	test	 edi, edi
    	...
    			
    
    $L23115:
    	lea	 eax, DWORD PTR [rdi+rbx]
    	add	 eax, r10d
    	add	 eax, r9d
    	lea	 r11d, DWORD PTR [rax+r8+1]
    	lea	 eax, DWORD PTR [r8+r8]
    	sub	 eax, r9d
    	lea	 eax, DWORD PTR [rax+rax*2]
    	lea	 r8d, DWORD PTR [rax+r11*2]
    	cmp	 r8d, 20
    	cmovg	 r8d, r15d
    	test	 r11d, r11d
    	je	 SHORT $L23119
    	lea	 eax, DWORD PTR [r10+r9]
    	add	 eax, r8d
    	cdq
    	idiv	 r11d
    	mov	 r9d, eax
    	jmp	 SHORT $L23120
    
    $L23119:
    	lea	 eax, DWORD PTR [r10+r8]
    	add	 r9d, eax
    	mov	 r11d, r14d
    
    $L23120:
    	test	 r8d, r8d
    	...
    			

    The 64 bits code is still shorter than the 32 bits code, but what is noticeable in this part is that the AMD64 compiler used the cmovg instruction, at line 10.
    This instruction was introduced with the Pentium Pro ; it is a conditionnal move (if greater, in this case) and allows to avoid a branch.

    The 32 bits compiler did not generate this instruction but a branch (jle SHORT $L22360). A cmovg was not generated because this instruction is not available on all CPUs, only from Pentium Pro. Of course, the compiler settings can be changed in order to generate this kind of intructions, but this is not recommended as some CPUs may not support it and generate a protection fault. So, the default settings we used only involve the basic 386 instruction set.
    The AMD64 compiler can generate this instruction, because AMD64 starts with Athlon 64, so it can use all the instructions this particular CPU supports. This is also why the compiler can use SSE/SSE2 with default settings, unlike the 32 bits compiler.

    Here are the results obtained with floating point functions.


    The performance increase is lower in comparison to integer results, and reach 8% for the double precision function. Let's have a look at the generated code.

    32 bits code 64 bits code
    
    	push	 10
    	fld	 QWORD PTR __real@3f...
    	fld1
    	pop	 eax
    	fld	 ST(2)
    	fld	 ST(3)
    	fld	 ST(4)
    
    $L22485:
    	dec	 eax
    
    	fld	 ST(1)
    	fadd	 ST(0), ST(3)
    	fadd	 ST(0), ST(4)
    	fsub	 ST(0), ST(1)
    	fmul	 ST(0), ST(5)
    	fstp	 ST(4)
    
    	fld	 ST(2)
    	fadd	 ST(0), ST(4)
    	fsub	 ST(0), ST(2)
    	fadd	 ST(0), ST(1)
    	fmul	 ST(0), ST(5)
    	fstp	 ST(3)
    
    	fld	 ST(3)
    	fsub	 ST(0), ST(3)
    	fadd	 ST(0), ST(1)
    	fadd	 ST(0), ST(2)
    	fmul	 ST(0), ST(5)
    	fstp	 ST(2)
    
    	fld	 ST(0)
    	fsub	 ST(0), ST(4)
    	fadd	 ST(0), ST(2)
    	fadd	 ST(0), ST(3)
    	fmul	 ST(0), ST(5)
    	fstp	 ST(1)
    	jne	 SHORT $L22485
    			
    
    	movsdx	 xmm3, xmm9
    	movsdx	 xmm1, xmm7
    	movsdx	 xmm4, xmm7
    	movsdx	 xmm2, xmm7
    	mov	 rax, rbx
    
    $L23259:
    	movsdx	 xmm0, xmm4
    	addsd	 xmm0, xmm1
    	addsd	 xmm0, xmm3
    	movsdx	 xmm3, xmm0
    	subsd	 xmm3, xmm2
    	mulsd	 xmm3, xmm6
    
    	addsd	 xmm1, xmm3
    	subsd	 xmm1, xmm4
    	addsd	 xmm1, xmm2
    	mulsd	 xmm1, xmm6
    
    	movsdx	 xmm0, xmm3
    	subsd	 xmm0, xmm1
    	addsd	 xmm0, xmm2
    	addsd	 xmm0, xmm4
    	movsdx	 xmm4, xmm0
    	mulsd	 xmm4, xmm6
    
    	subsd	 xmm2, xmm3
    	addsd	 xmm2, xmm4
    	addsd	 xmm2, xmm1
    	mulsd	 xmm2, xmm6
    
    	sub	 rax, 1
    	jne	 SHORT $L23259
    			

    The 32 bits x87 code is a very optimized and efficient code, it may be hard to do better, even manually. The AMD64 version uses SSE2 scalar instructions (the "sd" suffix means "scalar double"). As we can see, the use of scalar instructions provide a code that is as fast as x87 code, even a bit faster, that can be explained by the slightly lower number of instructions. The stack management of the x87 provides some extra instructions in comparison to the SSE2 code that directly accesses the registers.

    However, all the functions we used for the test did not provide such a good level of performance, because of the compiler limitation we spoke above. As soon as a trigonometric function was used, the performance drastically dropped down. We just hope that the final release of the Microsoft compiler will solve all these problems.

    AMD64 : our conclusion

    The results obtained by the AMD64 code are very promising, if we consider that it is a new technology. The versions of the compilers we used are far from providing yet the best code they could provide, and we can expect improvements with the releasing of the versions, as it used to be the case for 32 bits compilers (and is still the case, rearding the improvements of the version 8.0 of the Microsoft compiler).
    Results are good, and can only be better.

    The release of a good AMD64 compiler is a key feature for the success of the technology. It will mean drivers optimized for AMD64, then AMD64 versions of the graphics API. This is necessary for te developpers to cross the 64 bits line, especially video games editors.

    Conclusion

    The 64 bits technology is a bet, maybe the most interesting one of these last years. But AMD takes a big risk with AMD64 because the emergence of its technology needs the developers to use it. That's why the manufacturer made its technology as accessible as possible, approaching the developers, releasing the complete K8/AMD64 documentations very soon... all is made to promote AMD64 among programmers.
    AMD64 is very attractive for developers, because the use of it does not need a lot of work, unlike SIMD instructions sets. This is a very important point for the use of AMD64, as using it won't cost too much and will provide a performance increase with just a compilation.

    About the K8, this CPU does not represent a major architecture evolution in comparison to the K8, but maybe there was no need to. The K8 corrects the weaknesses of the K7, that's it. Best SSE/SSE2 performance, high memory performance, and a good positionning on the business market thanks to the HyperTransport technology. This will be useful for the K8, that will have a double challenge in the coming months :

    • Introducing AMD64 technology on the market. In the same time the Athlon 64 becomes accessible, Microsoft has just released a 64 bits version of Windows XP. All on the K8 was made to make a smooth transition to 64 bits. 2004 will tell us if AMD succeeded in this innovative challenge.

    • On the 32 bits field, the K8 is the direct concurrent of the latest Pentium 4, especially the Prescott. For the moment, the Athlon 64 outperforms Intel's lastest CPU, but this situation won't be for a long time. The Prescott does no starts very well, but will become more and more attractive as its frequency will increase. The Athlon 64 will have to follow the performance race, but is it really able to do that ?


    More information about K8


    Article Updates

    Date Author Modifications
    02/15/2004 Franck Delattre First version
    02/18/2004 Franck Delattre The integer 32 bits code was not fully optimized. The new displayed code was obtained with Visual C++ 6.0. Sorry for this mistake.


    Download

    Article in html format Download

    All website content subjected to copyright - ©2001-2006 www.CPUID.com
    Legal Information