DailyTech - SMT, Multi-level Cache Confirmed for Nehalem

Submit News

Blog: Hardware SMT, Multi-level Cache Confirmed for Nehalem
Kristopher Kubicki (Blog) - July 18, 2007 2:46 AM

28 comment(s) - last by Master Kenobi.. on Jul 20 at 4:12 PM

If at first you don't succeed -- try, try again.

Earlier this year we ended up with a little egg on our face when we claimed Intel would bring HyperThreading back with the Penryn architecture. Intel Insiders, roadmaps and guidance all said the same thing: we'd see simultaneous multi-threading in 2007.

Of course, we all know how that turned out -- after a full day Intel revised its roadmap and sent new guidance out without HyperThreading. All engineers immediately clammed up and apologized.

Six months later, simultaneous multi-threading is back on Intel's roadmap, but this time nobody is doubletaking. Intel's guidance for Nehalem, the next-generation 45nm successor to Penryn, claims the following:

Leverages 4-issue Intel Core microarchitecture technology
Simultaneous multi-threading (SMT)
Multi-level shared cache architecture
Performance-enhanced dynamic power management
Fully unlocks Intel 45nm Hi-K silicon process benefits

Intel engineers stress that while we'll see SMT on Nehalem, it has very little to do with the original HyperThreading found on the Pentium 4 architecture. "We've had some sort of SMT support for all processors since the Pentium 4 ... HyperThreading is just one implementation of SMT." He adds, "What you're going to see on Nehalem is much, much different."

In addition to these tidbits, Intel's roadmap confirmed the presence of Intel's new uniform bus, previously dubbed Common System Interface. This bus has been renamed to Intel QuickPath Interconnect, and will appear on Itanium and Xeon platforms.

Expect to see the first Nehalem offerings from Intel in the second half of 2008 for one and two socket servers. The company did not disclose a date for desktops or mobile offerings.

Comments

Threshold

Username
Password
remember me

This article is over a month old, voting and posting comments is disabled

QuickPath Interconnect?

By Sahrin on 7/18/2007 2:57:25 PM , Rating: 2

Fire that marketing guy.

And I thought we were all in agreement that SMT sucked? Invest your time in better compilers and software support so that developers know how to best utilize the resources at their disposal, not a hack that 50% of the time costs you CPU cycles. SMT does nothing for your system that CMT doesn't do better.

RE: QuickPath Interconnect?

By ChronoReverse on 7/18/2007 4:32:43 PM , Rating: 2

SMT is a great thing. It was the implementation in the Pentium 4 that wasn't that great.

Parent

RE: QuickPath Interconnect?

By TomZ on 7/18/2007 5:13:11 PM , Rating: 3

I disagree. HyperThreading was a good feature to add as a stopgap measure until dual-core processors were practical. It solved the problem where one app would consume 100% CPU and effectively lock you out from being able to do anything else with your computer.

Parent

RE: QuickPath Interconnect?

By smitty3268 on 7/18/2007 5:46:02 PM , Rating: 2

It was a good feature to add. That doesn't mean it wasn't still implemented piss-poorly. It should have been twice as good as it ended up being.

Parent

RE: QuickPath Interconnect?

By TomZ on 7/18/2007 9:25:55 PM , Rating: 2

I think the same could be said about any non-trivial engineering development that does not have unlimited resources.

Parent

RE: QuickPath Interconnect?

By smitty3268 on 7/18/2007 10:26:39 PM , Rating: 2

It can't be said about the other SMT implementations that existed before HyperThreading. I'm never against getting an extra feature for free, and perhaps cost kept Intel from putting out a better product, but compared to competing implementations it was very low quality. Still, better something than nothing.

Parent

RE: QuickPath Interconnect?

By ChronoReverse on 7/18/2007 6:14:49 PM , Rating: 2

SMT originated from studies and papers that showed the limitations of how many instructions at once a CPU can execute at once. It was shown that only about 3-4 instructions at once was the most in almost all cases (instruction level parallelism). So what do you do with the leftover execution units when you're not on the border case where the units are all used?

Enter SMT. Only requiring a very modest increase in the pipeline length, thread level parallelism is enabled to extract more performance. Depending on the design of the processor (that is, whether SMT is tacked on or designed in from the beginning) the level of TLP can actually be very significant.

From the high level information about the Core 2, the processor appears to be very wide and thus a prime candidate for SMT. However, there are other concerns that probably didn't make it reasonable for Intel to enable SMT in the Core 2. One thing that comes to mind immediately is the power density if the CPU is being used more completely.

Anyways, your post implied that SMT is great for latency issues which it did in fact serve for the P4's. However, SMT's true purpose is to increase throughput, and if you take a look at other modern processors besides the P4 that are built with SMT, it certainly is the case. In the end, the rather inadequate implementation of SMT in the P4 has quite tarnished its image.

Parent

RE: QuickPath Interconnect?

By Sahrin on 7/19/2007 12:50:30 PM , Rating: 2

SMT works on the principle that there is wasted execution hardware in the CPU, either through pipeline stalls or over-designed hardware. The response to this should either be:

1) Design a more efficient scheduler/cache system which is able to execute available threads in an OOO manner which yields better results

or

2) Don't waste so much die space on extra execution hardware.

In response, you could say "you design to the load" and that a wider core is better for "heavy load" situations, to which I respond:

1) Software developers are the ones who determine the load an how it is allocated - you are either putting control of SMT in their hands (with HTT instructions) or you are making them responsible for parallelism through multi-core

and

3) It is far more efficient (and useful) to have a large number of simple cores (providing more execution hardware than the system could ever possibly need) than it is to have a massive, wasteful monolithic core which has enough hardware for *peak* loads but too much for an "average" load.

To use the common car analogy, SMT is like destroying two lanes to get an HOV lane - it only works if 'drivers' you were already moving double up and use the HOV lane INSTEAD of driving alone.

SMP (I put CMT before and I don't know what the hell I was thinking, I apologize for confusion) is like mass-transit. None of them is a "sports car" (P4) or a "semi truck" (Core 2) - it's a bunch of efficient "packets" being sent across the system.

SMT is a hack to address an inefficient situation - instead of helping improve the situation with better compilers and dev support, Intel throws the "HTT" bone directly to users - and thus "locks us in" to the inefficient system.

Parent

RE: QuickPath Interconnect?

By ChronoReverse on 7/19/2007 1:42:19 PM , Rating: 5

>
To use the common car analogy, SMT is like destroying two lanes to get an HOV lane - it only works if 'drivers' you were already moving double up and use the HOV lane INSTEAD of driving alone.
>
Eh.

First, to continue the car analogy, SMT is like taking one really wide lane and dividing it into two narrow lanes. The issue is what happens when you try fitting a truck? It'll clog up both lanes until it gets through. But otherwise, you're getting way more cars through.

Actually using Trains would be a better analogy.

Second, the schedulers and cache systems in modern CPUs are already fearsomely efficient. Because the cost of a miss is so high, we're still getting boosts from improvements to these subsystems, but there's only so far you can go when your hit rate is already at 90% in the previous generation (and even higher in the current). Looking elsewhere IN CONJUNCTION for other performance boosts is a good idea.

Third, it's easy to say you shouldn't waste that much space on execution units. We already know that 3-4 is the max you'll see in non-specialized cases. Nonetheless, 4 wide is the C2D design and that means you'll often have unused units. Might as well make use of them.

>
3) It is far more efficient (and useful) to have a large number of simple cores (providing more execution hardware than the system could ever possibly need) than it is to have a massive, wasteful monolithic core which has enough hardware for *peak* loads but too much for an "average" load.
>
This is one solution. SMT is also a solution that works quite elegantly too when implemented well. Fact of the matter is, there are still some jobs that will be "fat" in terms of ILP and a wider chip will simply get it done far faster.

Remember, ILP is easy, TLP is hard. That's why wide cores are still useful for General Purpose CPUs

The ideal solution of course would be a combination of the two. The hardware is being used efficiently because of SMT, but you still have a full core's resources when you need it (ultimately not everything can be threaded that well).

If I had to wager a guess, 4 cores each equipped with SMT would be a sweet spot for desktop/gaming purposes

Parent

RE: QuickPath Interconnect?

By defter on 7/20/2007 3:28:53 PM , Rating: 2

quote:
3) It is far more efficient (and useful) to have a large number of simple cores (providing more execution hardware than the system could ever possibly need)

LOL. You cannot have "more execution hardware than the system could ever possible need" with a reasonable die size.

Why do you think that simple cores make SMT useless? Niagara has 8 simple cores, and each core supports 4-way SMT...

SMT is not a "hack", it's a way to extract more efficiency from a core. Even if the core is simple, it isn't always 100% utilized by a single thread. So why not run another thread on it while the first thread is idling? Why waste die space by adding huge amount of cores?

If SMT is so useless, why do you think that server oriented CPUs (Power, Niagara, Itanium) all support SMT? Do you think that SUN and IBM are stupid?

Parent

RE: QuickPath Interconnect?

By defter on 7/20/2007 3:32:07 PM , Rating: 2

SMT has nothing to do with number of cores. In many tasks (servers, rendering, etc..) ability to run extra threads usually improves performance. Thus quad core with SMT will be faster than quad core without SMT and so on. SMT and multi-core are not mutually exclusive.

Parent

RE: QuickPath Interconnect?

By TomZ on 7/18/2007 5:09:44 PM , Rating: 2

I don't understand what you're talking about. What sucks about SMT/HT? What is CMT?

Parent

RE: QuickPath Interconnect?

By Sahrin on 7/19/2007 12:51:14 PM , Rating: 2

TomZ, see my reply above - I meant SMP - not CMT, I apologize for the confusion.

Parent

Comments

By Anh Huynh on 7/18/2007 1:25:14 PM , Rating: 2

Please post all comments in reply to this one, or other ones below it. We're trying to resolve the issues with our comment system. Thanks.

RE: Comments

By EarthsDM on 7/18/2007 1:28:38 PM , Rating: 2

I wonder what the exact difference is between the old and new HyperThreading model. IIRC, the old HyperThreading was about 30% efficient, meaning that between the first and the 'extra' processor, you got about 130% of the original performance (optimally.) I wonder if they're using the IBM/Sun model and just having one core that happens to execute two threads simultaneously.

Parent

RE: Comments

By Master Kenobi

(blog) on 7/18/2007 1:32:19 PM , Rating: 2

That wouldn't be too far fetched as Conroe currently executes 2 jobs at the same time if I'm not mistaken, it wouldnt be too hard to enlarge the lane to handle 4 and then split it between threads. Or keep all 4 tied together and intelligently split the difference depending on which thread is more demanding. Possibilities are there.

Parent

RE: Comments

By ChronoReverse on 7/18/2007 1:53:23 PM , Rating: 2

Part of how well SMT works is how many resources are available. The number of units that are underutilized depends on the processor design but they certainly will exist.

When I found out how wide the C2D's were, I was rather surprised that Intel didn't enable SMT to utilize the resources more effectively. I also thought that they would put it back in for Penryn.

I wonder if the current motherboards will support this with a bios update though. I hope my P35 DS3R will at least support it...

Parent

RE: Comments

By smitty3268 on 7/18/2007 5:43:47 PM , Rating: 2

The efficiency of HyperThreading varied a lot depending on the workload. Branchy, inefficient code let it work quite well, while highly optimized code caused HT to basically become useless. Due to the Netburst design, there was actually quite a bit of the former but the apps that really wanted a full second core tended to be the latter.

Parent

RE: Comments

By iwod on 7/18/2007 1:34:41 PM , Rating: 2

OH NO!...
Intel QuickPath Sounds AWEFUL!
CSI , Intel's finest (Like the advert in Channel 4 UK) Common System Interface is a much better name.

Parent

RE: Comments

By Master Kenobi

(blog) on 7/18/2007 2:01:35 PM , Rating: 2

Yea, I agree, the CSI Common System Interface was a better sounding name.

Parent

RE: Comments

By Operandi on 7/18/2007 2:12:42 PM , Rating: 3

When I hear CSI I can't help but think of horrible TV.

Parent

RE: Comments

By omnicronx on 7/18/2007 2:44:26 PM , Rating: 2

WHO ARE YOU!!! WHO? WHO?

Parent

RE: Comments

By Ringold on 7/19/2007 8:55:18 PM , Rating: 3

I don't know, but surely you can stick "Operandi" in to a magic app, it'll get his IP from his post, and within seconds you'll have his criminal records from everywhere in the country (including a full range of biometrics) and a current satellite video feed showing him eating dinner through his kitchen window.

Parent

RE: Comments

By James Holden on 7/19/2007 11:08:43 PM , Rating: 2

I'm always amazed when people can turn 6 pixels into a license plate! :)

Parent

RE: Comments

By MonkeyPaw on 7/20/2007 7:38:19 AM , Rating: 2

Maybe CBS threatened to sue Intel if they used the term CSI?

<insert cheesy David Caruso line here>

Parent

AMD's Answer to Nehalem

By EndPCNoise on 7/18/2007 3:28:40 PM , Rating: 2

Anybody know what product line/series AMD is planning to have compete with Nehalem around 2009'ish? Would it be Phenom, or something else?

RE: AMD's Answer to Nehalem

By Vanilla Thunder on 7/18/2007 5:34:03 PM , Rating: 2

2009 should be bringing us into the age of Fusion and Torrenza with AMD. Very interestnig projects that in my opinion could change the way we look at/purchase processors and GPU's.

Vanilla

Parent

RE: AMD's Answer to Nehalem

By Master Kenobi

(blog) on 7/20/2007 4:12:55 PM , Rating: 1

Not really. Short of allowing you to have a user-replaceable Integrated Graphics chip, its not that great. And it will never replace or even compete with discrete graphics cards.

Parent

"This is from the DailyTech.com. It's a science website." -- Rush Limbaugh