A CPU Researcher Explains Why It Took 22 Years to Discover Fundamental Chip Flaw

By

Danish PC researcher Anders Fogh was one of many researchers who circled in on the Spectre and Meltdown vulnerabilities found on nearly every CPU chip, with Intel being hit especially hard. After reading his blog post about how so many people poked at the same problem for years, I called him one afternoon to learn more.

(Before we start, a brief refresher. Meltdown is a massive vulnerability on nearly every Intel chip made since 1995, but is largely being fixed with software patches. Spectre is more difficult to exploit, but will likely be with us for years — if not decades — because the only true way to fix it is by replacing your CPU. Both exploits allow a malicious user to access data, whether that’s your password, credit-card number, or just your personal photos stored on a cloud server.)

Before we get started, could you just give me a broad idea of what CPU research is?
CPU research is a lot of documentation reading. We dig down in the documentation and what other people have found out, and then we make a theory about how something may work.

Then we start making hypotheses: What kind of effects would this have on a piece of software? And then we write the software and check if we’re right. If we’re right, we move on.

If we’re wrong, we try to figure out why this isn’t happening the way it’s supposed to. It’s a very tedious process, with reading a lot of stuff, getting an idea of how could this actually be working inside a CPU, what effects would this have on software, and writing a software that proves or disproves the theory and then repeat forever.

To me, a layman, it’s odd that CPUs require so much research, since the architecture is designed by humans. Why do they require so much outside research to sort of understand what they’re doing?
Because CPUs are remarkably complex. So to build a CPU, what you do is, you take a handful of sand, bit of epoxy, a tiny bit of metal, and a bit of pixie dust, and you stir it all together and you get this machine that basically runs our world today. You can imagine that that process has to be very, very complex. So down at the lowest level you have to deal with quantum phenomena; at the next level you have heat dissipation; on the next level you have to connect everything; and then the next level and next level all the way up, you actually have a piece of silicon that takes instructions, and that just turns out to be incredibly complex. For scale, a modern CPU, not even the newest and the biggest, has about 5 billion transistors in them. The Saturn V rocket that took man to the moon has about 3 million. So this is a really ridiculously complex machine, and they have been developed for longer than I have been alive.

One of the responses a lot of people had with Meltdown and Spectre is that every chip — with a few exceptions — made since 1995 was fundamentally broken, and yet it took 22 years for these vulnerabilities to become public. Why’d it take so long? Is it just that these are such complex machines that understanding every single part of what’s going on is just too much for one person?
I don’t believe anybody understands what’s going on everywhere inside of a modern CPU. There’s just way too much information. In this specific case, I think a lot of it has to do with that research into CPUs is not a new phenomenon, but it has been picking up in recent years. People like to search for things where they know there’s going to be a weakness, and where it’s easy to find, so you have this thing, that the CPUs are black boxes, but we can actually see what’s going on in software — you can take a disassembler and look at what’s inside. So that scared a lot of research people away from taking a look at processors.

So it’s sort of the idea that, for a while, software research into malware was relatively easy while CPU-level exploits were incredibly hard?
Yeah, in the beginning, back when I started playing with these things, it was a running joke that if you could find a specific kind of string operations that are abundant in programs, you had an exploit. It was just a matter of figuring out how it worked. It’s not like that anymore, but that’s how it was around the year 2000. So why would you bother doing all this black-box research into a CPU if you can just throw it into a disassembler and you have your exploit? But software has become progressive and more difficult, on the one hand, and the entire security sector has also grown immensely over the past couple of years.

So was it that as software security increased, suddenly the black box of CPU — which looked so difficult 20 years ago — is now worth exploiting if you’re a bad actor within the space?
Surely. There’s a connection between software problems that are getting more and more difficult to pick and people looking elsewhere.

When it comes to CPU research, I understand part of it is tremendously difficult, but it also seems like it’s difficult to make money from doing it?
Monetizing this research is fairly difficult. There’s the official way to monetize this kind of research like bug bounties, but there’s a lot of risk, and most bug bounties are about $1,000. You can’t make a living off of $1,000, especially if you’re not sure that you’ll find anything. This is something that’s limited to people who have clients who pay for penetration testing and stuff like that, and then universities and hobbyists.

Taking a step back, I’ve seen people writing that Spectre and Meltdown are a result of chip manufacturers favoring speed over security.
I’m not sure because I’m not really a CPU designer and the topic is ridiculously complex, but my guess is that they could have done something relatively easy. I’m not trying to say that Intel can do something about it in a short time frame because that’s an entirely different question, but I would guess they were taken by surprise by that.

Looking out to the future, chips are only going to get more complex. Can CPU research keep up?
Yeah, chips will get more complex, but our understanding of what security means in chips is growing as well. I think security in chips is going to be something that’s going to stay with us for a very long time — if not forever — and I think some of the problems may also be intrinsic to how chips work.

So even if Spectre-infected chips are redesigned in a way that Spectre isn’t an issue, there are basic side-channel security vulnerabilities that may always be with us.
The first side channel that I know of is from 1973 and it still works. It’s a ridiculously simple side channel, and it’s not much of a security problem, but it’s still there.

In the near future, it seems like Meltdown is a problem that will slowly get patched up, but do you have any read on how long it takes for a problem like Spectre to go away or not be a potential vulnerability?
I think it’s going to be a very long time before Spectre is entirely dead, it’s already very difficult to use and that’s going to get more difficult with time, and the things that you can do with it are going to gradually become less and less, but a total eradication of the problem is very far, a long time away I think.

As you said, there’s a side channel from 1973 that’s still in existence, so it doesn’t fill you with a ton of optimism.
I mean, the point is that we’ve known about these software exploits for a long time now. We’ve not been able to fix them, but the world’s still running. Yeah, even that is a really bad software exploit. I think the point is here that there is an amount of risk that is acceptable or at least workable. I have no problems with that, that we still have the side channel in our computers. And I think that is the direction that Spectre is going to go. It’s just going to get less and less relevant as the big issues get fixed, but the more smaller and less important issues are probably going to remain with us.

So it’ll be like the common cold. We’ll never get rid of it, but very few people die from it.
I think it’s probably more like tuberculosis. There was a time where it was really, really bad, but now we have penicillin. But tuberculosis is still not something you want to catch.

Why It Took 22 Years to Discover Fundamental Chip Flaw