The Shape of Disaster on the Net

outages.country.stack.10days.png
Network Outages Caused By Taiwan Earthquakes

The quakes that damaged seven undersea cables last month got me thinking about disasters in general and they way they look to the network routing around them. Much has already been written about the quakes and the damage that they did to telecommunications infrastructure to Asia. But two perspectives have been missing. The first is the understanding of the event from a network (Internet) perspective. Renesys data and tools are obviously good at providing that.

The second is the comparison of this event to other events of equal scale. What did this event look like compared to large-scale power outages? Compared to hurricane Katrina? Compared to global routing events (mass route leaks, high-rate network scanning, etc.)? Put another way, is there a consistent "shape" that disaster takes on the Internet and were the Taiwan quakes disaster-shaped?

power.png
Network Outages Caused By Power
Outages in the Northeast US, 2003

Outages caused by different events have a characteristic shape. Disasters caused by natural events typically have an extremely sharp onset, coincident with the onset of the event itself. Occasionally, there is a delay in onset, but this is usually explainable within the context of the event itself. For example, in the case of the widespread power outages in the Northeastern US in 2003, the beginning of the outages was sharp and tightly correlated with the power outage itself. But outages scaled even higher as networks that had some kind of battery-backed power slowly lost power over the course of the next hour. That's why the graph shows a sharp rise and then a ragged peak as outages fluctuate at the top.

katrina.png
Network Outages Caused By Hurricane
Katrina Hitting the US Gulf Coast

Outages caused by natural disasters (like the massive network disruption caused by Hurricane Katrina making landfall in August, 2005) follow a similar pattern. In that case, as you can see in the accompanying graph, you can actually track Katrina's landfall by the timing of the outages. The Louisiana outages take place almost half an hour before the Mississippi outages start, which corresponds to how much of Louisiana sticks out into the Gulf of Mexico in the face of the advancing storm. In each case, the pattern of onset of network outages is sharp and corresponds directly to the event. Any delay in onset is easily explicable by the interaction between the event and the networking.

tw4.png
Network Outages By Country
Following the 2006 Taiwan Quakes

And then we have the quakes, with an event onset that is ragged and messy. And a peak that is almost exactly 60 minutes delayed from the last quake of any magnitude in the region. This caused us (and still causes me) some consternation. Digging into the data helps somewhat. In the graph posted on the front page of this article, you can see the outages broken down by country. In the breakdown, it's possible to see that the ragged onset is actually a series of stacked sharp onsets per country. So not all outages across all countries in the region happened at once, but the outages that did happen were fairly sharp. That helps explain the ragged start.

But the delayed onset of the main peak outage (zoom of which is shown here on the left) is harder to explain. Theories around the office run from time zone screwup (the 60 minutes was suspicious, but we verified that we got it right several times) to something about underwater landslides, cables stretching, etc. I'm still not totally comfortable with any of those, but without more information, there's not much else we can say.

So, disaster has a shape. And this one didn't quite match that. There's obviously something interesting about the size and compound nature of this event. It wasn't a single event. It was seven cables breaking one at a time. It was thousands of networks making independent routing decisions about how to respond to the sudden loss in connectivity in one direction and massive congestion in every other direction. As of this writing I don't believe a single cable is yet repaired. We will be presenting information about this event at NANOG 39 in Toronto next week and also at APRICOT in Bali at the end of February. We hope to gather some war stories from providers who were (and are) in the midst of rerouting around the outage.

In a future blog, if people are interested, I'll go over some of the interesting results from the quake: who "won" and who "lost" transit customers and network, and some of the recovery and coping strategies.

TrackBack

Listed below are links to weblogs that reference The Shape of Disaster on the Net:

» Week's Links from Alessandro "jekil" Tanasi blog
Asking the Right Question: Penetration Testing vs. Vulnerability Analysis Tools, Which Is Best?Exploiting JSON Framework : 7 Attack ShotsIntroducing a new idea of Routing via Novel Clustering Strategy for AdHoc NetworksRFC 3227 - Guidelines for Evidence C [Read More]

Comments

An interesting mystery you have there, Todd. The timing seems too rigorous to be natural or the result of a human decision. But it does sound like what you might see from a timeout on some kind of automated system ("If I can't do X for Y seconds do Z"). No idea what that might be, of course.