site title

Topic: server

The Speed of Light Sucks

05-17-11 by Jeff Atwood. 54 comments

Our current datacenter is in New York City. Yep, where they make all that great salsa. So whenever you make a request to any Stack Exchange site, the internet tubes must connect from your location to our datacenter in NYC. We are not (yet) immune to the laws of physics, so depending on the distance between you and NYC this … can take a while.

As John Carmack once so eloquently said:

The speed of light sucks.

But there is a workaround of sorts. As of summer 2009 we currently serve all our static content (that is, stuff that does not change on every request) such as JavaScript, images, CSS, etc. from sstatic.net. Since these files don’t change very often, there’s no reason they have to be served directly by us; we can offload our static files to a content delivery network.

A good CDN has a network of fast nodes all over the world.

With a CDN, when you make a request for, say, favicon.ico — that particular file doesn’t have to be delivered from our NYC datacenter. It can come from a server in the CDN closer to you. Yes, these files are usually cached, but you do have to retrieve them at least once and sometimes a few times a day. The resulting performance improvement can be quite dramatic, particularly for that first click!

We’re currently evaluating our CDN options and we want to measure the real-world improvements of a few different CDNs.

Make a few requests to each of these links, using Ctrl-F5 / Command-Shift-R to force a redownload instead of using a cached version, and record the typical duration of a download.

In Chrome, you can see detailed download times via the “Network” tab of the Developer Tools, which can be invoked via Ctrl-Shift-I.

In Firefox with Firebug, download timing is on the “Network” tab, too:

The result in the Chrome screenshot is 576ms; in the Firefox screenshot it’s 490ms.

Please use this Google form to enter your results.

With your data in hand, we hope to choose a killer CDN that makes Stack Exchange faster all over the world!

update: now with results! The percentages here mean percent better than sstatic.net which is our default CDN in NYC.

PEER 1 Hosting – Making your data center more awesome!

03-08-11 by Alison Sperling. 6 comments

In 2003, Fog Creek Software (aka Joel’s other baby) moved offices, and decided to ditch its internal T1 and look for a colocation provider. Joel was impressed with PEER 1 Hosting’s customer service, the shiny new data center in NYC, and PEER 1 Hosting even volunteered to host Joel on Software – for free!

When we decided to move our Stack Exchange Network to the East Coast to better serve our global customers, PEER 1 Hosting was the logical choice because of the success that Fog Creek had. We began to migrate part of the data center in May of 2010, and finalized the move of all live sites from Oregon in October of 2010. After all the sites were set up at PEER 1 Hosting, we noticed some awesome results and thus we started a discussion with PEER 1 Hosting about how to extend the same benefits to the community.

We think it’s a win-win!

  • As an advantage of being part of the community you get an awesome data center at a discounted price – Win!
  • The more business Peer 1 Hosting does with people in the community, the more support they can provide to power Stack Exchange – Win!

Here’s a look at our servers hosted at PEER 1 Hosting:

Stack Exchange Peer 1 Servers

Database Upgrade

10-30-10 by Jeff Atwood. 23 comments

As part of our datacenter migration, the database server received a substantial upgrade:

Oregon
48 GB
2 Xeon X5470 CPUs
8 total cores @ 3.33 Ghz
NYC
64 GB
2 Xeon X5680 CPUs
12 total cores @ 3.33 GHz

However, a few things didn’t go quite to plan in the migration. Much to our chagrin, the database server ended up being barely faster — and maybe even a bit slower than our old database! This was deeply troubling.

The new Nehalem CPUs (what you may know as Core i7) are sort of meh on the desktop, but they are monsters on the server. It’s not unusual to see 200% performance increases going from Core 2 class server CPUs, like the ones we have in Oregeon, to these newer Core i7 class server CPUs. Just ask AnandTech’s Johan De Gelas:

The Nehalem architecture only caused a small ripple in the desktop world, mostly due to high pricing and performance that only shines in high-end applications. However, it has created a giant tsunami in the server world. The Xeon 5570 doubles the performance of its predecessor in applications that matter to more than 80% of the server market. Pulling this off without any process technology or clock speed advantage, without any significant increase in power consumption, is nothing but a historic achievement for the ambitious and talented team of Ronak Singhal.

So … yeah. We should be seeing performance improvements, and big ones, not the break-even parity (at best!) we were actually seeing.

We began looking into it and troubleshooting. That’s why there was some downtime around 5 pm Pacific the last few days. We were messing around with our primary and backup database servers in NYC. Here’s what we tried:

  1. We thought maybe the combination of SQL Server 2008 R2 and Intel’s next-gen HyperThreading were not mixing well. We’re still not sure, but we opted to be on the safe side and disable HyperThreading for now; 12 real, physical cores seems like plenty for our workload without adding fake logical CPUs to the mix.
  2. We realized we had mixed up CPUs a bit and we didn’t have the correct CPU in the server. Close, but not quite right. This was easy enough to fix with a CPU swap, but it alone was not enough to explain the performance issues.
  3. After trying a few other minor things, and with a nudge from Brent “database ninja” Ozar we narrowed it down to the clock speed of the CPUs themselves. Despite having set high performance mode in Windows Server 2008 R2′s power management control panel, the CPUs weren’t clocking up at all under load — we were seeing about half the clock speed under load we should have.

Kyle asked why our CPUs weren’t clocking up on Server Fault. In the process of asking the question and researching it ourselves, we discovered the answer. These Dell servers were inexplicably shipped with BIOS settings that …

  • did not allow the host operating system to control the CPU power settings
  • did not set the machine to high performance mode
  • did not scale CPU speed under load properly

… kind of the worst of all worlds. But Kyle quickly flipped a few BIOS settings so that the machine was set to “hyperspeed mode”, and performance suddenly got a lot better. How much better?

My benchmarks, let me show you them! This is an average of 10 SQL query runs on a copy of the Stack Overflow database, under no (or very little) real world load.

  OR DB2
2.5 Ghz
OR DB1
3.33 GHz
NYC DB2
3.33 Ghz
gnarly query for Sportsmanship badge 3177 ms 2919 ms 1285 ms
simple full text query 555 ms 423 ms 335 ms

Notice that database performance scales nearly linearly with CPU speed. This has always been the case in our benchmarking, but our dataset fits in memory. I don’t think that’s unusual these days. Building a 64 GB server like this one is not terribly expensive any more — and solid state drives are bridging the gap between disk and memory performance at 256 GB and beyond. Anyway, the received wisdom that “database servers need fast disks above all else” is kind of a lie in my experience. Paying extortionate rates for a crazy fast I/O subsystem is a waste; instead, spend that money on really fast CPUs and as much memory as you can afford.

Most of all, there’s the crushing 2x Nehalem Xeon performance increase we would expect to see! It’s “only” 25% faster on full text operations, but we’ll take that too!

So, our apologies for the downtime. We tried to share everything we learned in the process here and on Server Fault so the community can benefit. We hope this upgrade brings a faster and more responsive set of Stack Exchange sites to you!

(and if you’d like oodles more datacenter details, do check out the Server Fault blog.)

Datacenter Migration Oct. 23

10-21-10 by Jeff Atwood. 41 comments

This Saturday, October 23rd, starting at about 2 PM Pacific, we will be migrating all of our primary sites from the Corvallis, OR datacenter to the New York, NY datacenter.

Please be advised that this is a major move, and while we will do everything we can to prevent major service interruptions (largely with a read-only site mode we’re introducing), there may be a few hours of unavoidable downtime.

This move is good news, though:

  • NYC is approximately 80 milliseconds closer to Europe, which is where a significant portion of our audience arrives from. And of course dramatically closer to the rest of the east coast of the USA.
  • The Peer 1 internet network infrastructure should be faster.
  • The servers all have twice as much memory (16 GB web tier, 64 GB database tier) and their CPUs are one generation ahead of what we have in Corvallis (Core 2 vs Core i7 class).
  • There’s a lot more of … uh, everything.

At worst this NYC configuration will be the same speed overall — but much more robust. At best, you should notice 100 to 150 millisecond improvement in response time on every single page.

As always, you can read real time updates and details about the move on blog.serverfault.com.

update: this migration is now complete. We have a few very minor things left to clean up, but for the most part everything should be working as before.

Stack Overflow Outage

10-09-10 by Kyle Brandt. 16 comments

As you may have noticed many of Stack Overflow’s websites suffered some down time today from 6am EDT for about an hour and Stackoverflow.com is still offline for maintenance. Our collocation provider in Oregon experienced an unexpected UPS failure that caused us to lose power. Once they were able to restore power Geoff, who was already on site, brought up our servers. Stack Overflow itself is still offline because the database, was, well “suspect” according to SQL Server. We have recovered the database and are working to bring it live again.

We apologize for this outage but we are working hard to make sure you can always get your answers. We will keep everyone updated.

Update:
We have managed to restore the database and stackoverflow.com is now live again as of 10:45 AM Eastern.

Update:
We have managed to restore the missing 4 hour window of Stack Overflow data as of 1:30 PM Eastern.

Going forward we have set up new servers in a new facility in New York. We have already moved some of our sites; you may have noticed meta.stackoverflow.com was still up during this outage. This new data center includes the following improvements so that our sites will have higher availability so you can always get answers to your questions:

  • Two Power feeds from independent UPSes.
  • Redundant Internet feeds as well as redundant routers and switches.
  • Every site is run from multiple servers.

Thank you everyone for your patience and support during this outage.