Microsoft.com Operations

We are the operations team that runs the Microsoft.com sites.

Blogs

Scaling Your Windows...and other TCP/IP Enhancements in Windows Vista/Longhorn

  • Comments 4
  • Likes

Our Ops team has been testing and sampling the goods that are Longhorn Server for a while now and one of the areas we're very interested in is networking.  Specifically, we're jazzed about the changes happening in the TCP/IP stack for both Vista and Longhorn.  We know the impact will be huge for backend operations such as moving data between data centers, but we also think there will be significant improvements on the front-end including downloads with Vista clients.  That means snappier downloads for you at home and at work...at least where your network has the bandwidth to allow you to take advantage of this.   

 

Our first taste of the new stack came when the Windows Networking team asked us to help them test the new stack in the data center to get some real-world data.  We set up one server in Bothell, WA and one in Santa Clara, CA (~22ms round-trip latency) and let the Devs have at testing with TTCP.  The results were stellar:  >890 Mbps throughput.

 

Now, TTCP pushes the limits of the stack, CPU, bus, network, etc, but that doesn't reflect the normal file transfers that happen as part of doing real work.  Since those file transfers create some of the more challenging scenarios for us, we put two new servers in WA and two in CA, all with GigE NICs.  Each data center has one W2K3 server and one Longhorn server.

 

From there we set up two Robocopy jobs to pull 20 1GB files from the servers in CA and drop them onto the servers in WA.  One job was run with W2K3 at each end and another was run with Longhorn.  All servers are the same HP DL385 Dual Core machines with 16GB RAM and GigE network uplinks.  Results:

 

Pull with W2K3 at both ends (CA and WA) :  ~12Mb/s (includes SMB and TCP/IP tweaks)

Pull with Longhorn at both ends:  >400Mb/s (default config...no tweaks)
Pull of same 1GB files between two Longhorn boxes on same VLAN:  502Mb/s

 

So, I know, you're thinking, but I don't move a bunch of 1GB test files back and forth all day, I pull web logs from remote servers back to a central location for processing and that takes a significant amount of time. We thought the same thing so, for a real-world sample of something we do regularly we pulled a single hourly web log file (199 MB) from a www.microsoft.com server in CA back to a couple servers in the WA data center.  The WWW server in CA is a W2K3 box with GigE and we pulled the file across the wire with a W2K3 and Longhorn server in WA.  For a good view into the future we also put the file on a Longhorn server in CA and pulled from the same Longhorn server in WA.  Results: (represented in terms of time because when you get up to make a sandwich between file copies, this is how long you have):

 

Pull from W2K3 in CA to W2K3 in WA:  ~2:12

Pull from W2K3 in CA to Longhorn in WA:  ~0:12

Pull from Longhorn in CA to Longhorn in WA:  ~0:04 (not much sandwich time)

 

Currently we have 40 of the boxes that serve www.microsoft.com in the CA data center translating into half of our ~250 GBs of log files per day being 20+ms away.  Today moving that 125+ GBs can take 83,333 seconds which is close to a day.  This means we must be creative and make multiple pulls at the same time to move the data more quickly...or get really full eating a lot of sandwiches.  As we move to pulling this data with Longhorn, we can reduce that time down to ~45 mins without being creative at all.

 

If you have a copy or two of Vista Beta 2 you can test out these changes with a medium to large file download from a server that is over 10ms away in terms of latency.  You should see a nice improvement.

 

What's next for our team:  With these gains in network utilization, there is a paradigm shift in what network utilization amounts to network congestion.  Previously with each client/server connection taking a relatively small portion of the available bandwidth over latent links, it was much easier to determine when network link utilization was becoming an issue.  Now, two servers can fill a 1 Gig WAN link all by themselves, but neither of them would be experiencing congestion that would be of concern; however, that's not so easy to determine when looking at link utilization from the network side of things.  This means we need to partner closely with the Networking folks on how we measure and communicate congestion issues in the future.

 

For further information on the TCP/IP changes in Vista and Longhorn:  http://www.microsoft.com/technet/itsolutions/network/evaluate/new_network.mspx

 

Comments
  • The Microsoft.com operations teams blog is definately worth a read. This post details some of the testing...

  • Readers of this blog might have noticed (or been puzzled by) the variety of subject matter that we present....

  • Readers of this blog might have noticed (or been puzzled by) the variety of subject matter that we present.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment