|
 |
|
Feature
|
Farming, Linux-Style
Massive PC clusters allow genomic research
Summary
Incyte has cashed in on the virtues that Linux believers have been talking about forever, using the OS's stability and cost effectiveness to create massive, blindingly fast PC clusters. LinuxWorld spoke with the engineers who made it happen to find out how they did it, and why. (1,300 words)
By Katrina Glerum
|
 |
one are the days when any pioneer with a bit of hardware, hard code, and hard work could run a small Linux farm and compete with the best plantations. The smart folks at biotech firm Incyte Genomics of Palo Alto, Calif., have just invented agribusiness. You remember everything you ever tried to tell your boss or colleagues about Linux's stability, price performance, and reliability? Well, Incyte has put those ideas to the test and come up grinning like a bandit.
To map the human genome, Incyte runs the world's largest commercial Linux farm, with more than 2,000 Linux processors chomping away on tens of millions of jobs per day. In its datacenter, laid out like a temple in the middle of Incyte's corporate headquarters, space costs a king's ransom -- but the company has come up with clever ways to address that problem, as we shall see.
Power hungry
Incyte is a genomic information company, which means it is a essentially a technology company that sells pharmaceutical companies the capacity to work on gene sequencing. It sells access to what it claims is the world's largest genomic database and the technology to process that information. It needs a lot of computing power. And Incyte is prepared to spare no expense.
Phil Kwan, the director of network operations, claims that Incyte's multigigabyte backbone is the fastest private network in Silicon Valley. It consists of a beautiful row of SGI Origins and a galaxy of Suns, including a massive Sun E10K. Incyte even has two 6,970-processor Paracels, which cost about $360,000 a pop, even though the processors are very stupid, really only good for comparing one itty-bitty string to another itty-bitty string. Not surprisingly, they're used to search Incyte's databases for genetic sequences.
However, despite being willing to put out cash for technology, Incyte found that it had a serious CPU crunch. To do a lot of tiny repetitive actions on tiny repetitive nucleotides (DNA bits), Incyte needed a lot of processors. The company calculated that some of its projects would require five and a half years to complete on a single processor. Incyte would do anything to cut that down, given that the machines it was previously running these projects on, 4-processor DEC Alpha 4100s with five 18-GB disks and a GB of RAM, cost $140,000 each. Even with parallel processing, the sheer cost of hardware was putting some projects out of reach. Besides, the problem wasn't access speed, RAM, or disk space as much as sheer throughput.
PCs are pretty cheap, though. A dual-processor Pentium with an 8-GB IDE disk, half a GB of RAM, and a few custom space-saving features costs only $5,000 or so. And even better, it can run Linux.
Cultivating Linux
About two years ago, Incyte's director of bioinformatics, Stuart Jackson, and an engineer, Steve Barry, started playing with the problem. They set up a 20-machine development cluster of single-processor Pentium II 450s, and benchmarked it against their DEC Alpha 4100s. "This was a real skunkworks project," Jackson laughs today. "They were all in my office!"
The results were inspired, though. It turned out they needed only five Pentium IIs to match those four DEC Alpha processors. Less than a tenth the cost for the same power? That's a sweet upgrade, but could they scale it? It turned out they could do that too. Incyte is naturally a bit close-mouthed about how their farms work, but LinuxWorld was able to uncover the rough outlines.
Today, Incyte throws up a stack of PCs: dual-processor Pentium III 550s running Red Hat Linux 6.0, connected via a Foundry Fibre Channel switch. The biggest farm they have is 200 processors, but since they run it on a class C subnet, they could conceivably scale it to 254.
Here's how it works: A 4-processor DEC Alpha 4100 is assigned a project. It grabs information from the Oracle database, thinks about it, and starts running jobs. Called the "feeder," it feeds jobs to one member of the farm that is the designated "broker." The broker farms jobs out to all the other boxes in the cluster, which are essentially slaves. The trick is to make the DEC Alpha think that it's talking to its own processor, when in fact it's talking to the broker.
Meanwhile, each farm has another machine dedicated to running system administration tasks. In Incyte's world, all these machines start with G and end with O (Groucho, Gaspacho, Gringo, etc). They all have a master sync machine (named Harpo), which makes sure they're all alive and keeps them updated. A change at Harpo can propagate out to every machine in every farm in about two hours.
Incyte chose Red Hat mostly because of the latter's market position two years ago when this project began, and because of a neat little program called KickStart, which came with Red Hat. (See Resources for a previous LinuxWorld article on KickStart.) The modified KickStart that Incyte uses allows the company to rack in a new PC, insert a boot floppy, turn it on, and sit back while the sys admin machine (Groucho) finds the new PC, syncs it up to the farm, and reports its name back to the administrator. It takes only one human system administrator to run all of those machines. Try that with NT.
Launching an agribusiness
Incyte started its first production farm a year ago, a 100-box, single-processor cluster of beige minitower cases. Today it looks wimpy next to the gorgeous black stacks of newly racked PCs, but it's still running Incyte's flagship product, the LifeSeq database. To conserve space, though, the folks at Incyte are now installing custom PCs that come with two dual-processor motherboards each. They can install 126 processors in about 10 square feet (2 square meters). It's no wonder they started calling their Linux farms supercomputers.
In all this, Linux has been performing like a dream. Sure, PC hardware isn't as reliable as a Sun, but the OS is rock solid for the mission-critical tasks that Incyte is running 24-7. The more interesting problems have come up on the programming side. John O'Neill, Incyte's ace bioinformatics programmer, has had to learn how to efficiently transport the right-size pieces of the database around, for instance. But he can port an application written for Solaris over to Linux in about two days.
The upshot is that Incyte can do jobs that would have been absolutely financially unthinkable before. The company now has about 20 farms with up to 200 processors each. Each farm behaves like a supercomputer, at about one-hundredth of the price -- or less. These farms can do those five-and-a-half-year gene sequencing projects in six weeks, which is faster than anyone else in the world. And if you've been following the news, you know it's pretty much a flat-out race right now to sequence and patent genes. Incyte's biggest competitor, Celera, recently crowed that at 298 processors, it had the biggest clustered network outside the Department of Defense. Stu Jackson just shakes his head. Incyte has 2,000 machines in its clustered network this week, and will have perhaps another 500 by the end of March. Slapping in a new farm is as easy as, say, getting a new client.
Discuss this article in the LinuxWorld forums
(2
postings)
(Read our forums FAQ to learn more.)
|
|
 |
About the author
Katrina Glerum is a freelance journalist, strategic consultant, and ecommerce entrepreneur in San Francisco.
|
|