Today, Apple introduced Xgrid. What is Xgrid and why should one care? This article describes my findings on Xgrid. Everything is available in the documentation or somewhere on the web, but this article presents a quick overview.
Xgrid is targeted towards computations that take a very long time (several hours). Typical applications that gain from this are: Monte Carlo calculations, 3D rendering, and other calculations that can be broken in several sub-tasks that don't affect each other. Apple provides a few examples, the most obvious is Mandelbrot: the calculation of the Mandelbrot map at a given point does not depend on the result at another point. Hence, one can split the whole map in sub-maps and ask the agent computers to perform their part of the calculation.
Xgrid does not perform the calculation. Actually, Xgrid does not know squat about math or science. Even worse (or better?), Xgrid does not even know you are trying to "compute" something. Xgrid provides the basic infrastructure so that one computer can talk to several others, run a command and get the result. That's it. It is based on BEEP, which is a (new?) HTTP-like protocol. You can get very good information on it here and there, but I will come back to it later. BEEP is the plumbing to do the talking.
The shell program requires particular attention because the source code is also provided. The shell program runs any command that is available on the agent's machine. The real question I have is this: for the Shell program or the Mandelbrot program, does the agent run its local copy (which it finds in /Library/Xgrid/Plug-ins/Mandelbrot.xgplug/ for instance), or does it receive a copy over the connection from the controller? I suspect it is the former, which would make everything less useful than it appears: you would need to have a local copy of your program installed on all the machine you want to run it on. Hence, if you have some scientific program you've made, you would need to find the agents and copy the files onto them and always make sure they have the proper version. That in itself would defeatd the purpose of rendezvous: you might not even know where the agents are and you highly likely don't have access to them anyway, let alone administrator access. [Note, Jan 10th: However, the custom plug-in allows one to set an arbitrary program name and a working directory which may even contain files. Upon completion, the directory is copied back to the "Destination directory". More on that in another article.]
The source code provided by Apple (the Shell program) does not give enough information to get to the guts of Xgrid: one must derive a class from XgJobViewController and override a few functions, and we don't have the code for that class. Hence, the details of the Xgrid protocol are kind of hidden, which makes me scratch my head more than I should. And this brings me to the last section.
14 What about other software clustering technologies (MPI)?
• Xgrid is not a replacement for MPI. MPI is an API that enables programmers to write portable parallel applications, whereas Xgrid is a suite of applications and daemons which enables scientists to run distributed computations using a simple Mac OS X application.
• An Xgrid plug-in could be written and used as a replacement for programs such as mpirun, which coordinate the start and stop of MPI applications on a cluster of computers. However, no such plug-in is included with this release of Xgrid.
10 Can I use Xgrid with other UNIX-based computers?
• The short answer is no.
• The long answer is that Xgrid uses an XML property list protocol built on top of BEEP for all of its inter-computer communication and coordination, and because these protocols are open, it is possible a client, agent, or controller could be written to run on other UNIX-based computers and interoperate with Xgrid. However, no such programs have been written.
(Bold passages by me). MPI (Message Passing Interface) is the standard for parallel computation, at least in academia. It allows you to easily split a computation in sub-tasks, execute the sub-tasks on other computers that you specifiy manually in some configuration file or on the command line. How MPI talks to the other nodes is irrelevant: it just does and one should not care. However, MPI provides facilities to collect all the results of a calculation and "sum" them, which is something that Xgrid does not provide. Xgrid provides the piping and finds the agents to perform a task, but that's it. What I don't understand is how one can take the current MPI programs (with all the convenient functions for "summing" results) and use them in Xgrid. Apple alludes to the fact that they at least thought of it (I suspect they even have some kind of solution), but I just don't understand, since MPI has its own communication scheme. What do we need here? Some kind of xGridMPI? I am not sure.
But really, what I do know for sure is this: although some of us are lucky enough to have an OS X machine on the desk at work, most people around us don't. Moreover, the real powerful machines for calculations in Universities are Unix-based and they aren't running OS X. Hence, it is critical that the protocol that Xgrid implements (what is the controller asking the agents to do and how) be made public so that Xgrid agents can be programmed for Linux, SunOS, IRIX, etc. Since BEEP has been implemented on tons of architectures (see http://www.beepcore.org/), the base plumbing is there for a brave soul to implement the Xgrid client, agent and controller on their machine of choice (and rendezvous). Mac OS X will be the best machine from which to initiate the calculation, but as long as Xgrid does not interact with other architectures, its adoption in academia will be quite limited. We don't all have 1100 G5 in our labs.
Xgrid looks good and removes a lot of complexity in managing parallel computations, but how one tailors it to suit ones needs is not clear to me. If it is required to recreate the functionalities of MPI, then I don't see the gain in using Xgrid (so far) considering the time investment. Moreover, how Xgrid differs from Pooch is also unclear. [Added Jan 8th: Actually Dauger has a FAQ about the difference between Xgrid and Pooch. This is it: Pooch does MPI, Xgrid does not. The discussion above is correct.]
The second part of this article is available here.