Rsync on Steroids

Posted 26 Apr 2008 at 16:33 UTC (updated 26 Apr 2008 at 22:06 UTC) by lkcl Share This

Rsync is an incredibly powerful tool that synchronises anything from a single file to an entire hierarchical filesystem, over a network. Unlike many other synchronisation methods, rsync will use the outdated copy of a file to save on network traffic (resulting in anything up to 99% optimisation).

Rsync the implementation however is restricted to only Posix systems (such as Linux, Cygwin and *BSD), and, worse, its implementation can only perform operations on Posix-based filesystems. This seems somewhat puzzling, and, as part of the continued Tech Fusion series, this article will outline some of the amazingly powerful things that could be done with rsync... if it had a VFS layer.

Rsync (the application) performs directory-by-directory and file-by-file synchronisation of a filesystem hierarchy - a POSIX-compliant filesystem hierarchy. Recent modifications to rsync show already some of the limitations of the current approach: storage of userid information into extended attributes when rsync is running as a daemon has just been added! The reason is because rsync as a daemon cannot be run as root, and so, when attempting to synchronise file permissions and userid attributes, thus maintaining file system integrity when performing backups, the previous version of rsync simply threw that information away. As a hack, the information is now stored in "extended attributes" - if an ext3 or other filesystem is used - for later retrieval on a restore / recovery.

How much better would it be if rsync had a VFS plugin layer, such that the storage of userid information and other attributes could be put into an alternative database, of which "storage in extended attributes" was just one example? Would it be nice to be to store that information in a format that was compatible with backuppc?

Or - how about storing an entire filesystem into a Tar ball? TAR (Tape Archives) have supported userid attributes, last modified dates and permissions for decades. Heck, while we're at it - what else is a "hierarchical storage" mechanism in the I.T. world? NTFS and HPFS; XML files and HTML files; Structured Storage and Streams; GVFS and KDE's KIO VFS plugins; FUSE and other user-space file systems; heck, even wget could be back-ended into an Rsync plugin at one end: in combination with a TAR plugin at the other end you could make regular compressed backups of web sites (ok - smart readers will have noticed that the last is stretching things a bit, but wait - there is rproxy! oh darn. hmm... even smarter readers will have noted that U.S. patents are only valid in the U.S. but frequently any patent usually results in a piece of software development being stopped, dead. we neeeed to do something about this, even if it means putting a notice on rproxy that it must not be distributed in binary form to the United States, until Software Patents are neutralised. but anyway - sorry for the interruption!).

What else is "hierarchical"? IMAP (and to some extent POP3) mailstores. How about going actually into the mail messages themselves, unpacking attachments, then looking across the entire mailbox for similar attachments, and performing a pseudo-sync of the "old" version of the attachment and the "new" one? How about doing the same thing across filesystems themselves?

How about the idea of optimising rsync on the server, by storing the (expensive-to-calculate) MD4 block checksums in a database? One of the reasons why rsync is not that widely deployed (Debian mirror sites often do not run rsync) is because of the amount of checksumming that's carried out, each time the file is sychronised. However, if you can guarantee filesystem integrity because the entire filesystem is stored not in a POSIX-compliant filesystem but actually in a SQL database, along with the MD4 checksums, actually splitting the files up into "blocks" rather than storing the entire file as one contiguous binary blob, then you've immediately got not only a method for optimising file storage space (if blocks occur more than once across many files or even the same file) but also you've saved yourself a great deal of CPU time not having to look up the MD4 checksums.

How about storing a hierarchical file system in GIT? (yes - i noticed that GIT itself can use rsync for synchronisation - but I'm talking about rsync using GIT for file storage!).

The list of possibilities are just incredible.

My favourite has to be an IMAP plugin, though, because then finally you can keep as many "offline" copies as you want of your mailbox synchronised with the "online" copy. This is one of the things that Exchange has which has no equivalent offering from Free Software projects (that i know of). In Exchange, synchronisation is a dog, causing immense aggravation to users. An rsync IMAP plugin would allow users to install an imap daemon on their own local system - a desktop or even a PDA - which then automatically synchronised email in a highly efficient manner.

Likewise, even sending of email - rsync with an SMTP plugin - could perform "synchronisation" over to a server before sending it out over the Internet. Close integration between the IMAP plugin and the SMTP plugin could result in massive savings of network traffic, which would be very handy on GSM/GPRS connections, based on analysis performed by the plugins, looking at file attachments that had already been transferred, or modified only by tiny amounts, and transferring only the differences rather than the whole email message.

(whilst we're at it - this of course hints at the possibility of doing away with SMTP altogether, especially with a peer-to-peer distributed IMAP server. Think about this: when you "send" an email, where is a copy first stored? in your IMAP "sent mail" folder. So why send it via SMTP at all? why not drop a DHT-based "notification" message into the peer-to-peer infrastructure for your recipient to pick up (with the hash of their email address as the 'key' of course), providing sufficient information and privileges such that they can "authenticate" against your IMAP server or its online version, and access your "sent mail" folder directly. using rsync-IMAP of course :) wouldn't have it any other way. The advantage of this approach is that the problem of SPAM almost entirely disappears, as you are using an authenticated "pull" mechanism, not the "push" mechanism that is SMTP. Further enhancements are to have a hash of both the sender and the recipients email addresses concatenated; for the recipient to perform regular "polling" of all known senders; for a new recipient to "request authorisation" to send email, just like has been done in every single popular IM system ever invented). Actually, an even better enhancement would be to negotiate a random hash for use by each sender-recipient combination, with the hashes generated at "communcation acceptance" time aka "buddy authorisation" yukk hate that phrase).

My second favourite idea is the one where XML documents are treated as "filesystems", which doesn't sound such a big deal except until you recall that ODF is an XML standard. Thus, the possibility exists to use rsync with a double-VFS-plugin (on input as well as output) to perform real-time peer-to-peer document editing (just like aka "Google Docs"). Whilst I realise that it is a non-trivial task to make any editor (whether it be Inkscape or Koffice or any other) report and recognise XML fragments as "modified" and "synchronised", at least a convenient and efficient method would exist to perform the document synchronisation, alleviating the need for the developers of each of the editor projects to reinvent that wheel.

I just know that there are more things that could be done, such as making the file-selection method part of the plugin architecture (options such as --exclude, --include, --cvs-exclude and --one-file-system), where instead of having these options you would have a much more suitable set per-plugin. There must be far more uses for rsync, with a VFS plugin layer, than I've been able to describe and hint at, here.

I hope that you enjoy looking for more such creative possibilities.

blackberry blues, posted 26 Apr 2008 at 16:43 UTC by lkcl » (Master)

ok - own up. which of you sad software donuts has a blackberry? :) i know there's at least one of you out there! wouldn't you far rather have a FreeRunner that ran auto-sync'd email server software?

p.p.s. on patents, posted 26 Apr 2008 at 16:46 UTC by lkcl » (Master)

remember: all patent law states that an "inventor" has the right to create a single instance of a patent for "personal use" such that they can experiment and create "new inventions". that right is enshrined into patent law, world-wide.

it just so happens that "downloading and compiling software" dove-tails nicely with this :)

so, if there's a problem with a single component being patented, heck - make it software-only distribution and provide a compile-up option on the user's device ha ha.

bye byee patent lawsuit... :)

distributed p2p rsync, posted 26 Apr 2008 at 18:07 UTC by lkcl » (Master)

MD4 checksums are used as keys into a DHT p2p store. backups would be not only distributed but also "merged", only one copy of each block need be stored (indexed by MD4 checksum). idea contributed by phil - thanks!

IM2000, posted 26 Apr 2008 at 18:10 UTC by lkcl » (Master)

< a href="">IM2000</a> apparently has some of the features described

qmail-src (patent-buster), posted 26 Apr 2008 at 18:15 UTC by lkcl » (Master)

qmail-src is a wrapper debian package which takes a tarball and compiles it, producing a package, which can then be installed....

Luke, songs to cool you down a bit, posted 27 Apr 2008 at 16:35 UTC by badvogato » (Master)

Feelings, nothing more than feelings

Tears - Donde Voy

Ne me quitte pas
Ne me quitte pas
Ne me quitte pas...

Who knows where the time goes?

Across the morning sky
All the birds are leaving.
How can they know
that it's time to go?
Before the winter fire
I'll still be dreaming.
I do not count the time
Who knows where the time goes...
who knows where the time goes?

Sad desserted shore Your fickle friends are leaving Ah, then you know... that it's time for them to go, But I will still be here. I have no thought of leaving. For I do not count the time. Who knows where the time goes... who knows where the time goes?

But I am not alone... As long as my love is near me. And I know it will be so... 'Til it's time to go. All through the winter... Until the birds begin to return in spring I do not fear time Who knows where the time goes... who knows where the time goes ?

Offline IMAP, posted 27 Apr 2008 at 16:40 UTC by lkcl » (Master)

Offline IMAP

OfflineIMAP is a tool to simplify your e-mail reading. With OfflineIMAP, you can read the same mailbox from multiple computers. You get a current copy of your messages on each computer, and changes you make one place will be visible on all other systems. For instance, you can delete a message on your home computer, and it will appear deleted on your work computer as well. OfflineIMAP is also useful if you want to use a mail reader that does not have IMAP support, has poor IMAP support, or does not provide disconnected operation.

another idea!, posted 29 Apr 2008 at 02:04 UTC by lkcl » (Master)

good grief, these don't stop :)

how about an rsync plugin that does an "in-memory" synchronisation of a hierarchical data structure? :)

for example, in Koffice, OpenOffice or InkScape or other editor, you would do a memory-to-memory synchronisation of the actual in-memory data structure that the word processor or editor is using (!)

it is of course essential to have "locking" of memory areas as part of the rsync VFS layer - which to be honest i am not sure if rsync _has_ "file locking" which would map to "data structure locking", i've not checked.

Re: another idea!, posted 8 Jul 2008 at 10:54 UTC by sehe » (Apprentice)

lkcl, i think it is quite easy to prove that the performance will *always* be around a factor 2 slower than just directly copying it because the algorithm requires full read of both copies to determine the delta in the first place. So that pretty much blows away any benefits over just copying.

Now, if

1) you have memory that hsa very asymmetric read/write timings in favour of reading, that might still be useful (I don't know of any memory technology that has that property) 2) if you actually do not wish to reduce memory manipulation but memory usage by storing deltas only (not full copies) and construct the 'revisions' on the fly, this may be worth some

At 2) I believe it is common in the kind of application you mentioned that there is alraedy a Command pattern in place to cater for undo. This Command history (tree?!) can be used as a de-facto delta-tree to reduce the memory consumption. I suppose this is exactly what happens (I suppose the Command-delta-tree could be viewed as a transaction journal against the 'committed' coy of your inmemory data structure).

A guide for the perplexed, posted 20 Oct 2008 at 07:53 UTC by chalst » (Master)

It occurs to me that Google will send some people to this article who will find lkcl's ideas fantastic, but actually were looking for advice on how to get rsync to do what they want at all.

Andrej Bauer has written just such a guide, Remote Backup with Secure Shell and Rsync. It does not explain rsyncs huge array of shell options, instead it introduces a sh script written by John Langford, famous for his work on applying dimensional analysis to data mining, and describes good practice for using it. Once you've got your own rsync setup working, you'll be able to appreciate the discussion here of VM layers for rsync and rsync-alikes at so much deeper a level...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page