Twiiter

Twitter Updates

    follow me on Twitter
    Search
    Powered by Squarespace
    « New theme | Main | Flash on the iPhone »
    Saturday
    24Oct2009

    The loss of ZFS

    Well, in case you haven't read any of the myriad stories about it, it appears that Apple has decided not to use ZFS on Mac OS X. Gruber has sources that say it was primarily licensing concerns, which is consistent with what people have implied to me, both recently, and around WWDC (although at that time I think there was probably still hope of resolving the issues).

    Now, some people jump may comment that it couldn't be licensing issues, since ZFS is opensource (under the CDDL), and that Apple already uses CDDL software (DTrace). That may be true, but often in deals that involve large companies there is more to it than that. Apple may have wanted guarantees of indemnification in the NetApp lawsuit. Maybe it wanted guarantees that certain modifications it wanted to make would be accepted upstream, or even to get Sun to make certain changes. It also might have wanted additional distribution rights that were not granted under the CDDL. It is typical for companies to negotiate custom agreements in such cases (and for some money to change hands), so the idea that licensing issues are why it fell through is entirely reasonable, even though it is an opensource product. Obviously Sun's steady decline in the market place, and the uncertainty caused by the Oracle acquisition may have greatly complicated any such negotiations.

    Why not do a new filesystem?

    Apple has a lot of talented filesystem engineers. They are certainly capable of doing something comparable to ZFS, at least for their target market. The problem with developing a new modern filesystem is that it generally takes longer than a single OS release cycle. Most companies are really bad at having large teams focused on projects that will not ship in the next version of the project they are working on.

    This is a particularly acute problem at Apple, which traditionally has done things with very few engineers. I don't want to get into exact numbers, but I recall having a discussion with the head of a university FS team who was discussing the FS he was working on. He was pitching it to a group of Apple engineers. It was some interesting work, but there were some unsolved problems. When he was asked about them he commented that they didn't have enough people to deal with them, but he had some ideas and it shouldn't be an issue for a company with a real FS team. It turned out his research team had about the same number of people working on their FS as Apple had working on HFS, HFS+, UFS, NFS, WebDAV, FAT, and NTFS combined. I think people don't appreciate how productive Apple is on a per-engineer basis. The downside of that is that sometimes it is hard to find the resources to do something large and time consuming, particularly when it is not something that most users will notice in a direct sense. That is especially true if senior management is not excited about the idea.

    Because of that, I was fairly convinced ZFS was a credible future primary FS for Apple. Not because it was an optimal design for them (it isn't), but because it was a lot less work than doing a new design from scratch. The fact its fundamental architecture is 20 years newer than HFS meant it would still be better than HFS+ in almost all respects even if it was not designed for Apple's exact needs. Clearly I was wrong, since Apple has stopped the ZFS project.

    What changed?

    Well, a couple of things have happened. The first is that Mac OS X has gotten more mature. They no longer need to port all of those FSes, they already have them working, and in most cases they work fairly well. That frees up some engineers. Apple has also greatly expanded the number of people working on their kernel since it is amortized over many different products (Mac OS X, iPhone, AppleTV, etc).

    Suddenly the notion of doing a new filesystem seems doable, so long as it is a real priority and the FS team doesn't get pulled to keep adding features or doing major work to legacy FSes. That is still a lot of work when Apple had ZFS approaching production quality on OS X.

    Apple can do better than ZFS

    Sun calls ZFS "The Last Word in Filesystems", but that is hyperbole. ZFS is one of the first widely deployed copy on write FSes. That certainly makes it a tremendous improvement over existing FSes, but pioneers are the ones with arrows in their back. By looking at ZFS's development it is certainly possible to identify mistakes that they made, and ways to do things better if one were to start from scratch. From where I sit, there are 3 obvious ways doing a new FS will be better for Apple than ZFS:

    1. There have been new fundamental research since ZFS was designed that simplifies many of the issues involved with it. In particular the "B-trees, Shadowing, and Clones" (PDF). That paper is the basis for the design of BtrFS, which has a very similar feature set to ZFS, but internally is entirely different. LWN has an article about BtrFS that explains the significance in some detail (it is written Valerie Aurora, who worked on ZFS at Sun).

    2. ZFS was designed for the storage interfaces available a decade ago. Spinning disks are going to be with us for a long time, especially for bulk storage in data centers and on backup devices. The future is all about solid state. Flash SSDs have significantly different performance characteristics than spinning media, and there may be FS design decisions one could make that would benefit from that. Now, any FS Apple designs will have to work acceptably on traditional drives, but if they are designing for the future then flash is what to target.

      ZFS has had some optimization work for flash, but it is all in terms of using flash as part of a storage hierarchy. That makes complete sense, since ZFS's primary deployment targets are high-end systems and data center storage. Those systems have multiple drives, so the idea of separate flash drives for a ZIL and L2ARC are completely reasonable. Most consumers have one drive in their system, and maybe an external drive for bulk data, data exchange, and backup.

    3. That brings up the last point. ZFS is designed for big systems. It works on small systems, but most of the tradeoffs favor very large computers, with lots of drives. This shows up in a number of ways. The first is that ZFS is not currently capable of adding single drives to an existing vdev or migrating vdevs between various types (mirror, raidz, raidz2). This is a major feature for smaller users who might want to add a single drive, but is a non-issue for data center users who tend to add large number of drives all at once, since they will add whole vdevs. Another issue is that ZFS assumes you have a lot of ram. NEC has been doing a port of OpenSolaris to ARM, and they determined they could not get ZFS to use less than 8 megabytes of ram without making incompatible format changes (Compacted ZFS). With those changes they could squeeze it into a more reasonable 2 megabytes. On a desktop that doesn't seem like a big deal, but on an iPhone 3G or a Time Capsule 8MB of wired memory is an enormous issue.

    The only major downside is that if Apple is just starting on a next generation FS now it could be a long time before we get our hands on it.

    But now we are going to have another incompatible next generation filesystem

    Wolf brought this point up during some of the ZFS talk on twitter yesterday. My general opinion is that it doesn't matter. People use drives for two largely unrelated tasks. One is running their computers. This is fixed storage. The other is for data exchange. In the old days people used floppies for their sneakernet media, which made the situation much simpler to understand. In recent years the market realities have caused people to move to using SD cards, thumbdrives, and hard drives as the exchange medium of sneakernet.

    The important point is that understand is that while the physical devices may be the same, the use model is different, just as the using a floppy disk and an internal hard drive were different. Nobody would balk at the notion that floppies should use different FSes than internal drives. Likewise, most people shouldn't care that their external drives are formatted differently than their internal drives.

    There are complicated features you want for your boot drives and system disks. Ideally you could have them on your interchange disks, but there are other features that are more important, particularly interoperability, and simplicity. ZFS didn't bring either of those. There might have been a few people who were psyched to be able to use ZFS to share disks between a Mac and a Solaris or FreeBSD box, but honestly those people are few and far between. Whether Apple used ZFS or something else it is just as interoperable with Linux and Windows (which is to say, not at all). So that fact that Apple looks to be doing a new FS does not impact interoperability in any real sense.

    The other feature you really want for an interchange FS is simplicity. There are a lot of devices out there that use an FS to communicate with a computer. The simplest example is a digital camera via its media cards, but there are many others. Something like ZFS is way too complex for those devices, and honestly most of the features of ZFS like multiple drive support and snapshots are useless since the devices don't have the physical interconnects or user interfaces to expose those features. There is certainly an argument to be made that we could use something a bit better than FAT32 or exFAT as that format, but ZFS was not the right solution for that.

    In other words, for that disk you want to use as an external drive to drag between computers you don't want something like ZFS, you want something that is simple enough that a firmware engineer can write a read-only implementation from the specs in less than a week. For the disk embedded in your computer (operationally or literally) you want something like ZFS, but it doesn't matter if it is interoperable with anything else because you won't be moving it between systems.

    This is basically how Windows works. Microsoft generally uses NTFS for internal drives, but FAT for external drives. Ultimately somebody should design a filesystem explicitly for use as an interchange format and license it for free, then everyone can deal with their internal FSes and do what makes the most sense for their OSes and markets.

    Reader Comments (31)

    But you have to concede Sun does have the last letter in filesystems.

    October 24, 2009 | Unregistered Commenterid.html

    OTOH, the door is wide open for [FS or {fs…

    October 24, 2009 | Unregistered CommenterJohn Siracusa

    ~FS, the last non-control ascii character in filesystems?

    October 24, 2009 | Unregistered CommenterLouis Gerbarg

    Tell me something about Apple technology? But I can tell you something: BSD 4.x (NextStep), FreeBSD/NetBSD (MacOS X), Mach, Cups, Xerox technologie (copy'n'paste, the mouse, the desktop per se), Adobe's display postscript (Quartz is a derivative), MAC (Mandatory Access Control - a security framework from FreeBSD) and so on. Apple can sell it, but I doubt they can do it better.

    October 25, 2009 | Unregistered Commenterfixmbr

    "Apple can sell it, but I doubt they can do it better "

    Bsd 4.X is not from NeXT.

    the whole Cocoa/openstep stuff is from next/apple engineer

    Quartz is from Apple/Next engineer

    Mach was an university project, only much NexT/Apple engineer made something useful with it.

    the desktop is an old metaphore, Apple did all the work to make it useful and possible on small computer (quickdraw)

    Quicktime

    firewire

    Mandatory access is not invented by freebsd

    Apple can sell it and they also can do better.

    NOONE can do ALL better. You have to admit the whole computer industry is made of many companies and organization contributing to the whole stuff.

    It's good , and there are also apple in the mix.

    Accept that.

    October 25, 2009 | Unregistered CommenterPrimrose

    "Ultimately somebody should design a filesystem explicitly for use as an interchange format and license it for free, then everyone can deal with their internal FSes and do what makes the most sense for their OSes and markets."

    Brilliant.

    October 25, 2009 | Unregistered CommenterJason Wagner

    It might be my eyes but this is one of the hardest blogs I've ever read. Small white text on black is not good

    October 25, 2009 | Unregistered CommenterJay

    I'd suggest that interoperability does matter, especially when you are running more than one system on a computer (for example Boot Camp for Windows).

    Plus, to try and predict the storage model you are going to be stuck with for the next 10+ years (whether large arrays, remote cloud storage or perhaps a mix of the two in some sort of seamless format) is difficult at best.

    Finally testing to destruction a filesystem in all its myriad ways such that it is production capable is something I don't think Apple can do by themselves. btrfs will get massive support from the linux community and people running bleeding edge servers. Desktop users aren't going to have spare computers to run and test with in the massive numbers that Apple might need. And why reinvent the wheel every time? Realistically they should negotiate with Oracle for btrfs and save those filesystem engineering resources for current issues (adding trim support now being one example).

    As for the interchange file format, I'd love to see one, but I can't see it happening. People would go for the overly ambitious universal file format thereby rendering it just another filesystem. I'd think a filesystem that gracefully "minimizes" would be a better solution, one whose tricks are dependent on the space and resources available.

    October 25, 2009 | Unregistered CommenterTN

    This is some well informed analysis. It will be interesing to see how this all pans out. As you suggest, the big players in the industry would be wise to work together to develop and interchangeable unrestricted licenseable file system format.

    October 25, 2009 | Unregistered CommenterKhurt

    I appreciate the effort and thought that went into the article and enjoyed reading it, but...

    I must agree on the comment about readability. Small, dense white text on black background.

    October 25, 2009 | Unregistered CommenterTim Holmes

    fixmbr:

    In the article didn't say better, I said "something comparable to ZFS, at least for their target market." I actually do think they can do straight up better, because they have have years of experience and two publicly several copy on write designs to analyze. Whether or not they do has a lot of other factors involved, such as time to market and management buy in. Regardless of whether not the filesystem is going to be technically better than ZFS, it will almost certainly be better for Apple's target markets because they will make technical tradeoffs that favor those, as opposed to Sun who made tradeoffs that favored people who bought 48 drive Sun Fire X4500 Servers.

    Everyone else:

    Okay, between this article and the last one I have gotten the hint, I will change templates before I post another article. Every time I ditz with the templates I tend to break comments and comment linking for a few hours while I do it, so I am going to hold off a few days until traffic subsides.

    October 25, 2009 | Unregistered CommenterLouis Gerbarg

    TN:

    From my understanding, btrfs is basically a non-option. There are two reasons

    1) It heavily leverages a lot of the linux kernel infrastructure. Rather than implementing something like vdevs by telescoping the existing stack, they are making the linux MD driver (which handles raids) smarter so it supports dynamic stripes. It is really an FS designed very much for Linux, and the amount of scaffolding and compatibility code one would need to write to support it on another FS make it a more difficult port than something like ZFS.

    2) I do not believe Oracle has been requiring any copyright assignments, which means they are not in a position to offer it to anyone under special license terms. It is GPLv2 or bust, and for a primary filesystem directly linked into the kernel that is almost certainly a no go.

    As to your other point:

    TRIM is not a filesystem issue, if you read through HFS+ source code it is there. It is an ATA driver issue (AHCI really). That is done by a completely different group of people, and if I had to guess I would assume at this moment the issue is that none of the drives Apple ship support TRIM, but that when they spec out a drive that does for some machine it will implemented as part of support for that product.

    As for testing... no, I am pretty sure Apple can test a filesystem just fine. User testing for those sorts of things is actually surprisingly bad, since errors tend to show up much later and be very hard to predict. It is certainly a useful tool, but I think you over estimate how significant it is compared to good test tools.

    In fact, several years ago Apple opensouirced an internal testing (fsx) that they used for FS testing. It found lots of errors in shipping Linux filesystems that had been in use for years. It is actually one of the tools used for testing btrfs. You can checkout the various versions of it at http://www.codemonkey.org.uk/projects/fsx/

    October 25, 2009 | Unregistered CommenterLouis Gerbarg

    Re: the 'licence issue', I'm sceptical how much Sun/Oracle not providing suitable indemnity vs. potential NetApp lawsuits really played in this decision.

    *IF* NetApp's patents on copy-on-write and snapshots hold up (and that doesn't look like it's happening) those patents would be valid not just against ZFS, but *ANY* FS implementing the concepts, presumably including BTRFS and the mooted 'next gen AppleFS'.

    Given the number of OSX licences is ridiculously larger than the number of Solaris licences, and Apple obviously has the money to pay, they would be next on the lawsuit list whether or not they implement copy-on-write and snapshots via ZFS or their own filesystem (assuming software-patent madness does not disappear off the face of the earth[USA] in the mean time).

    Aside from that issue, ZFS certainly isn't ideal for Apple's end users. 128-bit filesystems are irrelevant for 99.999% of it's users: A variant "ZFS64" implementation would seem to make more sense for OSX and Mobile OSX doesn't even need that.

    The main decider could simply be that Apple doesn't see the need for a next-gen file system RIGHT NOW, which is the main advantage offered by ZFS. If that advantage doesn't matter, I could see them seriously considering BtrFS, which given it's Linux-friendly licence will certainly be getting alot broader development than ZFS. In that light, what's the best use of a relatively small Apple FS team: working on developing a smaller-volume-scaling FS variant that only Apple will use (if a ZFS variant) or the same type of work, on a broader project that a smaller-target-variant would actually see broad community involvement in (BtrFS).

    October 25, 2009 | Unregistered Commenterremember

    remember:

    The licensing thing has basically been confirmed by Jeff Bonwick: http://mail.opensolaris.org/pipermail/zfs-discuss/2009-October/033125.html

    That is about as much confirmation as anyone not directly involved is likely to see, and if it is not enough to convince you I doubt anything will.

    With respect to BtrFS, I really don't think it is an option for reasons I outlined in the comment right above your comment. Mind you, I am not dissing BtrFS, I think it is great looking FS and in few years I am confident it will be widely deployed in the linux community, but short of a ground up format compatible rewrite I just don't see how anyone can use it in a kernel that is not using a GPLv2 compatible license.

    October 25, 2009 | Unregistered CommenterLouis Gerbarg

    BtrFS's structural accomodation to the Linux kernel could very well make it not worth it for Apple to adopt, and Apple may well be planning to go their own way here...

    But I don't see how any alternative that implements similar concepts would provide any better protection than they would have from ZFS (indemnified or not). Issues relating to Apple's own modifications (i.e. upstream adoption, etc) could easily have played a part IMHO (and that's in line w/ Bonwick's comment).

    October 25, 2009 | Unregistered Commenterremember

    "Makes" is duped here: "That makes completely makes sense…"

    Also, "Something like ZFS is way to complex" should use "too" and not "to."

    October 25, 2009 | Unregistered Commentersuperlance

    RE: the readability of this blog:

    For the young and cool, style's the rule.
    For the old and wise, Apple + a few times.

    Thanks for having style even though it's clearly wasted on (at least part of) this demographic.

    October 25, 2009 | Unregistered Commentermacwise

    @Louis - Thanks for the answer. It still seems a shame that we are doomed for separated filesystems going forward.

    I suppose they will have to go it alone, I don't know of any other orphaned modern FSes out there right now that could be picked up.

    October 25, 2009 | Unregistered CommenterTN

    Jay: It's not your eyes. The color and font choices on this page are horrendous. I couldn't get through the whole article without using Safari's web inspector to change the font and background colors to non-painful values.

    October 25, 2009 | Unregistered CommenterStormchild

    I would hope that if Apple are working on a new file system, they build in support for on-line defragmentation right into it. This would be very userful, and contrary to popular belief HFS+ does need to be defragmented occasionally (and right now it is a very slow, hazardous process relegated to third parties that often times do not do a good job).

    October 26, 2009 | Unregistered CommenterMario Grgic

    Mario:

    Personally, that would not be a huge concern to me, for two reasons:

    I want it to tuned for SSDs. While SSD performance does degrade some for very small reads, if you are using the device's advertised stripe sizes then there is basically no performance boost between reading sequential stripes. In other words, it has no seek penalty, so as long as you you are writing your files in at least ~128k chunks it doesn't matter if those chunks or scrawled all over the disk.

    Having said that, assuming they support online device removal from multiple device FSes they will have to have all the tech necessary to arbitrarily reposition data while the FS is in use. BtrFS has support for this already. It is one of the ways BtrFS is distinctly better than ZFS, and it is a direct result of that research paper from 2007 I linked in the article.

    October 26, 2009 | Unregistered CommenterLouis Gerbarg

    Thank you for a clear and well written piece, which is a breath of fresh air in the quicksand of FUD we seem to be constantly stuck in.

    In other news:
    Apple buys BFS from Access JP...
    An oldie but goodie re-surfaces, and will be re-jigged to fulfil the fantastical capabilities of 'copy on write' FS's...

    Importantly it will also be a symbolic gesture by Steve towards his old pal Jean Louis Gassée...
    The message:
    "No hard feelings right?"
    ah ah ah
    I am rich Biach!!!!

    ok sorry probably not appropriate but come on, Jean Louis must still be pissed, non?

    October 27, 2009 | Unregistered CommenterSebastien

    I've discussed your point about FSs for exchangeable media at my advogato diary:
    http://www.advogato.org/person/chalst/diary/243.html
    where I say that the old BSD FSs are too old to be patent encumbered, and so are excellent candidates.

    Hey, post another Advogato diary entry! Folks there will be happy to see that you remember us! Your account was lgerbarg, in case you've forgotten :->

    October 27, 2009 | Unregistered CommenterCharles Stewart

    I think this was a very good article on the topic of ZFS and Apple MacOSX!

    I figured some may be interested in a story about waiting for ZFS on some Mac's.

    We had a mirrored set of 500 Gig drives, a mirrored set of 400 Gig drives, and assorted external 250GB, 160GB, and 80 GB drives.

    We lost 1x400GB and 1x500GB drives - within a week! Fortunately, both losses were from mirrored sets.

    A pair of mirrored 1.5TB drives were purchased. FireWire reliably plagued the Seagate Extreme 1.5TB drives on the dual-core Intel Mac, dual-processor PowerPC, and even a single processor G4. Finding ESATA Cards for PCI-X bus on dual 2.5GHz G5 was not trivial. We were stuck with USB.

    The 1.5TB set took awhile to come up. Several ungraceful shutdown forced a complete rebuilds which took days each! Lengthy rebuilds are completely unacceptable risk for mirrored file systems and consumed too much system capacity. Apple REALLY needs a next generation file system RIGHT NOW.

    We waited for the next MacOSX release for ZFS. Once the decision was made to drop ZFS from MacOSX, we deployed an older dual processor SPARC Solaris 10 box with ZFS, and move the mirrored 1.5TB drives.

    A single external 1.5TB USB drive taken from the Mac to create a ZFS pool. The data was copied over gigabit ethernet from the Mac to the new system. The last external USB 1.5TB drive was pulled off of the Mac and "attach"'ed to the ZFS pool, to make a mirror. This blog made it sound like one could not add a single drive or convert a single drive to a mirror. Half a day was required to do a complete rebuild. Ungraceful simulated shutdown tool seconds to recover from.

    A really neat feature in ZFS is the ability to upgrade your existing mirror, one drive at a time, to a larger drive size, without destroying your existing mirror.

    Our mirrored 1.5Gig set over half full. We will need another mirrored set added. With ZFS, additional external USB drives can be added, one pair at a time. In preparation, a new 4 port USB network card ($11.00 US$ at MicroCenter) was added to the PCI based SPARC.

    ZFS can be used with external portable USB drives today. There is an "export" and "import" command to move pools from one system to another - even of different endianness. The blog made it sound like portability was not that easy. I absolutely agree that ZFS is far less simple (than FAT) as Louis pointed out.

    Flash, as a next generation storage device, is not as simple suggested. Fast-write flash is expensive while fast-read flash is cheap. Louis briefly discussed the new version of ZFS allows for read flash acceleration (i.e. L2ARC cache) as well as write flash acceleration (i.e. ZIL) - but "systems have multiple drives, so the idea of separate flash drives for a ZIL and L2ARC" is incorrect reasoning. Separate flash was designed because not all flash is equal. A small amount of high-cost high-reliability ZIL flash gives a huge boost and a large amount of cheap low-cost low-reliability L2ARC flash gives a huge boost. ZFS allows the most economical approach to get performance, today. If a single flash is created with both high read & write performance as well as high-reliability (years away) - then a single flash can be partitioned and used for both.

    ZFS is not perfect, but growing
    - usb port positions are important just like SCSI drive id's
    - ZFS is adding drive removal from a pool to reduce capacity
    - ZFS is adding deduplication. Imagine school full of Mac's being able to be backed-up onto a little 500 Gig Time Capsule.

    More Mac's, PC, and SPARC's will be added to the network. Since Adobe did not port FrameMaker to MacOSX and Apple dropped OS9 PowerPC emulation - deploying SPARC Solaris is the easy solution to keep FrameMaker on everyone's MacOSX desktop (via X Client.) ZFS is just another business reason to add SPARC Solaris.

    I'll post to my blog on the benefits on my experimentation with CHEAP read and write flash from my Mac clients over Gigabit Ethernet sometime soon with ZFS.

    http://netmgt.blogspot.com/

    October 27, 2009 | Unregistered CommenterDavid

    David:

    Those are some good points, but I just want to make two quick comments:

    1) Mirror upgrades

    Basically any raid solution supports this. When you pull one drive it degrades the mirror, then you stick in a larger replacement drive and reconstruct. You then remove the remaing small drive, stick in a larger one and reconstruct. You now have a a filesystem that is not taking the full size of the drive, so you use the OSes LVM (Logical Volume Management) interface to expand the FS. You can do that with most filesystems on Linux using a MD raid, and you can do it with HFS+ on Mac OS X. Chances are the reconstruct will happen faster on ZFS because it tends to cluster sequential writes which lower the amount of write contending with the read traffic generate by the reconstructs, but that is about it.

    2) I am aware of the difference between write biased and biased flash and how that effects the L2ARC and ZIL. I might not have been clear, my problem wasn't putting L2ARC and ZIL on separate devices is wrong. The point I was making is that putting either of them on different media than the main storage on disk media is completely wrong in Apple's main market. Sun feels (correctly) that telling data centers to replace all of their disks with flash is not cost effective when you can place it in the hierarchy.

    If you look devices that Apple sells that use a filesystem, the most common are iPhones and iPods. In computers the bulk of their sales are portables. All of those have a single drive, and it is either going to be flash or disk. There are generally not going to be both. So Apple's next generation FS should be optimized to use as main storage, not as a layer of its hierarchy (though being able to support that mode of operation would be a good thing).

    October 28, 2009 | Unregistered CommenterLouis Gerbarg
    Comments for this entry have been disabled. Additional comments may not be added to this entry at this time.