Interview: Matthew Dillon

Submitted by Jeremy
on August 6, 2007 - 1:56pm

Matthew Dillon created DragonFly BSD in June of 2003 as a fork of the FreeBSD 4.8 codebase. KernelTrap first spoke with Matthew back in January of 2002 while he was still a FreeBSD developer and a year before his current project was started. He explains that the DragonFly project's primary goal is to design a "fully cross-machine coherent and transparent cluster OS capable of migrating processes (and thus the work load) on the fly."

In this interview, Matthew discusses his incentive for starting a new BSD project and briefly compares DragonFly to FreeBSD and the other BSD projects. He goes on to discuss the new features in today's DragonFly 1.10 release. He also offers an in-depth explanation of the project's cluster goals, including a thorough description of his ambitious new clustering filesystem. Finally, he reflects back on some of his earlier experiences with FreeBSD and Linux, and explains the importance of the BSD license.

DragonFlyBSD Versus Other BSD Projects
Jeremy Andrews: It's been five and a half years since I last interviewed you on KernelTrap. At that time you were a FreeBSD committer, and DragonFlyBSD had not yet been born. Can you summarize what happened to cause you to leave the FreeBSD project and start the DragonFly project?

Matthew Dillon: Generally speaking the FreeBSD project was moving away from the sorts of concepts I wanted to implement for BSD, and in particular the DragonFly project has a number of clustering and design goals that require a great deal rewriting within the core kernel, so much that it was highly unlikely that I would have been able to do it as part of the FreeBSD project. My relationship with certain members in the FreeBSD project had always been somewhat fractious and it was finally time to break the connection.

JA: How does the DragonFly development process differ from the FreeBSD development process?

Matthew Dillon: The development process works about the same way but we make it clear that there is no concept of long-term 'ownership' for any of the code that goes into CVS. Code that goes in is considered to be community property. Individual developers only 'own' a piece of code while they are actively working on it. Once something has gone in and is no longer being worked on actively anyone else can come in with fixes or cleanups.

We aren't as rabid about cleanups as other projects. We consider cleanups to be good business because it results in a more readable and more understandable code base down the line.

The DragonFly release process is similar but somewhat more evolved in order to not waste developers time as much as the FreeBSD release process did. The code freeze is usually no longer then a week and once the release is branched in CVS people are free to commit any work in HEAD that was held off. We also went with a full Live-CD image almost immediately, something I am happy to see other projects doing now as well.

There is a much greater emphasis on code and algorithmic quality over code performance, because there isn't much point releasing a system that performs highly but also crashes occasionally.

JA: Have you done any performance comparisons between DragonFly and any other BSD projects?

Matthew Dillon: I haven't, particularly. We know where we stand simply by virtue of where we are in the MP work, which is what most performance comparisons measure these days... somewhere between Open/Net and FreeBSD. Removing the big giant lock has not been a big priority (and wont be for 2.0 either) and I keep hoping that another developer will pick up that ball and run with it.

JA: What is the "MP work"?

Matthew Dillon: Multi-processor, aka SMP. In the context of the OS code it means getting rid of the global locks that effectively prevents code from running in a kernel context on more then one cpu at a time.

JA: Do you continue to share code with the FreeBSD project?

Matthew Dillon: We pull in driver updates when appropriate (NATA being a prime example), but we do a considerable amount of driver work ourselves now, at least with regards to network drivers. We probably port more stuff from NetBSD and OpenBSD then we do from FreeBSD at this point.

JA: How many developers are actively contributing to DragonFly?

Matthew Dillon: About a dozen people are heavily involved, with numerous other people contributing around the edges.

JA: Has the number of active developers been growing since the project's inception?

Matthew Dillon: We have added a new committer on average about once every few months.

JA: Does code have to be approved by you or any other developers before it is merged?

Matthew Dillon: We rely on the people we've given commit bits to to have common sense and by and large that has worked.

JA: How stable is each release of DragonFly?

Matthew Dillon: DragonFly releases tend to be very stable. I can't really say how it stacks up to recent FreeBSD releases but DragonFly's stability is roughly similar to the stability that the FreeBSD-4 series was known for.

JA: What are the main goals and design differences of DragonFly that separate it from the other BSD projects?

Matthew Dillon: My primary goal is to eventually have a fully cross-machine coherent and transparent cluster OS capable of migrating processes (and thus the work load) on the fly. Doing this properly requires direct, integrated support in the kernel. We are probably two years away from accomplishing this goal.

JA: What pieces of this goal are already implemented, even partially?

Matthew Dillon: Range locking mechanics for I/O atomicity have been partially implemented. We do range locking but we have not yet removed the vnode lock for I/O operations. Syslink has been mostly implemented (but not the over-the-wire version for remote machine access yet). A userland vfs API using syslink is partially implemented and will be completed for 2.0. A system identifier abstraction for major data structures called sysid (kern_sysid.c) is partially implemented.... this is how machines will reference shared resources in a clustered environment. The cluster filesystem design is well under way, as previously described.

JA: What pieces are not yet implemented?

Matthew Dillon: Cache Coherency mechanics have not been implemented. Resource sharing and execution context migration mechanics have not been implemented. Cache coherency is the big ticket item here, resource sharing naturally follows once we have proper cache coherency mechanics.

JA: What is your inspiration and need for all the clustering features in Dragonfly?

Matthew Dillon: I don't want to have to shut down a machine that has been running for months and months just to fix a piece of hardware. I want to be able to shove all of its functions onto another box while it is still live and then be able to physically power down the original box without it effecting operations.

More exhaustively speaking I see a huge potential in internet-based clustered computing where resources such as cpu and storage are not localized.

DragonFly 1.10:
JA: What architectures are currently supported by the DragonFly code base?

Matthew Dillon: Just i386 for now. Architectures haven't been a big priority and tend to interfere with necessary machine dependent infrastructure work. There is more interest now that most of the potentially interfering infrastructure work has been completed, and because we now have a 'virtual kernel' architecture that serves as a great template for new architectural ports. We certainly want to get 64 bit support in sooner then later but it is not the number one priority for the project.

JA: What is the 'virtual kernel' architecture?

Matthew Dillon: It refers primarily to the way the kernel source code is organized in the source tree. While working on virtual kernel support a great deal of separation and clarification of the machine and architectural specific source files occurred.

JA: DragonFly 1.10 was released today, nearly 6 months after DragonFly 1.8. Does the project have a regular release cycle?

Matthew Dillon: 1.10 may be a bit late due to a slew of last-minute work.

DragonFly does a release every 6 months or so, one in summer, one in winter. This somewhat slower release cycle allows our relatively few developers to focus on development rather then focus on release engineering.

I think FreeBSD has contemplated going to a slower release schedule as well, though I don't know what the status of their discussion is.

JA: What issues are delaying the 1.10 release?

Matthew Dillon: Primarily little niggling bugs that were thought to be less important then they turned out to be. We broke the kernel's ability to boot with a root vinum, there was a kernel memory leak related to exec*(), a particular sound driver wasn't happy, a bug was found in msdosfs, and a few other things like that. Vinum (the software raid driver) also had some serious interaction issues with the new ATA driver (NATA) because it was ignoring DMA limits specified by the new driver. That opened up a pandora's box that required adding an abstraction layer in the vnode subsystem to break up large I/O strategy requests.

JA: What are some of the main new features found in DragonFly 1.10 compared to version 1.8?

Matthew Dillon: I haven't put together the feature list yet but off the top of my head our virtual kernel support (running a virtual DragonFly kernel as a user process under a real DragonFly kernel) is much better now then it was in 1.8. We have introduced a new disk management infrastructure which supports GPT and 64 bit native disklabels (though not the boot code yet to allow one to boot from the above). A great deal of wireless networking work has gone in and been stabilized. We also ported and stabilized FreeBSD's new ATA driver (we call it NATA) and it will become the default in 1.10. Package Source support is also considerably better now. Syslink is in and working (syslink isn't really user visible yet, though).

JA: What is GPT, and what are the advantages of 64 bit native disklabels?

Matthew Dillon: GPT is a partitioning scheme put forth by Intel as part of the EFI standard for BIOS interactions during low level boot. Among other things it expands the 32 bit block number limit that the DOS partition table had (which is responsible for the 2TB-per-device limitations we see on PCs today) to 64 bits.

64 bit disklabel support is a DragonFly-specific feature that replaces the old crufty 32 bit BSD disklabels with 64 disk labels. That is, BSD disklabels also had the problem of specifying block ranges with 32 bit quantities as well as having to store them as absolute sector offsets on-disk and translate them back and forth on the fly when reading or writing a label. Standard BSD disklabels were also in-band in that the first partition was allowed to overlap the label area itself, which causes all sorts of problems. The 64 bit disklabel support is slice-relative, sector-size-agnostic (uses 64 bit byte offsets rather then sector numbers), and does not allow partitions to overlap the label area.

Both GPT and 64 bit disklabel support is experimental as of this release. We do not yet support booting from such partitions because we haven't yet written boot code to support them.

JA: In addition to stability, what other improvements have been made to wireless networking?

Matthew Dillon: It isn't my area of expertise but my understanding from browsing all the commits is that a great deal of device support has been added in addition to the stability work.

JA: You mentioned your new ATA driver, NATA, having been ported from FreeBSD. What was the driver called in FreeBSD?

Matthew Dillon: 'ata'. In FreeBSD their new ata driver replaced the old one entirely. In DragonFly kernels with either the old or the new new driver can be built, but they are mutually exclusive of each other.

This is typical of how we operate. We generally integrate major upgrades in parallel with existing ones and allow both to be used for quite a period of time before switching the default. NATA has been in the tree in some form or another for almost a year but was only considered good enough to make the default as of this release. Similarly we have multiple versions of GCC in the tree (3.4.6 and 4.1.2 at the moment), and numerous people are running with a 4.1.2 default, but the official default is still 3.4.6. In the case of GCC the compiler can be selected on an individual-use basis with a simple environment variable.

JA: What are the advantages of the NATA driver?

Matthew Dillon: The biggest advantage of the new driver is that it supports something called AHCI which is the native command queuing protocol used by SATA controllers. SATA controllers can operate in one of two modes: Emulated mode or native mode. In emulated mode SATA controllers looks like your standard run of the mill IDE/ATA/ATAPI controller. IDE protocols are over 20 years old and horrible beyond measure. In native (AHCI) mode SATA controllers looks much more like modern SCSI controllers, are far easier to manage, and do not have any of the severe limitations the IDE protocols had.

This is all the work of FreeBSD's Soren Schmidt.

Highly Available Clustering Filesystem:
JA: In February you posted some updates on the DragonFly mailing lists about a highly available clustering filesystem that you are designing. What is the current status of this filesystem? What are the long term goals?

Matthew Dillon: A really good cluster filesystem is a prerequisite to having a really good cluster OS. I had hoped to have it finished for this release but a ton of things came up that had to be addressed and I ran out of time. The cluster filesystem is my personal priority for 2.0.

The filesystem has several major goals.

  • Infinite snapshots. You can think of this kinda like journaling but more in a transactional sense where there is no explicit snapshotting event and instead you simply mount the filesystem 'as of' a certain date to get a snapshot as-of that date.
  • Today's storage media is far larger then most people need. Until told otherwise the filesystem will not destroy any historical data. A continuous cleaning system will allow you to manage the granularity of the snapshots... for example, every 30 seconds for the last hour, every minute for the last day, every hour after a day, every day after a week, once a week after a month, and so forth. Cleaning involves 'collapsing' modifications made within the selected time quantum, which has the side effect of freeing space.
  • Integrated backup mechanism. Because we have infinite snapshots we do not have the races associated with dump/restore or tar or other standard backup mechanisms. Backing up the filesystem can be thought of as a continuous stream of changes, but a stream which does not require any 'queuing' of the backup data per-say, which means that you can back-up to multiple targets including very slow targets and mostly off-line targets without worrying about the backlog built up for any given target creating problems on the live system.

    The filesystem is being designed to make streaming backups very efficient -- e.g. meaning that one will not have to do a full filesystem scan to make an incremental backup. Backup targets themselves will also be live filesystems, not archives, and can independently manage their snapshot granularity.

  • The backup mechanism will also be used for replication in a cluster, or even replication without a cluster, for redundancy. This won't exist in the first release of the filesystem but the infrastructure will be designed to support it.
  • Storage media migration. The filesystem will be able to cross multiple storage media boundaries natively and not require a volume manager. Additionally it will be possible to migrate large chunks (probably in the 2-4G range) between physical storage media while the filesystem is live, which is a major requirement for any filesystem one wishes to maintain long-term.

Those are the basics of the design. Look for it in 2.0!

JA: When you talk about preserving historical data, does this essentially mean that the file system offers a built in version control system?

Matthew Dillon: Yes, you can think of it that way.

JA: How will one control and access this historical data?

Matthew Dillon: Either directly specify a date when mounting, allowing one to mount arbitrary snapshots (as many as you like) in parallel with the live filesystem mount, or through an extension of the file or directory name to specify the as-of date, which I have not yet come up with.

JA: Does your new filesystem have a name?

Matthew Dillon: Not yet. 'DFS' is taken unfortunately.

JA: Why do you need filesystems implemented in userland?

Matthew Dillon: Being able to implement a filesystem in userland greatly reduces development time. From an operational standpoint you want any high performance filesystem to run in the kernel. On the other hand, there is no reason why one would ever need to run a low performance filesystem such as for a CD/DVD, msdosfs, or cross-os emulated filesystems, to run in the kernel. Running those things in userland insulates the kernel from filesystem bugs which might otherwise crash the kernel.

JA: Are there any existing free or proprietary filesystems that your aware of that meet all or most of these goals?

Matthew Dillon: Some of them but not all of them together. ZFS, EXT3, and Reiser all have individual features that are desirable.

JA: How does your filesystem compare to ZFS?

Matthew Dillon: ZFS solves a different problem, but I hope to achieve similar storage redundancy in our filesystem by virtue of making live mirrors practical. ZFS takes an integrated filesystem+storage-layer approach to redundancy but I think that is a mistake. One really needs to take a whole-filesystem-approach to redundancy, meaning that redundancy at the filesystem layer needs to operate multiple independent filesystems which happen to implement protocols allowing them to remain coherent with each other, verses operate multiple independent storage systems as a single filesystem.

Think of it like this: When you make a backup of a filesystem to tape, or you make an archive of a filesystem, and then at some point down the line something blows up and you need to restore it, suddenly you are faced with the situation of having to spend an entire day or even longer rebuilding your live filesystem from your archived backup. That is unacceptable today. Most people I know have switched to filesystem replication as their backup scheme. For example, I back-up all my DragonFly systems by doing a daily snapshot to independent storage on another machine in my lan, and do a weekly snapshot of that system to an off-site backup machine. Both the LAN backup box and the offsite backup box replicate the entire directory and file structure and use hardlinks for those files which have not changed:

    backup# df -i /backup
    Filesystem  1K-blocks      Used     Avail Capacity
    /dev/ad6s1d 726621736 299012871 369479127    45%

    iused   ifree %iused  Mounted on
    40997899 4790899   90%   /backup

The LAN backup box keeps daily snapshots for the last two months and the off-site backup box keeps weekly snapshots for the last 6 months.

One of the major ideas behind the new filesystem is to integrate the concept of making backups and maintaining live mirrors and offsite mirrors directly into the filesystem.

JA: How possible will it be to port this filesystem to other operating systems? That is, how intimately is it tied to the design of DragonFly?

Matthew Dillon: I am developing it in userland using syslink so theoretically it would not be too hard to port, but it will use DragonFly's VFS API which is substantially different then the API found in other BSDs.

Syslink Protocol:
JA: You've also discussed your Syslink Protocol on the DragonFly mailing lists. Can you explain a little about what the Syslink protocol is, what changes it has made to the kernel, and how it is used?

Matthew Dillon: Yes, 1.10 will have the first real cut of the syslink protocol ready to go and syslink will be used for our userland VFS implementation. You can think of syslink as a glorified communications pipe with a twist.

  • Instead of sending a byte stream you are sending messages and getting replies.
  • Message formatting is formalized, endian-translatable, and verified by the transport mechanism (i.e. the kernel in this case).
  • The kernel keeps track of messages which have not been replied and if the communications pipe is broken any unreplied messages will be replied to by the kernel with an error code.
  • Plus there is an out-of-band 'DMA' data mechanism. Messages have a limited size but may include data payloads up to 128KB. The data payloads are out of band... they are not transferred as a byte stream over the syslink 'pipe' but instead are implemented separately. So, for example, the OS syslink API can opt to memory-map the data buffers between sender and receiver if it wishes, and a truly remote syslink can choose to transport the data in-band or via a separate connection or something of that sort.

    This allows the basic message stream to be byte oriented (aka copied rather then mapped) without imposing copying overhead on the stuff that truly matters, that being the related data.

    Syslink kinda sounds like mach messaging but it isn't, really. It is designed with an eventual use as the primary form of communication between hosts in a cluster. Initially a userland VFS interface will be implemented using it with the intent of producing an extremely robust result. We need to be able to implement filesystems in userland and yet still guarantee that nothing bad happens if the related userland process is killed.

DragonFly 2.0:
JA: What major plans do you have for the next version of DragonFly?

Matthew Dillon: The cluster filesystem is the major goal for 2.0.

JA: Are you aware of any other major projects that will be focused on by other DragonFly developers for the 2.0 release?

Matthew Dillon: I am hoping we'll get some progress on 64 bit cpu support.

JA: Referring back to all of the technologies that you've described, where did you learn to design and implement this?

Matthew Dillon: That's hard to say. I tend to soak up everything around me. Many concepts formulated for the new filesystem are based on work I did designing and building a database at a startup a few years ago. This was known as the Backplane Database and it was a really nicely engineered piece of work with multi-master quorum operation, historical queries, and non-queued streaming backups.

BSD License:
JA: How important to you is it that your code is released under the BSD license?

Matthew Dillon: Unbelievably important. I have never subscribed to the almost religious fervor surrounding the GPL, in particular I do not like the idea of trying to impose the concept of freedom on people by attaching strings. The GPL has created a misguided sense of self importance in the open source world.

Simply getting openly specified software and algorithms into the mainstream has a far larger effect then any license. BSD conforms more to the concept of pure invention. More importantly, in large collaborative projects the BSD license allows the individual authors to use both the project as a whole and bits and pieces of collaborative work they have contributed to no matter where their life takes them, including into commercial settings and even proprietary commercial settings.

BSD is a way of saying that we are not so greedy that we have to hog-tie anyone else who wants to use and profit from our work. Or, in another sense, BSD is a way of confirming that actually making money from an open-source project is a very rare event and some of us aren't really interested in that aspect of the work.

Frankly it is not so easy to 'steal' open source projects as people seem to think. The BSD license acknowledges this fact while also acknowledging and even supporting both commercial use and the occasional commercial proprietization of project code. In a sense, it doesn't really matter whether code is proprietized or not because short of rewriting it completely any commercial success (take Apple's use of BSD and Mach for example) will inherently force that commercial entity into the use of a great deal of openly specified protocols. Just because they can add little proprietary bits and pieces here and there does not change the fact that 95% of their work base will not be proprietary, so the goal of forcing the world into using more open standards, something I *DO* want, is achieved just as well with BSD as it is with GPL.

It is really unfortunate that the fanatics don't realize this. They hold up few and far-between examples of so-called 'stealing' and the so-called protection that the GPL affords against such 'stealing' without any real understanding of what is actually accomplished. There is very little difference between the concept of 'integration' and the concept of 'stealing' in the open-source world. They are more like shades of grey.

If I were to write a large proprietary commercial application that happens to run on Linux (and many such examples exist), the integrated result is for all intents and purposes a black box, GPL or not. And yet, even in that black box a staggering number of open standards are going to be put into play by virtue of the use of open-source, and the use of such standards has a snowballing effect that in almost all cases prevents any significant proprietization over the long term, regardless of the license. And even when proprietization does occur (take Microsoft's stupid extensions to Kerberos for example) it is questionable whether such proprietization actually helps the commercial entity doing it verses the black box nature that their product already is, and it certainly has no significant effect on the open-source world.

From my point of view, this means that the GPL basically just devolves down into, in effect, giving a project protection from competition if the project wishes to go commercial. MySQL is a good example. As people have realized, just because the base code is free doesn't mean that anyone can continue to maintain and develop it. Using the BSD license is basically saying that one has no serious monetary interest in any of the work derived from that project, and that one has no interest in imposing strings on people who might want to use the work.

Personal Life:
JA: Briefly digressing from kernel development and referring to comments you made in our earlier interview, are you still snow boarding?

Matthew Dillon: Yes! Not taking big jumps any more, though.

JA: Did you finish the row boat you were working on?

Matthew Dillon: Yup, we sure did. That was way back in 2002:

JA: Are you still living and working in Berkeley?

Matthew Dillon: Yup.

JA: I was digging through some old newsgroup mailing lists, and found a post from 1994 in which you compare Linux to BSD, claiming that you prefer Linux and found BSD "well, stuffy". The thread is dated April 1'st, was it a serious post?

Matthew Dillon: 1994! Looks serious to me. And look! The very next posting was Jordan Hubbard responding. I worked on and used the linux kernel while still living up in the Tahoe area, before starting BEST Internet (as people may remember, we used BSDi at BEST, then switched to SGI, then switched to FreeBSD). It was a long time ago, before I really got back into BSD. I was still heavily involved with the Amiga at that time too. Picture me sitting in a cubby hole in a large room (more like the entire floor of a house) stacked to the brim with electronics, testing equipment, technical reference books, with an Amiga 3000 sitting on the desk and a Linux box with 80 individual wires (my version of a SCSI cable) hanging out from it going to an external 80 megabyte full height seagate hard drive.

Of course, all my points about programs making assumptions about BSDisms are now true of linux. Programs make tons of assumptions about linuxisms these days.

I am rather amused that I made the observation of the stuffiness of the BSD projects way back then. It turned out to be one of my chief complaints about those projects over the years.

JA: Thank you for all your time!

Dragonfly SMP performance

Anonymous (not verified)
August 6, 2007 - 3:02pm

Matt's answer to the question about SMP performance is a bit disingenuous, since he does know the answer to the question even though he has not done any benchmarking himself. Performance benchmarks of dragonfly on SMP hardware shows that it has zero SMP scaling, i.e. it is unable to make use of multiple CPUs. This is because effectively the entire kernel remains under a global lock.


Anonymous (not verified)
August 6, 2007 - 9:39pm

Although I'm not one to question the project, this seems a bit strange to me.
AFAIK, one of the big reasons for starting DFBSD was that he disagreed with
the direction of the SMPng work. Then went to implement his own weird and
wonderful SMP synchronization paradigm that was widely touted (among the DFBSD
camp) as being a simpler solution to the problem and would eventually rival or
beat Linux's scalability.

That's fine. Very ambitious, but utmost respect for trying something new.

However, I would have thought that after all that work has been done, SMP
scalability *should* be one of the top priorities by now, if nothing else
than to begin to validate the synchronization approach used. Zero SMP scaling
on MySQL seems incredible, even if it isn't the most scalable database around.

It's not as strange as it

Anonymous (not verified)
August 7, 2007 - 1:38am

It's not as strange as it might seem, even though some of the subsystems have been mad MP-safe and don't require the Big Giant Lock any more, some key subsystems still do. Until they are made MP-safe you will not see any big performance gains on SMP systems. The reason they are not MP-safe is simply that the team has not prioritised making it so.

How is this not strange? If

Anonymous (not verified)
August 7, 2007 - 2:07am

How is this not strange?

If GP is right, they forked because of their totally superior ways of doing SMP, but then don't put any priority on getting SMP working. Huh??

Apparently you're just parotting without thinking, because not prioritising SMP *is* the strange thing.

It is not strange.

Joerg Sonnenberger (not verified)
August 7, 2007 - 7:48am

Do you have any idea how an Operating System kernel works? Beside some very trivial system calls, you can't just take a subsystem and take it out of the Giant Lock to get measureable speed improvements.

Think about it a different way: if the VFS layer for example is MPsafe, but a specific filesystem is not, you might end up having to take the Giant Lock anyway. Pushing it down can help or can it not, but you can do the conversion work without anyone actually seeing a change.

For the network stack, Jeffrey Hsu did tests for MP scalability using some pretty aggressive patches. That is not something you want in a production environment though.

What do you get from priorisiting SMP? I date to say "not much", because there are many areas where huge improvements can be found without as much work. The work still happens, but in the background and over time.

It is

Anonymous (not verified)
August 7, 2007 - 6:59pm

Linux most definitely did get incremental improvements all the way from
2.0 where everything was under a single kernel lock, to 2.6, where some
things (eg. some filesystems, parts of the VFS, some ioctls, tty layer,
etc) are still under the "big kernel lock".

What do you get from prioritising SMP, you ask? To start with, you get
to utilize modern hardware efficiently; secondly, you validate what is
one of the primary and most revolutionary aspects of the system.

Distributed systems aren't much new. Distributed filesystems, SSI over
network (in various forms ranging from qasi-SSI in userspace all the way
to full cache coherent global shared memory in software) are nothing new.
They've been done over 15 years ago. What I find interesting is the DFBSD
approach of doing multiprocessor synchronisation that is supposedly so
much better than anything else around.

No offence, but I really

Anonymous (not verified)
August 7, 2007 - 11:18pm

No offence, but I really think you are a functional illiterate.

Again: First they forked BECAUSE of SMP and afterwards they DON'T CARE about SMP. That IS strange. All your ramblings about how hard and useless SMP is are completely beside the point.

BTW: a poster further down explained the situation very well. Next time you might try to understand the question at hand and not just ramble on random stuff trying to look smart.

Good, scalable and simple

Petr Janda (not verified)
August 7, 2007 - 8:24am

Good, scalable and simple SMP is indeed one of the goals, and one of the reasons FreeBSD 4.8 was forked. However, its not really an issue that the kernel runs under the big giant lock at this stage, because the projects absolutely main goal is clustering, SMP is just one part of it. DF devs have been working from the ground up, not from top to the ground, and theres so much to do in every subsystem, it just takes a (sometimes long) while. I for one would love the see the ie. network stack running without the BGL for 2.0 as well, but everything has its time id say.


Gergo Szakal (not verified)
August 7, 2007 - 10:03am

What most people fail to realize is that this is rather a research project. It indeed misses some features because they just aren't top-priority, but if one finds it uncomfortable, there are so many choices out there.
I still like it because pkgsrc is the most logical package manager I have ever come across, the system is very comfortable to use and usually rock-solid. I run DF on my production *NIX boxes and it can always fully accomplish the required tasks, be it a HTTP server, a Samba fileserver or a firewall. The developers are very helpful and usually you can get an answer on the mailing lists quickly if you have a problme.
Regarding the SMP and other 'features': I rather find hilarious that one cannot use the 'world's most secure OS' as a truly secure WiFi client -- especially if I compare the ego factors and the marketing.

Re: features - "World's most secure OS" + WiFi

Roo (not verified)
October 17, 2007 - 12:53am

If you are referring to OpenBSD & WPA ... Agreed, seems a bit odd that they didn't have WPA support, that said last time I checked (6+ months ago) they were working on it. That said WPA has been shown to have some vulnerabilities too... For the record it is possible to have OpenBSD as a secure WiFi client, but by definition it won't be talking WPA. ;)

No MP-safe subsystems

Anonymous (not verified)
August 7, 2007 - 11:19am

There are a few trivial syscalls (like getpid) that do not require the big giant lock, but no subsystems. Other commenters are correct that there was a big about-face in the direction of the dragonfly project: the main initial goal was to develop a better SMP system because Matt didn't like the direction FreeBSD 5 was pursuing.

He gave up on this when it became clear what a difficult task this was (FreeBSD has taken 7 years and probably a couple of man-centuries of work to get to where it is today, and they've only just started getting the really good payoffs from that work). At this point "single system image clustering" became the new focus of the dragonfly project and SMP dropped from the radar. There has been no concrete progress on SMP for a couple of years now.

In one sense this is fine: Matt has found a largely unexplored niche in the free OS community and it makes sense to work in a direction orthogonal to other projects. However, it would serve him and his project much better if he was honest about things like SMP, namely that they have stopped working on their original project goals and decided to work in other areas instead.


renoX-ag (not verified)
August 7, 2007 - 1:53pm

I totally agree with your point about communication, in my mind DragonFly was the FreeBSD fork which wanted to use a 'better' way for SMP..
Then I read about the focus on SSI which I found strange as this is not really related. And now, I discover that SMP is not really in a good state for many subsystems.

It's a bit disapointing..

All the more since for me SSI == Plan9 so I can't help but wonder if it's not a case of NIH to invent a new SSI OS with a FreeBSD 'base' instead of porting/adapting Plan9 on the FreeBSD base: has there been comparison made between DragonFly and Plan9 ?

Part of the issue is that

James Frazer (not verified)
August 7, 2007 - 2:01pm

Part of the issue is that there is only so much time, and the small group of core devs can't do everything. It's my understanding that the key infrastructure for SMP is in place, it's just that someone has to go through the code and lock stuff up.

I don't think he 'gave up' because it was difficult -- the hardest stuff is supposedly done -- he just moved on to something he finds more interesting.

The goals haven't changed -- the clustering was always THE major goal... SMP is still an important goal, it's just waiting for someone to do it.

"Locking stuff up" is the *hard* part

Anonymous (not verified)
August 7, 2007 - 2:22pm

"Locking stuff up" is the *hard* part. Coming up with some low-level primitives is the easy part, and that's most of what has been done. Figuring out a sensible way to use those primitives in high-level kernel code, developing and applying debugging tools to identify and fix the inevitable bugs, dealing with all the difficult corner cases - that is what takes serious time and effort.

There was some work by Jeffrey Hsu on parallelizing some parts of the network stack, but even that remains incomplete and he has not committed actively to that code for a year or more. That is the only high-level SMP work on dragonfly of which I am aware.

There has been no work done on virtual memory, file systems, device drivers, nontrivial system calls, etc. Those are the important services provided by a kernel, and those are what are required before dragonfly can expect to see any performance benefits from a second CPU on workloads that involve the kernel.

By any realistic estimate, even if Matt started work on this today and focused on it exclusively, it will be several years before dragonfly could expect to catch up to where other OSes are today.

Also, your claim that "clustering was always THE major goal" is revisionist and historically false. You probably haven't been around since the beginning of the project, but Matt only got the "clustering" idea later on, some time after launching the project.

I believe Jeff got a new

James Frazer (not verified)
August 7, 2007 - 7:37pm

I believe Jeff got a new job/moved/etc and had to allocate his time elsewhere, but I can't remember exactly.

The goal of the new locking primitives was to make locking easier as to lower the bar so that more devs could help out with it -- essentially bypass some of the complex problems other systems were experiencing at the time.

Admittedly clustering wasn't in the initial announcement (looking back on it now), but the plan did pop up early enough that I can't see what difference it makes. FYI I've been around from the beginning, as archives of the mailing lists should support.

Things haven't ended up where they were initially intended to -- this is life -- I'm not rich and retired on a tropical island yet, but that doesn't mean I don't want to be.

Great theory, but...

Anonymous (not verified)
August 7, 2007 - 10:25pm

...if your approach is so much simpler and less prone to development problems, why hasn't anyone demonstrated this by actually locking a kernel subsystem and proving the benefits? You could even start with an easy one, just do *something* to put some weight behind the claim.

Until that happens, these repeated claims that "the dragonfly approach is soooo much better, trust us!" are just empty boastful words. I think most people look upon such claims with skepticism given the lack of proof, and quite frankly it has earned dragonfly a somewhat comical reputation outside the project.

P.S. The reason it is important to be factual and not revisionist is because you damage the case you are trying to make when your incorrect assertions are exposed. "The goals haven't changed -- the clustering was always THE major goal". When it is pointed out that this is false, it makes you look like a fanboy who will defend whatever is the current line at the expense of telling the truth.

You and your fellow dragonfly supporters will do a lot good more for the project if you honestly acknowledge your weaknesses and how your project's goals have changed over time. The kind of developers you need to attract to start making progress are usually a pretty savvy lot, and they will smell bullshit when they're told made-up things like "SMP support is coming along great, in fact most of the hard work is finished!".

"He who fights with monsters

Anonymous (not verified)
August 8, 2007 - 12:27am

"He who fights with monsters might take care lest he thereby become a monster."

The hard part...

Anonymous (not verified)
August 16, 2007 - 4:10am

""Locking stuff up" is the *hard* part. Coming up with some low-level primitives is the easy part, and that's most of what has been done."

That isn't entirely true, the DragonFly devs have done more than merely come up with primitives. Much of the code in the kernel has been rearranged in such a way as to minimize the ammount of locking needed in order to make the system MP safe by serializing tasks into per-cpu threads, and by replicating subsystems (like Jeffrey Hsu's network stack) across multiple cpus.

This work represents the core features of DragonFly's multiprocessing architecture, and *is* actually largely done for good chunks of the kernel. However, all subsystems in the kernel still make calls to functions that are *not* MP safe, requiring one to grab the MP lock. Work to lock up those bits scattered around the kernel to allow GIANT to go away is ongoing, but obviously slower than many people would like.

Considder also that the work is being done by far fewers developers than FreeBSD or Linux and still making decent progress in rearchitecting a kernel that was never intended to run on multiprocessor systems. Considder also that DragonFly's network stack was closer to being MP safe before FreeBSD 5 was (tho FreeBSD is farther along now).

"There has been no work done on virtual memory, file systems, device drivers, nontrivial system calls, etc. Those are the important services provided by a kernel, and those are what are required before dragonfly can expect to see any performance benefits from a second CPU on workloads that involve the kernel."

There certainly has been work on those other subsystems, slow but continual progress has been made in network drivers and the VFS, tho you are correct to point out that the VFS and filesystems are not nearly as developed as in FreeBSD 6 or modern Linux. IIRC, the VM subsystem does still need considderable work.

"Also, your claim that "clustering was always THE major goal" is revisionist and historically false. You probably haven't been around since the beginning of the project, but Matt only got the "clustering" idea later on, some time after launching the project."

I have been following the progress of DragonFly since it was first announced on the FreeBSD mailing lists, and while I don't recall clustering being initially mentioned, it became the Project's goal *very* early on (well before the first release) while much of the restructuring of the kernel was being done.

At any rate, things are progressing, tho some parts more slowly than many folks would like. I have high hopes for DF's SMP capabilities in the future, but when dealing with a small project like DF, you have to learn to be patient.

That's a very, very small test

Anonymous (not verified)
August 7, 2007 - 11:46am

MySQL represents one example of a threaded workload with lots of shared resources (and a ton of kernel contention; MySQL *loves* its syscalls); it's not really a good test of a system where the default threading library is still a single process userspace implementation and proper threading is just getting started.

Certainly people used FreeBSD 4 in SMP environments and got a reasonable amount of use out of multiple CPU's, but they were invariably cases using process based concurrency, since that's all there was.

One skewed benchmark does not make an accurate assessment of the performance of a system.

OK, so where are your numbers?

Anonymous (not verified)
August 7, 2007 - 2:10pm

If you think this benchmark is unfair, where are your alternative benchmarks showing areas where dragonfly performs well on some - or indeed any - benchmark?

I follow the dragonfly mailing lists and there have been none posted. In fact if you read the followup to the link in the original comment, Matt admits that there is no kernel workload that will make use of multiple processors because all kernel subsystems remain giant locked.


Anonymous (not verified)
August 7, 2007 - 2:26pm

Also, if you read the original link it *does* test the alternative kernel-based threading library and finds almost no performance benefit. This is for the same reason: everything is excluded by the big giant lock so even true independent threads cannot do work in the kernel at the same time.

Yes, I can read thank you

Anonymous (not verified)
August 8, 2007 - 11:09am

I'm not claiming anything about DfBSD's performance, I'm just saying that the linked to benchmark is extremely limited and that you can't just point to that and say "all workloads will scale this badly"; that might be the case, but it's not sufficient evidence. Given FreeBSD 4 didn't scale *that* badly with many multiprocess workloads, indeed, it's likely entirely false; plenty of things do not involve large quantities of kernel contention.

I *know* both threading libraries were tested, that's why I said "default threading library"; the kernel threading isn't default yet, hence kernel supported threading is "just getting started".

I don't have numbers, I have no real interest in making them (especially not in an OS I don't use), but I know from experience that Giant doesn't completely remove any hope of making use of multiple processors in all possible workloads, unless DfBSD's actually got *worse* since it was forked.

There is no such thing as a

Anonymous (not verified)
August 6, 2007 - 10:16pm

There is no such thing as a "core committer" or a "core developer". There is a core team (the people "in charge" of the project), and there are committers (developers with commit privileges). There are no core committers however and Matt was never part of the core team.

I'm not sure, but I believe

Anonymous (not verified)
August 7, 2007 - 1:40am

I'm not sure, but I believe that core in this case refers to the core of the OS (the VM subsystem in this case).

FreeBSD core team and VM

John S. Dyson (not verified)
August 30, 2007 - 1:30am

Actually, most of the FreeBSD VM work at the time of the split between DFLY and FreeBSD had still been done by myself and David Greenman -- even though the split had occured well after my resigning from FreeBSD. Since I had left, there has been signifcant cleanup of the VM code by various individuals (esp FBSD Alan Cox), but the basic VM algorithms and philosophy for page selection, paging I/O, fixing/upgrading the MACH VM for good traditional VM performance was done by David Greenman, me, and several other contributors (later on, Matt had certainly helped.) The algorithms that had given FreeBSD good paging performance and good page-out page selection were developed primarily by yours truly. (Refer to the commit history and file copyrights in the FBSD tree.)

It just so happens that as part of the team of two (plus several others) who developed/fixed the FreeBSD VM code, wrote the VM/Buffer cache code, wrote the new pipe code, and improved the pmap performance for the X86 platform, both DG and myself were also core team members of FreeBSD. Even today, FreeBSD still tends to avoid the periodic VM&Buffer cache caused sluggishness-under-load that still persists on some other OSes.

Admittedly, FreeBSD is (and always was) far from perfect, but the various developers (including the FreeBSD Alan Cox) have continued to improve the FreeBSD kernel -- often significantly improving or rewriting code that I had written or significantly modified.

Part of my frustration that helped motivate my 'walking away' from FreeBSD was related to the overwhelming SMP design issues and the fact that I couldn't see any efficient or expedient way to solve the numerous SMP issues in the FreeBSD kernel.

John Dyson

To clarify further

Anonymous (not verified)
August 7, 2007 - 10:52am

Most top FreeBSD developers have never been part of core. Core is more an administrative thing.

The Highly Available Clustering Filesystem (HACFS)

Ricard (not verified)
August 7, 2007 - 3:03am

simply mount the filesystem 'as of' a certain date to get a snapshot as-of that date

Will this mount be with write permissions? If so, then it could be seen as creating a branch of the FS, right? Will it then be possible to merge branches?

This would give really interesting possibilities. I can think of using two snapshots of the same FS one at home and one at work and merge them every now and them. Kind of like with rsync.

Although I guess conflicts would be haaard to solve.

Re: Snapshots

Matt Dillon (not verified)
August 7, 2007 - 1:44pm

No, snapshots are not separate branches in the current design, and probably never will be.

The basic design is that the filesystem is a synthesis of records associated with filesystem objects. Each record has a creation and deletion transaction id (which doubles as a timestamp). A snapshot is simply saying that we ignore all records with a creation stamp greater then the snapshot stamp or a deletion stamp less then the snapshot stamp.

The general design is abstracted as a series of records appended to the object. Hence the concept of doing a backup, or more particularly doing a live update stream to the mirrors (with an out of band coherency scheme to prevent conflicts) is simply nothing more then iterating through a linear sequence of records. Thus no out of band queueing mechanism is required and slow offsite links do not interfere with operation.

Optimization is dealt with in two ways. First, the records are indexed. The indices are throw-away (meaning they can be recreated by iterating the record list), and thus they can be updated asynchronously. Second, often used data in older records that are far away from newer ones in a complex file object can be reoptimized simply by generating a new record with the information. fsync() is nothing more then waiting for asynchronous I/O's to complete and then updating the linear record append index in a large block header. Since additional appends beyond the scope of the fsync can occur simultaniously the filesystem doesn't stall.

Data is not necessarily stored in-line with the record so for example a single 2GB write() could be represented by a single record. Similarly small bits of data can be stored in-line with the record, giving us the ability to store small files very compactly. Meta-data associated with a file object is stored as primary data with negative object offsets. The inode itself is just a bit of meta-data and thus subject to the same historical snapshot capability as the rest of the file object.

It is also theoretically possible for unrelated files to reference the same physical data if the data happens to be the same between them, which in turn means that you can 'cp a b' in a split second regardless of how large 'a' is. There are numerous interesting side effects like that which I think will make the filesystem operate very nicely.



May 3, 2011 - 10:50pm


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.