The modular design of microkernels makes for easier design & debugging, and with some designs the freedom to make user space services that can only be in privileged space in monolithic designs, but does one want to pay the overhead for all that message passing? Now that we are getting into parallel processing at the consumer level with multicore and hyperthreaded chips, maybe the answer is yes.
QNX uses shared memory to pass messages. Its message passing is very lightweight, and the resulting performance is far better than Linux.
In this day and age, there is no reason to use a macrokernel unless your hardware lacks the features needed for a microkernel. QNX has proved this quite nicely.
So does Mach, and it's slow. I've never seen real-world measures to suggest that QnX is fast. All we know is that the performance of the OS itself is good, and that's a VERY DIFFERENT measure.
The slow performance is due to a number of problems:
1) not all MMU's are really suited to this task. Many are slower to set up than just copying the memory around. Sun found this to be at around 5k, below that, it was faster to just copy memory physically.
2) MMUs/VM are based on pages, 2 or 4k typically. Thus passing in a single 32-bit int parameter causes big page hits. You can tune this out, but it's still annoying.
3) Each copy takes TWO context switches - one to switch into the kernel to copy the memory across ports, another back out to the called program. This means that even the simplest "system calls" are twice as slow as under a monokernel, AT BEST.
4) Additionally the data has to be examined to see if it contains ports being passed around, and if so, they have to be translated because the port codes are private to a program (and thus different in the other one).
5) Using mapped memory ignores all the hardware specific solutions to these problems, many of which are built into modern processors.
It's exactly the sort of one-size-fits-all solution that you'd expect from a research project. One that doesn't work in the real world. One that should have been replaced, and was in L4, Spring, etc.
For instance, Spring included three different IPC systems, each tuned to certain types of data, each used in different ways on different CPUs. The "fast-path" used a half-switch into the kernel by mapping off registers, allowing IPC to degenerate into register passing largely identical to a procedure call. As long as the message fit within the limitations -- 8 registers, no port identifiers, etc. -- it was faster than a traditional Unix trap. These limitations seem serious, but were in fact used for 80% of calls and 60% of returns (you often say "getDiskSector(integer value)" which could fit into the fast-path, and get back 2k of data which wouldn't).
Your point about simple system calls is pretty wrong. Sure, the simplest system call that has no real body will execute twice as fast without extra context switching, but if the system call is non-trivial, then the context switch overhead becomes a small percentage of the time. So at worst, it's twice as slow, and at best, it's just as slow.;) Thus is the folly of microbenchmarking--it doesn't measure system performance, just how fast you can do one thing. The real problem with microkernels is that you
My point was limited to the time for the switching itself. Perhaps I should have been more clear on this.
The "at best" is assuming that the forgoing issues don't cause things like cache faults. Passing parameters in registers won't. Thus the performance really can be MUCH worse than twice even in the case of minimal calls.
But even in that case the real-world performance of Mach is, in fact, much worse that twice as slow. I believe it was Chan that did all the big measurement runs, and - going on memory he
The benchmarks for Mach 4.0 showed it within 20% the speed of a monolithic kernel of the same era. Check the site for more details, although I seem to recall that the project is no longer in active development.
There is one very easy way to kill a microkernel's performance - force it to use a synchronous system call API (e.g. POSIX). With a synchronous system call API, a context switch is required for every system call. With an asynchronous API, the process simply writes messages into a buffer (or set of buffers for different kernel services) until it either needs to wait for a response or its quantum expires. At this point, you switch to the next context (perhaps a kernel server) and process the incoming messages. This reduces the total number of context switches (and, more importantly the number of mode switches). If you want to see good performance from QNX, then use the native system call API, not the POSIX wrapper.
3) Each copy takes TWO context switches - one to switch into the kernel to copy the memory across ports, another back out to the called program. This means that even the simplest "system calls" are twice as slow as under a monokernel, AT BEST.
Ahem, the point the poster was making was about modern desktop computers going multicore, in which case you cannot discount the possibility that the message will be going from an OS process active on one core to an OS process active on another core. This can p
As someone who actually evaluated and used both linux and QNX for embedded projects, I can say - not by a long shot. QNX is a very nicereal time operating system, with predictable interrupt timing and all that. But performance-wise it suckus-dickus. Too many context switches.
Real-time operating systems have different design criteria than "normal" desktop-server OS, like linux. In general purpose OS you care about performance, which is average-case behavi
Its message passing is very lightweight, and the resulting performance is far better than Linux.
Faster than passing arguments in the stack? Oh wait, we have an option to pass arguments in registers in linux 2.6. Is QNX message passing faster than passing arguments in registers? Somehow, I doubt it...
QNX is a real time operating system - its message passing only has good performance when there's not too many different types of messages to pass. The desktop versions of QNX work if you are only doing a couple things, like browsing and doing email. But if you try to do the things that a Linux distro could easily do, like burning a CD while writing to USB device and compiling a new kernel and running a dozen windows, it'll choke up: it's NOT suitable for a general purpose desktop.
you're crazy, a RTOS is designed to handle that kind of load. On a submarine or on your PC. That's what real-time means. Next time XMMS starts stuttering when doing a compile, remember that you're using a non-RTOS
Well, there's theory, and then there's practice. A RTOS is generally designed to run one application in real time, not an arbitrary set of apps launched at random.
The modular design of microkernels makes for easier design & debugging, and with some designs the freedom to make user space services that can only be in privileged space in monolithic designs, but does one want to pay the overhead for all that message passing?
No, which is why Apple's XNU runs in one address space for the most part (I don't even know whether there are parts which don't), and most message passing has been reduced to plain function calls. They still have the design advantages of s
Yes, Apple should totally follow the direction of HURD, as they are truly pioneers in not-releasing useable software, from which Apple could learn a lot from.
While many devices are not supported, and the performance is not good, HURD/Mach is feature complete (and most of Debian runs on it, IIRC).
Because the performance was bad, the new HURD effort focuses on reimplementing on L4. Perhaps with a faster microkernel, Apple could have avoided the kludge of an in-kernel BSD peer.
If I am reading correctly, Mach is responsible for IPC in the Apple kernel. It would be interesting to see benchmarks of SYSV system calls to semaphores, queues, and shared memory (and pe
Actually, monolithic kernels will always be faster... in fact why not make all software monolithic? What I am talking about is running all programs in the kernel address space with simple function calls to kernel services. That would make the computer much faster, and it can be done.
If the entire operating system were written in a safe language such as Java or C# ("managed" code only) then the performance impact from syscalls, virtual memory (TLB flush/lookup), complicated task switching, and extra copie
You can do something like this with Busybox or BSD's crunchgen. I do embedded development and these tools esentially statically link everything together into one big program. It is handy, but I have not measured performance with it. The biggest issue is building the binary, since it would has to be built for the user's application selections. I'm sure someone could come up with a dynamic loader where each program is a library.
The language does not really matter. They all evaluate to ones and zeroes anyway.
well, I'd rather see a pointer-safe (and nil value safe) procedural language with structures as the most complicated data representation, building and tearing down objects for all the data structures and messages an OS has to pass around probably would constipate things.
So lots of people will mod this down since they assume that the low-level details like cache lines are more important than oh, say, free memory management. But I got some news: a few minor tweaks and you can do all that same low-level crap in Java or managed C# and get all the benefits of a safe kernel.
You've got my vote. Can you get this finished by the end of the month?
No, but if you have vmware or the right hardware you might want to check out JXOS [jxos.org] an all-Java operating system. It was done by I think 2 people in their spare time, and it's like 50% of Linux speed at an actual benchmark (nfs filesystem). That's pretty good for two guys writing everything from scratch.
Check out this recent video discussing Microsofts "Singularity" research project works in this way. <URL:http://channel9.msdn.com/ShowPost.aspx? PostID =68302>
And this is surprising to people because ? C'mon, anyone who's been around the block a few times KNOWS that design != good performance.
With most software, most folks design to achieve functional goals, then think of performance AFTER the fact. You see it again and again. Here's a blast from the past. Most folks won't remember that between Win NT 3.1 -> Win NT 3.51 ->Win NT 4.0 ->Win NT 5(otherwise known as Win2K ), Microsoft progressively, came to realize that having to transition from GDI user-mo
The modular design of microkernels makes for easier design & debugging, and with some designs the freedom to make user space services that can only be in privileged space in monolithic designs, [...]
That's the theory, but in the actual practice of microkernel designs, what you end up having to do is move most of those "user space services" into kernel space for decent performance, as NT and OS X do.
I assume that they at least keep the stuff that would ideally run in user space logically separated fro
design is better, performance is worse (Score:5, Interesting)
qnx does just fine with a u-kernel and message pas (Score:1)
Re:qnx does just fine with a u-kernel and message (Score:5, Interesting)
In this day and age, there is no reason to use a macrokernel unless your hardware lacks the features needed for a microkernel. QNX has proved this quite nicely.
Re:qnx does just fine with a u-kernel and message (Score:2, Funny)
Re:qnx does just fine with a u-kernel and message (Score:5, Informative)
So does Mach, and it's slow. I've never seen real-world measures to suggest that QnX is fast. All we know is that the performance of the OS itself is good, and that's a VERY DIFFERENT measure.
The slow performance is due to a number of problems:
1) not all MMU's are really suited to this task. Many are slower to set up than just copying the memory around. Sun found this to be at around 5k, below that, it was faster to just copy memory physically.
2) MMUs/VM are based on pages, 2 or 4k typically. Thus passing in a single 32-bit int parameter causes big page hits. You can tune this out, but it's still annoying.
3) Each copy takes TWO context switches - one to switch into the kernel to copy the memory across ports, another back out to the called program. This means that even the simplest "system calls" are twice as slow as under a monokernel, AT BEST.
4) Additionally the data has to be examined to see if it contains ports being passed around, and if so, they have to be translated because the port codes are private to a program (and thus different in the other one).
5) Using mapped memory ignores all the hardware specific solutions to these problems, many of which are built into modern processors.
It's exactly the sort of one-size-fits-all solution that you'd expect from a research project. One that doesn't work in the real world. One that should have been replaced, and was in L4, Spring, etc.
For instance, Spring included three different IPC systems, each tuned to certain types of data, each used in different ways on different CPUs. The "fast-path" used a half-switch into the kernel by mapping off registers, allowing IPC to degenerate into register passing largely identical to a procedure call. As long as the message fit within the limitations -- 8 registers, no port identifiers, etc. -- it was faster than a traditional Unix trap. These limitations seem serious, but were in fact used for 80% of calls and 60% of returns (you often say "getDiskSector(integer value)" which could fit into the fast-path, and get back 2k of data which wouldn't).
Maury
Re:qnx does just fine with a u-kernel and message (Score:1, Insightful)
Re:qnx does just fine with a u-kernel and message (Score:3, Informative)
The "at best" is assuming that the forgoing issues don't cause things like cache faults. Passing parameters in registers won't. Thus the performance really can be MUCH worse than twice even in the case of minimal calls.
But even in that case the real-world performance of Mach is, in fact, much worse that twice as slow. I believe it was Chan that did all the big measurement runs, and - going on memory he
Re:qnx does just fine with a u-kernel and message (Score:4, Interesting)
There is one very easy way to kill a microkernel's performance - force it to use a synchronous system call API (e.g. POSIX). With a synchronous system call API, a context switch is required for every system call. With an asynchronous API, the process simply writes messages into a buffer (or set of buffers for different kernel services) until it either needs to wait for a response or its quantum expires. At this point, you switch to the next context (perhaps a kernel server) and process the incoming messages. This reduces the total number of context switches (and, more importantly the number of mode switches). If you want to see good performance from QNX, then use the native system call API, not the POSIX wrapper.
Re:qnx does just fine with a u-kernel and message (Score:2, Insightful)
Ahem, the point the poster was making was about modern desktop computers going multicore, in which case you cannot discount the possibility that the message will be going from an OS process active on one core to an OS process active on another core. This can p
Re:qnx does just fine with a u-kernel and message (Score:2)
As someone who actually evaluated and used both linux and QNX for embedded projects, I can say - not by a long shot. QNX is a very nicereal time operating system, with predictable interrupt timing and all that. But performance-wise it suckus-dickus. Too many context switches.
Real-time operating systems have different design criteria than "normal" desktop-server OS, like linux. In general purpose OS you care about performance, which is average-case behavi
Re:qnx does just fine with a u-kernel and message (Score:2)
Faster than passing arguments in the stack? Oh wait, we have an option to pass arguments in registers in linux 2.6. Is QNX message passing faster than passing arguments in registers? Somehow, I doubt it...
Re:qnx does just fine with a u-kernel and message (Score:3, Informative)
Re:qnx does just fine with a u-kernel and message (Score:1, Insightful)
Re:qnx does just fine with a u-kernel and message (Score:2)
Re:qnx does just fine with a u-kernel and message (Score:2)
Re:design is better, performance is worse (Score:3, Informative)
No, which is why Apple's XNU runs in one address space for the most part (I don't even know whether there are parts which don't), and most message passing has been reduced to plain function calls. They still have the design advantages of s
Updating (Score:2)
L4 performance? (Score:3, Interesting)
HURD abandoned Mach because of performance issues and is being reimplemented on L4 [l4ka.org].
If Apple had chosen L4, would it have been necessary from a performance perspective to include BSD at a peer level with the microkernel?
Is it now far too late for Apple to dump Mach?
Re:L4 performance? (Score:1, Funny)
HURD on Mach is done. (Score:3, Informative)
While many devices are not supported, and the performance is not good, HURD/Mach is feature complete (and most of Debian runs on it, IIRC).
Because the performance was bad, the new HURD effort focuses on reimplementing on L4. Perhaps with a faster microkernel, Apple could have avoided the kludge of an in-kernel BSD peer.
If I am reading correctly, Mach is responsible for IPC in the Apple kernel. It would be interesting to see benchmarks of SYSV system calls to semaphores, queues, and shared memory (and pe
design AND performance better with safe kernel (Score:2, Interesting)
If the entire operating system were written in a safe language such as Java or C# ("managed" code only) then the performance impact from syscalls, virtual memory (TLB flush/lookup), complicated task switching, and extra copie
Re:design AND performance better with safe kernel (Score:1)
The language does not really matter. They all evaluate to ones and zeroes anyway.
Re:design AND performance better with safe kernel (Score:2)
Re:design AND performance better with safe kernel (Score:2)
Re:design AND performance better with safe kernel (Score:1)
Re:design AND performance better with safe kernel (Score:3, Informative)
<URL:http://channel9.msdn.com/ShowPost.aspx
Re:design AND performance better with safe kernel (Score:2)
http://channel9.msdn.com/ShowPost.aspx?PostID=683
Re:design is better, performance is worse (Score:1)
With most software, most folks design to achieve functional goals, then think of performance AFTER the fact. You see it again and again. Here's a blast from the past. Most folks won't remember that between Win NT 3.1 -> Win NT 3.51 ->Win NT 4.0 ->Win NT 5(otherwise known as Win2K ), Microsoft progressively, came to realize that having to transition from GDI user-mo
Re:design is better, performance is worse (Score:1)
Ooops...got caught by the > < gotcha. What that phrase was supposed..to say was..
good design !_NECESSARILY_= good performance.
Re:design is better, performance is worse (Score:2)
That's the theory, but in the actual practice of microkernel designs, what you end up having to do is move most of those "user space services" into kernel space for decent performance, as NT and OS X do.
I assume that they at least keep the stuff that would ideally run in user space logically separated fro