PANIKRAUM mit PUMPGUN

A while ago I started development of special branch of PulseAudio which is called glitch-free. In a few days I will merge it back to PulseAudio trunk, and eventually release it as 0.9.11. I think it's time to explain a little what all this "glitch-freeness" is about, what made it so tricky to implement, and why this is totally awesome technology. So, here we go:

Traditional Playback Model

Traditionally on most operating systems audio is scheduled via sound card interrupts (IRQs). When an application opens a sound card for playback it configures it for a fixed size playback buffer. Then it fills this buffer with digital PCM sample data. And after that it tells the hardware to start playback. Then, the hardware reads the samples from the buffer, one at a time, and passes it on to the DAC so that eventually it reaches the speakers.

After a certain number of samples played the sound hardware generates an interrupt. This interrupt is forwarded to the application. On Linux/Unix this is done via poll()/select(), which the application uses to sleep on the sound card file descriptor. When the application is notified via this interrupt it overwrites the samples that were just played by the hardware with new data and goes to sleep again. When the next interrupt arrives the next block of samples is overwritten, and so on and so on. When the hardware reaches the end of the hardware buffer it starts from its beginning again, in a true ring buffer fashion. This goes on and on and on.

The number of samples after which an interrupt is generated is usually called a fragment (ALSA likes to call the same thing a period for some reason). The number of fragments the entire playback buffer is split into is usually integral and usually a power of two, 2 and 4 being the most frequently used values.

Image 1: Schematic overview of the playback buffer in the traditional playback model, in the best way the author can visualize this with his limited drawing abilities.

If the application is not quick enough to fill up the hardware buffer again after an interrupt we get a buffer underrun ("drop-out"). An underrun is clearly hearable by the user as a discontinuity in audio which is something we clearly don't want. We thus have to carefully make sure that the buffer and fragment sizes are chosen in a way that the software has enough time to calculate the data that needs to be played, and the OS has enough time to forward the interrupt from the hardware to the userspace software and the write request back to the hardware.

Depending on the requirements of the application the size of the playback buffer is chosen. It can be as small as 4ms for low-latency applications (such as music synthesizers), or as long as 2s for applications where latency doesn't matter (such as music players). The hardware buffer size directly translates to the latency that the playback adds to the system. The smaller the fragment sizes the application configures, the more time the application has to fill up the playback buffer again.

Let's formalize this a bit: Let BUF_SIZE be the size of the hardware playback buffer in samples, FRAG_SIZE the size of one fragment in samples, and NFRAGS the number of fragments the buffer is split into (equivalent to BUF_SIZE divided by FRAG_SIZE), RATE the sampling rate in samples per second. Then, the overall latency is identical to BUF_SIZE/RATE. An interrupt is generated every FRAG_SIZE/RATE. Every time one of those interrupts is generated the application should fill up one fragment again, if it missed one interrupt this might become more than one. If it doesn't miss any interrupt it has (NFRAGS-1)*FRAG_SIZE/RATE time to fulfill the request. If it needs more time than this we'll get an underrun. The fill level of the playback buffer should thus usually oscillate between BUF_SIZE and (NFRAGS-1)*FRAG_SIZE. In case of missed interrupts it might however fall considerably lower, in the worst case to 0 which is, again, an underrun.

It is difficult to choose the buffer and fragment sizes in an optimal way for an application:

The buffer size should be as large as possible to minimize the risk of drop-outs.
The buffer size should be as small as possible to guarantee minimal latencies.
The fragment size should be as large as possible to minimize the number of interrupts, and thus the required CPU time used, to maximize the time the CPU can sleep for between interrupts and thus the battery lifetime (i.e. the fewer interrupts are generated the lower your audio app will show up in powertop, and that's what all is about, right?)
The fragment size should be as small as possible to give the application as much time as possible to fill up the playback buffer, to minimize drop-outs.

As you can easily see it is impossible to choose buffering metrics in a way that they are optimal on all four requirements.

This traditional model has major drawbacks:

The buffering metrics are highly dependant on what the sound hardware can provide. Portable software needs to be able to deal with hardware that can only provide a very limited set of buffer and fragment sizes.
The buffer metrics are configured only once, when the device is opened, they usually cannot be reconfigured during playback without major discontinuities in audio. This is problematic if more than one application wants to output audio at the same time via a sound server (or dmix) and they have different requirements on latency. For these sound servers/dmix the fragment metrics are configured statically in a configuration file, and are the same during the whole lifetime. If a client connects that needs lower latencies, it basically lost. If a client connects that doesn't need as low latencies, we will continouisly burn more CPU/battery than necessary.
It is practically impossible to choose the buffer metrics optimal for your application -- there are too many variables in the equation: you can't know anything about the IRQ/scheduling latencies of the OS/machine your software will be running on; you cannot know how much time it will actually take to produce the audio data that shall be pushed to the audio device (unless you start counting cycles, which is a good way to make your code unportable); the scheduling latencies are hugely dependant on the system load on most current OSes (unless you have an RT system, which we generally do not have). As said, for sound servers/dmix it is impossible to know in advance what the requirements on latency are that the applications that might eventually connect will have.
Since the number of fragments is integral and at least 2 on almost all existing hardware we will generate at least two interrupts on each buffer iteration. If we fix the buffer size to 2s then we will generate an interrupt at least every 1s. We'd then have 1s to fill up the buffer again -- on all modern systems this is far more than we'd ever need. It would be much better if we could fix the fragment size to 1.9s, which still gives us 100ms to fill up the playback buffer again, still more than necessary on most systems.

Due to the limitations of this model most current (Linux/Unix) software uses buffer metrics that turned out to "work most of the time", very often they are chosen without much thinking, by copying other people's code, or totally at random.

PulseAudio <= 0.9.10 uses a fragment size of 25ms by default, with four fragments. That means that right now, unless you reconfigure your PulseAudio manually clients will not get latencies lower than 100ms whatever you try, and as long as music is playing you will get 40 interrupts/s. (The relevant configuration options for PulseAudio are default-fragments= and default-fragment-size-msec= in daemon.conf)

dmix uses 16 fragments by default with a size of 21 ms each (on my system at least -- this varies, depending on your hardware). You can't get less than 47 interrupts/s. (You can change the parameters in .asoundrc)

So much about the traditional model and its limitations. Now, we'll have a peek on how the new glitch-free branch of PulseAudio does its things. The technology is not really new. It's inspired by what Vista does these days and what Apple CoreAudio has already been doing for quite a while. However, on Linux this technology is new, we have been lagging behind quite a bit. Also I claim that what PA does now goes beyond what Vista/MacOS does in many ways, though of course, they provide much more than we provide in many other ways. The name glitch-free is inspired by the term Microsoft uses to call this model, however I must admit that I am not sure that my definition of this term and theirs actually is the same.

Glitch-Free Playback Model

The first basic idea of the glitch-free playback model (a better, less marketingy name is probably timer-based audio scheduling which is the term I internally use in the PA codebase) is to no longer depend on sound card interrupts to schedule audio but use system timers instead. System timers are far more flexible then the fragment-based sound card timers. They can be reconfigured at any time, and have a granularity that is independant from any buffer metrics of the sound card. The second basic idea is to use playback buffers that are as large as possible, up to a limit of 2s or 5s. The third basic idea is to allow rewriting of the hardware buffer at any time. This allows instant reaction on user-input (i.e. pause/seek requests in your music player, or instant event sounds) although the huge latency imposed by the hardware playback buffer would suggest otherwise.

PA configures the audio hardware to the largest playback buffer size possible, up to 2s. The sound card interrupts are disabled as far as possible (most of the time this means to simply lower NFRAGS to the minimal value supported by the hardware. It would be great if ALSA would allow us to disable sound card interrupts entirely). Then, PA constantly determines what the minimal latency requirement of all connected clients is. If no client specified any requirements we fill up the whole buffer all the time, i.e. have an actual latency of 2s. However, if some applications specified requirements, we take the lowest one and only use as much of the configured hardware buffer as this value allows us. In practice, this means we only partially fill the buffer each time we wake up. Then, we configure a system timer to wake us up 10ms before the buffer would run empty and fill it up again then. If the overall latency is configured to less than 10ms we wakeup after half the latency requested.

If the sleep time turns out to be too long (i.e. it took more than 10ms to fill up the hardware buffer) we will get an underrun. If this happens we can double the time we wake up before the buffer would run empty, to 20ms, and so on. If we notice that we only used much less than the time we estimated, we can halve this value again. This adaptive scheme makes sure that in the unlikely event of a buffer underrun it will happen most likely only once and never again.

When a new client connects or an existing client disconnects, or when a client wants to rewrite what it already wrote, or the user wants to change the volume of one of the streams, then PA will resample its data passed by the client, convert it to the proper hardware sample type, and remix it with the data of the other clients. This of course makes it necessary to keep a "history" of data of all clients around so that if one client requests a rewrite we have the necessary data around to remix what already was mixed before.

The benefits of this model are manyfold:

We minimize the overall number of interrupts, down to what the latency requirements of the connected clients allow us. i.e. we save power, don't show up in powertop anymore for normal music playback.
We maximize drop-out safety, because we buffer up to 2s in the usual cases. Only with operating systems which have scheduling latencies > 2s we can still get drop-outs. Thankfully no operating system is that bad.
In the event of an underrun we don't get stuck in it, but instead are able to recover quickly and can make sure it doesn't happen again.
We provide "zero-latency". Each client can rewrite its playback buffer at any time, and this is forwarded to the hardware, even if this means that the sample currently being played needs to be rewritten. This means much quicker reaction to user input, a more responsive user experience.
We become much less dependant on what the sound hardware provides us with. We can configure wakeup times that are independant from the fragment settings that the hardware actually supports.
We can provide almost any latency a client might request, dynamically without reconfiguration, without discontinuities in audio.

Of course, this scheme also comes with major complications:

System timers and sound card timers deviate. On many sound cards by quite a bit. Also, not all sound cards allow the user to query the playback frame index at any time, but only shortly after each IRQ. To compensate for this deviation PA contains a non-trivial algorithm which tries to estimate and follow the deviation over time. If this doesn't work properly it might happen that an underrun happens much earlier than we expected.
System timers on Unix are not very high precision. On traditional Linux with HZ=100 sleep times for timers are rounded up to multiples of 10ms. Only very recent Linux kernels with hrtimers can provide something better, but only on x86 and x86-64 until now. This makes the whole scheme unusable for low latency setups unless you run the very latest Linux. Also, hrtimers are not (yet) exposed in poll()/select(). It requires major jumping through loops to work around this limitation.
We need to keep a history of sample data for each stream around, thus increasing the memory footprint and potentially increased cache pressure. PA tries to work against the increased memory footprint and cache pressure this might cause by doing zero-copy memory management.
We're still dependant on the maximum playback buffer size the sound hardware supports. Many sound cards don't even support 2s, but only 300ms or suchlike.
The rewriting of the client buffers causing rewriting of the hardware buffer complicates the resampling/converting step immensly. In general the code to implement this model is more complex than for the traditional model. Also, ALSA has not really been designed with this design in mind, which makes some things very hard to get right and suboptimal.
Generally, this works reliably only on newest ALSA, newest kernel, newest everything. It has pretty steep requirements on software and sometimes even on hardware. To stay comptible with systems that don't fulfill these requirements we need to carry around code for the traditional playback model as well, increasing the code base by far.

The advantages of the scheme clearly outweigh the complexities it causes. Especially the power-saving features of glitch-free PA should be enough reason for the embedded Linux people to adopt it quickly. Make PA disappear from powertop even if you play music!

The code in the glitch-free is still rough and sometimes incomplete. I will merge it shortly into trunk and then upload a snapshot to Rawhide.

I hope this text also explains to the few remaining PA haters a little better why PA is a good thing, and why everyone should have it on his Linux desktop. Of course these changes are not visible on the surface, my hope with this blog story is to explain a bit better why infrastructure matters, and counter misconceptions what PA actually is and what it gives you on top of ALSA.

posted at: 19:54 | path: /projects | permanent link to this entry | 58 comments

Posted by Paul Davis at Tue Apr 8 22:36:03 2008
Congratulations on redesigning the internals of CoreAudio :) Apple has at least a couple of papers on the subject. Its an excellent design, much better for USB and ieee1394 devices than the one inspired by PCI-style devices.

Although putting this in PA doesn't strike me as a bad idea at all, to be honest I think your effort would be much better applied to getting ALSA to switch over to this model (which is where it can have the most benefit).

2 technical points. (a) this design adds latency to the output path (because of the need for a gap between the h/w and s/w ptrs; apple calls it the "safety buffer"), which is not always a desirable thing, and therefore can be a step backwards for some use cases (b) you've made an assumption that every h/w device can accurately and reliably tell you where its h/w pointer is. this isn't true, and its one reason why interrupts are helpful.

Posted by Philip Withnall at Tue Apr 8 22:46:17 2008
Absolutely brilliant work!

Posted by Lennart at Tue Apr 8 23:01:23 2008
Paul: Care to post the links to those Apple papers? I think I read most of them, but I am always interested in more fodder.

I doubt it makes sense to move ALSA over to this model, because it would make ALSA a complex userspace daemon. And quite honestly I don't see why we would need to do this, since PA already is just that.

Regarding your points:

a) Sure it adds latency. However, according to my measurements my code that estimates the deviation of the sound card timer is accurate to a subsample precision (on my HDA PCI hardware that is). Since you should stay away a few samples from the hw read ptr anyway this should be more than good enough. I only tested this on one piece of hw however, and I am not sure what exactly caused Apple to add that extra safety, though.

Also, while I try to offer very low latencies with PA I am not aiming for the low-latency crown. I had to enter a few comprimises (like not being able to synchronously wait for each client to provide me with audio data on each iteration, which adds a bit of buffering latency) which will certainly increase the latency, for robustness, security and network-transparency reasons.

b) Yep, the interrupt issues are true. The idea is to use the timestamp data from the ALSA timing structure to correlate the sound card clock with the system clock. Unfortunately this doesn't work properly yet, since ALSA lacks support for CLOCK_MONOTONOUS. The scheme would then be this: for soundcards which do support sample-granularity for querying the playback time, use it and disable interrupts. For all others set NFRAGS=1 and use the ALSA timestamp. I talked to a few ALSA people about this, but this needs more love.

I am not even sure if this glitch-free model makes any sense for pro-audio stuff at all. In pro-audio you need the most exact timing info you can get, the lowest latency you can get, you know your latency requirements are well known in advance, the user is very technically skilled, you don't care about power consumption and so on. So most of the reasons to use this model don't matter. Right now I am tempted to say that software like JACK should better stay away from this model.

Posted by Stoffe at Wed Apr 9 00:32:58 2008
Fantastic work. What you do for free sound is purely amazing and we are many more who are cheering you on than wants you to get off the lawn. :)

Keep it up!

Posted by Paul Davis at Wed Apr 9 02:10:28 2008
I'll try to dig up the Apple refs I have, but I suspect you've read them.

I'm a bit tired to think much right now, but I will note regarding your final point that JACK runs on top of this model on CoreAudio and works extremely well there. I am not sure why you think this has to be done in user-space - CoreAudio does it all in the kernel. I wasn't advocating a user-space daemon in ALSA - I was suggesting (and have been mumbling about if for a year or three) that ALSA should abandon its current interrupt-driven model for the "timer-based" model (in which interrupts are used to drive a DLL that lets you sync with the h/w sample clock). What happens in user-space on top of that is irrelevant (though of course, interesting :)

Great work by the way.

Posted by Paul Davis at Wed Apr 9 02:16:06 2008
Lennart, actually let me just outline my view of how this model works. Its not meant to be replacement for what you've explained, but it might (just might) shed a different light on the whole system in a way that could be useful. Or not :)

There are two clocks in the system. One of them is a system clock (no matter what its actual h/w origins). The other is the clock that drives audio to/from the interface. If you stop using interrupts as the driving force, the problem of audio i/o reduces to one of knowing the relative position of the s/w pointer and the h/w pointer. If you can establish the correct relationship between the system clock and the audio clock, this problem is essentially solved. Doing that can be done (relatively easily too, and with amazing accuracy) using a DLL that is driven by the interrupts. At each interrupt, you determine the h/w pointer position and the system time, and input the deltas since last time into the DLL. YOu end up with massively sub-sample accurate prediction of the h/w pointer based on the system time, and if the s/w pointer is driven by system timer events, the issues are solved.

You can do this in the kernel or in user-space.

Posted by Xav at Wed Apr 9 12:16:20 2008
"we configure a system timer to wake us up 10ms before the buffer would run empty and fill it up again then. If the overall latency is configured to less than 10ms we wakeup after half the latency requested."

Curious, I would have set the wakeup to half the requested latency when it reaches 20ms (not 10ms), just to have a continuity in the wakeup size:
- if you request 22ms, you have 10ms
- if you request 20ms, you have 10ms
- if you request 18ms, you have 9ms

Posted by Thomas at Wed Apr 9 13:20:40 2008
When remixing, it should be sufficient to have the previous mix result (which is in the buffer?) and not all previous inputs. Mixing is scaling and adding, and the order of adding should not matter.

Mixing in the new sound could mean a reduction of the other sound levels. If the original and new sound levels are known, this can at least be approximated, unless only a single original signal was supposed to be muted.

Good work!

Posted by Lennart at Wed Apr 9 13:50:59 2008
Thomas, no, this won't work since we do saturated integer mixing. Mixing is thus generally not invertible.

Posted by Lennart at Wed Apr 9 13:55:03 2008
Paul, what you suggest is basically what glitch-free PA does now. pcm_status->tstamp and pcm_status->delay (in conjunction withh a user maintained sample counter) contain the necessary information to establish the correct relationship between the system clock and the audio clock.

Posted by Paul Davis at Wed Apr 9 15:41:08 2008
Lennart, yes, they sort of do. What is sad, and what I would love to see changed, is that ALSA could establish the clock-to-clock relationship itself, thus allowing any app to ask "where is the h/w pointer now?" and get an at-least sample accurate answer at any time. This would remove quite a bit of complexity from user space (lets be honest, as great as PA is, we're going to continue to see at least 2 or 4 audio APIs living on for some time), and would also solve a fundamental problem with USB and ieee1394 devices where the interrupt interval is not slaved to the audio sample clock.

Posted by Joe Henley at Wed Apr 9 20:33:50 2008
Will it be implemented as poorly as PA was in FC8?

Posted by Lennart at Wed Apr 9 20:40:33 2008
Troll Henley: No, of course, even poorer. What did you expect? Find some other place to troll! Oh, and of course you filed BZ reports for all issues you found, right?

Posted by William Lovaton at Wed Apr 9 22:17:05 2008
Lennart, you are truly a hero! Congratulations.

I was wondering, isn't Rawhide very close to a final release? are you sure this is Fedora 9 material? If yes, then great! but I'd like to know your point of view about this.

Another question, right now I'm using totem to listen to my music on my up to date Fedora 8 system and powertop shows interrupts from:
- HDA Intel (43 wps)
- totem : schedule_timeout (process_timeout) (21)
- totem : do_nanosleep (hrtimer_wakeup) (20)
- totem : futex_wait (hrtimer_wakeup) (15)

PA is well down with 0.5 -> wps pulseaudio : schedule_timeout (process_timeout).

Reading your post I understand that most of the totem's interrupts will be reduced by a great extent. Am I wrong? Do you have numbers?

Cheers and keep up the good work!.

Posted by seringen at Thu Apr 10 07:14:07 2008
hi, i was wondering how portable pulseaudio is for non linux use, or even for oss4 use on linux? i'd be interested on your take about that.

Posted by ethana2 at Thu Apr 10 09:40:36 2008
Keep up the awesome work, guys!
~that is all.

Posted by Lennart at Thu Apr 10 14:28:05 2008
seringen: Older versions of PA have been ported to non-Linux systems (the BSDs, Solaris, Windows). The current code has not, although a lot of glue code is in place and the code should generally be friendly to porters. Patches welcome.

OSS is supported as backend. However, OSS4 tends to be less compatible with the established OSS API ("3") than ALSA. (Yes, this is not a joke) So running PA an OSS should generally work, but YMMV.

Also, I consider OSS to be more like a zombie. Dead but still coming back to haunt people. It would be great if it finally died a silent deth. But I guess due to intensified support from the Solaris camp it won't be so easy. I do think that ALSA is by far the more capable system, and while it has issues it still is not as fucked up as OSS, not by a far margin. (And I say that as someone who knows both APIs very, very well on a technical level, and is not a lame fanboy with no clue)

Anyway, I am a Linux developer, payed to bring Linux forward. I only care about ALSA. Basic OSS support is there. It's not as fancy as the ALSA code, i.e. can't do glitch-free and stuff. If anyone wants to see support for this in PA, then he is welcome to contribute. But for myself I don't see any reason why I should invest more time on this than the most basic housekeeping.

Posted by David at Thu Apr 10 18:28:13 2008
Lennart - Thanks so much for all the fantastic work! I'm thankful not only for the heaps of software, but also for the great posts explaining the internals. Thanks again!

Posted by Aster at Fri Apr 11 01:02:48 2008
Lennart, maybe you should use a temporary buffer with no saturation (32 bit/sample to avoid overflows)? This way you can do O(1) remixing when a client rewrites its data. Of course you have to convert the buffer afterwards, but you can use a more clever dynamic range compression than simple saturation.

Posted by Donald Wallace Rouse II at Sat Apr 12 04:03:45 2008
I agree with Aster; for each channel, use a 32-bit/sample simple additive mixing buffer, plus an unsigned-8-bit/sample number-of-samples buffer.
Scale each 32-bit sample by number-of-samples and the channel volume (if any).
Mixing in a new sound will be much easier.

Posted by Karl Zollner at Sat Apr 12 13:34:25 2008
Part 1:
Lennart, this is the third time I have read disparaging comments from you about OSS. I know you work at Redhat and are working specifically for Linux-and ALSA has won in the Linux camp. Yet I wonder what your solution is to the problems of cross-platform free software audio.

I can understand your resentment for OSS3-that API and it's use by propietary apps for Linux has caused lots and lots of problems in the past. But the situation seems to have changed. I can also understand a desire to have a single low-level audio driver system-this simplifies things for you. Yet OSS4 is being adopted by a number of different platforms, in spite of your lambasting it, and there are probably some thousands of Linux users using it right now.

ALSA has it's own problems-it is seriously under-manned, it is an absolute usability nightmare-god forbid a mere mortal must ever edit asoundrc, the labeling of mixer elements and the non-deterministic nature of many of it's controls makes it extremely painful to even use the mixer, its documentation belongs to the worst of the worst of all free software documentation -where there is documentation it is outdated, incorrect, incomplete or simply misleading and ALSA is simply not getting the kind of love and attention it would need to begin to fix these problems. ALSA cannot and will not ever be ported to any other platforms. Moreover the lack of love for ALSA plays no small part in ALSA failing to provide things that would help Pulseaudio, in fact if ALSA had proper community support a lot of what you are doing in Pulseaudio could/should be done in ALSA proper-ie. in kernelspace.

Most of the free software written today can and will be used on platforms other that Linux. While I do not believe that any of the other free software platforms will ever be as popular as Linux this does not mean that those who write applications which manipulate audio can simply ignore these other platforms. OSS4 is poised to be the solution for some of these platforms and as long as OSS provides better support for some hardware and any support for ALSA unsupported hardware there will be Linux users using OSS.

Now, admittedly, it is not your responsibility to provide good OSS support for Pulseaudio-those who develop OSS should be doing this work. Yet the interesting work you are doing on Pulseaudio does not appear to be portable, which is one of the things which leads me to believe that your probing of the bounds of that which can be done in userspace could probably be done better in kernelspace. I have seen your attempts to sway the GNOME community to embrace Pulseaudio. After having not received support for Pulseaudio by the GNOME and KDE communities Pulseaudio is now being embraced by the distributions(fedora, mandriva, ubuntu etc.). Congratulations for this victory! Perhaps libsydney will end up being the portable portion of Pulseaudio that does get adoption by the GNOME and KDE communities.

I do not want to see Linux being held back by concerns about cross-platform availability, this applies not only to Pulseaudio but also projects like FUSE and HAL. But as promising as Pulseaudio currently is the outlook still seems awfully bleek. Every time I start to think I see light at the end of this tunnel, this tunnel seems to start stretching out into infinity.

What is it going to take to get a rallying call around freedesktop(*cross desktop/cross platform) audio ? What is it going to take to get ALSA the needed love and attention to make it truly viable ? Right now the freedesktop audio community has virtually 0 support from the audio manufacturers. Right now 3rd party application writers(propietary) do not have a platform-neutral API against which they can program(with perhaps the exception of gstreamer). What is it going to take to get freedesktop audio to a point that it can demand support from the manufacturers ? that it can dictate to 3rd parties which API's to use(ie. if you want Flash to properly integrate you must use this API) ?

continued...

Posted by Karl Zollner at Sat Apr 12 13:35:18 2008
Part2:
If we look over our shoulder at Xorg what we see is truly amazing. Xorg has a vibrant community which has undergone a true revolution in the last 5 years. The manufacturers of graphic cards are working with us now. Opengl is being adapted to our needs. This is empowerment.

If we look over our shoulders at ALSA what we see is just nigh of outright depressing. The glaciers are moving at a far faster rate. There is no community communication between ALSA and anyone else. Not one blog. No forums. No communication. ALSA is at its best when no one even notices it exists-for mere mortals are confronted with their own mortality by merely using and configuring it.

If we look over our shoulders at OSS we see new free software, which is cross-platform, easily configurable, and not user hostile. We see a new API which addresses the gregarious mistakes of previous OSS API's and which is being adopted by other platforms. We see in some cases better support for the same hardware that is available under ALSA and we see cases where some hardware is supported under OSS and not under ALSA.

My concerns are easy to understand. I have been fighting with Linux audio and been left defeated more times that I can count for more than 10 years now. Because some of the apps I use were written for OSS3 I was forced to install a second audio card-and with 2 cards I still cannot do what I easily can do with 1 card under Windows. I have such an interest in Pulseaudio because it promises to solve some of the nightmares I have been fighting with-but it remains, for me, a promise.

We need the manufacturers to open up documentation for audio hardware like what has happened in graphics community. We, as a community, need to be able to bring something to the bargaining table to get the manufacturers to work with us. We need Adobe, Skype and others to use the API's that are created by our communities. My fear is that we are facing another wave of balkanization-where Linux has ALSA, and the rest has OSS, and we are left with no bargaining position-because hardware manufacturers would be forced to support ALSA and OSS-and in all likelihood would simply choose OSS because it is already cross-platform-and 3rd party app writers must support ALSA and OSS-and probably should just support OSS because it is already cross-platform.

Given this scenario it is imperative that there is a community supported cross platform alternative(is this gstreamer? should it be Pulseudio ? perhaps libsydney ?) Perhaps Pulseaudio should just abandon any attempts at being cross-platform, allowing libsydney to fill this role, and actually try to tie itself even more closely with ALSA-and in thus doing infuse ALSA with enough creativity, talent and life to raise it from it's current zombie like status. Could ALSA incorporate some of your work directly ? Could Pulseaudio become the userland counterpart to ALSA and free us from the god-awful ALSA configuration and tools ? Is there absolutely NOTHING that can be learned and adopted in either ALSA or Pulseaudio from OSS4 ? Is there any hope of any freedesktop activity to actually sort these issues out ?

enough of my rambling, sorry for writing so much.

Posted by Wout Mertens at Sat Apr 12 16:09:26 2008
I too am curious why you don't just use a big pre-mixed sample buffer as Aster and Donald suggest.

Even at 96KHz that's less than 1MB/channel, and assuming you use 24bit precision per stream with 256 max voices. The buffer would already be resampled from the stream bitrate to the audio card bitrate, which is the most expensive part, no?

I'm armchair programming here; I'm just curious why such a method would not work...

In any case, great work! Thanks for taking the time to explain the technicalities.

Posted by Lennart at Sat Apr 12 16:44:38 2008
Karl: There are a few misconceptions in what you are saying:

- You claim that the situation generally changed in regards to the mess that is OSS3 to OSS4. I would claim the contrary. There are some things fundamentally flawed in OSS, among other things this is the fact that it tries to be a portable kernel interface. This makes it inherently difficult to virtualize. The approach is just wrong, wrong, wrong.

- I am not sure why you came to the conclusion that ALSA was undermanned while OSS was not. AFAIK there's just Hannu on OSS, while we there are at least Takashi and Jaroslav looking after ALSA full-time on the payroll of Novell resp. Red Hat, in addition to James an Clemens, and relatively strong community support.

- The ALSA API has been proted to other platforms. There are even plugins for libasound that use OSS as a backend, they are shipped as part of alsa-plugins.

- ALSA still provides much better what I need than OSS does. The ALSA people are aware of the problems I have to deal with and already made a good way inroads to fix those issues. I have listed my issues on http://pulseaudio.org/wiki/WhyIHateALSA and it's getting better.

- The things PA does should not happen in kernel space. The kernel people made that clear before, and everyone who has a clue acknowledges that. The fact that OSS4 does mixing in kernel space is another one of these inherently wrong approaches, again caused by the wrong approach that it is to define a portable kernel interface. If your API is ioctl() based you are are forced to do mixing in kernel space. And that's just evil. No mixing in kernel space, please. (And you know, the thought about mixing FP in kernel is just frightening, since kernel space is territory where FP is forbidden) Believe me, doing this mixing in kernel space is wrong, really really wrong.

- The "interesting" part of PA (as you call it) is not inherently unportable. Given a powerful driver interface that implements the right number of basic operations this is implementable on non-ALSA backends too.

- I think you didn't really get the story right behind PA and GNOME.

- libsydney is intended to be the cross-platform API you are asking for. And nothing stops people to port PA to non-Linux systems again.

Posted by Lennart at Sat Apr 12 16:45:15 2008
Karl here's my second part:

- Most (consumer) audio HW these days is based on HDA which is very well documented. The driver situation is much much better than with 3D, where running the newest hardware is far more troublesome than with ALSA.

- Communication between ALSA and me is not as bad as you might claim. There's also a mailing list where Jarsolav and Takashi and the others post regularly. Sure they don't maintain "forums" or blogs, but I am not sure if that's really what makes a software project a good software project. Handling user support requests takes up a lot of time, and is quite frankly not what Takashi, Jaroslav are I am being payed for. The fact that OSS4 apparently manages to keep a forum around is probably more due to the fact that the its userbase is much much smaller than ALSA's. Also note that there is an #alsa irc channel where people answer user questions.

- Again, the biggest probelms of OSS3 are not adressed at all in OSS4.

- Docuemtation for audio HW is pretty much available. Certainly much better than for video HW.

- Adobe, Skype and otheres are using the ALSA APIs these days.

- Quite frankly I don't care too much about the otehr Unixes. And I don't think anyone should really care. I will not make my code inherently unportable, and I will happily take portability patches. But investing time in portability for OSes only the tiniest fraction of people actually uses I won't do.

- There's no need to configure ALSA in any way these days. The default config should be fine for almost everyone. If you however want some fancy setups with channel routing and so on then ALSA lets you do this. And for that it offers you a complex config language. But you shouldn't be complaining about that. OSS4 doesn't allow you to do things like that at all.

- Again, libsydney is intended to be a cross-platform API. The relavant platforms for that API are Linux, Windows and MacOSX. It will run on top of PA, of ALSA, on CoreAudio and DirectSound.

- ALSA is no zombie. It's a very lively project working closely with the kernel community.

I think you have a bit of an unfounded hatred against ALSA. I don't know why? ALSA has problems, sure, every software does. But it ain't any worse than anything else. And certainly not worse than OSS4. Au contraire, mon ami!

Posted by Lennart at Sat Apr 12 17:10:21 2008
Donald, Aster, Wouter:

Inverting the mixing for one specific stream would not have any positive impact on memory consumption: I'd still have to keep a copy of the recent past of every single stream around, so that I can subtract it from the mix buffer, before adding the new data.

As said, mixing is generally not invertible, due to saturation (yes, and doing dynamic range compression when mixing is on the TODO list and won't help here either). And of course for FP samples it will add numeric noise.

To work around the fact that we generally cannot invert the mixed buffer we'd have to keep around a copy of the mixed buffer in a higher resolution. And this copy of course takes up more memory due to the increases sample width. Cache pressure will also be much bigger, since we have to store away the high-precision sample we are summing into instead of just dropping it after we did the saturation.

So, what you propose has a negative effect on memory consumption and cache pressure in the general case when we do not have to remix.

It does -- however -- have a good effect on CPU time if we have at least four streams to remix. If M is the mixed buffer; A,B,C the data of the first three streams; D1 being old data of the fourth stream, D2 being new data of the fourth streem: Then first we'd have to calculate M as A+B+C+D1. If D is then rewritten, we'd have to calculate for my algorithm: A+B+C+D2. In your algorithm it would be M-D1+D2. The latter calculation has a smaller cache pressure and one operation less. However, this is only the case if we have four channels. If we have less than that your alg is worse. Now, think of how many streams you usually play back at the same time on your desktop. You'll notice that usually you play back only one, maybe 2 when there is an event sound, and in the worst case 3. Optimizing for four channels and more is thus not really a good idea I would say. Also, rewinding is not the common case.

Also, let's not forget that redoing the same calculation in case of rewinding requires much simpler code than implementing a different scheme when rewinding.

Posted by Wout Mertens at Sun Apr 13 10:10:26 2008
Hi Lennart,

I completely forgot about the old data you need to substract. I now see the error of my ways :-) Thanks for explaining.

Posted by Aster at Sat Apr 19 19:51:48 2008
Lennart, you're right that many simultaneous samples (>=4) is not a common case. And yes, rewinding also is not a common case. But I don't care if it is common or not - PulseAudio should not suck in such "rare" cases (I hope you agree). However, after giving it more thought, I realised that you would want to rewind only a small numer of samples (<1s I guess). So my optimization may not be necessary.

One idea to think about is to dynamically choose the algorithm based on the number of streams. If you have less than lets say 8, you use your algorithm (remixing from scratch). If you have more than 8 streams, you store the 32bit/sample mix, so that you can remix it later in a bounded time (independent of the number of streams, which could go really high for some pro applications or games).

One more thing: In my opinion saturation is NOT ACCEPTABLE. PulseAudio should have some high quality dynamic range compression when it gets too loud (to protect your ears or speakers, for example).

Posted by Aster at Sat Apr 19 19:58:06 2008
About memory consumption: applications could specify that they will not rewind. Using my algorithm, you wouldn't have to store their audio for remixing at all. You would never have to substract the old sound, so why store it?

What about an API (I'm not familiar with pulse audio, maybe it already has this), which would allow an application to specify an upper bound for rewinding? You would have to store a history of only this many samples.

Posted by gordboy at Sat May 17 15:53:37 2008
Having a cross-platform audio system is something people have been crying out for, for years. However, Pulseaudio is one of the most amateur, crass audio projects I have ever seen.

To even HAVE a "glitch-free" branch, speaks volumes about the toytown nature of the project. Call this trollery if you like, but I'm a serious sound programmer, not some fly-by-night amateur muppet.

How you ever got any exposure in the Linux World completely baffles me.

Posted by bodorgy at Sun May 18 06:38:36 2008
gordboy, "serious sound programmer" extrordinare: Knowing someone like you is a "serious sound programmer" for something that is the polar-opposite of PA makes me want to use PA even more.

Go back to your "serious sound" programming and stop wasting time by drooling on your bib in public.

Posted by Lennart at Sun May 18 20:51:28 2008
gordboy: I love you too, you serious sound programmer, you.

I'd love to have a peek on your serious sound code. Care to share a link with an amateur muppet like me?

Posted by Andy at Mon May 19 16:45:54 2008
I suppose if I were a serious sound programmer and knew that people were crying out for a cross-platform audio system for years, that even an amateur muppet (not even a professional muppet? harsh...) could get undeserved publicity by working on one, that my "serious" implementation would be pretty far along by now... and I would be ready to sit-back with all the coder-groupies I had acquired, sort through my lucrative job offers from various technology corporations, and "kick it," as they say, rather than hanging out on a website comment board trying to start the coder equivalent of a "mine is bigger" argument.

But on a more serious note...mess with the people who make me free software and I'll fight you. :P

I suppose it is coincidence "gordboy" and "bodorgy" consist of all the same letters?

Lennart: I'm a techie and musician, but someone not familiar with audio programming, there is plenty here I don't understand, but quite a bit that I do. I always appreciate when the coder gurus on various projects take the time to explain and dialog with the community, and I hope you don't let any negative comments or trolling behavior discourage you.

The gist of what I get from this is, we are going to have a sound server that is up there with the modern ones on the "big boys": more efficient, less error prone. Sounds good to me. :)

Anyway, now that we've all fed the troll enough for hibernation through the winter...back to the important stuff...making us free software :D

Posted by Paul Wayper at Tue May 20 06:36:42 2008
[Andy] I'd say bodorgy was a back-flame at gordboy.<P>
[Lennart] This sounds great - memory is cheap these days and keeping the wake-ups to a minimum is something that everyone likes. I'd suggest multiplying the wake-up anticipation time (10ms normally) by a factor of ten if you get an underrun, and then turning it back down in a linear fashion until it was a set time away (I'd go for 7msec but it should be configurable) from when the buffer is empty. This exponential-lower-on-fail linear-rise-on-success is what TCP and other well-tested algorithms already do. You could probably even calculate how much time you had spare and back down intelligently. The idea here is to make sure there's a vanishingly small chance of a second underrun, but also to get to a state where the wake-up time no longer has to be calculated or checked - thus reducing the amount of extra calculation you do and thus the amount of time that the processor is awake.<P>

Keep up the good work!

Posted by gianni at Tue Jun 3 02:50:32 2008
How would this glitch-free model affect or be used for capturing?

Currently in PA (that distributed in Fedora 8, at least) there are major problems with PA and sound capturing (I have not been able to capture any sound from any source while PA is running, in my laptop or desktop), how will glitch-free PA integrate with capture (especially with applications such as Skype or Mumble which use ALSA as their backend)?

Posted by Alex Lukin at Tue Jun 3 17:48:45 2008
I'd like to point you to usecase where PA is not good yet and causes only problems.

I use Rosegarden+jackd+fluidsynth+qsynth for composing and recording some music tracks. It works fine with rt kernels and with "timerless" kernels I compile myself.
It does not work in F8 and F9 with PA on. Only scratching sounds and 3 times accelerated tempo.

How PA will work with jack infrastructure?

I think that PA must grow in some reliable and professional sound API like Steinberg's ASIO to allow low latency high quality sound on *nix.

I understand that a lot of things depends on rt capatibility of kernel and scheduler but even more depends on sound server.

Anyway, thaks for good work and good luck!

Posted by Downhill Games at Sat Jun 14 18:56:57 2008
Lennart, it would appear your payment is compensation for putting up with idiots, while your programming is just doing what you love :) I'm glad there is some counter-point against(for?) the morons, otherwise we might not have this wonderful and MUCH NEEDED daemon. I really do hope the other daemons (maybe except arts?) die off a quiet, unnoticed death. Of course, users will notice when their [sound-enabled] applications start working without a hitch -- I know I have :)

Thanks a lot for your work, and I hope they pay you well enough for putting up with idiots like Karl Spamzalot^

Take care ^_^

Posted by Downhill Games at Sat Jun 14 18:59:21 2008
Oh, oh! And I can't wait to try .11 :D

Posted by Pizuz at Sun Jul 20 12:43:41 2008
I'm certainly curious.

Are those latencies applications can "request" the real-world latencies you will actually hear? Or will there be an additional delay? I mean, a latency of 20ms or even 10ms would be pretty awesome already for games or emulators.

Posted by Simos Xenitellis at Thu Jul 24 19:49:18 2008
Congratulations Lennart!
Amazing work and great write-up.

Posted by apanloco at Sat Aug 2 19:58:43 2008
It seems this is now released with PA 0.9.11 :)

Posted by Lorenz at Tue Aug 12 18:36:56 2008
Thanks for your great work, Pulse Audio lets me do things I never thought were possible!

Posted by Animesh Saxena at Sat Sep 20 13:11:27 2008
This is truly awesome :).

Initially I also couldn't help moving back to alsa on slightest of error, but I think once anyone sees the numerous advantages that pulseaudio has, nobody would switch back ;).

Thanks for explaining how it all works. Got me interested in the code as well!

Posted by JoeS at Sun Oct 5 16:00:41 2008
It might be great when it finally works, but so far it's as bad as KDE4. Blame the distros, blame the apps, but audio worked for the majority of people before pulse was included in distros. The number of bug reports related to audio has quadrupled since pulse. Jack did a better job and still does. All we needed was configuration for ALSA and JACK. Why add another layer? I might be impressed in a few years, but I'll still be wondering why we went through the hassle when we had a better engineered audio all along (jackd).

Anyone that wants to record audio needs to remove all traces of pulse to get decent results and from what I can read in all the pulse code and comments, pulse will never be useful for an audio workstation. Nor has it solved any of the issues with audio that where there before pulse. Multi-speaker setup, multi-channell setup, asound.rc configuration, low-latency recording and playback, etc...

Please quit blaming others and start working with distros, ALSA, JACK, OSS and existing audio app developers to solve these problems or quit pushing pulse down the distros throats. And yes, Pulse was marketed to distros, not requested.

Posted by Towner at Sun Oct 12 02:08:46 2008
Thanks for the article, a thoroughly interesting read. The API neck beard comment wars are a fairly hilarious side show as well.

Posted by Muppet at Sat Nov 22 23:24:11 2008
For 10 years sound in Linux was a nightmare. Now it seems to get better and better, many thanks!

But the user interface is not really muppet-proof yet. E.g. the pulse audio manager mentioned in the perfect setup is such an example.

Also configuration in Gnome is far from muppet-proof. One has preferences-sound, every application also has a chooser of the backend and individual tests, the names are then different in skype, in sound recorder, in pidgin, in wine, etcetc.

Posted by anon at Wed Nov 26 05:52:11 2008
Whatever, will the new release resolve the dlopen issue? <a href="http://www.google.com/search?hl=en&q=ubuntu+%2B%22pulse+audio%22+dlopen&btnG=Search">http://www.google.com/search?hl=en&q=ubuntu+%2B%22pulse+audio%22+dlopen&btnG=Search</a>

Posted by Jerome at Fri Nov 28 11:19:27 2008
Saturated integer mixing? I was under the impression that floating point mixing was just as fast (if not faster) on modern hardware, and of course much easier to implement without fear of clipping etc. Correct me if I'm way off-base with this.

Posted by Paul Wayper at Tue Dec 2 00:29:21 2008
[Lennart] Will these changes support ultra-large playback buffers? One thing that I saw on Intel's power saving pages (http://www.lesswatts.org/projects/applications-power-management/large-buffers.php) was the idea that applications decode large amounts of audio (and video) to memory and then let the decoding (thread/process) go idle while the media is played back. This is especially useful if reading of DVDs or CDs where keeping the device spun up uses more power than reading a large chunk and letting it spin down. Of course, applications would get more out of buffering the compressed data rather than uncompressed, but if an application can be simpler by pushing the last twenty seconds of decoded audio out to PulseAudio before then buffering the next two minutes of compressed data, then this is an overall win.

Hope that makes sense :-)

Posted by Lennart at Wed Dec 10 22:56:51 2008
Paul: yes, the ultra-large buffers is exactly what g-f is about. But not 2mins. Just 2s -- which is very long already.

Posted by zero latency at Fri Feb 27 06:33:10 2009
I think pulseaudio server is too aggressive in rewinding the application pointer when starting a new audio stream while another stream is playing

Refer to image 1 , the server can only safely rewind the application pointer up to the start of fragment 3 to mixed the new stream and the running stream

If glitch free mode using NFRAGS = 1 , the server can only rewind to the start of the buffer (i.e. the start of fragment 1 )

Posted by DMA transfer at Thu Mar 5 03:22:24 2009
"the hardware reads the samples from the buffer, one at a time, and passes it on to the DAC so that eventually it reaches the speakers"

http://thread.gmane.org/gmane.linux.alsa.devel/60428/focus=60478

At the end of interrupt , the driver just setup the starting address and the number of samples for the DMA transfer and the pointer callback return the value from a hardware register about the number of processed samples

i.e. snd_pcm_hw_rewindable() is not equal to snd_pcm_mmap_hw_avail() for driver using DMA transfer , especially those driver which using snd_pcm_indirect_playback_transfer()

e.g. cs46xx copy a period of samples from the system memory to the DSP 's memory and perform mixing by the DSP before sending it to AC97 's DAC

Posted by Grr at Sun May 17 03:27:52 2009
"The code in the glitch-free is still rough and sometimes incomplete ... I hope this text also explains to the few remaining PA haters a little better why PA is a good thing, and why everyone should have it on his Linux desktop."

No, it shouldn't be on our desktops. It's incomplete, "rough", buggy, and has a horribly complex user interface.

It should be finished first, and given an intuitive interface, and tested thoroughly, and THEN it should be on our desktops.

Posted by MrQuincle at Sat Dec 12 14:14:19 2009
On the moment I am in a team that creates a physically realistic robot simulator. Although I am responsible for AI (and not game AI, but real AI about which you don't wanna know), we are also running into audio problems.

OpenAL Soft seems to allow multiple robots with each their own microphone. I've to check that though.

From what I understood, the remarks by Donald, Aster, Wouter would be valid if their are more than 4 streams involved in mixing. So, would that mean that their suggestion would be better as soon as we have a robot simulator with more than 4 robots listening to the sounds of each other with their own (virtual) microphones?

Sorry, for my layman question, but I was intrigued by the high quality of the post, and even the comments. Even though, I think this question about the scalability of the sound daemon is not for the faint of hearth. :-)

Kind regards!

Posted by Igor Katalnikov at Thu Feb 3 16:24:40 2011
Sorry, but IMHO:

Latency is the distance between hardware read and software write points. This difference should not grow bigger than buffer size. Such event is IMHO called buffer under-run.

So contradictory to your statement IMHO the rule should be formulated as following: Fast Hardware - Small Buffer, Slow Hardware - Large Buffer.

Decreasing buffer size on slow hardware leads to chaining rabbit to a cannonball thus decreasing latency. You can keep the consequent noise below hearing limit though by making buffer around 1 / 10 Hz = 0,1 Sec. Still only adequate hardware or decreasing encoding quality can solve the problem completely.

Posted by D at Sat Apr 16 11:21:30 2011
Thanks for the Work Lennart. But I'm not really getting this working all the way. I've wanted to try the IRQ-less mode with 2.6.38 and hda-intel, but I can't get below 4 irqs/s and 8ms C6 residency, which is what I have prior to 2.6.38, too. Have you tried this? Any hints?
I got a fresh pulseaudio (1.0-dev-180-gf7ac) and I'm using that. Does pulse need special configure flags or similar during compile time for success? What's the best audio application for testing? Maybe you could post something about this now there's the first support to get the irq-less pulse more attention? Thanks!

Posted by Merkel at Tue Aug 9 16:46:49 2011
Best pharmacy

It was an excellent brand! PulseAudio had been changing my thoughts since the first time... but there are to many applications that you cannot avoid either. I could see the code in the glitch-free is still rough and sometimes incomplete, you are completely right.

Posted by Arnout at Sun Oct 16 13:53:27 2011
Very nice explanation! Thanks!

Leave a Comment:

It should be obvious but in case it isn't: the opinions reflected here are my own. They are not the views of my employer, or Ronald McDonald, or anyone else.

Please note that I take the liberty to delete any comments posted here that I deem inappropriate, off-topic, or insulting. And I excercise this liberty quite agressively. So yes, if you comment here, I might censor you. If you don't want to be censored your are welcome to comment on your own blog instead.

レナート PANIKRAUM mit PUMPGUN ﻟﻴﻨﺎﺭﺕ

Tue, 08 Apr 2008

What's Cooking in PulseAudio's glitch-free Branch