¶Live audio playback
My latency woe research continues, this time on how to handle playback. I've managed to write a new basic audio renderer and hook it into the DirectShow graph, but the next issue now is how to handle playback. When playing stored audio/video data, you can read the source at variable rate as necessary, so playback timing is simple: minimize the difference between the audio and video timing. In other words, the user can't really tell that you're buffering two full seconds of audio if the video is also delayed by the same amount. Where this does matter is in seek latency, because then the decoder needs to re-fill that delay before you can restart playback, but even then doing soft start or just having a fast decoder can do the trick.
However, live input -- or specifically in this case, interactive live input -- is a tougher problem. In this case the clock already starts ticking the moment the user hits a button, and therefore you're already late by the time you receive audio. This means that minimizing latency in the entire pipeline is absolutely critical. Unfortunately, as I mentioned before, the recording device I'm using sends 40ms packets, so already by the time I receive the first byte I'm already that much behind and I don't really need another 100ms of latency on the output. This means that in the playback code, the problems are:
- Compensate for the discrepancy between the input and output rate somehow. If the incoming rate is faster, the buffer gradually fills up and latency increases. If the incoming rate is lower, the buffer gradually empties and underflows.
- Maintain as little of an output buffer as possible without running dry at any point (underflow). If that does happen, the audio breaks up and crackles, which sounds nasty.
The strategy I traditionally took for the first one, and the one which the DirectSound Renderer uses, is to slightly resample the input -- play it slightly faster to reduce buffered data, and slightly slower to increase it. The problem is that while this generally works fine for compensating for a difference in rate, it doesn't work so well for adjusting the current latency. For instance, let's say that through a glitch we suddenly have half a second of audio buffered, and thus a lot of extra latency. In order to drain this amount of audio in 30 seconds -- which is quite a long time -- we'd need to raise a 48KHz sampling rate to 48.8KHz. Problem is, in pitch terms that's a quarter of a semitone (log2(48.8 / 48)*12 = 0.29), which is noticeable. As a result I'm now leaning back toward time domain methods, i.e. chopping or duplicating audio segments. This is a crude form of time stretching, and provided that the adjustments are rare, it works better than I had expected.
That leaves the problem of determining what the minimum latency should be. There always needs to be a minimal amount of data in the sound buffer to cover delays in the output path, such as the CPU time to process the system calls and copy data into the hardware buffer, and jitter in thread scheduling. I hate just putting it in as an option and letting the user tune the audio buffer, especially when changes in app configuration and system load change the required latency. Ideally, the application should be able to monitor the audio buffer status and adaptively adjust the buffer level. I tried doing this with a waveOut-based routine, and while it worked pretty well in XP, it gave crackling in Vista. Dumping out a log of buffering stats revealed the problem (timestamps in milliseconds):
Finished 19 at 11948090 Finished 20 at 11948090 Checking at 11948090 Checking at 11948106 Checking at 11948106 Checking at 11948123 Checking at 11948140 Finished 21 at 11948140 Finished 22 at 11948140 Finished 23 at 11948140 Checking at 11948140 Checking at 11948156 Checking at 11948156 Checking at 11948173 Checking at 11948173 Checking at 11948190 Finished 24 at 11948190 Finished 25 at 11948190
In this test, I'm actually delivering 16.7ms buffers (60 frames/second), but the buffers are being marked as done by the OS in batches. More suspiciously, the notifications are occurring every 50ms. Having that amount of latency isn't a dealbreaker, but what is is that it appears that Vista is not marking buffers as completed until after they have already been copied to the hardware buffer and played. If you think about it, this is necessary for an application to wait for a sound to complete playing and avoid cutting it off. However, it also has the annoying side effect of making it impossible to tell if an underrun has occurred by buffer status alone, and it seems that the Vista user-space mixer is more likely to cause problems in this regard than the XP kernel mixer. The waveOut API doesn't have any other way to report underruns or the amount of internal delay, so as far as I can tell, the only way to deal with this is to fudge up the buffering amount. Suck.
You might be wondering why I'm still using waveOut, even though it's quite old. Well, up through Windows XP, the only other alternative is DirectSound. Unlike waveOut, which uses a straightforward streaming API and simply stops audio playback if the application buffer underruns, DirectSound uses a hardware DMA model and has the application write into a looping buffer, even if software mixing is actually used. This has the undesirable behavior that if the application blocks in some operation for too long, the mixer wraps around and keeps playing the same stuttering sound over and over like a broken record. Furthermore, it doesn't report that this has happened, so unless you have extra logic to prevent or detect this, it also screws up your buffering calculations and suddenly your output routine thinks it has a nearly full buffer. (There is a notification API that could help with this, but as usual, no one uses it because it is broken with some drivers.) The DirectSound Renderer handles these issues by queuing buffers to a separate thread that sits on the output buffer and clears or pauses it once an underflow is detected. Expecting an application to deal with all of this in order to just stream some audio is unreasonable, and since most of my audio playback isn't latency sensitive I've just stuck with waveOut.
It looks like I'll have to change my mind for this case, because DirectSound appears to work much better in Vista. I can get playback positions with much better accuracy and precision than with waveOut, and buffer underruns are detected more reliably. In addition, DirectSound reports an additional write cursor to the application, which says how far the application should write ahead of the current playback position (play cursor) to avoid underflowing. The main problem that's left is dealing with the wraparound problem, which I haven't solved yet. I think I can detect it and avoid screwing up buffering calcs by using a big buffer and checking if the system clock has advanced far enough for a wraparound to have occurred, and pre-clear sections of the buffer to reduce the artifacts if the buffer does underflow a little bit. The remaining question is whether I want to use a separate thread to manage the buffer so I can stop playback on a delay. I'd like to avoid the broken record, but I don't know if I can spare the additional latency from queuing the audio to another thread.
Somehow, things seemed simpler when I just had to enable auto-init DMA and handle some interrupts.
Finally, since I'm still mostly XP-based, I haven't bothered to look much at the new Vista API called WASAPI, which is actually what waveOut and DirectSound now map to. Originally, besides the XP issue, I had avoided looking at this much because I didn't need the new functionality. WASAPI appears to actually be easier to use than DirectSound, though, so I might have to look at writing a WASAPI-specific output path.