¶Learning DirectSound and audio with more than 2 channels
About a week ago, I acquired a Creative Labs Audigy 2 ZS Notebook card, partly out of the need to do some testing with 24-bit audio support, and partly due to disdain for the wonderfully fancy AC97 codec in my laptop. As far as I can tell, it has essentially the same feature set as a regular Audigy 2 ZS, including the EMU10K2-based effects engine and the separate 24-bit/96KHz support chip. It has a single headphone port on the side that doubles both for a regular wired connection and an optical output. An unexpected feature, however, is that if you don't plug anything into the Audigy 2 ZS Notebook, its drivers can take the output of the effects engine and push it through your regular sound card. The latency is a bit high, but this means that I can get EAX environmental effects through the onboard speakers. Also, the built-in sound card's mixer and the Audigy 2's mixer are both active, so I can set the built-in one's mixer such that the volume control that most programs see has a normal range instead of loud to extremely loud to ear-shatteringly loud. Neat.
Now, I had another agenda with getting this card, and that is to try a surround-sound hack. The idea was to use the "center cut" algorithm to split the center audio from the sides and run two speakers with the side audio and one or two more in front with the center audio. (With major help from Moitah on the forums, the algorithm has been improved -- the new version will appear in 1.7.0.) In order to do this, however, I needed to output sound with more than two channels, which I had never done before. I figured that while I was mucking around with this I might as well learn DirectSound as well.
The surround-sound hack sounded terrible, but DirectSound turned out to not be as bad, and it was interesting learning about the current state of the Windows sound system.
"Direct"Sound?
First, DirectSound isn't really direct anymore, joining DirectInput in the list of DirectX APIs that aren't. In Windows XP, both waveOut and DirectSound run through the kernel mixer API, and in Vista, they're both being layered on top of a new user-space API called WASAPI. The bad part is that calls are going through extra translations and overhead, but on the other hand, the Windows audio team has been doing a good job of maintaining all of the APIs -- in fact, waveOut code I wrote years ago still runs on Windows XP, but with more formats and with lower latency than when I developed it on Windows 95. This is in stark contrast to other teams at Microsoft. *glares at GDI and Direct3D teams*
DirectSound, or at least DirectSound 8, is actually fairly simple to set up for playback. For the most part, it is a lot like programming an old SoundBlaster 16 -- select a wave format, initialize an audio buffer, start playback, and periodically poll the position so you can lock and fill data ahead of the DMA point. The Lock() call simply gives you two pointers to write into (two are required to handle buffer wrap). As with programming on the bare metal, failure to stay ahead of hardware causes the playback pointer to loop around. You can also register for notifications for when the playback pointer crosses certain thresholds, allowing for non-polled loads, but apparently these aren't reliable on some drivers. Bummer. You can, however, retrieve an approximate read pointer, so you can at least periodically check the buffer status. In some ways, DirectSound is actually easier to use than waveOut, where you have to create a bunch of buffer headers, allocate memory for them, "prepare" the headers, and manage separate pools of active and pending buffers. Well, I guess there are looping buffers in waveOut too, but I never checked if they could be backfilled like hardware or DirectSound streaming buffers.
The DirectSound API does strike one of my pet peeves, which is that it has the ever annoying SetCooperativeLevel(), with a non-NULL window handle requirement. This means that your DirectSound objects have thread affinity, which I hate. I much prefer APIs that are thread agnostic, which are much easier to deal with because then you can use synchronization, instead of having a library dictate your threading model. One saving grace is that if you are creating buffers that only have global focus, apparently you can use GetDesktopWindow() as the handle. Before you say that this is naughty, note that a software developer on the DirectSound team said we could do it. Therefore, they can't complain later. :)
Multi-channel and high precision audio formats
The next question is how you blast out more than two channels of audio. There are two ways of requesting this. One is to just extend PCMWAVEFORMAT to use more than two channels. The other method is to use the newer WAVEFORMATEXTENSIBLE format, which is preferred since it contains a channel mask for remapping channels. Which one to use? The Audigy 2 ZS Notebook doesn't seem to care one way or the other, as it accepted either form for everything I tested; the Sigmatel C-Major audio device, however, only accepted WAVEFORMATEXTENSIBLE for 24-bit and 32-bit formats. For recording, however, none of the devices supported WAVEFORMATEXTENSIBLE, even for stereo formats that did work for playback.
That leaves the matter of which bit depths and frequencies are supported. Thanks to the kX Project and the sources to the ALSA drivers, quite a bit is known about the Audigy series hardware. The primary audio path is the EMU10K2 chip, which runs on a fixed regimen of 48KHz, 16-bit audio, to which all voices are resampled. The Audigy 2 series adds an additional chip ("p16v") with 24-bit/96KHz support, which can both feed into the regular 48KHz effects path or output directly. The EMU10K2 is much more flexible with regard to sampling rates than the p16v, which only handles specific sampling rates; I believe 44KHz, 48KHz, 96KHz, and 192KHz are available. Yet, trying various audio formats with DirectSound produces some interesting results: I can believe that the EMU10K2 would be very flexible with regard to voice input formats, but I have some skepticism that either sound card can really render 32-bit, 145KHz, 68 channel audio.
My theory is that the kernel mixer in Windows is very flexible at resampling from unsupported to supported formats, and that many of these formats aren't native, even in the sound card driver software. The sampling rate conversion is filtered, but I'm guessing that it simply chops off anything beyond 16 bits and two channels. Trying various formats for recording is more interesting, because requests for 24-bit and 32-bit formats only respond when hardware support is available, i.e. 96KHz/24-bit and 192KHz/24-bit work on the Audigy 2, but not 145KHz/24-bit -- but just about any sampling rate at 192KHz or lower passes at 8-bit or 16-bit depth on any sound card. Unfortunately, there aren't caps or query functions in DirectSound, so there isn't a way to tell if a format is actually supported in hardware. (This is probably one reason that the DirectSound capture filter in DirectShow doesn't expose 24-bit formats on its output pin, thus the reason that they don't show up in VirtualDub's "raw audio format" dialog box.) I have heard that this is going to become worse in Vista, with all audio capture being resampled from a single default recording format that is user-specified in Control Panel, and only being discernable through the new WASAPI. This is good from the standpoint of things Just Working(tm), but it's bad from the standpoint of not lying to the user about what is actually supported.