3D Audio for Shared VR

Audio is a critical, though often overlooked, part of the VR experience. Quality audio is essential for easy communication between real people. But it also has a meaningful impact on a user's sense of immersion and presence in the world. This post covers some of the recent, and we think exciting, additions we're making to our audio offering.  

There is a big difference between audio in a single-player and a shared or social environment. Processing audio to create a true 3D effect, often described as a "Head-Related Transfer Function" or HRTF, under conditions where the audio needs to be live and coming from multiple different sources is a unique challenge. Spatial audio must provide realistic, low latency localization for multiple, potentially dynamic, audio sources culminating in a quality stereo audio stream for each user.

To make progress on these requirements, we had to do three new things:  

  • develop a server-side approach for mixing 3D audio,
  • create a new audio codec designed for VR, and
  • build an environmental reverb engine 

Server-side 3D audio: In a shared live environment, with multiple people talking to each other as well as listening to other potentially real-time audio sources, 3D processing of audio cannot be done entirely on the client (as it is for typical gaming or single-player VR) because the need to send all live source streams for 3D processing would overwhelm the bandwidth of the connection to the receiver. 

Instead, multiple live sources need to be combined at a server to create a single left/right stream for each receiver. But this is a considerable problem, for several reasons: End-to-end latency to and from the server must stay below the roughly 150 milliseconds required for face-to-face interaction to feel correct. Cell phones, for example, fall short of this requirement. Some or most of the 3D processing of sound must be done at the mixer, so that the inferred position of each source with respect to the listener (the HRTF) can be preserved. There are different ways of approaching this, each with different results. Most significantly, everything streamed must be encoded, sent to the server, decoded for mixing, re-encoded as combined audio, and finally decoded for the local listeners. Encoding and decoding twice places additional demands on the codec. 

Our approach was to first break the audio into small segments (typically 10 milliseconds in length, or 240 audio samples) for transmission to and from the server, to get end-to-end latency down sufficiently low--typically about 40 milliseconds coast-to-coast in the US--for the added network delay due to 3D processing, and jitter buffering to ensure that the entire process stays below a total of about 100 milliseconds. 

Next, and most importantly, we designed a new HRTF with a low enough latency (<5 milliseconds) and high enough speed (2500 source/listener pairs on a typical Intel CPU) to create the unique left/right mix for each listener. When mixing tens of avatars talking at once, the output can have a large dynamic range; we addressed this with a floating-point pipeline and high-quality peak limiter. Finally, we created a codec to compress the audio by a factor of four.

In the end, the innovations mean that a single High Fidelity server can easily process a group of 50 people and numerous additional ambient sound sources, a meaningful benchmark given that typical output for 50 avatars and ambient sounds will greatly exceed the limits of 16-bit audio.

Audio codec designed for VR:  Compression of audio for VR is a unique problem. Existing high-quality audio codecs add too much latency and are too slow to encode and decode at the scale required in shared VR. End-to-end latency for interactive communication should be kept below 100 milliseconds, meaning the latency imposed by a traditional transform-based audio codec, like AAC or MP3, would be too much. 

To address this challenge, we developed a subband audio codec to meet the dual goals of latency and performance. The High Fidelity Audio Codec reduces the bandwidth required for streaming voices, sound effects, and music by a factor of 4:1 with perceptually transparent quality, while adding a latency of only 4 milliseconds. The encoder and decoder are highly optimized and can process 3000 source/receiver pairs on a single server.    

Parametric reverb for environmental audio:  Room acoustics are important for creating the correct sense of presence in VR. Much of the sound of a person’s voice, even when nearby, comes from the many different reflections off the walls. Accurately reproducing all those reflections is computationally complex (as well as required for each source/listener pair), and stopping short by modeling even a considerable number of reflections can result in objectionable artifacts.

Our solution was to design a high-quality reverb engine with programmable settings that can transition smoothly and rapidly between different simulated environmental conditions. These settings automatically map to the listener's current surroundings, creating a more immersive experience. You can listen to the effect in the demo below, best experienced with headphones.

This is achieved by sampling the virtual space near the listener using raycasting. The reverb engine looks for collisions, and in real time applies reverb settings that adequately reproduce the sound of the listener's current virtual location. In the video above, you can hear this effect when the walls are suddenly removed from the room toward the end of the clip. 

Summary: As you can see from the demo clip above, the overall effect is to create an immersive 3D audio experience despite numerous live sources coming over the internet with variable amounts of delay, and with a fixed bandwidth used for each receiver. The beta version of High Fidelity sandbox has all the tech described here, and is being used today by our alpha and new beta users to create many different VR spaces.