Creating Crowds in VR
90 people gathered in a theater at our second stress test, doing the wave.  

90 people gathered in a theater at our second stress test, doing the wave.  

We reached a concurrency of 90 people in one space last Friday, and we plan to keep trying to assemble as many people as possible in VR, every week or two.   

Many types of VR social experiences will require lots of people together in one place, for example lecture halls with students,  live music events, company meetings,  or public town halls. This is a hard scaling problem, because people talking and wearing headsets and hand controllers generate a lot of bandwidth as well as require a lot of server and client CPU/GPU.   We are iterating and testing with the goal of ultimately reaching the capacity for thousands in one space. 

For VR to reach its full potential, crowd events need to deliver 'presence' in the same way that we talk about a single-user 3D space delivering presence to the wearer of an HMD when the visual latency is low enough. In the case of a crowd, presence requires having really low latency, high quality 3D audio for everyone, and being able to see people moving. You can try to cut corners by doing things like sharding or muting the audience,  but this very rapidly degrades the experience to a point where it is not very compelling for either performers/speakers or audience members. So we are focusing on fully enabling everyone in one space, and here is where we are so far: 

In the image above there are 90 people together, most of them with HMD's and hand controllers (either Vive or Rift + Touch). This is a lot of data moving around: Each person generates 100Kbps of compressed audio from their microphone which is sent to the audio server, and also about 200Kbps of joint motion data, which is sent to the avatar server. That is about 30Mbps received by the servers. The output from the servers was about 270Mbps, or about 3Mbps per receiver. So one way of looking at the overall problem is that we need to somehow reduce or compress the incoming data by about a factor of around 10x without making the experience any less real.  

Latency has to stay low for everyone to feel connected (with each other and the performers). In this event, most people were experiencing a latency of between 100 and 150 milliseconds - the time delay between motion or sound from their actual bodies being seen by others. Studies have shown that if you allow this latency to get much over 200 milliseconds, you stop feeling connected to other people and have a difficult time communicating (like on a bad cell phone call). How this translates to a crowd experience will be different, but when I was doing things like asking people to clap or cheer, the low latency felt great. To get to this low latency, we packetize the audio and motion data in small blocks, use UDP for the underlying transport, and removing all unneeded buffering or delay at the servers.    

3D audio is also challenging, because you need to apply a HRTF ('Head-Related Transfer Function') to each audio source for it to sound correct to the receiver in an HMD. The typical way this is done with many existing VR apps is to send the source audio to each receiver and let them do the HRTF on the client, but for 90 people this would require that everyone have 9Mbps of available downstream bandwidth just to receive the audio. What we do instead is use a server to apply the HRTF for each source and then mix them all together to create a single stream (per ear) for each receiver. But this requires a very fast HRTF, since if everyone yells at the same time in the same room, that is in the worst case 90 x 90 = 8,100 HRTF's being computed at the same time! Of course you can do better by not processing the really quiet streams, but at peak our audio server was processing several thousand HRTF pairs at once - we had everyone shouting and clapping, for example, which sounded really great. So making the audio server very fast has been a big part of our work. 

The servers can also adjust LOD for receivers that don't have enough bandwidth. The audio stream is fixed at 200Kbps, but the data about other people's body motions is quantized and prioritized, and can be reduced substantially without reducing the quality too much, allowing a good experience to be delivered for different levels of bandwidth and/or rendering capacity.   This is one of our big areas of development focus, and what we are hoping to study and improve in these stress tests. Making a crowd feel like a crowd is dependent on what LOD choices you make, knowing that you will not be able to send all the data from everyone to everyone else, while still wanting to preserve the 'feeling' of a lot of people moving around in your view.   

Try it yourself:  If you want to host a 100-person event with High Fidelity, you will need to use a server with at least 50Mbps upstream (to your server) and at least 500Mbps downstream.  For this test we used an Amazon c4.xlarge instance for the avatar, entity, and message servers, and a c4.4xlarge instance for the audio mixer (which is the most heavily loaded).  As a side-note, bandwidth and CPU are getting really cheap: the Amazon hosting cost of a two hour 100-person event run this way was only about $15!

If you get a chance, join us at our next stress test this Friday, February 10, at 2pm PST. You can sign up to participate right here.