Binaural (3D) audio is the future of virtual conferencing


In a recent article for aeon journal I argued that videoconferencing could and should be a technology that grows key capabilities for human development, overcoming inequalities, and enabling cooperative action to address the major challenges we face. It is, however, a poorly conceived and developed medium. Constrained within tiny little square boxes on the screen, with low-quality sound compressing all voices into one non-directional sound source, and clunky interactivity, it’s unlikely that the possibilities of the medium will be realised.

Dave White’s thoughts on “spatial collaboration”, combined with a term spent teaching Design Thinking through Teams, led me to try some variations in my own home-studio set up (I like to think of it as a studio, as it is where I make things). Dave is rightly critical of primitive attempts at skeuomorphic representations of groups of people on screen – for example, the gimmicky Together Mode in Teams – which always only ever renders an experience that is sort of real but not-quite and therefore just plain weird. Dave argues that we are better-off adopting non-skeuomorphic spatial models. We use such an approach in Design Thinking when we collaborate on a Miro whiteboard. The board is usually set up with a spatial organisation created using frames – each frame has a different role in the collaboration. As we talk to each other online, we can move our cursors over to different parts of the board, zoom-in, and work together on them. We can also see each-other’s cursor, so can see where the action is happening. Zooming-out from the board, when we have 30 people working on it at the same time (often in a set of breakout rooms), we can see them swarming around, congregating, adding and editing.

The spatial collaboration approach, we have definitely found, is good for a certain kind of collaboration. It’s especially effective when working in a loosely-coupled collaboration over a length of time. Sometimes I open up a board outside of a workshop, and I can see people active in it, and if I want, start up a conversation with them (through Teams, although the full pro version of Miro has built-in videoconferencing).

But there are other cases when, I believe, a more effective skeuomorphic spatialisation would be appropriate. I’ve been experimenting with this. I have my studio space set up with two cameras. I use the web cam on my MacBook Pro (raised about 8cm on a stand), for the usual “at the keyboard” shots. But I also have a camera on a tripod, set about 2.5 metres to the left of my desk (I plan to get a longer HDMI cable so I can go further back). I can use this for a shot in which I’m sitting next to my laptop, but not at it. So that I can keep the other participants in view, I’ve got an iPad on a long angle-poise stand. I use Apple Sidecar to display the Teams meeting (from my Mac) on the iPad, and then move it to the left between me and the side camera.

I’ve gone further still in expanding the sense of space in the view I present of myself over Teams. The last day of workshops in our Design Thinking modules were led by the physical theatre company Highly Sprung. We usually have a session with them on campus, in a theatre rehearsal room (big, open, sprung floor). This time, over Teams, they had us moving about in sync, jumping around and doing all kinds of moves. I moved the side camera so that it could show me standing and moving around.

I then tried a different camera angle, with me sitting on a rocking chair about 4 metres back from my desk (with my feet up on a rest and the fire on). I put Teams up on the big screen, and used a Blue Yeti mic to capture sound at a distance. I could have gone a step further, repositioning the iPad, and using a bluetooth keyboard and mic. This all works really well, except the one-dimensional nature of the audio became even more onviously strange. So I thought: wouldn’t it be great if the voices of the participants could be spread out around the room, using the same brilliant binaural 3D effect provided by the speakers on the Oculus Go VR headset. But what if we left it at that? Do we really need the poor quality images of people’s faces? I often listen to audio-only discussion radio programmes and podcasts, such as In Our Time (BBC history). And I find that format both effective, in terms of maintaining attention for a long period, and relaxing. This article, on Improving Intelligibility with Spatial Audio, from the sound specialists Dolby, explains more about why this might be the case.

So that’s the conjecture: forget video, high-quality binaural sound is the future of virtual conferencing.

Once the technology enables it, we might also add a subtle and unintrusive bit of AR (which could be toggled on or off) – showing the identity of the speaker in the location of their voice, perhaps by simply over-laying their name over the normal view of the room.

Most of the elements of this design are already available. The question is, will the tech industry drop its obsession with video. But I’m not the only one to have come to this realisation. For example, see this article by Mark Sparrow of Forbes. It might just happen soon. The technology is quite straightforward. See for example (some “live” demos, but not well-chosen examples, too noisy and complicated).

Dr Robert O'Toole NTF

Senior Teaching Fellow, Arts Faculty, University of Warwick. Fellow of the Higher Education Academy, National Teaching Fellow, Warwick Award for Teaching Excellence.

You may also like...