What is spatial audio?

20 views
0

What is spatial audio?

In: 2

It’s audio that is designed to feel like things are happening around you. Normal audio sounds like it’s just happening in your head. Spatial audio sounds like it’s happening all around you.
It’s only noticeable with a decent audio system, or some reasonable headphones though.

It is just a fancy name for simulated surround sound.

You need a surround sound source (most movies/tv and some music) and manipulate it so when only 2 speakers are playin it, it sounds more enveloping than just playing a stereo source.

Some also go further with head tracking. Surround sound movies/tv have most all dialogue in the center channel. Head tracking will keep the center channel physically where the screen is, so if you turn your head to the right, the dialogue comes out of your left ear.

Here is (approximately) how it’s done

If you are out in the open, and there’s a sound to your right, you don’t only hear it in your right ear – you hear it in both ears. So in headphones you can’t just put some audio on the right hand side and expect that to sound right. In real life you’ll hear it also with your left ear – there will be a slight delay and it will be less loud because it’s partially blocked by your head.

So spatial audio uses sort of a physical model of a head and ears to try to reproduce how a sound that is a specific distance and direction from you would sound if you were to actually hear it in the real world.

It can be difficult often to distinguish between a sound that is in front or behind you in spatial audio, but that’s also fairly true in real life.

Spatial audio is a more advanced technology trying to replicate how recorded sounds are spatially perceived by listeners than stereo

True SA comprises of two parts, audio recording and playback.

Old-fashioned stereo records in two channels, left and right, but the sound collecting microphones don’t have the exact same propogation properties as human ears, results in stereo recordings have pretty vague positioning. It’s pretty difficult to tell anything other than generic left or right.

Surrond sound tried to improve on that by recording with at least 4 channels, making it much more directional, at least horizontally. Modern home cinema are commonly recorded with 5 or 7 channels, not counting the bass channel.

Stereo and surrond sound all have the problem in that the recording and playback channels aren’t always spatially identical. There are standards and recommendations in playback equipment setup and calibration and the recordings are made assuming those playback parameter requirements are met, but we all know it’s not practical for home theaters to be 100% up to spec, compromises have to be made on playback systems and sometimes recordings have to take that in consideration as well, reducing spacial information accuracy.

True spatial audio tries to solve that problem by calculating how the sounds propogate from source all the way to your ears with more accurate mathematical models of sound propogation from various directions to your ears. The model is called “head related transfer function” (HRTF). With that model, how sounds are collected and reflected in your pinna (outer ear) from various directions causing various frequency and phase changes will be calculated for each sound object, mixing together to represent what you actually hear and played back via traditional output devices.

This kind of sound rendering technologies are pretty common in video games, sound objects are actually spatially placed and HRTF calculated according to player point of view position. It is also pretty common for recording studios to produce surround sound movie tracks or stereo TV/music tracks this way. Component sound objects are recorded with separate microphones then spatially recombined, mastering software can even change sound object placement virtually, fine tuning the final output.

Better yet, using a microphone array with known relative position to each other, advanced software can analyze the individual recorded tracks and use beamforming calculation to pick out individual sound objects and work out their spatial placement in relation to the microphones, so you have spatial information to work with, this is important later on.

Then there’s the technology trying to move to consumer theater and music.

SA recordings can be traditional surround sound, or newer technologies like Dolby Atmos, with which there can be dozens of “sound object” tracks recording how they sound along with information regarding spatial placement.

When the device plays back a SA recording, it is basically doing sound rendering just like a 3D video game, with either fixed point of view and predefined HRTF then output via traditional surround channels, or better yet, use a pair of headphones with well-known frequency response characteristics so HRTF calculation can factor them in creating very accurate rendering of spatial audio, faithfully recreating the soundstage like the listener is on site instead sitting at home.

What’s even more magical is that if the headphones can measure the listener’s head movements, it’s literally like moving your point of view in a FPS game, the playback device can adjust to that and render sound object moving, so you feel like the sound is really there, much more accurate than regular stereo or surround sound.

Sometimes older songs, even those recorded decades before, can be remastered to be true SA, it’s because the original mastering files were either recorded separately for each sound component and you know where they are placed, or they were recorded with multi-microphone arrays placed at the concert, making reasonably accurate beamforming deduction possible.

Without these original master files, Apple devices can still “spatialize” regular stereo songs, they basically create a virtual sound stage with only two spatially placed “speaker” objects, creating the illusion that they are actually spatially recorded tracks, but that “spatialize” feature sounds hollow because there is very little information from stereo tracks to work with, and that’s as much as battery-powered handheld devices can compute.