The speakers on your TV aren’t expensive and dynamic enough to perfectly replicate the complexities of a human voice – the sound wave is “flattened” and compressed with a lot of the peaks and valleys and low bass removed or simplified.
If you had really good recording equipment and really good speakers you could absolutely fool the ear, but the little one-inch speakers and shitty mixing software on a tv can’t do it.
There are lots of hints that your ears and brain can pick up on to determine whether a voice is live or recorded.
First of all, recorded voices are often highly processed, particularly for things like TV and radio transmission. One type of processing that is often used is called compression. It’s not data compression (like a .zip file), but dynamic range compression. Essentially, it changes the volume of some parts of the signal, making it so that the loudest parts of the signal and the softest parts of the signal are much closer in volume than they were originally (thereby “compressing” the dynamic range of the signal). This has the effect of making speech easier to understand because it’s all coming out of your speaker at nearly the same volume, regardless of whether the person was yelling or mumbling. However, this kind of processing can produce an “unnatural” sound, and your brain can pick up on that. In the real world, the human voice can have a large dynamic range. When a voice sounds too “perfect”, it’s a clue to your brain that it’s a recording.
You may not notice it consciously, but voices on a TV very often have music or other sound effects in the background. This is an obvious clue that it’s recorded audio. Even if there’s no background music but there are some sound effects added (like footsteps, running water, horse hoofs, whatever), the sound effects often play at unnaturally loud volumes and your brain notices this as artificial.
Additionally, typical TV speakers are low cost and low quality, and they don’t accurately reproduce all frequencies of the human voice. This can lead to more unnatural sounds that your brain can recognize as artificial. But this effect is probably a lot more subtle in this case. If you were really listening to a voice in another room, you’d already be losing a ton of frequencies (due to air absorption, and the sound having to bounce off walls or go through walls to get to your ears), so low quality speakers would only have a very small effect after all of those other losses.
But in general, if you had relatively high quality speakers, and a relatively high-quality recording of a voice with minimal processing, it’s not difficult to make a convincing reproduction of a human voice. In this case, it would be absolutely impossible to tell if a voice coming from a room was live or recorded.
You don’t. You have extra information from your eyes that it is not in the room – If you were to do a blind test with a theoretically perfect recording (no hiss/hum from the mic) and a theoretically perfect speaker that could reproduce every frequency audible to us accurately, it’s likely you could fool someone, but other sound cues such as breathing or ambient noise from just someone’s clothes could be enough to tip someone off.
I accidentally confused my roommate this way – they were in a room next to the living room, they heard a voice while I was watching TV and thought it was me talking to them. Without the extra visual context behind the sound, they couldn’t confirm immediately if it was me or the TV. Granted this anecdote is not perfect because the sound gets slightly distorted traveling through a space, but still it goes to show just how much lifting our other senses do when it comes to perceiving this world.
One – commercial audio is heavily compressed – in that soft and loud are similar volumes. Why? So the final product (TV audio, music also) can be heard clearly on shitty speakers or in noisy environments. Or at low volumes (think a parent watching TV while their baby/family is asleep).
Think ASMR, Billie Eilish whispering her songs into the mic, your TV character speaking softly to themselves. It’s all soft, but the volume is blown up. I’d argue this compression sounds natural to us as we’ve been listening to this processing all our lives.
Addendum: there is a lot of other processing going on, and it’s all to make everything sound consistent and smooth and an artificial “natural”.
Two – reverb, or as the layman calls it, echoes. The exaggerated version is obvious – when you are in a cave, you hear your voice reverberating/echoing throughout the cave. Even in a small room, your ear is hearing a mixed signal of someone speaking with their voice reflecting off the ceiling, floor, and four walls surrounding.
TV audio doesn’t reproduce audio with the same intimacy of someone facing you. All that processing we spoke about earlier isn’t tuned for that.
Your brain is hardwired to detect human voices more than other sounds. However, your tv speakers don’t output all frequencies, the editing and broadcasting will compress the audio and shape the frequencies in an unnatural or low quality way for file size, style, etc.
If you got a high quality recording of someone and played it back on very high quality speakers in an environment where you aren’t aware of a speaker in the room. You’ll be fooled pretty easily.
Latest Answers