I see a lot of commenters have already covered the equipment side of things, but I’d like to bring in a bioacoustics perspective.
The key aspect of sound that we use to recognize voices is the frequency spectrum (the range and relative amplitudes of the frequencies that compose the sound).
Whenever sound travels or interacts with any objects it is **filtered** (the frequency spectrum changes). We are tuned to these changes in frequency — e.g. we learn what voice-behind-a-door sounds like, and can recognize it without needing visual cues. (I hope that’s a good example, let me know if that doesn’t make sense)
So, when you record and play back a voice, you are passing the sound through multiple filters – the air, the microphone, the audio processing that digitizes the sound, the speaker, and then more air – before it reaches your ears. The frequency spectrum changes, your brain detects those changes, and you recognize it as a recorded voice.
There are some other considerations too. People are mostly talking about televisions, but telephones/announcements/hi fi systems/cinemas are in our life too, all with distinctive qualities. That’s part of the answer, too – think about the difference between hearing a voicemail from your sister on speakerphone, vs playing a video you took on your phone of your sister speaking. Same equipment and even the same speaker, right? But over a lifetime of living in the world, you have an understanding of what different audio means based on lots of training about frequencies, ambient sound, echo/reverb, etc.
Another thing that hasn’t come up yet is echo and directionality. Humans aren’t like bats but we still have directional hearing. The sound wave of speech is easiest to pick up on the direct route from mouth to ear, but that’s still affected by our environment. We subconsciously pick up details from how the sound wave is three dimensional. Even the best speaker in the world can only blast sound from one direction, and while that bounces off the room too, we gather information about what data is present/missing as it does that. Wait a minute, those people are speaking in a studio and the reverb was stripped out, then the engineered/packaged/detectably artificial noise bounced around my familiar living room – well, that just doesn’t sound the same as a voice with its natural reverb preserved bouncing around the same room. Also, all of the noise is coming from the same place and then echoing from that point, while if I was really outside a coffee shop full of people chatting with a protagonist’s speech being addressed to me, some sounds would be happening around/behind me, plus the open air would make sounds different, and this is clearly bouncing off walls.
We pick up a lot more than we know. But we can still be fooled, of course.
A lot of people are discussing how we can tell because of compressed audio or low quality equipment, but theres also a distinct difference even when you’re listening to a high-quality poscast taped in a recording studio. Here’s an explanation for how we can tell, even with high quality recording and playback equipment where background noise and other factors are eliminated:
You ever had someone whisper into your ear? You know how when people use certain sounds, like an T sound or a P sound and those feel much louder than the rest? These sounds are called “plosives,” and they happen when you block off your airway for a second and release that stored up energy all at once, rather than consistently like you would with an A sound or an E sound. They’re usually called something like hard consonants because of this.
In normal speech, these are generally close enough to the same volume that it doesn’t really register as different. When they’re spoken directly into your ear though, you can think of it as the extra bit of air that comes out making the sound feel a bit louder and more intense. In this sense, a microphone is basically an ear; even with a pop filter (which is designed to stop that extra air from entering the microphone) some of that plosive energy is still picked up. This effect, on top of the extra sharpness and clarity you get from speaking so close to your “listener” (the microphone) means that the sound is picked up in a completely different way than if it were spoken in-person. When the sound is reproduced, all of this extra information gets reproduced as well, so even if you’re listening through speakers from a distance, you’re still hearing as though your ear was where the microphone was when the words were spoken. That’s the information that was fed to the microphone, so that’s what the speakers are told to reproduce.
This process is called “mediation,” where the information we use to perceive the world goes through an extra process before it reaches our primary senses, and then our brain interprets it. There are also a bunch of other ways microphones add in extra mediation compared solely to the human ear outside of the ones I already described, but I could sit here all day and talk about the nuances of different conditions and that’d veer more out of ELi5 territory than I already have.
Source: communications degree incl. experience in sound studies
TL;DR: We can tell when something is a digital reproduction because the way we are listening (our position relative to the “speaker,” the volume we’re hearing the words compared to the intensity they’re being spoken at, etc.) doesn’t line up with the way we’re hearing.
Missing overtones. Every note you play (or sing) has a number of higher notes that ring along. If you listen carefully when plucking a guitar string you can even hear some! It‘s the same with other sounds (just not as musical and clean). When a recording is made you will always cut off some overtones for technical reasons, making the voice more flat. This is also why we are kinda bad at telling apart people on the phone, those overtones carry a lot of the uniqueness of our voice. But even with a perfect recording equipment, the environment of the person speaking is different than your own, if you‘re wearing headphones the voice doesn‘t interact with your own environment the way it‘s supposed to, speakers aren‘t on the same height as a mouth and other things like that
Dunno why nobody has mentioned the reverb yet. Your brain is also surprisingly good at picking up if the reverb of the space the speaker is talking from doesnt match your own environment. Obviously this only works with significant differences, but if youre in a small room listening to a soeaker broadcasting to a speaker in that room, but theyve been recorded outside, its not going to sound like theyre in the room with you at all.
Large hall vs small room, definitely noticeable.
Perfectly clear sound recording booth with no noticeable reverb at all vs small room, also very noticeable.
Humans have evolved to pay a LOT of attention to what other humans are saying and HOW they are saying it.
If someone is sick? You want to hear that and move away from them.
If someone is lying about wanting to stab you? You want to hear that and not be stabbed.
If someone is getting mad at you? You want to hear that now so they don’t stab you later…
Latest Answers