Eli5: How is it that our voices are distinct enough to be captured by a recording instrument and replayed accurately


For example, say 100 people each read / sang an identical passage into a recorder. It seems as though the frequency, amplitude etc that is captured wouldn’t anywhere near specific/precise enough the be accurately represented when played back. I.e what variables are at play that allow us to easily discern the 100 distinct voices when replayed? Thanks!

In: 5

Depends on quality of recording device and noise cancelation, distance from mic/recording device from sound source, and you have to understand when you speak you hear a deeper version of your voice because of the bass of your skull or something like that

Good recording equipment is highly precise. A middle C, in the middle of the voice band, is 260 cycles per second. Good recording gear can capture 44K samples per second at 24 bits per sample (0-8,388,608), or 160 values for each sine wave of a note. Each voice makes slightly different waves and that sampling resolution is quite sufficient to tell them apart.

If we held our voices steady to sing a single note they might sound more similar to one another than when we speak. Even more similar if we are a similar height and weight and have similar length vocal cords.

But when we speak we put a certain intonation and have a specific way of saying each individual phoneme that makes up the words we’re saying. That’s why accents will sound different. There’s a lot of small variances that our ears are very sensitive to – some of that comes from our physical characteristics and some from the way that we pronounce and annunciate sounds.

This might not be exactly what you’re asking about though…

Do you understand Fourier Transforms? It is a big part of audio compression. Our voices are distinct to us because our hearing has finely tuned itself to the range of sounds we are capable of making for communication. The people who could hear and produce sounds in this range communicated and then were able to procreate better.

As far as the actual recording, you’d have to reference a specific codec, because each is slightly different. mp3 is a “lossy” algorithm that cuts sound outside the range of “normal” hearing. MPEG-4 is a “lossless” format thay uses more computing, but can code down 50-60% of the data using something very close to Fourier Transforms with a correction factor.

If we’re talking about a mechanical device like a tape recorder, then the answer is reproduction fidelity is mainly determined by one factor: Tape to head speed, or how fast the tape moves across the recording/ playback head. Professional machines used in recording studios move the tape at 15 inches per second (ips). Home reel to reel units typically roll at half that, or 7-1/2 ips. At the low end, cassette tapes (remember those?) Turn at 1.875 ips. The reasoning is the faster the tape moves, the more tape you have to record every second of every nuance.

Then we get to the new fangled stuff. Digital recordings. And here we have a similar situation, even though nothing is actually moving. In digital it is called the Sampling Rate. It’s a bit hard to describe without using charts and diagrams and 8×10 glossy photographs with circles and arrows and a paragraph on the back of each one describing what each one is in case you want to use each one against me in a court of law. BUT.. the more often you take a sample of something, the closer you get to having the whole thing.

Human hearing frequency response ranges from sometimes as low as 20 hertz (cycles per second) to almost 20,000 hertz. Digital sampling rates most used are 8,000 hz, 44,100hz and 44,000hz for professional quality equipment. If the human ear can hear 20,000hz and we are taking a sample of the sound at TWICE that rate, I think we’re getting the whole picture.