For example, say 100 people each read / sang an identical passage into a recorder. It seems as though the frequency, amplitude etc that is captured wouldn’t anywhere near specific/precise enough the be accurately represented when played back. I.e what variables are at play that allow us to easily discern the 100 distinct voices when replayed? Thanks!
In: 5
If we held our voices steady to sing a single note they might sound more similar to one another than when we speak. Even more similar if we are a similar height and weight and have similar length vocal cords.
But when we speak we put a certain intonation and have a specific way of saying each individual phoneme that makes up the words we’re saying. That’s why accents will sound different. There’s a lot of small variances that our ears are very sensitive to – some of that comes from our physical characteristics and some from the way that we pronounce and annunciate sounds.
This might not be exactly what you’re asking about though…
Good recording equipment is highly precise. A middle C, in the middle of the voice band, is 260 cycles per second. Good recording gear can capture 44K samples per second at 24 bits per sample (0-8,388,608), or 160 values for each sine wave of a note. Each voice makes slightly different waves and that sampling resolution is quite sufficient to tell them apart.
If we’re talking about a mechanical device like a tape recorder, then the answer is reproduction fidelity is mainly determined by one factor: Tape to head speed, or how fast the tape moves across the recording/ playback head. Professional machines used in recording studios move the tape at 15 inches per second (ips). Home reel to reel units typically roll at half that, or 7-1/2 ips. At the low end, cassette tapes (remember those?) Turn at 1.875 ips. The reasoning is the faster the tape moves, the more tape you have to record every second of every nuance.
Then we get to the new fangled stuff. Digital recordings. And here we have a similar situation, even though nothing is actually moving. In digital it is called the Sampling Rate. It’s a bit hard to describe without using charts and diagrams and 8×10 glossy photographs with circles and arrows and a paragraph on the back of each one describing what each one is in case you want to use each one against me in a court of law. BUT.. the more often you take a sample of something, the closer you get to having the whole thing.
Human hearing frequency response ranges from sometimes as low as 20 hertz (cycles per second) to almost 20,000 hertz. Digital sampling rates most used are 8,000 hz, 44,100hz and 44,000hz for professional quality equipment. If the human ear can hear 20,000hz and we are taking a sample of the sound at TWICE that rate, I think we’re getting the whole picture.
Do you understand Fourier Transforms? It is a big part of audio compression. Our voices are distinct to us because our hearing has finely tuned itself to the range of sounds we are capable of making for communication. The people who could hear and produce sounds in this range communicated and then were able to procreate better.
As far as the actual recording, you’d have to reference a specific codec, because each is slightly different. mp3 is a “lossy” algorithm that cuts sound outside the range of “normal” hearing. MPEG-4 is a “lossless” format thay uses more computing, but can code down 50-60% of the data using something very close to Fourier Transforms with a correction factor.
The variables in play are frequency and amplitude. You can plot this in 2D and it’s called a “fourier transform” when you do.
The fourier transform of a signal is what is displayed on a mixing panel where you see the constantly-shifting spikes.
Long story short, the human voice contains many, many different frequencies at many different amplitudes.
If you’ve ever heard a pure sine wave, it’s a very simple sound. It’s a tone, like you’d get from a button press on an electronic device.
A constant sine wave at constant volume would be a single dot on that fourier transform. But if you’ve ever seen a signal of a human voice or the sound of a motor or running water or anything else, it’s not a dot it’s a whole jagged mountain range of different frequencies superimposed on one another.
That’s where the variation comes into play. As your vocal chords vibrate you’ve got air resonating in your chest and throat and sinuses. Those vibrations are being picked up and transformed by your bones and your fat and muscle. Basically in the sense that a guitar string is a simple instrument, but a guitar is a complex one, the human body is a very complex instrument and each one has a unique sound because each one has a unique physical shape and pattern of stiffness.
I’m about to cut an assload of Corners with this explanation.
Recording devices work by translating sound waves into an electronic signal that can be saved and replayed through a speaker system.
Higher quality recording devices are capable of more precisely taking in that sound information and translating it more precisely into a high-quality electronic signal.
Recording and playing back sounds accurately isn’t the amazing part of this by a long shot. The human ear and the brains ability to process sounds are pretty impressive. I believe I read somewhere that an average human outperforms the best computers and algorithms in being able to distinguish / separate sounds mixed together – examples being hearing a single speaker in a crowded environment and individual instruments in an orchestra. Unexplainably so, and exceeding theoretical limits.
Latest Answers