Why are voices so unique and distinguishable? Even a single word can let you recognize someone.



Why are voices so unique and distinguishable? Even a single word can let you recognize someone.

In: Biology

When you hear someone speak vowel sounds in particular, the timbre (resonant characteristics) is determined by the physical dimensions of a person’s body, especially their throat, mouth and nose. Your brain learns to decode this to learn about the speaker, just as reverberations provide clues about the room you’re in.

Humans evolved as social animals, so we are typically very good at distinguishing human features, including faces and voices. Physically, our faces and voices are as similar to each other as the faces and voices of sheep, but we can’t distinguish sheep as easily as we can distinguish humans. It is the power of our brains.

Same with faces, and lots of other things. Your brain is amazingly capable of distinguishing the smallest differences in nearly everything, but obviously only does so in what it thinks is important to. Recognising voices is clearly quite important, so we do. Recognising the intonation and tone in a word is very important if you’re a primary mandarin speaker, and so they can, a natural English speaker would find it difficult to do the same, as they don’t need to, and can you tell the difference between the moo’s of two different cows? Well you better believe that Barry the cow can tell the difference between Maisy and Suzy’s moo’s!

Our ears are very fancy, they are not like single microphone, but rather set of thousands of microphones called [[“hair cells”]](https://iiif.wellcomecollection.org/image/B0000114.jpg/full/880%2C/0/default.jpg), that allow us to distinguish not only a single sound, but also “whole bouquet” of all sound frequencies that fall into our ear.

Someone talking to us might have a major sound that is dominating in their voice, but it will always be a combination of many different sounds and reflections, that’s also how even with our eyes closed we can recognize if a person is standing in front of us or behind.

You can tell the difference between different instruments. Just by playing one note, you can tell whether that instrument is a piano or a guitar. This is because the sounds aren’t just made of the fundamental pitches of those notes, but also other pitches and tones in different ratios that are unique to each instrument. Human voice boxes work the same way; every voice box is different, and so if you’re familiar with the sound of a voice you can identify it when you hear it.

There’s a lot of characteristics to a voice.

* The tone/timbre of the voice
* The rhythm/cadence at which someone speaks
* Their prosody, ie. how their inflexion changes throughout a sentence (or even a word)
* Their accent; particularly how they pronounce certain vowel sounds
* and additionally their dialect. What kind of words they use

That’s a lot of different features for identifying a voice, it’s not just a sound.

Example: Look how well Tom Hiddleston impersonates Graham Norton: [https://www.youtube.com/watch?v=zzqWPDnYEik](https://www.youtube.com/watch?v=zzqWPDnYEik) He’s paying attention to each one of the points I mentioned above (except dialect, he used fairly standard language).

Has anyone seen any “guess which voice goes with which face” research? With factors such as gender, age, country of origin, race/ethnicity, fitness/obesity, maybe height.

I’ve always felt like I could identify obesity by someone’s voice.

For me, they are not. I get voices mixed up all the time. I don’t know who’s singing a song, many voice actors sound the same to me, and I can’t always tell what family member I’m talking to on the phone. So explain that to me…why can’t I tell the difference?

**TL;DR**: we all have different bodies and backgrounds and that affects the timbre of our voice and it has been evolutionarily beneficial for us to be able to distinguish between voices and voice characteristics.

**Full Response**

All of the responses so far discuss how the ears receive sounds and how the brain decodes the sound, but no one has given a deeper explanation as to why voices are so unique to begin with. This is an extensive topic so I won’t be able to cover everything, but I’ll provide enough of an explanation to get you going.

All sounds that we hear are really just acoustic waves. When you breathe out and make any “sound” from your mouth, you are pushing a certain amount of the air in front of you with a certain amount of energy which causes the air to move at a certain speed. Your body causes this energy to change rapidly over time, going up and down as you speak, and that causes the acoustic wave moving through the air to change rapidly as well. If I put a little spring in front of your mouth and an electromagnet behind the spring to record signals into a computer about how the spring moves when it is struck by the acoustic wave, I can plot this movement on a graph that shows the position of the spring at any given time. This generally looks like some sort of wave that goes up and down.

A “pure” wave can look like a smooth sine wave, flowing evenly up and down over time at a particular frequency (i.e., the number of ups and downs over a period of time). But in reality, most acoustic waves aren’t so smooth. Some get tall very fast — this is often called a fast “attack”. Some stay loud for a long time and slowly get soft — this is called a slow “decay”. Some might have a fast or slow decay, but continue making a sound for a long period of time — this is called “sustain”. And when the sound suddenly falls off, it can happen abruptly or slowly — this is called “release”.

In addition to these aspects, there are other subtle nuances that can change the acoustic wave called “overtones”. These are very important to your question. If I sing a note an A note, which is 440 Hertz (a measure of frequency), the fundamental frequency of my sound is 440Hz. But, if I make the sound more nasal sounding, certain overtones will be produced. These overtones occur at a higher frequency than the fundamental frequency. On a graph of this signal, the overtones can look like little bumps along the fundamental frequency, but what is really happening is that the overtones are being added to the fundamental frequency tone to produce a complex signal.

With that in mind, it’s a bit easier to understand this excerpt from the Wikipedia article on overtone:

> Most oscillators, from a plucked guitar string to a flute that is blown, will naturally vibrate at a series of distinct frequencies known as normal modes. The lowest normal mode frequency is known as the fundamental frequency, while the higher frequencies are called overtones. Often, when an oscillator is excited — for example, by plucking a guitar string — it will oscillate at several of its modal frequencies at the same time. So when a note is played, this gives the sensation of hearing other frequencies (overtones) above the lowest frequency (the fundamental).

> Timbre is the quality that gives the listener the ability to distinguish between the sound of different instruments. The timbre of an instrument is determined by which overtones it emphasizes. That is to say, the relative volumes of these overtones to each other determines the specific “flavor”, “color” or “tone” of sound of that family of instruments. The intensity of each of these overtones is rarely constant for the duration of a note. Over time, different overtones may decay at different rates, causing the relative intensity of each overtone to rise or fall independent of the overall volume of the sound. A carefully trained ear can hear these changes even in a single note. This is why the timbre of a note may be perceived differently when played staccato or legato.

In addition to the above, complex waves have something called an “envelope”. If I were to draw a smoothed out line from peak to peak over a complex sound wave, the envelope is essentially that smoothed out line. Our brain often picks up that envelope as well. The crazy thing is, our brains can actually hear and decode the envelope of an ultrasonic sound. In other words, we can hear the envelope of a signal that has a fundamental frequency that is too high for the human ear to hear. One fascinating corollary and consequence of this is that ultrasonic sounds have a smaller cone of dispersal — they can be aimed more precisely at, say, a single person in a crowd. But if I modulate the ultrasonic sound to have an envelope in a frequency that your ear can hear, then if I aim that ultrasonic signal at you while you’re standing in a crowd, only you will hear that sound. Creepy stuff, right? I actually wrote a patent on this technology and casinos are working to implement it into their games (which can be used both positively and with devious effect).

But to finally answer your question, voices are so unique and distinguishable because each person’s body is very unique. Some people might have a larger soft palette than others, some may have a uniquely shaped nasal cavity, some may have a stronger diaphragm, some may have lungs with a larger air capacity. And those are just physical aspects. Social aspects can result in tons of unique pronunciations and and habits in how a person uses their unique physical system. Similarly, various other things can come into play. For example, cortisol levels, particular diseases such as the common cold and covid-19, and tons of other biological aspects that can result from external sources can affect the voice. If you’re interested in that last bit, look at the research of Rita Singh.

In conclusion, there are tons of different aspects that can change from person to person, and these aspects all affect the timbre, overtones, and many other factors that can be picked up in an acoustic wave. That is at least a good starting point for why voices are so unique and distinguishable.

As for why our brains are able to perceive and decode those different sounds so precisely, that’s an equally long discussion that deals with evolutionary psychology, auditory processing pathways in the brain, social psychology, neuroscience, and much, much more. The high-level answer is because being able to distinguish between voices and voice characteristics has been beneficial to the survival of our species. But the detailed answer as to why and how we do that is far more than any one response can cover.

This is partly confirmation bias. You know a select set of voices you associate with people. It’s very possible you would mix them up with others, but are not exposed to a voice “match” . You are also an ‘expert’ on those voices so unless the match is similar you are very unlikely to make a mistake. Sort of like never confusing a twin, but you only ever saw one of them and didn’t know the other one existed.

They aren’t always, my brother and I sound so similar on the phone that our parents sometimes need a moment to figure out which of us they’re talking to and we aren’t twins or anything.

Voices are a combination of the sounds your body makes just because of its physical properties, and also how you learn to use that physical instrument. Your accent, cadence, slurring and other features of your voice are all properties that you learned growing up. This is why even identical twins can have pretty different sounding voices.

Voices tend to seem unique and distinguishable because we tend to hear voices from only a small number of people over the course of our lives. For example, for most of us Arnold Schwarzenegger has a super identifiable accent and voice, yet his voice is a result of his learning English as a second language and in fact a lot of people from Austria where he came from have similar accents in English. If you were around these people all the time, you might not think Arnold has such a unique voice.

It’s similar to why we can read facial expressions so well, and it’s hard to make it look perfect, even in modern video games, our brains are really really good at distinguishing vocal sounds and facial expressions.

They just aren’t. Sound is one of those things we think as human are good at but we’re not. Impressionists can fool us and we’re regularly confused by parents and children on the phone.

It’s purely in our heads that we are great at hearing voices. It’s our eyes that do much of the heavy lifting. We can be easily fooled by audio illusions and are tricked by binaural audio. Far and bar are interchangeable based on mouth movements. Go look at audio illusions.