How does voice over wire or over internet work? How can we hear and understand another person in real time over such a distance via a digital transmission?


Honest question I cannot wrap my mind around. How the hell does this work?

In: Physics

From end to end: speaking into microphone, modulation into digital, electronic signals (1s and 0s), transmitter sending 1s and 0s, picked up by receiver, receiver “translates” 1s and 0s back into identifiable audio, audio played by speaker.

Voice is sound. Sound is a moving wave of pressure, with function to time. That means, one can approximately represent a sound digitally, by measuring their intensity for every time interval.

So now you have sound data, but represented in numbers (these numbers are represented in ones and zeroes too). There are a lot of numbers, 48000 in one second, assuming 48 kHz audio. But no worries, internet bandwidth is far beyond enough for this amount of data.

On the other end, simply convert back these numbers to voltages. High numbers meaning high voltage, and vice versa. These voltages get pumped through a speaker cone, which moves proportionally to the voltage, which moves air proportionally as well, which then generates the sound you can hear.

Ooh voltage=sound helped my brain a bit. I’ve asked this question before. Have a friend in sonar tech I honestly asked to break it down for me. And still didn’t understand it. Voltage is a nice word, I like it and it’s helping. I would like to understand literally though. How the hell is the voice transferred over wire. I do not hear data, I hear the actual voice, and in real time, how is it so fast? Smarter brains than me for sure.

Sound is basically oscillating air. A microphone turns oscillating air into an oscillating electric field (same basic thing as the power in your home). This makes electrons move back and forth in a wire. The electrons don’t move very quickly, but the electric field travels down the wire at around the speed of light, reaching the other end in a fraction of a second. There, a speaker does the opposite job of the microphone, turning the field back to oscillating air, hence sound.

That’s the most basic telephone. Air shakes on one end, then electrons shake in a wire, then air shakes on the other end. Real phone systems involve switching (so you can dial a number) and multiplexing (so that more than one signal can share the same wire).

In digital communications, the electric field is converted into numbers describing it ([ADC](, then the data is transmitted, and it’s later turned back to a field ([DAC](

I don’t think you’re asking how data is transmitted over the Internet, and even if you are, that’s a bit much for an answer in this subreddit.

Sound is a compression wave, like pushing and pulling air molecules. Those pushes and pulls of air create high and low pressure against a tiny diaphragm (Im sure you’ve seen a speaker 🔈). When that cone moves against a magnet, it creates a voltage.

Now imagine that voltage wave, which goes up and down like an ocean wave, is sampled – thousands of times per second you measure the height of the wave and record it as a digital number.

We’ve found creative ways to make that height measurement data smaller so there’s less to move, and therefore faster to send. But then that data is sent a lot like Morse code – push the button down, 1. Let up, 0. Millions (not a typo) of times per second.

The receiving end accepts all those wave height measurements and knows the time intervals they were taken, and reconstructs an analog voltage, which then is connected to a speaker coil and moves that coil back and forth to create air pressure waves in the air.

Start by just humming at a low pitch. Then increase the pitch? What do you notice? The higher the pitch, the faster the vibration. Now turn that vibration into 1s and 0s. The 0s represent the space between the vibrations. The less 0s between each 1, the higher the pitch. That is what frequency is in a nutshell.

There are two other aspects of sound that turn the hum of your vocal chords into words, amplitude and modulation.

The amplitude is how “tall” each 1 is. Or in other words, the amount of voltage it has. I can change the amplitude of my voice without changing the pitch, and likewise, a signal can have more or less voltage per pulse without changing the frequency of the pulses.

Modulation is what your mouth does to the vibration as it passes through it. If you grin while humming it will make more of an EEE sound or if you purse your lips it will make more of and OOO sound, and so on. Modulating a signal is a bit trickier to eli5 but I’ll try. It is another wave that travels along with the “carrier wave.” The carrier wave has the information about the frequency and amplitude, and the signal modulation basically travels on top of it and “shapes” the carrier wave into little packets of data that are smoother, or sharper depending on the sound that is being transmitted.

The big difference is how the sound is represented and digital transmission just adds another step in-between. In conventional analog transmission, the sound waves are air vibrations that cause a microphone to vibrate. The vibrating microphone generates a change in voltage and this signal is conveyed via a wire to a receiver. On the receiver’s end, the change in voltage vibrates a speaker which vibrates the air creating sound waves. Both the speaker and microphone are basically magnets attached to diaphragms and work according to electromagnetic induction.

Digital transmission adds a step in-between as the changes in voltage are converted, via an electronic device, into a different kind of analog signal that *describes* the original signal. The original analog signal will just be a series of high and low voltages that correspond precisely to the highs and lows of the original sound waves. The new signal will be a series of high and low voltages that represent numbers describing the highs and lows of the original signal. In its most basic form, an analog signal like this is represented as a series of numbers indicating the measured voltage at a particular time interval. This series of numbers is encoded into a signal using an agreed-upon convention and that signal is transmitted. On the receiving end, another electronic device takes this descriptive series of numbers and produces the proper voltages at the proper times corresponding to the highs and lows of the original analog signal.

The actual transmission is a bit more complicated, as sending and receiving a digital signal requires a standardized representation of the numbers as well as a method to coordinate the sending of the information (so that you know when one number ends and another starts). The information itself is typically compressed as a pure analog signal contains unnecessary information (e.g. sounds that cannot be heard) and takes up a large amount of bandwidth. There must also be methods to break this information up into easily-transmitted pieces with information on the source and destination. However, these methods and the infrastructure to support them are the same ones used for transmitting any other kind of data over a network and allows a single infrastructure to handle any kind of data to and from any device connected to the network.