There is a thing, called “Nyquist frequency”.
Basically, with 44.1kHz sample-rate, you can sample/store sounds up to 22.5kHz, accurately.
Imagine a sine-wave .. you need multiple points (samples) on that sine-wave, to be able to store and reproduce it, accurately.
The closer the tone gets to the sample-rate, the worse the audio-quality becomes.
So, 44.1kHz is basically just good enough to get the frequency-spectrum of the human hearing covered.
This is called the “nyquist frequency.” Basically, in order to accurate record (and later replay) sounds at a given frequency (in this case around 20 kHz like you mentioned) you needs to “sample” it twice as often as that frequence. Sampling means taking a snap shot of the value at a given point in time.
This is because sound is a wave and you need to require both the peak and the valley to accurately recreate it. And a freuncy of 20kHz means there’s 20k peaks and 20k valleys every second. So that’s 20+20 =40.
As for why we go up to 44 instead of staying at 40 (a lot of people can’t even hear about 17 kHz so even lower might work) I’m not entirely sure.
Hopefully that makes sense, this stuff is a lot easier to explain/understand with pictures.
If the sampling frequency is too close to the acoustic frequency, higher frequency sound waves won’t map neatly into the rigid digital containers that record their values. This causes distortion.
By sampling sound at a frequency that is at least twice the highest expected acoustic frequency, we minimize (but do not eliminate) these acoustic artifacts.
Ideally, sampling would be at least 10 times (an order of magnitude) higher than the highest acoustic frequency. Of course, higher sample rates produce correspondingly more data. So acoustic recording becomes a balance between quality and data size.
Though audio CDs sound quite good, experienced listeners can discern recording inaccuracies, especially in music with cymbals or other very high frequency sounds.
CD’s sample at that frequency, 44,100 times per second.
If they sampled at 20,000, then it would sample once every sound wave, and every sample it got would see the same part of the sound wave. This system wouldn’t detect any sound at all in 20khz, because it sees the same part of the wave every time (and that wouldn’t look like a wave).
We use 44.1 kHz as it was a reasonably small rate that was sufficiently much higher rate than 20 kHz.
There are two effects we needed to consider. The first sets the big picture (double the rate), and the second refined it up a bit (a few kilohertz more).
First, to be able to reproduce any waveform that contains frequencies of up to 20 kHz, we need to sample at minimum at 40 kHz. (This is called Nyquist theorem. You can perfectly reproduce any frequency of less than half the sample rate.)
If we would just sample our original analogue audio at 40 kHz, there would be a huge issue: aliasing. Say, there is an ultrasound component at 29 kHz. Due to sampling at 40 kHz, this sound would appear indistinguishable from a 11 kHz sound. An inaudible sound just became a nasty audible artefact due to our sampling. Same would also happen to any ultrasound at 51 or 69 or … kHz.
Fortunately, there is a conceptually simple solution to aliasing: Before we sample the sound at 40 kHz, we filter out any sound above 20 kHz. No more aliasing, and we get a perfect reproduction of all audible frequencies.
Unfortunately, this exact filter cannot be realised without large compromises to sound recording equipment. It is very hard to pass, say, 19.9 kHz whilst blocking, say, 20.1 kHz.
Fortunately, there was a simple fix suitable for the technology back in the 1970s: rather than sample at 40 kHz, sample at a bit more than that. Now, we still need to filter out anything above 22.05 kHz, but as 20 kHz is much less than that, it is comparably easy to construct such a filter even with analogue electronics.
Edit: This was supposed to be a reply to another comment asking specifically why 44.1 rather than 40kHz, but I accidentally hit the button for a top-level reply.
44.1 kHz largely comes from how digital music was stored and exchanged *before* it ended up on the CD.
Compact discs came out in the early 80s, when computers were still rather primitive. Usable digital audio workstations (DAWs) wouldn’t really be a thing for another decade or so, and even then they were expensive dedicated hardware devices rather than programs that ran on normal computers. CD burners wouldn’t show up until closer to the end of the 80s, and it would be another few years before they were really practical. The only way to actually make a CD was the way they were mass-produced—making physical masters and stamping out copies.
So for the first decade and a half of the CD’s existence, you weren’t exactly recording your album on your MacBook and uploading it to Bandcamp to get your CD made. Nor could you burn it to a CD-R yourself. How, then, did studios send their digital recordings to the manufacturer to get turned into CDs? If you were to use floppy disks (used by the computers of the day) you’d have to mail over a thousand disks in for just one CD.
It turns out that there was a fairly common technology around at the time that was pretty good at storing large amounts of data on relatively cheap media: the VCR. To store digital audio on a video tape, a device called a *PCM adapter* was used. This converted the bits of data representing the digital audio to black-and-white dots in a video signal, which could then be plugged into a normal VCR and recorded onto a video tape that you could mail to the manufacturer, who could then read this special video signal off of the tape and turn it into a CD.
In this process, it’s useful to structure the black and white dots in the video signal so that there are nice round numbers of samples in each frame of video and so that no samples need to be split across lines. To make things more complicated, VCRs in North America and Japan used NTSC, which is 30 frames per second, while most of the rest of the world uses PAL at 25 frames per second. It would be annoying if CDs played at different speeds depending on which part of the world they were made in, so the chosen sample rate needed to fit evenly into both NTSC and PAL video frames in some reasonable way.
And that’s how we ended up with 44.1 kHz. It’s above the 40kHz Nyquist frequency to cover the whole range of human hearing, but divides evenly across video frames and lines in both PAL and NTSC. In fact, 44100 is the product of the first several prime numbers (it’s 2x2x3x3x5x5x7x7) meaning that you can divide it by just about any small number without getting a fraction, which was also useful for converting to lower sample rates before we had fancy digital signal processing (DSP) chips.
A lot of this is simplified, for further reading see Wikipedia: [PCM Adaptor](https://en.wikipedia.org/wiki/PCM_adaptor) and [44.1 kHz](https://en.wikipedia.org/wiki/44,100_Hz)
To give a bit more background than some of the other answers have:
Sound, when we hear it, is the vibrations of your eardrum being moved back and forth by a slight change in the air pressure next to it in your ear. Or if your head happens to be underwater right now, the water pressure. But there’s something moving next to your eardrum, causing your eardrum to get moved back and forth a bunch. Your ear picks that up, transforms it into nerve signals in to your brain, you perceive “sound”. You already knew that.
If the air in your ear is going from “the most pressure right now” to “the least” and then back to “the most” again 1000 times per second, we say that is a frequency of 1000Hz (or 1kHz, 1 thousand cycles per second). It’s often said that humans can “hear from 20Hz to 20kHz” – but as we get older, that 20kHz top end drops off (our ears get a bit less good at the high-frequency stuff as we age). Most of the important information in human speech is generally between 300Hz and 3kHz.
A microphone is kinda like your eardrum, but it turns the air vibrations in to an electrical signal that we then process with other electronic stuff. Before we got in to CDs and “digital sound”, this was all analogue – meaning the signal level can be basically anything from zero to “totally overloaded” and anything in between. When we were recording to analogue things like magnetic tape (or earlier, shellac cylinders or discs by using a needle), there were limits in how high an of audio frequency could be recorded and played back. With magnetic tape, in the era of reel-to-reel tape machines you could choose what speed the tape would move through the machine to set if you cared more about higher audio quality (faster speed) or recording on the same length of tape for longer (lower speed).
When we move over to digital sound, the recording equipment detects and stores “what was the value of the electrical sound signal right now? (as a number)” a bunch of times every second. A single capture of “the signal was at X level” is known as a “sample”, and the number of times every second that these measurements are taken is called the “sample rate”. Because each individual snapshot of signal-level is a number in a fixed range (for CDs that’s 65,536 different possible level values – being 16 bits), they are no longer infinitely variable in the way that the original electrical signal was – they’re digitized. Turned in to digits. Ignoring fancy compression technologies (because CDs don’t use them), this means the higher the sample rate, the more computery-data is generated per second of recording time. CDs are stereo, so both Left and Right channels are recorded. Two bytes per channel for each sample, multiplied by the sample rate, is 176,400 bytes per second. So you’d fit maybe 8 seconds of that on to a 1.44mb floppy disc (y’know, those “save icon” things).
The “44.1kHz” number for a CD is the sample rate – that is, the number of times every second that the analogue electrical signal from a microphone is measured. The absolute highest sound frequency that can possibly be represented by a stream of 44,100 samples per second is half that number – or 22.05 kHz. And that’s the best case – if sample #1 lands on the positive peak (maximum electrical signal level), sample #2 lands on the negative peak (minimum electrical signal level), and sample #3 catches the next positive peak, and so on. If the air is vibrating faster than that, meaning the electrical signal is changing faster than that, those changes cannot be recorded. That is the “Nyquist-Shannon sampling theorem” stuff that the other answers are jumping straight to – they’re entirely correct, I just preferred to take the scenic route.
**But** because humans wouldn’t be able to hear any audio frequencies above 22kHz anyway, it doesn’t matter that CDs can’t record them (or play them back).
Latest Answers