Songs are made of sounds. Sounds (more generally, any kind of wave) can be mumbled, jumbled, mixed and many things, but they have a nice property: if you mix two notes (frequencies) together even if they mix they can be mathematically divided again in a thing that is called a spectrogram, that is basically a list of all the notes that are played together at a single time. This is really nice, because even if you have sound jumbled and mumbled you still can divide it and have a nice fingerprint of the song. And each instrument, voice, and hence song has a peculiar spectrogram, which is what our brain uses to discern different sounds. Notes are like the colors of sound.
What Shazam does is calculate this fingerprint, and since different songs have different sounds, it can be used to identify a song. And like colors, it’s really difficult to distort a sound so much that it cannot be determined, because frequencies tend to stay the same even with noise or obstacles, unlike amplitude (volume) that can be used to recognize songs but only if the recording is really really accurate, because noise and obstacles have a greater impact on amplitude than on frequency
Other comments are missing what a fingerprint is.
A spectrogram is the result of applying a fourier transform to the input signal, it produces a matrix shaped `number of frequencies X time instants`. Basically now the content of any frequency at any point in time is known.
Then, a set of points (local maximums) are selected so that they spread across the whole spectrogram. Since these points are local maximums its likely they’re gonna survive even if the recording comes from a noisy environment.
Each of those maximums is paired to another maximum which is close in terms of frequency and time, the pairs with lower energy content are discarded (energy is the value of a point).
A fingerprint is the result of applying a certain hashing function to a pair of points, it takes the frequency and time instant of each point into account.
N pairs = N fingerprints
For any song a LOT of fingerprints are produced and stored in a database.
When you send a recording to Shazam, it goes through this process of fingerprint extraction. The extracted fingerprints are then used to query their database and if you’re lucky there will be some (many) matches.
Those matches are then filtered out to exclude false positives. For example:
* song A 100 fingerprints matched
* song B 20 matched,
* song C 10 matched
It’s likely the recording you sent is taken from song A.
SOURCE: I’ve implemented a similiar audio fingerprint algorithm
Chunks of every song get turned into numbers called vectors, and those vectors get stored in a database.
When you record a bit of a song in the Shazam app, your recorded data gets turned into the same kind of vector numbers that they put in the database.
They compare the vector numbers from your recording to the vectors stored in the database. The closest set of numbers is probably a chunk of the song you’re looking for.
https://www.elastic.co/blog/searching-by-music-leveraging-vector-search-audio-information-retrieval
If you want to find out the details, this course covers it: https://www.coursera.org/learn/audio-signal-processing (free)
It’s one of the last modules, so you will need to work your way through the FT, STFT, Harmonic model, etc. to get the technical knowledge to really understand audio feature extraction.
I’ve done this course myself and it’s very good, if you like mathematics and audio signal processing.
Latest Answers