So there are two forms of compression. Lossy and lossless.
Lossy compression is strictly for media files. Essentially they make small alterations to the media file so that it’s easier to compress it and make it smaller. It turns out you can keep making changes until it gets quite small. All video and audio formats that you’re familiar with use this form of compression. It’s not suitable for anything that’s not video or audio or an image because you modify the data in an irreversible way, and this would make the data useless for everything else.
Lossless compression typically revolves around predictable patterns in data that can be exploited. One of the first versions of this was something called RLE compression. Essentially in most text files and lots of databases, you may frequently see long strings of repeated values. These could be spaces in a text file or whatever. If you have 12 G’s in a row, you could just replace that with the number 12 and the letter G and reduce it to two bytes from 12.
One of the most efficient compression algorithms that was a big step forward when it was first developed is Huffman encoding. That’s a bit wise format where you basically take all the bytes in a file, and you count the number of instances of each value then you construct a binary tree with the most common on top and the least common below. This is especially good for text because certain characters are very frequently used like vowels. The letter e could be reduced to a single bit.
There’s also a whole host of other schemes that people have come up with and most modern compression algorithms will use some hybrid of different approaches to try to shrink things down.
Latest Answers