The compression algorithm scans the whole data and looks for repeated sections.
Then it makes a compressed file with a conversion table where larger repeated sections are represented by small numbers and translated with the table.
Some data compresses much better than others. When things are very repetitious.
But other data doesn’t compress well at all. Especially data that already went through a compression algorithm. JPEGs and mp3s already are lossy compression formats.
Look at lempel ziv for one of the earliest and common compression styles.
https://en.m.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
First off: it’s not possible to compress everything into something smaller. Whatever algorithm you choose, some possible data will actually end up larger after compression than smaller.
Now that being said, most data we care about has patterns and repetition. The simplest algorithms for compression will look for patterns and repetition and instead of repeating the data or expressing the pattern directly, will instead describe a way to rebuild the repetition or recreate the pattern.
It’s much less data for me to say “1 to 1000 by 2s” than it is to write out “2 4 6 8 10 12…” etc.
If you take a book and replace every occurrence of “the” with “#” instead, you will make the book shorter without losing any information. You have compressed it.
Of course, if you want to *read* the book, you’ll need to convert all of those “#”s back into”the”s. This is what happens when you expand a zip back into the files it contains.
exe is not a compressed file, it is an executable file.
now for compression. imagine I had a file that was just 10 gigabytes of
“`
…
“`
its pretty easy to see that
“`
Every following line in this file starts with https://www.youtube.com/watch?v=
dQw4w9WgXcQ
y3ZhS8wg8z4
XEkOXvkuy0Q
Hrxf-mhwAAs
zaD6WPIOoN8
…
“`
has the same information, but is much mich smaller, more like 2gb
and thats how zipping works, it finds repeating patterns and replace then with smaller identifier and way to undo this replacement, just in a machine readable way instead of human readable
Lots of clever mathy tricks depending on what compression algorithm is used. One way might be if a file has several repeating bits, say there’s 20 0 bits in a row, to replace them with a shorter pattern that describes the number of 0 bits that it replaced.
It might do the same with long repeating bit patterns, substituting them with a shorter patterns with a look up table to show what to replace when decompressing.
And lots of other ways to achieve the same thing: replacing lots of bits with fewer bits in a way that can somehow be restored when you want to uncompress the file.
So there are two forms of compression. Lossy and lossless.
Lossy compression is strictly for media files. Essentially they make small alterations to the media file so that it’s easier to compress it and make it smaller. It turns out you can keep making changes until it gets quite small. All video and audio formats that you’re familiar with use this form of compression. It’s not suitable for anything that’s not video or audio or an image because you modify the data in an irreversible way, and this would make the data useless for everything else.
Lossless compression typically revolves around predictable patterns in data that can be exploited. One of the first versions of this was something called RLE compression. Essentially in most text files and lots of databases, you may frequently see long strings of repeated values. These could be spaces in a text file or whatever. If you have 12 G’s in a row, you could just replace that with the number 12 and the letter G and reduce it to two bytes from 12.
One of the most efficient compression algorithms that was a big step forward when it was first developed is Huffman encoding. That’s a bit wise format where you basically take all the bytes in a file, and you count the number of instances of each value then you construct a binary tree with the most common on top and the least common below. This is especially good for text because certain characters are very frequently used like vowels. The letter e could be reduced to a single bit.
There’s also a whole host of other schemes that people have come up with and most modern compression algorithms will use some hybrid of different approaches to try to shrink things down.
Latest Answers