How does file compression work?

805 views

Title basically says it all. I understand how photos and videos can trade quality for file size, but files like games or large folders can be shrunk into a .rar or .zip file, transferred, and pulled back out with no loss to functionality. How does that work? If nothing’s being taken away how is space being saved?

In: Technology

11 Answers

Anonymous 0 Comments

Compression algorithm will try to find patterns in the data in various ways and then instead of outputting the data they output the pattern to recreate the data. This is how both lossy and lossless compression work. However for lossy compression used for audio, images and video the compression algorithm will find the patterns it thinks is important and then ignore any deviation from that pattern while a lossless compression algorithm will note down any deviation between the data and the patterns it found as it is not known which deviation is important and which is not. Different algorithms is able to find patterns in different types of data so if you use the wrong algorithm or the data have few recognizable patterns at all then it will not compress very good. This is why you can not compress already compressed data as there are no more patterns to find.

Anonymous 0 Comments

Some files have patterns in them. The compression tool looks through the file and finds the patterns. Then it makes a new file with one copy of the patterns, followed by the rest of the file including a marker for “pattern 1 goes here” or “pattern 5 goes here”. Since the markers are smaller than the patterns, and some file have a very large number of certain patterns, the compressed format of the file can be much smaller. Some files don’t compress well, and the compression doesn’t make them any smaller.

Above I’ve discussed lossless compression, where no information is destroyed. There are other approaches used to compress sounds or images, that also lose some less significant information, but those operations are not lossless.

Anonymous 0 Comments

There’s a lot of good comments here, but if you want a visual approach here is a super interesting article which covers compression.

https://pudding.cool/2017/05/song-repetition/

Note: The articles main point is to discuss repetition in music, but the compression part is just too cool to not mention.

Anonymous 0 Comments

I’ll use a text as an example. Take this sentence:

I am very awesome

Given that every character (letters and spaces) requires 8 bits (1 byte), it takes 136 bits to store the text. 17 characters x 8 bits = 136.

To compress, we start by counting how often each character appears. Listing from most often to least we get:

* space appears 3 times
* E appears 3 times
* A = 2
* M = 2
* I = 1
* O = 1
* R = 1
* S = 1
* V = 1
* W = 1
* Y = 1

Now, we convert each character to a string of bits like this:

* space = 01
* E = 001
* A = 0001
* M = 00001
* I = 000001
* O = 0000001
* R = 00000001
* S = 000000001
* V = 0000000001
* W = 00000000001
* Y = 000000000001

What this does is make the letters that appear most often use the fewest number of bits. You can see that normally space, E, A, M, I and O take one byte each (8 bits) but when compressed they each take up less than 1 byte. R still takes 1 byte and S, V, W, and Y each use more than one byte, but because they occur less often, you still come out ahead overall.

If you now write this using the raw bits, you would get:

* I = 000001
* space = 01
* am = 000100001
* space = 01
* very = 000000000100100000001000000000001
* space = 01
* awesome = 000100000000001001000000001000000100001001

Strung all together, it looks like this:

000001010001000010100000000010010000000100000000000101000100000000001001000000001000000100001001

Which is 96 bits. Since there’s 8 bits in a byte, you’ve now compressed the original 17 bytes down to 12 bytes. 12*8=96 bits.

You’ll notice that each letter is a string of 0s with a 1 at the end. This means that to decompress, you find the ones and count the zeroes in front of it. 000001 is an I, 01 is a space, 0001 is an A, etc.

Now I have to mention that the compressed file has that conversion table added at the very beginning so that you know that 01=space, 001=E and so on. So that means compressing a short sentence like “I am very awesome” would actually make a larger file just because that conversion table would take up so much space. But if you were to compress a whole book, you can see how well that would work. Also, there are standard compression tables based on how often characters appear in a language overall so that you don’t have to include the conversion table. They’re not as efficient, but have the advantage of not needing to do the character counts and calculations, so they’re faster, and you don’t waste space in small files with the conversion table.

Anonymous 0 Comments

I think the easiest description is that you’re replacing something with a shorter description of it. So describing the Mona Lisa as a painting of a lady is an example of (extremely lossy) compression.

Turns out that so many documents we can find alternate ways to describe them without losing any information. Loss-less compression. For example, writing 1,000,000,000 as 1e9 is an example of loss-less compression.

Anonymous 0 Comments

Lossless compression works finding a more efficient, compact way to represent exactly the same information contained in the input so it can be restored later, identical byte by byte.

There are several possible strategies, depending on the nature of the data.

In example files can contain predictable almost fixed sequences (headers, padding to align fields), or data can be correlated and partially predictable as in pictures (PNG compression is lossless), or audio/video files, so only variations from the prediction needs to be stored.

Another classic example is text, where letters are encoded using one byte (or even more bytes) which is quite inefficient because there are less letters than possible byte values.

In other cases an archive can contain multiple version of a file, and differences in each version can be tiny, so any subsequent version can be stored in a negligible amount of space.

In almost all kinds of data can exist some sequences which are more frequent and can be substituted by shorter sequences, and so on…

Strategies are almost infinite, but all revolves around finding a more concise way to tell the same message.

Anonymous 0 Comments

Tom Scott did a very clear video on how text is compressed using “Hofmann Trees”. This works for single characters, however win zip doesn’t use single characters but bigger chunks

Anonymous 0 Comments

I’d just like to recommend you check out some videos on the Computerphile channel on YouTube. They have some nice ones about various kinds of compression:

https://www.youtube.com/user/Computerphile/search?query=compression

Anonymous 0 Comments

If your letter to your friend list out 1000 letter a, you shrink your paper by writing 1000 x a. Instead of a 10×10 paper, you just need to send a 1×1 paper

Anonymous 0 Comments

There are many different ways and tricks!

All of them a very technical but I’ll give you my favorite (and super basic) concept that blew my mind.

You know you could send an entire image from position A to position B right? But that’s a lot of pixels, which are a lot of data to send!

So one trick you can do is “predict” specific pixels based on previous information. How exactly you do that doesn’t matter right now, but it’s a bit of math that looks at your last image and says “well the next image must kinda look like this…”.

How does it help you compress the image? Well you can place one of those predictors not only at position B but ALSO position A.

So now you predict the same image on both “ends”, how does it decrease the amount of information traveling?

Simple, on position A you predict the next image, let’s call it I_predict. You also KNOW what the next image really looks like I_real.

You know that at position B you will predict the exact same image I_predict.

So what you can do now is simply send the information that corrects I_predict in a way that it results in I_real (which you acquire from comparing them at Position A) and say:

“Listen, I know you will think I_predict is the next image, and it almost is! Just correct these few pixels and you will be on spot!”.

Thereby drastically reducing the amount of data transferred, since you didn’t have to send an entire image, but only instructions to correct it.