How does file compression/ZIP files work?

334 viewsEngineeringOther

I understand it’s to save file space, but like…how?

Follow-up: If you have to compress a photo to save on file size (when emailing, for example) couldn’t it just send essentially a text file on how to uncompress it back to its original size and quality? (ex. “Set the resolution to XxY and add X color at Y pixels”)

Thank you!

In: Engineering

9 Answers

Anonymous 0 Comments

Data compression at a basic level is actually a very simple concept, if you remove redundant data from a file you can reduce its size.

Imagine a particular book contains 1500 examples of the word ‘the’. If you were to replace all examples of ‘the’ in this book with a placeholder say the ‘@’ symbol and placed an index at the beginning that read ‘@’ = ‘the’ you could save a significant amount of hard drive space without losing any data.

1500 examples of a 3 character word replaced with a single character equals a savings of around 3000 bytes or just over 3 Kilobytes minus some overhead. Repeat this multiple times for multiple words and the savings adds up.

One of @ downsides though is that it makes @ book a tad harder to read as you have to look up what @ means in @ index all @ time. This extra overhead in a computer translates to a need for slightly more processing power.

In order to store more complex data like pictures and video they have to be compressed down significantly, but lossless compression described above isn’t going to be enough. We need lossfull compression. This is when you are willing to alter data to make it easier to compress.

In this example you could replace all instances of ‘th’ followed by a vowel to be ‘the’. So ‘tha’, ‘thi’, ‘tho’, ‘thu’, and sometimes ‘thy’. This allows all those examples to be treated as ‘the’ by the compression algorithm and therefore allows for more compression. Thes theugh makes the document even harder to read but importantly you still understand what it is trying to say.

This technique is used in formats like JPEGs and results in the trademark skewing and flattening of colors.

To go further than this you don’t necessarily have to store the exact data needed to recreate an object bit by bit.

Instead it’s sometimes easier to store the instructions on how to recreate that object instead. If you can break down an object into identical constituent blocks you can store how to make those blocks in memory along with instructions on where they go in a larger object like a gigantic lego set.

The point being that you only need to store the data on how to recreate any individual block once, even though that block may appear in an object thousands of times.

Games like Minecraft are a textbook example of this. Maps are created using a technique called ‘procedural generation’ where a pseudo-random map is created using a repeatable mathematical equation. These equations take in a ‘seed value’ as the random element. So long as that given seed value is the same, the output from the equation will always be the same. You therefore don’t need to store every block on a map, because you can run the math equation with the seed on demand to determine what block is in a particular location by default.

The randomness for the maps comes from the fact that there are millions of possible seed values each generating a unique map.

The save file then only needs to contain the seed value for your given map, and the blocks that you changed from the default.

Anonymous 0 Comments

digital files are just a bunch of 1s and 0s (at the very basic level), and those are repeating, in essence: they have patterns.

compression tool or algorithm look for these patterns, then make a new (now compressed) file with “pointers” like Pattern A Goes Here, Pattern B Goes There, etc.

these pointers are often smaller than the actual patterns, and some files have a lot of similar patterns, the resulting compressed file can be way smaller than it’s actual file.

of course there are files that can’t be compressed very well and you see basically no noticeable change in their size.

you can see this firsthand with a little experiment: make a text file in notepad, put in repeating words or letters like aaaaaaaaaaa or bbbbbbbbbbbbb or ababababababab, save it as a `.txt` file and then compress it with whatever compression program you use (WinRAR, 7-Zip, NanaZip, etc.)

you’ll see that the text file may have like ~200kb size but the compressed file is just around ~1kb (the program found the pattern of letters and make pointers on how to find them, like “Repeat AB 20x here”, “Repeat AAAA 30x there”).

texts are among the easiest files to compress.

> couldn’t it just send essentially a text file on how to uncompress it back to its original size and quality?

in a simple way, no. To uncompress a compressed file, the program have to understand how it was compressed in the first place, and that information is available in the file, you can’t simply send a text file containing the compression information, it’s kind of inefficient–and as stated in above example, a single text file can be 200kb, but the compressed version is 1kb, why send the bigger text file?

[more answers](https://www.reddit.com/r/explainlikeimfive/comments/g9266e/eli5_how_does_file_compression_work/)

Anonymous 0 Comments

elii5 answer:

1111111111111111111111111111111111111

versus

1010001

The 1st is 40 long. The 2nd binary 101000 = decimal (2*2*2*2*2) 0+(2*2*2)+0+0+0 = decimal 40 and now it takes 7 characters.

7/40 is 17.5% of the original file.

* The actual process is so efficient that text files are roughly 10% of the actual fie.
* The last 1 is not needed but used to keep it simple: basically you KNOW it is a group of 1s as the group before and after are always 0 (otherwise it would be 41 or more 1’s)

“couldn’t it just send essentially a text file on how to uncompress it back to its original size and quality? (ex. “Set the resolution to XxY and add X color at Y pixels”)”

Too complex and you also need to take in account that files can get damaged during transport. Compression also has a check digit to make sure it can be repaired when it arrives at the other end.

Anonymous 0 Comments

Instead of repeating something, it’s more efficient to have a way to say you repeated that something instead.

If you have a bunch of images making a video, it’s better to just say these images are repeats then repeating them. This can be abstracted to a lot of types of data.

Anonymous 0 Comments

Compression works by using short codes for common data, and long codes for rare data.

For example, the standard text code uses 8 bits for every letter, which is not efficient for English. Letter E is the most common letter in English – so we can give a short 6-bit code for it. Letter Z is the least common – so we can give it a longer 11 bit code. This substitution will make average English text shorter. If some letter doesn’t appear at all – it doesn’t need a code.

It can be compressed even further, if we consider the words – not every combination of letters is a word. We can make a dictionary of all English words and assign them codes according to their frequency – this will compress average text to 25% of its original size.

ZIP uses a combination of a dictionary and frequency coding. It starts from assumption that all letters are equal and there are no words, but as it reads the file – it keeps the tallies and adapts the coding to the text. That means, that the beginning of the text is always badly compressed – but it becomes better and better later.

Note, that no compression can compress every file – there are always files that actually get longer, even if just by 1 bit. A completely random stream of characters cannot be compressed by **any** method – its uncompressed form is already the shortest possible.

>couldn’t it just send essentially a text file on how to uncompress it back to its original size and quality? (ex. “Set the resolution to XxY and add X color at Y pixels”)

This will actually make most pictures much longer – you waste a lot of bits to say “Set the resolution” and “Add color”. All current picture formats just demand that info to be listed in some specific order – so the PNG reader doesn’t need to read “Set resolution” – it just knows that “5th number from the beginning is resolution”.

Anonymous 0 Comments

It finds repeating patterns in the data and avoids storing them more than once.

Which is why data lacking any patterns (such as random data or already compressed data) can not be compressed any further.

Anonymous 0 Comments

When you compress a ZIP file, you can choose to compress the file using the DEFLATE method. The compressed data for each file is added to the ZIP archive, and the location of each file in the archive is checked.

Once all the files are added, a directory called Central Directory is created at the end of the collection. This directory contains information about each compressed file, such as its name and location in the archive. Finally, the End of Central Directory record is written to the end of the file to help locate the Central Directory.

When you open a ZIP file to view or extract its contents, the first step is to locate the End of Central Directory at the end of the file. This gives you the information you need to find the Central Directory, with a record of every file in the directory. Each record tells you where to find the corresponding compressed file data in the archive.

Interestingly, all pointers in the ZIP file are relative. For example, instead of saying, “Go to byte 500,” it might say, “So go back to byte 100.” This option allows you to attach ZIP files to other files such as executable programs, and provides archive extraction. Using such a program can scan and extract the attached ZIP data.

Anonymous 0 Comments

Make dictionary with numbered entries, mail the dictionary and the text with all the words replaced by the number of their dictionary entry. Give common words short numbers.

That’s basically text compression.

Pictures: Make a paint by numbers of the picture, send the paint by numbers and the paint. You want a smaller package? Simplify the colours and add some white and black paint to make darker or lighter shades.

Video? treat every frame as a picture (intraframe compression) then see if you can just send “Keep everything the same except this bit that moved a fraction of a inch” (Interframe compression).

Anonymous 0 Comments

Instead of the full words **comp**ression, **comp**leted, **comp**ensation, **comp**utation, replace ‘comp’ with a symbol like $:

$ression, $leted, $ensation, $utation.

We made it smaller and we know how to change it back to normal.