How does file compression/ZIP files work?

465 viewsEngineeringOther

I understand it’s to save file space, but like…how?

Follow-up: If you have to compress a photo to save on file size (when emailing, for example) couldn’t it just send essentially a text file on how to uncompress it back to its original size and quality? (ex. “Set the resolution to XxY and add X color at Y pixels”)

Thank you!

In: Engineering

9 Answers

Anonymous 0 Comments

Data compression at a basic level is actually a very simple concept, if you remove redundant data from a file you can reduce its size.

Imagine a particular book contains 1500 examples of the word ‘the’. If you were to replace all examples of ‘the’ in this book with a placeholder say the ‘@’ symbol and placed an index at the beginning that read ‘@’ = ‘the’ you could save a significant amount of hard drive space without losing any data.

1500 examples of a 3 character word replaced with a single character equals a savings of around 3000 bytes or just over 3 Kilobytes minus some overhead. Repeat this multiple times for multiple words and the savings adds up.

One of @ downsides though is that it makes @ book a tad harder to read as you have to look up what @ means in @ index all @ time. This extra overhead in a computer translates to a need for slightly more processing power.

In order to store more complex data like pictures and video they have to be compressed down significantly, but lossless compression described above isn’t going to be enough. We need lossfull compression. This is when you are willing to alter data to make it easier to compress.

In this example you could replace all instances of ‘th’ followed by a vowel to be ‘the’. So ‘tha’, ‘thi’, ‘tho’, ‘thu’, and sometimes ‘thy’. This allows all those examples to be treated as ‘the’ by the compression algorithm and therefore allows for more compression. Thes theugh makes the document even harder to read but importantly you still understand what it is trying to say.

This technique is used in formats like JPEGs and results in the trademark skewing and flattening of colors.

To go further than this you don’t necessarily have to store the exact data needed to recreate an object bit by bit.

Instead it’s sometimes easier to store the instructions on how to recreate that object instead. If you can break down an object into identical constituent blocks you can store how to make those blocks in memory along with instructions on where they go in a larger object like a gigantic lego set.

The point being that you only need to store the data on how to recreate any individual block once, even though that block may appear in an object thousands of times.

Games like Minecraft are a textbook example of this. Maps are created using a technique called ‘procedural generation’ where a pseudo-random map is created using a repeatable mathematical equation. These equations take in a ‘seed value’ as the random element. So long as that given seed value is the same, the output from the equation will always be the same. You therefore don’t need to store every block on a map, because you can run the math equation with the seed on demand to determine what block is in a particular location by default.

The randomness for the maps comes from the fact that there are millions of possible seed values each generating a unique map.

The save file then only needs to contain the seed value for your given map, and the blocks that you changed from the default.

You are viewing 1 out of 9 answers, click here to view all answers.