What are compressed and uncompressed files, how does it all work and why compressed files take less storage?

882 views

What are compressed and uncompressed files, how does it all work and why compressed files take less storage?

In: Technology

27 Answers

Anonymous 0 Comments

[deleted]

Anonymous 0 Comments

For the kind of compression that’s typical in something like a zip file, it’s a combination of two different approaches. (Computer people, this is an ELI5, so please don’t jump in and say “well akshulally, it’s not letters, it’s bits”.)

Anyway, imagine you’re writing an email where you have to pay by the letter:

First, if there’s something you’ve already said and you have to repeat yourself, just refer back to what you already said. For example, instead of, “Yesterday I saw a red car and a blue car and today I saw a red car and a blue car,” you can say, “yesterday I saw a red car and a blue car and today, the same.”

Second, use a shorthand where the more common words you use have less letters in them and the less common words maybe have more letters, like saying “t” instead of “the”, but “t2” whenever you actually mean “t”, so the net number of letters is less. (“I wrote the letter T.” becomes “I wrote t letter T2.” saving one letter.)

Voila, your email is shorter.

Anonymous 0 Comments

Just a fun example of compression.

I used the zen of python. For simplicity, I removed all punctuation and made it lower case.

I’m sure there are better ways to compress this as well.

Original: 807 bytes

>beautiful is better than ugly
>
>explicit is better than implicit
>
>simple is better than complex
>
>complex is better than complicated
>
>flat is better than nested
>
>sparse is better than dense
>
>readability counts
>
>special cases arent special enough to break the rules
>
>although practicality beats purity
>
>errors should never pass silently
>
>unless explicitly silenced
>
>in the face of ambiguity refuse the temptation to guess
>
>there should be one and preferably only one obvious way to do it
>
>although that way may not be obvious at first unless youre dutch
>
>now is better than never
>
>although never is often better than right now
>
>if the implementation is hard to explain its a bad idea
>
>if the implementation is easy to explain it may be a good idea
>
>namespaces are one honking great idea lets do more of those

Compressed generated with some code: 709 bytes

>[~=is;!=better;@=than;#=the;$=although;%=never;^=idea;&=complex;*=special;(=should;)=unless;{=obvious;}=it;|=implementation;=explain]
>
>beautiful ~ ! @ ugly
>
>explic} ~ ! @ implic}
>
>simple ~ ! @ &
>
>& ~ ! @ complicated
>
>flat ~ ! @ nested
>
>sparse ~ ! @ dense
>
>readabil}y counts
>
>* cases arent * enough to break # rules
>
>$ practical}y beats pur}y
>
>errors ( % pass silently
>
>) explic}ly silenced
>
>in # face of ambigu}y refuse # temptation to guess
>
>#re ( be one and preferably only one { way to do }
>
>$ that way may not be { at first ) youre dutch
>
>now ~ ! @ %
>
>$ % ~ often ! @ right now
>
>if # | ~ hard to }s a bad ^
>
>if # | ~ easy to } may be a good ^
>
>namespaces are one honking great ^ lets do more of those

Compressed after manual modification: 673 bytes

>[~=is;!=better;@=than;#=the;$=although;%=never;^=idea;&=complex;*=special;(=should;)=unless;{=obvious;}=it;|=implementation;=explain;:= ~ ! @ ;’= # | ~ ;<= to }]
>
>beautiful:ugly
>
>explic}:implic}
>
>simple:&
>
>&:complicated
>
>flat:nested
>
>sparse:dense
>
>readabil}y counts
>
>* cases arent * enough to break # rules
>
>$ practical}y beats pur}y
>
>errors ( % pass silently
>
>) explic}ly silenced
>
>in # face of ambigu}y refuse # temptation to guess
>
>#re ( be one and preferably only one { way to do }
>
>$ that way may not be { at first ) youre dutch
>
>now:%
>
>$ % ~ often ! @ right now
>
>if’hard<s a bad ^
>
>if’easy< may be a good ^
>
>namespaces are one honking great ^ lets do more of those

Edit: Dang reddit messed up my formatting. Should be fixed now

Anonymous 0 Comments

Software programmer here. Like all binary data, files are stored as a series of 1’s and 0’s. Now imagine you had a file that was just a million 1’s. If you wanted to describe this file to someone, it would be a lot smaller to write “a million 1’s” instead of actually writing out “1” a million times. That’s compression.

More formally, compressing a file is actually writing a program that can write the uncompressed file for you. The compressed size of the file is then the size of that program. Decompressing the file is actually running the program to build your uncompressed file.

More structured data like a large rectangle of a single color compresses well because it is easy to write a program that describes that data. On the other hand, random data is not very compressible because it does not have any structure and so you can’t do much other than have your program actually write out the entire number, which is not going to be any smaller than the entire number itself.

This is also why compressing a compressed file does not save more space. You can think of compression like squeezing juice out of a lemon, where the more structure that exists in the file, the more juice there is to squeeze, but once you have thoroughly squeezed it, there is no more juice left. Compression turns highly structured data into low structured data, so then when you compress again, you are dealing with random-ish data that doesn’t have enough structure to take advantage of. You can also turn this backwards, and attempt to talk about how random some data is by measuring how easy it is to compress.

There are two types of compression. The type I described above is lossless where the uncompressed file is exactly the same as the original file. Lossless algorithms are typically not that complicated and usually look for large areas of the file that share structure, like I mentioned above. Zip files are lossless.

The other type of compression is lossy, where the uncompressed file does not have the same data as the original file, but has some sort of acceptable amount of data loss built into it. In return, lossy algorithms are far better at reducing the size. Lossy algorithms can be very complicated. JPEG and MPEG files are the main example of lossy compression. From personal experience, if you save a BMP file as a JPEG file, it will tend to be around a tenth the size of its BMP file. However, the JPEG file will not be the same pixels as the BMP file. The compression algorithm for JPEG files have been specifically tuned for photographs, so if you see a JPEG photograph you probably won’t be able to tell that some pixels have been altered. However, for something like digital art, especially pixel art, it is much more noticeable, so you should never save digital art as a JPEG.

Anonymous 0 Comments

There are many different approaches to compression, but it all basically boils down to the same thing: replacing parts of a file with something that takes up less space than it originally does.

Let’s say you have the utterly true sentence “Babylon 5 is the greatest television show ever”. Now imagine that for each word, you assign a single-digit number:

Babylon=1, is=2, the=3, greatest=4, television=5, show=6, ever=7 (and 5 will just be 5). With that, you can now save a ton of space by writing the file as:

15234567

Now, all you need is a program that knows the number for each word. So, when you decompress that file, it replaces each number with the appropriate word. That idea, generally, is a dictionary-based compression (your dictionary is what word is represented by what number).

Now, it should be obvious that this isn’t going to work on a larger scale because you need a number for each word, and that’s going to be a huge dictionary if you want to be able to compress any English-language file. And, you’ll soon have to have multi-digit numbers and that will be a problem because in some cases, you’re actually going to use MORE space when you compress a file: imagine you create numbers for 1,254 word, then you realize that you forgot to assign a number to the word “as” . Well, the next number available is 1255, which means that anywhere the word “as” appears, which takes up two characters normally, you’ll be “compressing” it with a value that takes up twice as many.

Oops!

Now, you might make that up through repetition in some files: everywhere the word “computer” appears maybe uses the number 4, which means you save 7 characters each time it appears. If it appears a lot, you can make up some of your loss from words like “as”. But, that’s gonna be hit-or-miss at best. And, either way, you’re still left with a massive dictionary, which isn’t going to do you any favors in terms of speed.

Also, while that could maybe/kinda/sorta work for English-language files, what about other languages, or images? Or music? Nope, you’ll need a dictionary for other languages, but for things like images and music, it’s not gonna work at all because there aren’t words like in a spoken language.

Another approach to encryption is RLE, which stands for Run Length Encoding. Here, you look for repeating sequences of bytes in a file. For example, imaging you have a picture, a headshot of yourself maybe. Imagine you took this picture in front of a while wall. If you start looking at the actual bytes in the file, which ultimately represent the color of pixels, you’re going to find that white wall is represented by a lot of repeating values (most likely 255, 255, 255, though not necessarily, but let’s say that’s the case here), one per pixel. So, if there’s 100 pixels that are white (that’s what 255, 255, 255 means in RGB values), then that means you have 300 bytes in a row (a run) that are all the same value (255). With RLE, you would actually write 300-255 (or something like that – there’s many ways you might write it). then, when the image is loaded, the decoder can look at that and say “oh, ok, there are 300 bytes with a value of 255 here” and reconstitute the image appropriately.

The important thing here isn’t the actual implementation details, and like I said, there are many compression techniques. The important thing is what I said at the start: in both approaches, or in any other not described here, things (words or runs of repeating values or something else) are replaced with something that takes up less space (in some cases, it can actually be mathematical formulas that can recreate a large piece of data). In all cases, the effect is that you get a compressed file.

Anonymous 0 Comments

Imagine you wrote a cool code where every time there were double letters, you replaced them with a number. Now lets say oo (two lower case Os) = 1, and ll (two lower-case Ls) = 2. Using that code:

balloon becomes ba21n: _balloon_ is 7 characters but _ba21n_ is only 5!

Now imagine that the pattern lloo happens alot, so you specify a special code for that. We’ll use 9 for that.

Now _balloon_ becomes _ba9n_ which is only four characters!

Of course it’s not that simple, but that’s compression in a nutshell. When you get longer strings of repetitive data (there are lots of zeros in files, for example) the compression gets even better.

Anonymous 0 Comments

Suppose you’re writing a grocery list. Your list initially says this:

I need to get 6 eggs
I need to get 2 liters of soy milk
I need to get 2 liters of almond milk
I need to get 1 pound of ground beef

There’s a lot of repetition in there, right? A smart compression algorithm would recognize that, and might render it like this:

I need to get
6 eggs
2 liters of soy milk
2 liters of almond milk
1 pound of ground beef

An even better compression algorithm might be able to further improve things:

I need to get
6 eggs
2 liters of
soy milk
almond milk
1 pound of ground beef

This is basically what compressing a file does. You take information that’s repeated multiple times and remove that repetition, replacing it with instructions on how to put it back when you need to reconstruct the original content.

Anonymous 0 Comments

I like the text examples

One for movies or animations is where they only save what changes between the frames. So if you have 100 frames all black, change them to one black frame and set it so that it takes up the same length of time as the 100 frames did. If you have a shot with blue sky, and it doesn’t change because all the action is going on in the lower half of the frame, save the blue part of the frame and lengthen it/draw it out the same way as was done with the black, once something moves, only then do you have something you need to keep. This can be done for 10000 frames in a row, or it can be done if there are 2 frames with only 10% of the screen the same as the one before it.

Anonymous 0 Comments

Basically compression makes some rules that you can use to re create (uncompress) the file.

In the most basic case, imagine you have a text file that for some reason is like 1,000,000 ‘a’ characters. Instead of storing all 1million, you can store something like ‘1000000a’ which saves a lot of space.

If you had 1000 ‘a’ characters followed by 1000 ‘b’ characters, you might compress the file by writing it as ‘1000a1000b’.

The steps you follow (in this case to count the number of same characters in a row) is called the compression algorithm. There are many different compression algorithms that have different characteristics (for example if you want to compress video or text or audio).

Now in our example, we can recreate exactly the information we started with from our compressed file (it would be pretty useless if we couldn’t read the text after we uncompressed it right?). These are called lossless algorithms.

There are also lossy algorithms, which compress stuff, but you can’t get the exact original back. So for example, let’s say you have the data 123456689. We can write that (approximately) as the formula x=y and then when we uncompress, we would get 123456789 which is almost the same as the original. Examples of lossy compression are jpeg, where the compressed images are less clear than the original (maybe more pixelation, or the colours aren’t the same etc.)

There are many different compression algorithms, suited to different purposes and data types (images, text, audio, video etc), and they can be quite complicated mathematically.

Anonymous 0 Comments

Compressing and uncompressing a file is like translating a book into a different language, except you make up the language based on what’s in the book. To make files smaller, you have your new language use very short words for the most common words or phrases in the original language, and longer words for the uncommon ones. Then you have to make the dictionary that translates back to the original language, or figure out rules so that you can construct the dictionary, and then the compressed file is the translated file plus the dictionary.

In most cases the compression method (or translation) is chosen to be very good for “normal” files, but bad for “uncommon” files that you generally wouldn’t encounter. Mathematically you can’t have a one-to-one translation that converts every possible combination of letters into a shorter form, because then some combinations would have the same translation and you wouldn’t know which one was the original when you translate it back. If you don’t need *exactly* the original file because it’s something like a picture, you can have a translation that is always shorter, but in general if you try to compress an already compressed file it doesn’t get smaller.