Title basically says it all. I understand how photos and videos can trade quality for file size, but files like games or large folders can be shrunk into a .rar or .zip file, transferred, and pulled back out with no loss to functionality. How does that work? If nothing’s being taken away how is space being saved?
In: Technology
I’ll use a text as an example. Take this sentence:
I am very awesome
Given that every character (letters and spaces) requires 8 bits (1 byte), it takes 136 bits to store the text. 17 characters x 8 bits = 136.
To compress, we start by counting how often each character appears. Listing from most often to least we get:
* space appears 3 times
* E appears 3 times
* A = 2
* M = 2
* I = 1
* O = 1
* R = 1
* S = 1
* V = 1
* W = 1
* Y = 1
Now, we convert each character to a string of bits like this:
* space = 01
* E = 001
* A = 0001
* M = 00001
* I = 000001
* O = 0000001
* R = 00000001
* S = 000000001
* V = 0000000001
* W = 00000000001
* Y = 000000000001
What this does is make the letters that appear most often use the fewest number of bits. You can see that normally space, E, A, M, I and O take one byte each (8 bits) but when compressed they each take up less than 1 byte. R still takes 1 byte and S, V, W, and Y each use more than one byte, but because they occur less often, you still come out ahead overall.
If you now write this using the raw bits, you would get:
* I = 000001
* space = 01
* am = 000100001
* space = 01
* very = 000000000100100000001000000000001
* space = 01
* awesome = 000100000000001001000000001000000100001001
Strung all together, it looks like this:
000001010001000010100000000010010000000100000000000101000100000000001001000000001000000100001001
Which is 96 bits. Since there’s 8 bits in a byte, you’ve now compressed the original 17 bytes down to 12 bytes. 12*8=96 bits.
You’ll notice that each letter is a string of 0s with a 1 at the end. This means that to decompress, you find the ones and count the zeroes in front of it. 000001 is an I, 01 is a space, 0001 is an A, etc.
Now I have to mention that the compressed file has that conversion table added at the very beginning so that you know that 01=space, 001=E and so on. So that means compressing a short sentence like “I am very awesome” would actually make a larger file just because that conversion table would take up so much space. But if you were to compress a whole book, you can see how well that would work. Also, there are standard compression tables based on how often characters appear in a language overall so that you don’t have to include the conversion table. They’re not as efficient, but have the advantage of not needing to do the character counts and calculations, so they’re faster, and you don’t waste space in small files with the conversion table.
Latest Answers