What are compressed and uncompressed files, how does it all work and why compressed files take less storage?

938 views

What are compressed and uncompressed files, how does it all work and why compressed files take less storage?

In: Technology

27 Answers

Anonymous 0 Comments

Lets say I have a file that contains following:

aaaaaaaaaaaaaaaaaaaa

I could compress that like this:

a20

Obviously it is now smaller.

Real compression comes from redundancy and from the fact that most data is wasteful in the first place. A byte is 8 bits and thats basically the smallest amount of data that can be moved or stored. How ever, if you type a message like this, this only contains 26 different letters and some numbers and punctuation. With 5 bits you can encode 31 different characters, so we could already compress the data a lot. Next level is to count the letters and notice that some are way more common than others, so lets give shorter bit lengths per character for those. You can look into Huffman coding for more detailed info.

Another form of compression is lossy compression which is used for images, videos and sound. You can easily reduce the amount of colors used in the image and it would still look the same to humans. Also you could merge similar pixels into same color and say that “this 6×6 block is white”.

Anonymous 0 Comments

File compression saves hard drive space by removing redundant data.

For example take a 500 page book and scan through it to find the 3 most commonly used words.

Then replace those words with place holders so ‘the’ becomes $, etc

Put an index at the front of the book that translates those symbols to words.

Now the book contains exactly the same information as before, but now it’s a couple dozen pages shorter. This is the basics of how file compression works. You find duplicate data in a file and replace it with pointers.

The upside is reduced space usage, the downside is your processor has to work harder to *inflate* the file when it’s needed.

Anonymous 0 Comments

Compression works by finding patterns in the data and then storing those patterns instead of the data itself. There are lots of different way to do this and a lot of different theory involved but it is the basic principle. Compression does work better when the compression algorithm is built for the specific file type. So a generic compression algorithm that is made to work on any file does not work as good on say image files as a dedicated image compression algorithm. Some algorithm might even opt to lose some information that is not important and does not fit into an easy pattern. This is most common in image and video where the exact value of each pixel is not that important. Compression algorithms also do not work if there is no patterns to the data. So random data, encrypted data or already compressed data can not be compressed any further.

Anonymous 0 Comments

Also to add.. there is lossless and lossy compression where lossy compression looks to remove data that is considered low informational content.

“For emaxlpe, it deson’t mttaer in waht oredr the ltteers in a wrod aepapr, the olny iprmoatnt tihng is taht the frist and lsat ltteer are in the rghit pcale. The rset can be a toatl mses and you can sitll raed it wouthit pobelrm.”
The above sentence is copied from https://www.livescience.com/18392-reading-jumbled-words.html

In a similar way, lossy compression can remove/ replace content with minimal change to structure of the data

Anonymous 0 Comments

If you have a document that only consists of the 100 times the letter “a” then 200 times the letter “b”, you could compress it to a file that looks like this “100a200b”. That would take less space.

Anonymous 0 Comments

There’s some good answers about how lossless compression works, and that’s really useful. But the answers for lossy compression are lacking a bit.

There’s also lossy compression, where some of the data is literally discarded during compression, then when you reopen the file, the computer basically makes educated guesses about what used to be there. As an example, you could remove all of the u’s following q’s, the S’s from the end of plural words, the apostrophes from contractions, and all of the punctuation. It’s pretty likely that you could look at that text and, given the rules that the computer used when compressing the file, figure out what was supposed to go where based on the rules and the context. I.e:

This is the original text, which I thought up rather quickly. It’s not the best example possible, but it should work well for our purposes.

Becomes:

This is the original text which I thought up rather qickly Its not the best example possible but it should work well for our purpose

Not really substantially shorter in this case, but we also didn’t have a very optimized algorithm for it. More rules make the file smaller and smaller.

It’s not really ideal for text, but it works pretty well for a lot of artistic data where it just needs to be close enough. Common examples of lossy-compressed files are JPEG pictures and MP3 audio files. It doesn’t matter if we get this specific pixel in our picture the exact right color, just so long as it’s about right given the surrounding pixels.

Anonymous 0 Comments

Yesterday I had to tell a customer service representative my account number over the phone: 000000932.

I could have said “zero-zero-zero-zero-zero-zero-nine-three-two” but I said “six zeroes nine-three-two.” It was quicker that way.

Sometimes describing a number can be quicker than saying the whole thing. That’s what file compression does, with more math; it finds ways to describe what is in a file that take less time and space than reading out every one and zero. In the same way we would say “the sky in this picture is blue,” software can describe part of a picture as “this pixel is color 000256000 and the next 732 pixels are too.”

Anonymous 0 Comments

Imagine you want to save a message:

AAAAAAAAAA
AAAAAAAAAA
AAAAAAAAAA
AAAAAAAAAA
AAAAAAAAAA
BAAAAAAAAA
AAAAAAAAAA
AAAAAAAAAA
AAAAAAAAAA
AAAAAAAAAA

It takes 100 characters to save it.

You could save it as:

50*A,B,49*A

And have it saved 11 characters. This is lossless compression, and a kind of thing (though obviously a very primitive version) that, say, 7zip or winrar do.

You could imagine a different algorythm that saves even more space:

100*A

And voila, you saved your message in 5 characters. Well, not exactly your message, you lost the B, but it’s very close to the message, maybe reader wouldn’t notice the B anyway. This is “lossy” compression, where you sacrifice some information the original had in order to save even more space. This is (a very primitive version of) what saving an image as JPG or music as MP3 does. Of course, these formats are popular because they are very good at only loosing the information humans actually don’t notice, but idea is the same.

Anonymous 0 Comments

OK – let me invent a compression. And this isn’t a real example, and probably won’t save much space – I’m making this up as I go along. I’m going to make the thread title take up less space, as an example.

>ELI5: What are compressed and uncompressed files, how does it all work and why compressed files take less storage?

Hm. “compressed” is long, and appears 3 times. That’s wasteful – I can use that. I’m going to put a token everywhere that string appears. I’ll call my token T, and make it stand out with a couple of slashes: T.

>ELI5: What are T and unT files, how does it all work and why T files take less storage?

Shorter. Only – someone else wouldn’t know what the token stands for. So I’ll stick something on the beginning to sort that out.

>T=compressed::ELI5: What are T and unT files, how does it all work and why T files take less storage?

And there we go. The token T stands for the character string “compressed”; everywhere you see “T” with a slash each side, read “compressed” instead. “::” means “I’ve stopped telling you what my tokens stand for”. Save all that instead of the original title – it’s shorter.

Sure, it’s not MUCH shorter – I said it wasn’t likely to be – but it IS shorter, by 7 bytes. It has been compressed. And anyone who knows the rules I used can recover the whole string exactly as it was. That’s called “lossless compression”. My end result isn’t very readable as it stands, but we can easily program a computer to unpick what I did and display the original text in full. And if we had a lot more text, I suspect I’d be able to find lots more things that repeated multiple times, replace them with tokens as well, and save quite a bit more space. Real-world compression algorithms, of course, will do it better, in more “computer friendly” ways, use more tricks, and beat me hands-down. But the basic idea is the same.

If you did something similar with, say, a digital image with a lot of black in it, we could replace long stretches of black with a token meaning “black” and a number saying how many pixels of black, and save a LOT of space (one short token saying “2574 black pixels here”, say). And if we’re not TOO bothered about getting the EXACT picture back, simply something that looks very close to it, we could – purely as an example, say – treat pixels that are ALMOST black as if they were, and save even more. Sure, when the computer unpicks what we’ve done the picture won’t be precisely identical to what we started with – but likely the difference won’t be very obvious to the human eye, and for most purposes the difference won’t matter. And that’s called “lossy compression”. JPEG, for example, is a lossy compression format.

Anonymous 0 Comments

A nice example that I always use to explain compression is using images. Consider a completely WHITE image of size 3000×4000 (about your phone camera resolution).

In the simplest case (it is seldom the case), each pixel of an uncompressed image is stored using 3 numbers to describe its color; for example, in 8-bit RGB color space (red green blue) we use the red blue and green components of a color to describe it. A white pixel has the 3 components equal to 255, so a white pixel is represented by 3 numbers all equal to 255.

Without any compression, a 3000×4000 image is composed by 12M*3 numbers… this means that we need 36 000 000 numbers to store an uncompressed file. This corresponds also the number of bytes that we need to store that uncompressed file (because you are using 8 bits, or 1 byte, for each number). This means that without compression an image taken by your phone would require a bit less than 36GB of memory of storage 🙂

Now suppose you want to compress a white image. The simplest way that we can store the image is to literally say that the image is composed of all equal WHITE pixels. Thus in this extreme case, the only thing that you need to store is the color of ALL the pixels: white (255). In other words, instead of storing 36 000 000 bytes we need to store only 1 byte. Then, the device that we are using to visualize the image (phone in this example) needs to ‘recreate’ the original image by replicating the one pixel for 36M times. So we compressed 36GB into 1B!

In practice, there are many compression algorithms, specific for text (zip), for sound (mp3), for images and videos (jpeg and mpeg), and whatever physical phenomena that you can digitalize. So compression algorithms can be more or less very complex. However the idea behind is still the same of my example, and that is to use the recurrent information in the data to be compressed. In our case the recurrent information is the fact that all pixels are white.