How does lossless compression work?

219 views

Files are just a bunch of bits, how is it possible to make the filesize smaller without removing any of those bits?

In: 3

5 Answers

Anonymous 0 Comments

Suppose I have a file with 1000 ‘0’s in a row. That takes 1000 charachters to store. Or I can just tell you “1000 zeros”, which is just 10 charachters but no information has been lost.

That’s how lossless compression works, only it’s more clever.

Anonymous 0 Comments

If the sequence of bits is completely arbitrary, it isn’t. You can’t make one possible file smaller without, in principle, making another possible one bigger.

But most applications, whether that’s images or audio or text, tend to have specific patterns more often than other specific patterns. In text, for example, the strings “at” or “ed” are common, while the string “xq” is very rare. So if you use a compression approach that makes “at” or “ed” smaller and “xq” bigger, you will, on average, tend to make your file size smaller.

More generally, the smallest size you can compress files to (given no knowledge except the file itself) is given by their [information entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)). The “more random” the strings of bits, the higher the entropy. When each string of bits is equally likely, the information entropy is 1 bit per bit, which means you can’t compress things losslessly. But real applications tend to have much less.

English text encoded as ASCII, for example, takes up 1 byte (8 bits) per character, while the information entropy of English text is usually around 1 *bit* per character (because some strings are much more common than others). So you can losslessly compress English text by a factor of about 8.

Anonymous 0 Comments

I give you the numbers 1111111333399999 and tell you I’ll need you to give those numbers back to me tomorrow.

You could write down 1111111333399999 (16 digits). But you notice a pattern: I gave you seven 1’s, followed by four 3’s, followed by five 9’s. If you write down 714359 (6 digits) you can capture the same information in a much shorter form.

This is only possible because the digits I gave you aren’t completely random. There are patterns (in this case, runs of repeated digits) and your coding scheme exploits them. (This particular coding scheme is so simple and common it has a name, *run-length encoding*.)

Commonly used lossless compression formats (like zip files) use more complicated techniques that can find more complicated patterns, but the principle is the same.

[1] In fact, there’s an entire branch of mathematics that studies coding and compression (information theory). One of the central insights of information theory is that a totally random sequence can’t be (losslessly) compressed. You can only (losslessly) compress a sequence if it’s somewhat non-random.

No matter how a lossless compression program works internally, its outputs will on average be shorter than its inputs only if, on average, the inputs contain some kind of pattern(s) that arise more often than they “should” by random chance.

Of course all kinds of common data contains patterns; this post (like most English text) contains mostly lowercase letters, and there are lots of vowels and spaces. Some letters and many punctuation marks appear very rarely, or perhaps not at all! And it’s not just letters. Multi-letter phonemes, words, and phrases are also repeated far more often than they would be by chance.

Anonymous 0 Comments

Some decent explanations here, but I’ll give you a much easier one.

It’s just language interpretation. When you read LMAO or LOL you already know what those are. You’ve compressed these words yourself with a type of language interpretation called an acronym.

When you parse LMAO and LOL through a specific interpretation you get the original full words back, meaning the compression was lossless. Lossless means you always get the full original content back reinterpreting it.

The only difference between a language acronym and what a general lossless compression method does is that these language interpretations are arbitrary, and general lossless compression like deflate (ZIP) tries to be generalized instead of only applying to specific words.

The way you do that is to find patterns in average things you’re looking to compress. Patterns can be compressed because they can be recognized and interpreted to then be something else.

Anonymous 0 Comments

Lossless compression is a type of data compression where the original data can be reconstructed exactly from the compressed data. This is in contrast to lossy compression, where some information is lost during the compression process. Lossless compression is often used for archiving or storing important data because it ensures that no information is lost during the compression process.

There are many different algorithms used for lossless compression, but they all work by reducing redundancy in the data. Redundancy means that there is extra information that isn’t needed to reconstruct the original data. For example, if you have a text file with lots of repeated words, you can remove the redundant words and still reconstruct the original text perfectly.

Lossless compressions are usually less effective than lossy compressions because they can’t make use of as much redundancy in the data. However, they are still useful when it’s important to preserve all of the original information, such as when compressing medical images or computer programs