Why does audio and video take so much storage?


So for example, videogames are super heavy on disk space. And aparently most of that space is just the sounds, textures, models, etc. But the code takes very little space.

Why can something as complex as a physics system weight less than a bunch of images?

Code takes very little space, media takes more. But an image is just code that tells the computer how to draw something (I think)

So how come some code gets to be so small in size and some code doesn’t?

In: 53

Uncompressed audio, images, and video take up a surprising amount of space.

Let’s start with audio. The human ear hears roughly up to 20,000 Hz frequency. You want to play samples at least double that to recreate that frequency. Let’s say you have 44,100 samples a second, a common sampling frequency. CD quality gives you 16-bits or 2 bytes per sample. And there’s 2 channels of audio for stereo, one for each ear.

Now for each second of uncompressed stereo audio, this is 2 bytes x 44,100 samples x 2 channels = 176.4 kilobytes per second. A song 3 minutes long is roughly 31 megabytes! Now let’s add up all hours of spoken dialogue, sound effects, and music in a game and it gets large, fast.

Many games don’t compress audio to save the CPU from having to decompress it. This can lead to huge game install sizes.

Images get ridiculously large in uncompressed form too. Let’s say we use 4k resolution (3840 x 2160 = 8,294,400 pixels). Each pixel has 8-bits for each red, green, and blue values so 3 bytes each. Each uncompressed 4k image is at least 3 bytes x 8,294,400 pixels = ~25 megabytes.

Now let’s make a video of 30 images per second. Each second of video uncompressed is 25 megabytes x 30 images/secknd = 750 megabytes/second… This is why video compression is almost always done to avoid dealing with these massive uncompressed video sizes.

I’ll start in reverse order:

Text (code) doesn’t take up a lot of space, compressed or uncompressed, as there isn’t a lot of details that need to be kept. Compressed text especially, as during compression you can change patterns into single letters, say “wherever I see axy change it to p.” And during decompression you would know that p=axy. (Special characters tend to be used, but just want to get the point across.)

Before code can be executed it needs to be compiled down into instructions that the computer understands. Compilers do an amazing job of optimizing the number of instructions needed.

So why do audio and video take up a lot of space? Well that depends.

For clarification, videos store information about frames, whereas each frame has information on what the individual pixels must look like to recreate some scene.

The higher the resolution the more pixels which are utilized for the same scene. (Meaning more data needs to be stored.)

Audio stores information about what signal must be reproduced, to create some sound. The higher the quality, the more signals that are stored OR the more levels for a given signal that is stored. (This is a completely separate topic, that will make this too long if I go into details.)

The files used to store them, is just a large instruction set that essentially says “if you want to recreate me, these are the pixels/signal values that you need to utilize.” The higher the quality, the more instructions each file contains. The program you use to open them, is the one that would have code to interpret the instructions.

Now that we talked about what the instructions say, we can discuss why the amount between low and high quality differ so much. And it’s as simple as: the amount of detail.

Files tend to be high quality because they store a tremendous amount of data about what was recorded. They tend to utilize compression techniques that are similar to Lossless. Meaning the amount of data that differs from the original file and the uncompressed file should be minimal.

Whereas low quality images tend to utilize lossy compression, as they can sacrifice a good amount of data, and still get their point across. (I.e fine details doesn’t matter.)

Uncompressed raw video takes up a ton of space, because the amount of details it initially records is staggering. Say you had a grey table, and to the naked eye the table seems to be entirely uniform in color.

When recording this table, the camera may produce a file that shows that each individual pixel on the table is a slightly different color of grey. While that information may be useful, from a player experience standpoint it’s entirely useless. As we mentioned that it’s completely indiscernible.

So rather than storing those different pixel colors in the file, the file that is sent to players would just have all the pixels set to the same color. I.e we maximize the amount of details that is perceivable to users, but minimize the amount of data that is utilized for unperceivable details.

Shakespeare’s complete works amount to about 900k words, an approximately 8MB text file (if I recall correctly). The first Harry Potter book is around 100k words, for comparison. A letter takes 1 to 4 bytes to encode.

An uncompressed 8192×8192 (“8K”) texture has 67108864 pixels. Each of them are encoded with one byte each for Red, Green, Blue and transparency, for 4 bytes per pixel, a total of 268435456 bytes – 256MB.

Think about it as the difference between an art gallery versus a booklet of instructions. The art gallery takes up an entire building, more and more space if you want to display more and more paintings. Meanwhile the instructions will hardly ever be bigger than a book and if they are truly massive they *might* fill up a few shelves in a room.

Images/video are the paintings, and they take a set amount of space that increases depending on how high definition they are and how many of them are desired. Meanwhile, the code for the game is the instruction booklet, and while it might be complex, it just takes up less space to keep words/instructions easy to read than it takes to keep pictures (and audio is almost as complex) easy to view.

Imagine a color by numbers game.

You have one 20th of a page with the rules, which state: Color each section according to the number it contains. 1 is red, 2 is yellow, 3 is green, 4 is blue, 5 black.

And then you have pages and pages of shapes, which you’re supposed to turn into pictures by coloring them in.

That’s how digital games work too. They have the code, which are the rules the program follows and then you have all the actual material (textures, audio, videos, images…) it can work with.

These materials are storage intensive, since they often can’t be stored as simplified code or as an algorithm, but have to be stored pixel by pixel or sound wave by sound wave. (It’s a bit more complicated, since compression can store it pixel group by pixel group etc., but you get the gist.)