How do large files have their integrity verified?

38 views
0

How does a server know whether your local files are corrupted, version mismatched, or illegitimate without you uploading to the server? If they have to check through every file for integrity, wouldn’t that effectively be the same as downloading/uploading the entire folder?

In: 19

I am assuming here that the server have a copy of the files as well as the client. That allows the client and server to calculate a hash of the data. If there is anything wrong with the file it will produce a different hash and they will not match.

No. File integrity is primarily achieved through the use of hashes. A hash is a kind of process where you can take data of any length and it very efficiently produces a fixed length string of numbers known as the hash. The math behind it is chosen such that, for all intents and purposes, the hash is unique for the given input* and any slight change in the input results in a completely different hash.

So the server just needs to know what the hash is. It can then run the hashing program against your files and see if the hash produced matches what it has on file.

^((* – given the variable length input and fixed length output, hashes are necessarily not unique, but the inputs producing identical outputs are unrelated and for the purposes of this discussion, the odds of two functioning programs having the same hash is negligible for a sufficiently designed hashing algorithm))

They do a weird math formula called a hash to the whole file. It comes out to a single really long number. They do that formula to the file before and after downloading. If the number matches, the file is identical.

Algorithms can compute hash keys that are unique for each file. Those are way smaller than the actual file and used for comparing the file on the server and your local file.

Like when you’re describing your car, the color, brand and number plate are enough info for others to identify it without you having to list the number of seats and wheels, engine details, fabrication date, etc. Sry that’s the best example I could come up with, I’m sure there are better ones.

While hashing is definitely a thing, there are simpler options. For instance, you can check the size of the file. Your file could be wrong and remain the same size but if the size is wrong then you know for sure that the file is wrong. The next level of complexity is called a [checksum](https://en.wikipedia.org/wiki/Checksum) and that’s a lot like measuring the length but you can do it in more dimensions and, depending on how your checksum works, use it to find out where the problem is if there is one.