eli5: Lots of websites will have a file hash you can use to verify file integrity; computationally speaking how is this created? Do you really need to inspect every character or just make sure the first and last few are correct? For well known programs should you verify the hash with a third party?

270 views

eli5: Lots of websites will have a file hash you can use to verify file integrity; computationally speaking how is this created? Do you really need to inspect every character or just make sure the first and last few are correct? For well known programs should you verify the hash with a third party?

In: 3

6 Answers

Anonymous 0 Comments

Imagine I have a sequence of ten numbers (like a phone number or something) that you write down:
> 5559817038

My repeating it back character-by-character to make sure you copied it down properly full would take repeating the full ten digits… so what else can we do?

We can create some sort of mathematical operation that can be done to the data by each of us independently. For example, we could decide to do a “add all the digits together” operation.

> 5+5+5+9+8+1+7+0+3+8=51

So, if we both knew to use the same “add the digits together” operation, I can say “I got 51.” and the verification process only requires that your 2-digits match, instead of verifying all 10-digits independently; if you get 50 or 82 or something then you know *”Nope, that isn’t the result I get. I must have copied it down wrong.”*.

You don’t know *which part* was wrong, but you at least know that “my copy is compromised” very quickly.

On the other hand, if you did get 51, it is *possible* that you still copied it down wrong but the operation got the correct result by pure dumb luck… in which case we could do another different operation and verify that result too. Instead of “add everything” lets try “add then subtract every other digit together”

> 5-5+5-9+8-1+7-0+3-8 = 5

Once again, I can now send a short operational-output (1-digit) for you to verify. If it doesn’t match this time, we know that the first match passed by dumb luck. If both results passed the test, then we’re pretty confidant that the odds of it being dumb luck both times are pretty low.

But we don’t have to stop there. We can do a whole bunch of different operations. We could do operations that take the results of previous operations. We could do operations of operations of operations.

And if we do enough of this, and it passes every time, we can be super-duper confidant that the odds of it being super-duper-extra-crazy dumb luck are basically impossible.

(Technically, what I have described above are called “checksums” and they are less sophisticated than proper “hashes”, but I think it provides a good ELI5 example.)

————————————–

But why? Isn’t that a lot of work when we could have just repeated back all ten digits back? Yeah… for ten digits. The savings kicks in when we’re talking about big files.

If I download a big 10GB file, then trying to “repeat it back” essentially means downloading a 10GB verification file.

On the other hand, running a well known checksum/hashing procedure will require my computer to do a bunch of number crunching… but at the end of that process I’ll have a much shorter verification code to download and compare – one a very small percentage of the original file size.

This is better for everybody, because sending and receiving lots of data is usually far more expensive than everyone doing some number-crunching locally on their personal machine.

You are viewing 1 out of 6 answers, click here to view all answers.