AnswerCult

Question

115 viewsDecember 31, 2023

Question 91.72K September 22, 2023 0 Comments

eli5: Lots of websites will have a file hash you can use to verify file integrity; computationally speaking how is this created? Do you really need to inspect every character or just make sure the first and last few are correct? For well known programs should you verify the hash with a third party?

In: 3

6 Answers

Answer 1 · 2023-09-22T15:26:02+00:00

The hash is computed by consuming every Byte of the file and feeding it in a hashing algorithm. These are some sort of special function called one-way function. They give you a certain output for a given input. If you feed the same data in, you’ll get the same result. But you can’t deduct the input from the output and good hashing algorithms also generate vastly different outputs for even slight changes to the input.

With that information, it is considered a shortcut to only compare a few character sin the beginning and end. But, for better security, the full hash should be compared. For every major OS, there are tools that can generate the hashes using various different algorithms for a given file and then compare that to a given hash.

The problem with hashes is that the don’t include identity. They can only convey integrity in the sense that the transmission didn’t cause any corruption and that what the source of the hash says is the hash. But you can’t be sure the website or transmission of the hash itself isn’t compromised.

That’s why some sources also provide PGP signatures. They’re based in a public key scheme, where you could also verify that the signature was created by the entity holding the private key and that the file wasn’t altered. Yet, if the private key were to be compromised, you could still forge a valid signature. It all depends on what you trust and how high of a level of security is desired.

Answer 2 · 2023-09-22T15:27:32+00:00

Not ELI5, but ELI7.

A file is a list of small numbers. Each number Is between 0 and 255. (Aside: we call this a byte.) all images, videos, zip files and so on are like this. A simple hash would be to add all the numbers in the file together, and then see if that sum is odd or even. If you had a file and you weren’t sure if it had been changed, I could tell you the file was meant to be odd, and if your file was even you would know for sure it had been changed.

File hashes are like this, but with a lot of fancy maths, and instead of a single odd/even like choice, they would have 160, or 256 choices. As the number of choices about the file goes up, it is less and less likely to be wrong. For example, if I told you that the last digit was odd, and the second to last digit was even, then there are three ways to notice wrong files, and only one way to miss it being changed.

Does this help?

Answer 3 · 2023-09-22T15:41:16+00:00

**How is a hash created?**

A hash is created using “trapdoor functions –” mathematical functions that are believed to be easy to perform in one direction, but extremely difficult to perform in the other direction.

For example, if I give you two random prime numbers — 7,949 and 8,161 — you can easily multiply them together and find that their product is 64,871,789. However, if I just give you the number 64,871,789 and tell you to find the two prime numbers that multiply together to make it, you will have an extremely difficult time figuring it out.

A hash function is basically just doing a lot of complex binary math on the 1s and 0s of the file using trapdoor functions. This means that for every file, you can produce a hash, but it’s extremely difficult to take a hash and produce a useful file associated with it.

**Do you really need to inspect every character?**

You *must* use the entire file to do this because, by design, hash functions are extremely sensitive to small changes, and the entire file is used in the computation. That’s why you can take a file that’s three gigabytes large and end up with a hash that’s just 64 characters long. If you omit any part of the file or change any part of the file, then the hash changes to somsthing completely different.

You *could* do what you said and only check part of the file, but this is dangerous. If you only check part of the file, then an attacker could change the part that you’re not checking and you would have no idea.

For example, if I give you a number like 111222333444, and your algorithm only checks the first and last two digits — “11” and “44” — then I could change the number to something like “110000000044” and your algorithm wouldn’t be able to tell the difference.

**For well known programs should you verify the hash with a third party?**

The trouble with using the hash displayed on the website where you downloaded the file from is that, if a hacker can replace the file with something malicious, then they can also replace the hash to match their file. A hash is only useful if you can trust the source that gave it to you.

Luckily, most well-known apps are “signed.” I won’t go in-depth on how code-signing works because that’s a whole other discussion. Just know that when an app is signed, your computer is able to be 100% certain that the app is what it claims to be.

When you run an app on your computer, the signature is checked before the app runs. So, if you have a well-known app, then it’ll run without problems. It might even pop up a box asking if you want to run it. When that box says something like “Published by [whatever corporation],” then you know the signature is good.

If your computer doesn’t recognize the app — either because it’s not signed or because the signature is bad — it will pop up a warning of some kind that either refuses to run the app or tells you that it’s from an unknown publisher.

Thanks to code signing, when you’re running a well-known app, you can trust your computer to verify it for you, and there’s no need to verify the hash manually.

**Bonus question: what if it’s not a well-known app?**

If it’s some random file, then remember, the hash is only as good as the source that gave it to you. Just because the hash matches doesn’t mean that the file is safe. It only means that the person who gave you the hash has the same file that you have. You need to make sure that you trust the person who created the file *and* the person who gave you the hash.

Answer 4 · 2023-09-22T15:55:03+00:00

If your goal is to verify file integrity, then you **must** inspect every character. That’s the entire point. You have to check the whole file to make sure it hasn’t been altered; otherwise, someone could have altered (accidentally or otherwise) one of the bytes you didn’t check.

Hashes are specifically designed to do this.

A “good” hash function has something called the “avalanche property”: if you change a *single bit* of the input, then output is *completely different*. This, along with other math properties, means that it’s very difficult to create two files with the same hash, and damn-near-impossible to create two *similar* files with the same hash. A small alteration, like a transmission error or a subtle bug, will cause a totally different hash.

If you receive the hash through the same channel as the file, that protects against transmission errors but not against malicious attackers. To protect against attackers, you have to receive or verify the hash through a separate channel. Historically, back when storage and bandwidth were more expensive, people didn’t host their own file uploads. They would point a link to a download location hosted on a different server (often a university!). The hash let you know that other server hadn’t tampered with the file. Since the link and file lived on different servers, there was actually a point to this.

Answer 5 · 2023-09-22T16:25:24+00:00

Imagine I have a sequence of ten numbers (like a phone number or something) that you write down:
> 5559817038

My repeating it back character-by-character to make sure you copied it down properly full would take repeating the full ten digits… so what else can we do?

We can create some sort of mathematical operation that can be done to the data by each of us independently. For example, we could decide to do a “add all the digits together” operation.

> 5+5+5+9+8+1+7+0+3+8=51

So, if we both knew to use the same “add the digits together” operation, I can say “I got 51.” and the verification process only requires that your 2-digits match, instead of verifying all 10-digits independently; if you get 50 or 82 or something then you know *”Nope, that isn’t the result I get. I must have copied it down wrong.”*.

You don’t know *which part* was wrong, but you at least know that “my copy is compromised” very quickly.

On the other hand, if you did get 51, it is *possible* that you still copied it down wrong but the operation got the correct result by pure dumb luck… in which case we could do another different operation and verify that result too. Instead of “add everything” lets try “add then subtract every other digit together”

> 5-5+5-9+8-1+7-0+3-8 = 5

Once again, I can now send a short operational-output (1-digit) for you to verify. If it doesn’t match this time, we know that the first match passed by dumb luck. If both results passed the test, then we’re pretty confidant that the odds of it being dumb luck both times are pretty low.

But we don’t have to stop there. We can do a whole bunch of different operations. We could do operations that take the results of previous operations. We could do operations of operations of operations.

And if we do enough of this, and it passes every time, we can be super-duper confidant that the odds of it being super-duper-extra-crazy dumb luck are basically impossible.

(Technically, what I have described above are called “checksums” and they are less sophisticated than proper “hashes”, but I think it provides a good ELI5 example.)

————————————–

But why? Isn’t that a lot of work when we could have just repeated back all ten digits back? Yeah… for ten digits. The savings kicks in when we’re talking about big files.

If I download a big 10GB file, then trying to “repeat it back” essentially means downloading a 10GB verification file.

On the other hand, running a well known checksum/hashing procedure will require my computer to do a bunch of number crunching… but at the end of that process I’ll have a much shorter verification code to download and compare – one a very small percentage of the original file size.

This is better for everybody, because sending and receiving lots of data is usually far more expensive than everyone doing some number-crunching locally on their personal machine.

Answer 6 · 2023-09-22T18:10:52+00:00

A hash is a one way function – the same input will always create the same output but there is no way to get the input even if you know the output and the function. It absolutely has to be checked in it’s entirety to verify a password.

Here’s a simple (and not very secure) example. Your ATM PIN is 1234. The hash function gives you the last two digits. So, you would put in your PIN and the hash would say “34”, which the ATM checks against the number stored. If that is also 34, you can get your money out. Note that knowing that your hash is 34 still doesn’t tell you what the original PIN is – it could be 3434 or 8134 or many other options that are wrong.

AnswerCult

eli5: Lots of websites will have a file hash you can use to verify file integrity; computationally speaking how is this created? Do you really need to inspect every character or just make sure the first and last few are correct? For well known programs should you verify the hash with a third party?

6 Answers

Search questions

Popular Questions

Latest Answers