Eli5:what is Hash value?


Eli5:what is Hash value?

In: 3


In computer science a hash value would be the result of taking any source value and passing it through a algorithm or calculation that is very hard to reverse.

A hash value is like a fingerprint for a file or set of data. An algorithm generates the hash which is unique for the record. A good example is to encrypt passwords. On a one-way hash, if someone gets the hash they can’t get the password. But the system can identify if the password that’s input is correct.

A hash is a cryptographic function (algorithm) that you run a file through. The resulting hash value is a constant size (usually much smaller than the original file, but it doesn’t have to be). It is also semi-unique…there’s a very low probability that the hash of another file would result in the same value. It is also impossible to directly reverse, because most of the information from the original is lost.

In essence, it’s a way to create a signature of a file.

A hash has several uses. One is that you can store the hash of a password, and allow the system to authenticate someone providing that password even though the system doesn’t “know” what the original password is. The user inputs the password and it is run through the same hash function, and the result is compared to the stored hash value. If it matches, then the password is correct.

Imagine you want to store files for a whole city worth of people. You have, say, 26 boxes for storing files.

A hash function is a way to take an input (in this case, a person) and produce an output (in this case, a box), in such a way that the input can be long and complicated but the output always stays within some range (in this case, box #1 to box #26).

Since that’s a pretty general definition, hash functions are pretty general objects. For example, a valid hash function here might be “if the person’s name begins with A, their file goes in box 1. If it begins with B, put it in box 2. And so on”. Another valid – if not particularly useful – hash function would be something like “put everything in box 1”. However, since the hash function is a function **only of the value being input into it**, “put the file into the first empty box” is not a hash function. The outputs of the hash function – in this case, which box you’re putting a file into – are the *hash values* of the inputs.


In practical use cases, hash functions usually want to have some or all of the following properties, depending on exactly what you’re doing:

* Uniformity: a *uniform* hash function evenly divides its possible inputs over its possible outputs. Neither of the hash functions described above is very uniform. The first fails because there are many more names beginning with A than there are beginning with X. The second fails because, well, it’s as concentrated on a single output as it possibly can be. In fact, it’s a good exercise to try come up with a good uniform hash function here! How could you take someone’s name and produce a number that could be anything from 1 to 26 with roughly even probability? (I’ll give an example at the end of this post.)

* Ease of computation: if you’re hashing many things, you want it to be easy to compute each hash value. It’s not very useful to know where a file is if it takes you an hour to compute it.

* Lack of collisions: to the greatest extent possible, you want different inputs to map to different outputs. That way (to continue our analogy), you don’t have to dig through a box – you can just take the one file that is in there! Our hash functions aren’t very good about this: Alex and Andrew both end up in box #1 with our first hash function, for example. When two inputs map to the same output, we say there is a **collision** in our hash function.

* Hard to reverse: sometimes you don’t want it to be possible to go backwards. If, for example, the files contain private data, you may not want it to be easy to tell what person the files in each box came from, while still being able to *look up* the files *given* a name. Our hash functions aren’t very good at this: anything in box #1 with the first hash function came from someone with the name A, which narrows things down a bunch.


To give an example of a better hash function for this purpose, you could add up the letters in a name, wrapping back around to 0 if you hit 26 or higher. For example, the name ABE would go into box 1 (for A) + 2 (for B) + 5 (for E) = 8. The name MABEL would go into box 13 + 1 + 2 + 5 + 12 = 33, which wraps back around to 7. This hash function does a better job at uniformity and will therefore avoid collisions better in a lot of cases than just using the first letter, and it’s harder to reverse (for you, anyway, not for a computer). However, that comes at the cost of slightly harder computation.