Eli5:what is Hash value?

258 views

Eli5:what is Hash value?

In: 3

5 Answers

Anonymous 0 Comments

Imagine you want to store files for a whole city worth of people. You have, say, 26 boxes for storing files.

A hash function is a way to take an input (in this case, a person) and produce an output (in this case, a box), in such a way that the input can be long and complicated but the output always stays within some range (in this case, box #1 to box #26).

Since that’s a pretty general definition, hash functions are pretty general objects. For example, a valid hash function here might be “if the person’s name begins with A, their file goes in box 1. If it begins with B, put it in box 2. And so on”. Another valid – if not particularly useful – hash function would be something like “put everything in box 1”. However, since the hash function is a function **only of the value being input into it**, “put the file into the first empty box” is not a hash function. The outputs of the hash function – in this case, which box you’re putting a file into – are the *hash values* of the inputs.

—-

In practical use cases, hash functions usually want to have some or all of the following properties, depending on exactly what you’re doing:

* Uniformity: a *uniform* hash function evenly divides its possible inputs over its possible outputs. Neither of the hash functions described above is very uniform. The first fails because there are many more names beginning with A than there are beginning with X. The second fails because, well, it’s as concentrated on a single output as it possibly can be. In fact, it’s a good exercise to try come up with a good uniform hash function here! How could you take someone’s name and produce a number that could be anything from 1 to 26 with roughly even probability? (I’ll give an example at the end of this post.)

* Ease of computation: if you’re hashing many things, you want it to be easy to compute each hash value. It’s not very useful to know where a file is if it takes you an hour to compute it.

* Lack of collisions: to the greatest extent possible, you want different inputs to map to different outputs. That way (to continue our analogy), you don’t have to dig through a box – you can just take the one file that is in there! Our hash functions aren’t very good about this: Alex and Andrew both end up in box #1 with our first hash function, for example. When two inputs map to the same output, we say there is a **collision** in our hash function.

* Hard to reverse: sometimes you don’t want it to be possible to go backwards. If, for example, the files contain private data, you may not want it to be easy to tell what person the files in each box came from, while still being able to *look up* the files *given* a name. Our hash functions aren’t very good at this: anything in box #1 with the first hash function came from someone with the name A, which narrows things down a bunch.

—–

To give an example of a better hash function for this purpose, you could add up the letters in a name, wrapping back around to 0 if you hit 26 or higher. For example, the name ABE would go into box 1 (for A) + 2 (for B) + 5 (for E) = 8. The name MABEL would go into box 13 + 1 + 2 + 5 + 12 = 33, which wraps back around to 7. This hash function does a better job at uniformity and will therefore avoid collisions better in a lot of cases than just using the first letter, and it’s harder to reverse (for you, anyway, not for a computer). However, that comes at the cost of slightly harder computation.

You are viewing 1 out of 5 answers, click here to view all answers.