Let’s say you want to store files for 1000 medical patients in filing cabinets for quick lookup. If we have one cabinet for A, one cabinet for B, and so on, some cabinets are going to be stuffed, like maybe 100 files in the Js for Johnson and Jackson, and 100 in the Ss for Smith and Stevens, but others, like Q and Z, will be almost empty. The full cabinets will take more time to look through and slow down the time to retrieve a file, and what’s worse, most lookups require searching one of the fuller cabinets, because that’s where most of the files you look up are. You aren’t benefitting from having some emptier cabinets if you almost never have to look there.
What if we choose a different scheme, like filing by the last name’s *second* letter? Then Johnson, Jackson, Smith, and Stevens will all go into four different cabinets. Even if the Q cabinet might still be empty, other cabinets might be more evenly distributed, so you don’t have the issue that you are opening one or two cabinets with the most files the most often. This idea of “choose the second letter” could be called a hash function.
Hash functions are functions designed to spread out the filing of data as evenly as possible. They take in data and return a number from 0 to the maximum size of the table, and try to guarantee as best as possible that no two pieces of data are more likely to end up in the same cabinet than if you were randomly assigning them. But since the function is not random, you can still use it to figure out where to look when you need to lookup a file. Using a properly designed hash function would allow you to store files perfectly evenly distributed between 26 cabinets, with 1/26 of the files in each one. Then every lookup is fast, because no matter which cabinet you are looking in, it only has about 30-40 of the files, never 100.
A “hash” is an algorithm, piece of code, etc that converts something (text, object) into a number. For example my name, “DeHackEd” might hash to a number like 25872. Note that this number does have a range limit and there’s no guarantee that every number is unique. Some people will have the same number.
So from that number we can set up a system where we can say that each object gets stored into a slot based on its number. For example if we store users by name and there are 10,000 slots available then I would be designated slot 5872 based on my hash number above (and the number range available since my hash number is higher than the number of slots available)
When searching for information for the user “DeHackEd”, the fact that there are 10,000 possible slots and who know how many users are being saved doesn’t matter. If I’m in there at all then I’m definitely in slot 5872. So a search is extremely fast. If the number of stored items is kept in the 10,000 range as well then realistically the size of the list doesn’t matter to how long a search takes. At worst all users who go into the same slot need to be checked but if the number 10,000 was selected wisely then that should be a very short list.
The biggest downside is that inexact matches are hard. If you want all users whose names begin with the letter “D” then the hash is useless and you must go through the entire table entry by entry looking for matches. Also the size (in our example, 10000) should be chosen wisely. If you are storing 1 million entries then on average each cell would hold 100 user accounts which is going to start eating into the performance advantages of a hashtable. Furthermore you need a strategy of how to deal with users sharing the same slot – using something else like a sorted list or a small RB tree might be a good idea.
I will assume you know hash functions and general data storage
It’s a storage structure of key/value pairs where the key is always the hash of the value you’re storing. This allows storing a lot of very different datapoints in the same table and also efficiently finding your data (just calculate the hash and look for that cell)
Since hashes are not unique you’ll have to decide on a strategy when you have a collision. For example you can use a linked list in that case wich will add a bit of hassle finding the data you want inside that list.
Latest Answers