Let’s say you want to store files for 1000 medical patients in filing cabinets for quick lookup. If we have one cabinet for A, one cabinet for B, and so on, some cabinets are going to be stuffed, like maybe 100 files in the Js for Johnson and Jackson, and 100 in the Ss for Smith and Stevens, but others, like Q and Z, will be almost empty. The full cabinets will take more time to look through and slow down the time to retrieve a file, and what’s worse, most lookups require searching one of the fuller cabinets, because that’s where most of the files you look up are. You aren’t benefitting from having some emptier cabinets if you almost never have to look there.
What if we choose a different scheme, like filing by the last name’s *second* letter? Then Johnson, Jackson, Smith, and Stevens will all go into four different cabinets. Even if the Q cabinet might still be empty, other cabinets might be more evenly distributed, so you don’t have the issue that you are opening one or two cabinets with the most files the most often. This idea of “choose the second letter” could be called a hash function.
Hash functions are functions designed to spread out the filing of data as evenly as possible. They take in data and return a number from 0 to the maximum size of the table, and try to guarantee as best as possible that no two pieces of data are more likely to end up in the same cabinet than if you were randomly assigning them. But since the function is not random, you can still use it to figure out where to look when you need to lookup a file. Using a properly designed hash function would allow you to store files perfectly evenly distributed between 26 cabinets, with 1/26 of the files in each one. Then every lookup is fast, because no matter which cabinet you are looking in, it only has about 30-40 of the files, never 100.
Latest Answers