Eli5: Database Normalization


Every time I think I got it, I’m getting lost along the way. Please help!

In: 1

If you prefer answers in video form, [this](https://www.youtube.com/watch?v=GFQaEYEc8_8) one does a great job.

It is possible for a database to store data that is incorrect. Some incorrect data cannot be protected against just by designing the database better, but some errors can be prevented through good database design.

For example, if a database is storing usernames and passwords (hashed, hopefully) then it shouldn’t save two passwords for the same user. Users have only one password, so if the database is storing two passwords for them then that is *always* an error. It would be best to catch that error as close to where it happens as possible so that it can be fixed, rather than just letting the problematic code keep running until every user has a bunch of passwords stored.

Normalization is a set of rules that you can follow that help to prevent that sort of logical inconsistency within a database. It is broken down into levels, named as 1st Normal Form (1NF) up to 5th Normal Form (5NF). Each normal form imposes some extra rules on top of the one that came before it. 1NF is the weakest with 5NF the strongest.

The above linked video goes through each of the normal forms with clear and simple examples to call out scenarios where breaking the normal forms could lead to inconsistencies in data.

There’s a funny little maxim some people use to remember the first three normal forms. It’s based on the common courtroom phrase,

> I swear to tell the truth, the whole truth, and nothing but the truth, so help me God

The maxim goes,

> Every non-key must provide a fact about the key, the whole key, and nothing but the key, so help me Codd

([Edgar Codd](https://en.wikipedia.org/wiki/Edgar_F._Codd) is the one who came up with these normal forms, by the way.)

Let’s break it down:

“Key” in this context means “primary key”. It’s the column (or group of columns working together) that you can use to uniquely refer to every record in your table.

The first normal form, 1NF, is by far the easiest to understand (and upsettingly, the least clear in the maxim). All it says is that every column should have one and *only* one value in it. So basically, no lists or complex data structures allowed inside single table cells. Keep it simple. 1 cell = 1 piece of data.This is the “provide a fact about the key” part of the maxim. That is, every value should be its own non-key, and it should depend on the (primary) key.

Second normal form, 2NF, only affects tables with composite primary keys. Tables with single-column primary keys that are in 1NF automatically get 2NF for free. For other tables, 2NF means that every column in your table should require every part of the primary key. “The whole key”, as the maxim says.

Say… I had a database table storing every room of a multi-building apartment complex, and the tenants who live in them. Every room can be uniquely described with a building number and room number, and each room record will have the name of the tenant currently living there. The building number and room number combined could be my primary key.

But say every building had a separate contact office to call, and I wanted to be able to store the designated contact office for each tenant. In that case, the contact office depends only on the building number and has nothing at all to do with the room numbers. I *could* just store it in my table with the composite primary key anyway, and it *would* work, since the building number is still part of the primary key, but it wouldn’t be 2NF. To make things 2NF, I would have to store office contact details in a separate table, one where building number alone is the primary key.

Third normal form, or 3NF, is concerned with making sure that all the data in the table depends *only* and *directly* on the primary key. “Nothing but the key”, as the maxim goes.

Say I wanted to take my apartment complex database from the previous example and store the ages of all my tenants. I *could* store that in the rooms table, right along with the tenant names. And again, this *would* work. But it wouldn’t be 3NF. Because, does the building and room number itself *really* tell you about the age of the tenant living there? I mean, it does indirectly, since knowing the room means knowing who lives there, and if you know who lives there, you know their age. But that’s the thing, tenant age doesn’t directly depend on the room number at all. It only depends on who the tenant is. To make this 3NF, we would have to make a new table where tenant name is the primary key, and store tenant ages there.

There are other even higher forms of database normalization beyond 3NF, but those are mostly academic. Data scientists flexing their theories to solve extremely nitpicky problems that probably will not benefit a real world database unless you *really* knew what you were doing and you were hyper-optimizing things. 3NF is usually as far as anyone should ever need to go to have a well-behaved database.

Imagine you have a database which is just a text file, and you just put everything related in a single line:

ID Data
1 John Doe, aged 38, lives in Seattle, earns 58k a year, works at Meta
2 Jane Doe, aged 34, lives in Melbourne, earns 55k a year, works at IBM

This is not normalized – there are lots of stuff crammed per entry. What if IBM changed their company name? You’d have to iterate every single entry and correct the name as you go.

When database gets normalized, we pull out data to different tables. Let’s make a new table for each company:

ID Name
2 Meta
3 Alphabet

Now, we can remove the company information that’s written in the “Data” column and add a new column that references the company table:

ID Data Company
1 John Doe, aged 38, lives in Seattle, earns 58k a year 2
2 Jane Doe, aged 34, lives in Melbourne, earns 55k a year 1

Little by little, we can move stuff to their own tables. This is what database normalization is (lots of things omitted for simplicity’s sake).

There are various levels of normalization, but those aren’t really ELI5 stuff.

Normalization happens in relational databases, i.e. databases that store data in tables. The idea behind normalization is that the way you store the tables can be different than what you show the user. So it’s a way of rearranging the data you have into several tables that are efficient to search on and useful for the business.