Eli5: Database Normalization

220 views

Every time I think I got it, I’m getting lost along the way. Please help!

In: 1

4 Answers

Anonymous 0 Comments

There’s a funny little maxim some people use to remember the first three normal forms. It’s based on the common courtroom phrase,

> I swear to tell the truth, the whole truth, and nothing but the truth, so help me God

The maxim goes,

> Every non-key must provide a fact about the key, the whole key, and nothing but the key, so help me Codd

([Edgar Codd](https://en.wikipedia.org/wiki/Edgar_F._Codd) is the one who came up with these normal forms, by the way.)

Let’s break it down:

“Key” in this context means “primary key”. It’s the column (or group of columns working together) that you can use to uniquely refer to every record in your table.

The first normal form, 1NF, is by far the easiest to understand (and upsettingly, the least clear in the maxim). All it says is that every column should have one and *only* one value in it. So basically, no lists or complex data structures allowed inside single table cells. Keep it simple. 1 cell = 1 piece of data.This is the “provide a fact about the key” part of the maxim. That is, every value should be its own non-key, and it should depend on the (primary) key.

Second normal form, 2NF, only affects tables with composite primary keys. Tables with single-column primary keys that are in 1NF automatically get 2NF for free. For other tables, 2NF means that every column in your table should require every part of the primary key. “The whole key”, as the maxim says.

Say… I had a database table storing every room of a multi-building apartment complex, and the tenants who live in them. Every room can be uniquely described with a building number and room number, and each room record will have the name of the tenant currently living there. The building number and room number combined could be my primary key.

But say every building had a separate contact office to call, and I wanted to be able to store the designated contact office for each tenant. In that case, the contact office depends only on the building number and has nothing at all to do with the room numbers. I *could* just store it in my table with the composite primary key anyway, and it *would* work, since the building number is still part of the primary key, but it wouldn’t be 2NF. To make things 2NF, I would have to store office contact details in a separate table, one where building number alone is the primary key.

Third normal form, or 3NF, is concerned with making sure that all the data in the table depends *only* and *directly* on the primary key. “Nothing but the key”, as the maxim goes.

Say I wanted to take my apartment complex database from the previous example and store the ages of all my tenants. I *could* store that in the rooms table, right along with the tenant names. And again, this *would* work. But it wouldn’t be 3NF. Because, does the building and room number itself *really* tell you about the age of the tenant living there? I mean, it does indirectly, since knowing the room means knowing who lives there, and if you know who lives there, you know their age. But that’s the thing, tenant age doesn’t directly depend on the room number at all. It only depends on who the tenant is. To make this 3NF, we would have to make a new table where tenant name is the primary key, and store tenant ages there.

There are other even higher forms of database normalization beyond 3NF, but those are mostly academic. Data scientists flexing their theories to solve extremely nitpicky problems that probably will not benefit a real world database unless you *really* knew what you were doing and you were hyper-optimizing things. 3NF is usually as far as anyone should ever need to go to have a well-behaved database.

You are viewing 1 out of 4 answers, click here to view all answers.