Can someone please explain to me the difference between a primary key, foreign key, clustered index, natural key, and surrogate key?

597 views

Learning SQL right now and the definitions for these terms are a little confusing. Can someone please explain them with an example? Thank you.

In: 15

18 Answers

Anonymous 0 Comments

“Keys” are selections of column in a table that are indexed in some way, intended to allow for much faster searching. Typically the index is stored in a different file from the raw table records. With that said…

A “primary key” is the key designed to uniquely identify rows in a table. So if a primary key is defined on columns `(a,b,c)` then a search with condition “WHERE a=1 AND b=2 AND c=3” will turn up exactly 0 or 1 rows. 2+ rows are impossible. As a result, rules that may allow multiple row matches, especially where NULL is involved, are forbidden. For example, if you run a shopping store, then products could be uniquely identified by their barcode numbers. Duplicate of that should never happen, and if they do then yeah you have a problem that needs solving anyway.

A “foreign key” is a cross-reference between tables. Table X may have a column named A and it’s a foreign key reference to table Y which also has a column named A (typically the same name to help reinforce they are the same piece of information for that cross-reference). The database may refuse to allow you to delete information in one table because the other references it, or the act of deleting from one table may cause deletion of data in the other table to keep the foreign key requirements met, depending on settings. It also helps software visualize the relationship between the tables. For example, if you have a table of customers, and a table of outstanding orders, you may not delete a customer if they have any outstanding orders placed, but deleting orders is fine.

“Clustered index” means the actual raw records on disk are sorted to this ordering. There can be advantages to that if you are fetching large amounts of data and you want it delivered pre-sorted… if you ask for it in the same sort order that the clustered index is provided in, then the DB just does sequential disk reads and gets data back pre-sorted, solving that problem without needing to actually perform a sort in memory or anything.

Now let’s go back to the primary key: a unique way to identify rows. How do you actually do that? There’s 2 common strategies: a unique number typically starting at 1 and counting up indefinitely, or finding some set of columns that are actually enough to uniquely identify a row. The former is what we call a “surrogate” key – something we introduced to assist. The latter is what we call a natural key – something already provided. For our list of products, barcodes make a pretty good natural key. For our customers, names aren’t unique, even addresses aren’t uniqu and we don’t have anything we can give them (damned customers keep refusing our loyalty cards) so we just give them a customer number and each new customers gets the next number. That’s a surrogate key. Customers might see the number on their invoices, but by itself the number doesn’t mean thing.

Anonymous 0 Comments

primary key: the thing that is guaranteed to identify a row in a table. it might be a single column that’s an incremental number (an identity column), or it might be something like a guid, or it might be two columns that together make a unique value, but might not be unique in themselves.

foreign key: when one table links a column or columns to the primary key in a different table. this is good for referential integrity, you can’t put a value into the foreign key columns unless they actually refer to a row in the other table. and that row is prevented from being deleted if something references it. (or you do a cascading delete, that deletes all records pointing to the initial record you’re deleting).

clustered index: the columns in the table that define the order that data is stored in. normally it’s the primary key. sometimes it can be useful to have that as a timestamp, so records from similar times are all kept together in storage.

natural key: sometimes the data you’re given has a key that uniquely identifies the record already available in it. a (bad) example might be your social security number. that uniquely identifies you from everyone else in the usa. isn’t good for foreigners tho. the chassis number on a car might be another natural key.

surrogate key: if there’s no natural primary key in your data, you’ll need to give it one. if you just number the first inserted record as 1, the next as 2, etc, that key is not part of the natural data you’re storing, it’s surrogating for a natural key that isn’t present.

Anonymous 0 Comments

“Keys” are selections of column in a table that are indexed in some way, intended to allow for much faster searching. Typically the index is stored in a different file from the raw table records. With that said…

A “primary key” is the key designed to uniquely identify rows in a table. So if a primary key is defined on columns `(a,b,c)` then a search with condition “WHERE a=1 AND b=2 AND c=3” will turn up exactly 0 or 1 rows. 2+ rows are impossible. As a result, rules that may allow multiple row matches, especially where NULL is involved, are forbidden. For example, if you run a shopping store, then products could be uniquely identified by their barcode numbers. Duplicate of that should never happen, and if they do then yeah you have a problem that needs solving anyway.

A “foreign key” is a cross-reference between tables. Table X may have a column named A and it’s a foreign key reference to table Y which also has a column named A (typically the same name to help reinforce they are the same piece of information for that cross-reference). The database may refuse to allow you to delete information in one table because the other references it, or the act of deleting from one table may cause deletion of data in the other table to keep the foreign key requirements met, depending on settings. It also helps software visualize the relationship between the tables. For example, if you have a table of customers, and a table of outstanding orders, you may not delete a customer if they have any outstanding orders placed, but deleting orders is fine.

“Clustered index” means the actual raw records on disk are sorted to this ordering. There can be advantages to that if you are fetching large amounts of data and you want it delivered pre-sorted… if you ask for it in the same sort order that the clustered index is provided in, then the DB just does sequential disk reads and gets data back pre-sorted, solving that problem without needing to actually perform a sort in memory or anything.

Now let’s go back to the primary key: a unique way to identify rows. How do you actually do that? There’s 2 common strategies: a unique number typically starting at 1 and counting up indefinitely, or finding some set of columns that are actually enough to uniquely identify a row. The former is what we call a “surrogate” key – something we introduced to assist. The latter is what we call a natural key – something already provided. For our list of products, barcodes make a pretty good natural key. For our customers, names aren’t unique, even addresses aren’t uniqu and we don’t have anything we can give them (damned customers keep refusing our loyalty cards) so we just give them a customer number and each new customers gets the next number. That’s a surrogate key. Customers might see the number on their invoices, but by itself the number doesn’t mean thing.

Anonymous 0 Comments

“Keys” are selections of column in a table that are indexed in some way, intended to allow for much faster searching. Typically the index is stored in a different file from the raw table records. With that said…

A “primary key” is the key designed to uniquely identify rows in a table. So if a primary key is defined on columns `(a,b,c)` then a search with condition “WHERE a=1 AND b=2 AND c=3” will turn up exactly 0 or 1 rows. 2+ rows are impossible. As a result, rules that may allow multiple row matches, especially where NULL is involved, are forbidden. For example, if you run a shopping store, then products could be uniquely identified by their barcode numbers. Duplicate of that should never happen, and if they do then yeah you have a problem that needs solving anyway.

A “foreign key” is a cross-reference between tables. Table X may have a column named A and it’s a foreign key reference to table Y which also has a column named A (typically the same name to help reinforce they are the same piece of information for that cross-reference). The database may refuse to allow you to delete information in one table because the other references it, or the act of deleting from one table may cause deletion of data in the other table to keep the foreign key requirements met, depending on settings. It also helps software visualize the relationship between the tables. For example, if you have a table of customers, and a table of outstanding orders, you may not delete a customer if they have any outstanding orders placed, but deleting orders is fine.

“Clustered index” means the actual raw records on disk are sorted to this ordering. There can be advantages to that if you are fetching large amounts of data and you want it delivered pre-sorted… if you ask for it in the same sort order that the clustered index is provided in, then the DB just does sequential disk reads and gets data back pre-sorted, solving that problem without needing to actually perform a sort in memory or anything.

Now let’s go back to the primary key: a unique way to identify rows. How do you actually do that? There’s 2 common strategies: a unique number typically starting at 1 and counting up indefinitely, or finding some set of columns that are actually enough to uniquely identify a row. The former is what we call a “surrogate” key – something we introduced to assist. The latter is what we call a natural key – something already provided. For our list of products, barcodes make a pretty good natural key. For our customers, names aren’t unique, even addresses aren’t uniqu and we don’t have anything we can give them (damned customers keep refusing our loyalty cards) so we just give them a customer number and each new customers gets the next number. That’s a surrogate key. Customers might see the number on their invoices, but by itself the number doesn’t mean thing.

Anonymous 0 Comments

primary key: the thing that is guaranteed to identify a row in a table. it might be a single column that’s an incremental number (an identity column), or it might be something like a guid, or it might be two columns that together make a unique value, but might not be unique in themselves.

foreign key: when one table links a column or columns to the primary key in a different table. this is good for referential integrity, you can’t put a value into the foreign key columns unless they actually refer to a row in the other table. and that row is prevented from being deleted if something references it. (or you do a cascading delete, that deletes all records pointing to the initial record you’re deleting).

clustered index: the columns in the table that define the order that data is stored in. normally it’s the primary key. sometimes it can be useful to have that as a timestamp, so records from similar times are all kept together in storage.

natural key: sometimes the data you’re given has a key that uniquely identifies the record already available in it. a (bad) example might be your social security number. that uniquely identifies you from everyone else in the usa. isn’t good for foreigners tho. the chassis number on a car might be another natural key.

surrogate key: if there’s no natural primary key in your data, you’ll need to give it one. if you just number the first inserted record as 1, the next as 2, etc, that key is not part of the natural data you’re storing, it’s surrogating for a natural key that isn’t present.

Anonymous 0 Comments

primary key: the thing that is guaranteed to identify a row in a table. it might be a single column that’s an incremental number (an identity column), or it might be something like a guid, or it might be two columns that together make a unique value, but might not be unique in themselves.

foreign key: when one table links a column or columns to the primary key in a different table. this is good for referential integrity, you can’t put a value into the foreign key columns unless they actually refer to a row in the other table. and that row is prevented from being deleted if something references it. (or you do a cascading delete, that deletes all records pointing to the initial record you’re deleting).

clustered index: the columns in the table that define the order that data is stored in. normally it’s the primary key. sometimes it can be useful to have that as a timestamp, so records from similar times are all kept together in storage.

natural key: sometimes the data you’re given has a key that uniquely identifies the record already available in it. a (bad) example might be your social security number. that uniquely identifies you from everyone else in the usa. isn’t good for foreigners tho. the chassis number on a car might be another natural key.

surrogate key: if there’s no natural primary key in your data, you’ll need to give it one. if you just number the first inserted record as 1, the next as 2, etc, that key is not part of the natural data you’re storing, it’s surrogating for a natural key that isn’t present.

Anonymous 0 Comments

What helped me understand all this was what used to be called third normal form*: you have a table with columns. All the columns MUST have data in them. Nulls are not allowed. Duplicate rows are not allowed. How do you store an address with these restrictions?

You pick out what makes each part of an address unique. Start with the name. Now, this is a problem with third normal form: you can go really over the top and atomize the data too much. Which is to say, you can split a name up into three or more tables if you’re manic about it. For the sake of argument, let’s say all names have a first name and a last name, and that’s it. No exceptions.

Okay, so that’s a table. First and last name. But that’s not a very good unique identifier. Lots of people have the same first and last name. So you create a primary key. This is a number that is incremented each time a record is entered. Always. To the point that, if you delete a record, that ID is never used again.

Okay. So we have a person’s name as a record with a primary key that will always identify that person even if they have the same name as another person, because you can’t enter two people with the exact same name at the exact same time (does this sound like an edge case? It sure does! But it’s beyond this discussion.) A person can have many addresses.

When someone is using an Excel spreadsheet to keep track of it all, they just add columns. “Address2,City2,Address3,City3..YaddaYadda2..3.” But we have a restriction. We can’t have null values. ALL columns must have data.

And this is where the relationships start to matter. One person can have many addresses. This means we put a PERSONID column in our ADDRESS table. This is a foreign key relationship. The name primary key becomes an address foreign key because, in the name table, THERE CAN BE ONLY ONE! But in my address table, there can be many. Now, all our tables have columns full of data.

Once you think about it, the rest emerges, at least in the US (but this applies to all addresses, the columns and tables will change but not the concepts). There are only a set number of states, so they get unique IDs. Those go into the ADDRESS table. Likewise zip codes. Presto: you now have a database with, what, six tables? No rows are repeated. All columns have data. There are no nulls. After a few mental adjustments, this makes structured queries… well, that’s why it’s called SQL.

There are several drawbacks to this model. But they pale against the advantages. It’s like knowing how to write music, or a novel: it’s fine to break the rules. Sometimes you need to. But first, you have to know them.

*This is from memory, of me reading a textbook on a bus, in 1996. YMMV.

Anonymous 0 Comments

What helped me understand all this was what used to be called third normal form*: you have a table with columns. All the columns MUST have data in them. Nulls are not allowed. Duplicate rows are not allowed. How do you store an address with these restrictions?

You pick out what makes each part of an address unique. Start with the name. Now, this is a problem with third normal form: you can go really over the top and atomize the data too much. Which is to say, you can split a name up into three or more tables if you’re manic about it. For the sake of argument, let’s say all names have a first name and a last name, and that’s it. No exceptions.

Okay, so that’s a table. First and last name. But that’s not a very good unique identifier. Lots of people have the same first and last name. So you create a primary key. This is a number that is incremented each time a record is entered. Always. To the point that, if you delete a record, that ID is never used again.

Okay. So we have a person’s name as a record with a primary key that will always identify that person even if they have the same name as another person, because you can’t enter two people with the exact same name at the exact same time (does this sound like an edge case? It sure does! But it’s beyond this discussion.) A person can have many addresses.

When someone is using an Excel spreadsheet to keep track of it all, they just add columns. “Address2,City2,Address3,City3..YaddaYadda2..3.” But we have a restriction. We can’t have null values. ALL columns must have data.

And this is where the relationships start to matter. One person can have many addresses. This means we put a PERSONID column in our ADDRESS table. This is a foreign key relationship. The name primary key becomes an address foreign key because, in the name table, THERE CAN BE ONLY ONE! But in my address table, there can be many. Now, all our tables have columns full of data.

Once you think about it, the rest emerges, at least in the US (but this applies to all addresses, the columns and tables will change but not the concepts). There are only a set number of states, so they get unique IDs. Those go into the ADDRESS table. Likewise zip codes. Presto: you now have a database with, what, six tables? No rows are repeated. All columns have data. There are no nulls. After a few mental adjustments, this makes structured queries… well, that’s why it’s called SQL.

There are several drawbacks to this model. But they pale against the advantages. It’s like knowing how to write music, or a novel: it’s fine to break the rules. Sometimes you need to. But first, you have to know them.

*This is from memory, of me reading a textbook on a bus, in 1996. YMMV.

Anonymous 0 Comments

What helped me understand all this was what used to be called third normal form*: you have a table with columns. All the columns MUST have data in them. Nulls are not allowed. Duplicate rows are not allowed. How do you store an address with these restrictions?

You pick out what makes each part of an address unique. Start with the name. Now, this is a problem with third normal form: you can go really over the top and atomize the data too much. Which is to say, you can split a name up into three or more tables if you’re manic about it. For the sake of argument, let’s say all names have a first name and a last name, and that’s it. No exceptions.

Okay, so that’s a table. First and last name. But that’s not a very good unique identifier. Lots of people have the same first and last name. So you create a primary key. This is a number that is incremented each time a record is entered. Always. To the point that, if you delete a record, that ID is never used again.

Okay. So we have a person’s name as a record with a primary key that will always identify that person even if they have the same name as another person, because you can’t enter two people with the exact same name at the exact same time (does this sound like an edge case? It sure does! But it’s beyond this discussion.) A person can have many addresses.

When someone is using an Excel spreadsheet to keep track of it all, they just add columns. “Address2,City2,Address3,City3..YaddaYadda2..3.” But we have a restriction. We can’t have null values. ALL columns must have data.

And this is where the relationships start to matter. One person can have many addresses. This means we put a PERSONID column in our ADDRESS table. This is a foreign key relationship. The name primary key becomes an address foreign key because, in the name table, THERE CAN BE ONLY ONE! But in my address table, there can be many. Now, all our tables have columns full of data.

Once you think about it, the rest emerges, at least in the US (but this applies to all addresses, the columns and tables will change but not the concepts). There are only a set number of states, so they get unique IDs. Those go into the ADDRESS table. Likewise zip codes. Presto: you now have a database with, what, six tables? No rows are repeated. All columns have data. There are no nulls. After a few mental adjustments, this makes structured queries… well, that’s why it’s called SQL.

There are several drawbacks to this model. But they pale against the advantages. It’s like knowing how to write music, or a novel: it’s fine to break the rules. Sometimes you need to. But first, you have to know them.

*This is from memory, of me reading a textbook on a bus, in 1996. YMMV.

Anonymous 0 Comments

Let’s pretend we’ve got a set of tables for sorting books for like a library or something

Primary key: what makes your row unique? If you have data about <book title> by <author>, the row can have all sorts of extra info like review scores or lending status, but it’s still the same book regardless of those extra details. If someone takes out “The hunger games” by suzanne collins and we don’t have a good primary key, the system could potentially add a totally new hunger games row with the lending status column set to “unavailable” and leave the original row alone (saying the book is still available), which is obviously not what we want. It’s the same book, there’s just a different status on it now, so we should update the existing row because the primary key for the new data matches the primary key of an existing row.

Foreign key: A list of values from another table to limit valid entries in the original table. Let’s say you’ve got a table that lists all sorts of book genres; it’s just literally one column filled with rows like “fiction” “science fiction” etc. Your main table may put a foreign key on its own genre column, so that it doesn’t end up doing something confusing like calling Star wars a “scifi”-genre story since that isn’t how all the other science fiction stories are categorized; it would make searching for it a nightmare. It’s like a web form providing a dropdown menu of valid responses instead of letting the user type whatever they want

Clustered index: what’s the default sort order for your table? This affects how they are stored in disk. If you tell the table to be stored on the hard drive by alphabetizing based on title, then adding a new entry to the table won’t tack the new entry to the end of the table and will instead try to insert it into the right spot.

non-clustered index (you didn’t ask, but figured I would talk about it): It’s basically a piece of paper on the side that lists our table in a different order. If someone wants to come in and search by author, they will have a hell of a time trying to do that if the table is sorted by book title. If someone wants to find all books by Brandon Sanderson, the table itself would be putting “Elantris” and “Warbreaker” nowhere close to each other in the table, so the search will have to go through the entire table to make sure it doesn’t miss anything. A nonclustered index is like looking at a note that says “Brandon Sanderson is the author for rows 5, 27, 83, (etc) in the original table” and the search will be able to jump straight to them for the rest of the data

Surrogate vs natural key are different types of primary key. My primary key example from earlier that was based on title/author would be considered a natural key because it uses the data itself to form a *natural* way to sort the information. A surrogate key would be like assigning a number to each row and just sorting on that. Surrogate keys have essentially zero chance of accidental overlap if done correctly, but they don’t give you much benefit in actually trying to find anything unless that number means something else to you (like if you want to find something inserted in the last month, so you take the most recent 1000 rows or something)