Can someone please explain to me the difference between a primary key, foreign key, clustered index, natural key, and surrogate key?

578 views

Learning SQL right now and the definitions for these terms are a little confusing. Can someone please explain them with an example? Thank you.

In: 15

18 Answers

Anonymous 0 Comments

Let’s pretend we’ve got a set of tables for sorting books for like a library or something

Primary key: what makes your row unique? If you have data about <book title> by <author>, the row can have all sorts of extra info like review scores or lending status, but it’s still the same book regardless of those extra details. If someone takes out “The hunger games” by suzanne collins and we don’t have a good primary key, the system could potentially add a totally new hunger games row with the lending status column set to “unavailable” and leave the original row alone (saying the book is still available), which is obviously not what we want. It’s the same book, there’s just a different status on it now, so we should update the existing row because the primary key for the new data matches the primary key of an existing row.

Foreign key: A list of values from another table to limit valid entries in the original table. Let’s say you’ve got a table that lists all sorts of book genres; it’s just literally one column filled with rows like “fiction” “science fiction” etc. Your main table may put a foreign key on its own genre column, so that it doesn’t end up doing something confusing like calling Star wars a “scifi”-genre story since that isn’t how all the other science fiction stories are categorized; it would make searching for it a nightmare. It’s like a web form providing a dropdown menu of valid responses instead of letting the user type whatever they want

Clustered index: what’s the default sort order for your table? This affects how they are stored in disk. If you tell the table to be stored on the hard drive by alphabetizing based on title, then adding a new entry to the table won’t tack the new entry to the end of the table and will instead try to insert it into the right spot.

non-clustered index (you didn’t ask, but figured I would talk about it): It’s basically a piece of paper on the side that lists our table in a different order. If someone wants to come in and search by author, they will have a hell of a time trying to do that if the table is sorted by book title. If someone wants to find all books by Brandon Sanderson, the table itself would be putting “Elantris” and “Warbreaker” nowhere close to each other in the table, so the search will have to go through the entire table to make sure it doesn’t miss anything. A nonclustered index is like looking at a note that says “Brandon Sanderson is the author for rows 5, 27, 83, (etc) in the original table” and the search will be able to jump straight to them for the rest of the data

Surrogate vs natural key are different types of primary key. My primary key example from earlier that was based on title/author would be considered a natural key because it uses the data itself to form a *natural* way to sort the information. A surrogate key would be like assigning a number to each row and just sorting on that. Surrogate keys have essentially zero chance of accidental overlap if done correctly, but they don’t give you much benefit in actually trying to find anything unless that number means something else to you (like if you want to find something inserted in the last month, so you take the most recent 1000 rows or something)

Anonymous 0 Comments

Let’s pretend we’ve got a set of tables for sorting books for like a library or something

Primary key: what makes your row unique? If you have data about <book title> by <author>, the row can have all sorts of extra info like review scores or lending status, but it’s still the same book regardless of those extra details. If someone takes out “The hunger games” by suzanne collins and we don’t have a good primary key, the system could potentially add a totally new hunger games row with the lending status column set to “unavailable” and leave the original row alone (saying the book is still available), which is obviously not what we want. It’s the same book, there’s just a different status on it now, so we should update the existing row because the primary key for the new data matches the primary key of an existing row.

Foreign key: A list of values from another table to limit valid entries in the original table. Let’s say you’ve got a table that lists all sorts of book genres; it’s just literally one column filled with rows like “fiction” “science fiction” etc. Your main table may put a foreign key on its own genre column, so that it doesn’t end up doing something confusing like calling Star wars a “scifi”-genre story since that isn’t how all the other science fiction stories are categorized; it would make searching for it a nightmare. It’s like a web form providing a dropdown menu of valid responses instead of letting the user type whatever they want

Clustered index: what’s the default sort order for your table? This affects how they are stored in disk. If you tell the table to be stored on the hard drive by alphabetizing based on title, then adding a new entry to the table won’t tack the new entry to the end of the table and will instead try to insert it into the right spot.

non-clustered index (you didn’t ask, but figured I would talk about it): It’s basically a piece of paper on the side that lists our table in a different order. If someone wants to come in and search by author, they will have a hell of a time trying to do that if the table is sorted by book title. If someone wants to find all books by Brandon Sanderson, the table itself would be putting “Elantris” and “Warbreaker” nowhere close to each other in the table, so the search will have to go through the entire table to make sure it doesn’t miss anything. A nonclustered index is like looking at a note that says “Brandon Sanderson is the author for rows 5, 27, 83, (etc) in the original table” and the search will be able to jump straight to them for the rest of the data

Surrogate vs natural key are different types of primary key. My primary key example from earlier that was based on title/author would be considered a natural key because it uses the data itself to form a *natural* way to sort the information. A surrogate key would be like assigning a number to each row and just sorting on that. Surrogate keys have essentially zero chance of accidental overlap if done correctly, but they don’t give you much benefit in actually trying to find anything unless that number means something else to you (like if you want to find something inserted in the last month, so you take the most recent 1000 rows or something)

Anonymous 0 Comments

let’s say you have a table and you’re going to use it to keep track of your new friends that you just met. you each get a number, and that’s your ID. the table will have an ID column. you also have a name column. now let’s say instead of names, you all have stickers on your shirts that have your ID number on them. you see one new friend has 3 as their ID. you go to the table but you see two rows with an ID of 3, and both have different names. that makes the table useless, right? that’s what primary keys are for. (unique keys too, they’re basically the same thing) they require that the values in that column are unique. meaning no copies/duplicates. so in that case, each person needs a unique ID that nobody else has.

now let’s say you all have pets. (dogs, cats, etc) you could add a pets column to your first table, but if you have more than one pet, you would need more than one row, and that just won’t work. you could add extra columns for more pets, but that’s not smart, because you’re limited by the number of columns you have, and will always have to keep adding more if people keep getting more pet and more pets. so instead, you create a pets table. but how do you keep track of which pets are for which person? their ID! so you insert a row for each pet. and you make a column called owner_id, and put in the id of the owner from the first table. now in this table, you don’t want that column to be primary key, because if you have more than one pet, you will have that person’s id in that column more than once. so now your pets are all in the table and you can identify which pets go to which owner. that ties the two tables together, and this is the whole point of a relational database. you can write a join query to get owner names and pet names with one single query.

now one of your friends moves away. so you delete that friend from the friends table. later on, you decide to add a new column to the pets table called type. you’ll put things like bird, dog, cat, etc into that column. when you start going through row by row updating that column, you look up the owner using the ID column, and call the owner to ask them. when you get to one pet, you can’t find the ID in the owner_id column in the friends table. what the heck? then you realize, that’s your friend that moved away. you deleted them from the friends table, so that row is now an orphaned row (that’s the technical term). a foreign key is a type of constraint you put on that column in the pets table to tie it to the ID column in the parent table. it won’t let you delete a friend from the friends table without deleting the associated pet rows from the pet table first. that way you don’t leave orphaned rows. those can break a lot of apps.

those are 99% of what you’ll hear about/use. surrogates you’ll use but i never hear anybody even use that word. ok, so your pets table. you have that owner_id column. let’s say you have two pets. that means you’ll have two rows in the pet table with your id in the owner_id column. if you want to update one of your pet’s row with an update statement, how do you do it? if you do “where owner_id=1”, it will update both rows. this is when you want a separate ID column that’s unique to each pet row. so usually people just add an “id int primary key auto_increment” column. they usually start at 0 and each row you insert the number in that column automatically goes up by 1. that way each row has its own unique id, and you can update a pet’s row directly without updating the other pets if they have the same owner. that’s all. it’s common that you’ll have an id column in every table. you’ll often see the column that contains the child rows (like our pets table), will have a column that’s the name of the other table + the word id, and use that as the foreign keycolumn. so if you had a bosses table, and an employees table, you’d have a column in the employees table called “boss_id” that would match the id column in the bosses table. in our example, if we did it the more common way, instead of owner_id, it would be friend_id. or we’d rename the friends table to owners, and keep our current owner_id column.

but anyway, i hope that makes things clear

Anonymous 0 Comments

let’s say you have a table and you’re going to use it to keep track of your new friends that you just met. you each get a number, and that’s your ID. the table will have an ID column. you also have a name column. now let’s say instead of names, you all have stickers on your shirts that have your ID number on them. you see one new friend has 3 as their ID. you go to the table but you see two rows with an ID of 3, and both have different names. that makes the table useless, right? that’s what primary keys are for. (unique keys too, they’re basically the same thing) they require that the values in that column are unique. meaning no copies/duplicates. so in that case, each person needs a unique ID that nobody else has.

now let’s say you all have pets. (dogs, cats, etc) you could add a pets column to your first table, but if you have more than one pet, you would need more than one row, and that just won’t work. you could add extra columns for more pets, but that’s not smart, because you’re limited by the number of columns you have, and will always have to keep adding more if people keep getting more pet and more pets. so instead, you create a pets table. but how do you keep track of which pets are for which person? their ID! so you insert a row for each pet. and you make a column called owner_id, and put in the id of the owner from the first table. now in this table, you don’t want that column to be primary key, because if you have more than one pet, you will have that person’s id in that column more than once. so now your pets are all in the table and you can identify which pets go to which owner. that ties the two tables together, and this is the whole point of a relational database. you can write a join query to get owner names and pet names with one single query.

now one of your friends moves away. so you delete that friend from the friends table. later on, you decide to add a new column to the pets table called type. you’ll put things like bird, dog, cat, etc into that column. when you start going through row by row updating that column, you look up the owner using the ID column, and call the owner to ask them. when you get to one pet, you can’t find the ID in the owner_id column in the friends table. what the heck? then you realize, that’s your friend that moved away. you deleted them from the friends table, so that row is now an orphaned row (that’s the technical term). a foreign key is a type of constraint you put on that column in the pets table to tie it to the ID column in the parent table. it won’t let you delete a friend from the friends table without deleting the associated pet rows from the pet table first. that way you don’t leave orphaned rows. those can break a lot of apps.

those are 99% of what you’ll hear about/use. surrogates you’ll use but i never hear anybody even use that word. ok, so your pets table. you have that owner_id column. let’s say you have two pets. that means you’ll have two rows in the pet table with your id in the owner_id column. if you want to update one of your pet’s row with an update statement, how do you do it? if you do “where owner_id=1”, it will update both rows. this is when you want a separate ID column that’s unique to each pet row. so usually people just add an “id int primary key auto_increment” column. they usually start at 0 and each row you insert the number in that column automatically goes up by 1. that way each row has its own unique id, and you can update a pet’s row directly without updating the other pets if they have the same owner. that’s all. it’s common that you’ll have an id column in every table. you’ll often see the column that contains the child rows (like our pets table), will have a column that’s the name of the other table + the word id, and use that as the foreign keycolumn. so if you had a bosses table, and an employees table, you’d have a column in the employees table called “boss_id” that would match the id column in the bosses table. in our example, if we did it the more common way, instead of owner_id, it would be friend_id. or we’d rename the friends table to owners, and keep our current owner_id column.

but anyway, i hope that makes things clear

Anonymous 0 Comments

let’s say you have a table and you’re going to use it to keep track of your new friends that you just met. you each get a number, and that’s your ID. the table will have an ID column. you also have a name column. now let’s say instead of names, you all have stickers on your shirts that have your ID number on them. you see one new friend has 3 as their ID. you go to the table but you see two rows with an ID of 3, and both have different names. that makes the table useless, right? that’s what primary keys are for. (unique keys too, they’re basically the same thing) they require that the values in that column are unique. meaning no copies/duplicates. so in that case, each person needs a unique ID that nobody else has.

now let’s say you all have pets. (dogs, cats, etc) you could add a pets column to your first table, but if you have more than one pet, you would need more than one row, and that just won’t work. you could add extra columns for more pets, but that’s not smart, because you’re limited by the number of columns you have, and will always have to keep adding more if people keep getting more pet and more pets. so instead, you create a pets table. but how do you keep track of which pets are for which person? their ID! so you insert a row for each pet. and you make a column called owner_id, and put in the id of the owner from the first table. now in this table, you don’t want that column to be primary key, because if you have more than one pet, you will have that person’s id in that column more than once. so now your pets are all in the table and you can identify which pets go to which owner. that ties the two tables together, and this is the whole point of a relational database. you can write a join query to get owner names and pet names with one single query.

now one of your friends moves away. so you delete that friend from the friends table. later on, you decide to add a new column to the pets table called type. you’ll put things like bird, dog, cat, etc into that column. when you start going through row by row updating that column, you look up the owner using the ID column, and call the owner to ask them. when you get to one pet, you can’t find the ID in the owner_id column in the friends table. what the heck? then you realize, that’s your friend that moved away. you deleted them from the friends table, so that row is now an orphaned row (that’s the technical term). a foreign key is a type of constraint you put on that column in the pets table to tie it to the ID column in the parent table. it won’t let you delete a friend from the friends table without deleting the associated pet rows from the pet table first. that way you don’t leave orphaned rows. those can break a lot of apps.

those are 99% of what you’ll hear about/use. surrogates you’ll use but i never hear anybody even use that word. ok, so your pets table. you have that owner_id column. let’s say you have two pets. that means you’ll have two rows in the pet table with your id in the owner_id column. if you want to update one of your pet’s row with an update statement, how do you do it? if you do “where owner_id=1”, it will update both rows. this is when you want a separate ID column that’s unique to each pet row. so usually people just add an “id int primary key auto_increment” column. they usually start at 0 and each row you insert the number in that column automatically goes up by 1. that way each row has its own unique id, and you can update a pet’s row directly without updating the other pets if they have the same owner. that’s all. it’s common that you’ll have an id column in every table. you’ll often see the column that contains the child rows (like our pets table), will have a column that’s the name of the other table + the word id, and use that as the foreign keycolumn. so if you had a bosses table, and an employees table, you’d have a column in the employees table called “boss_id” that would match the id column in the bosses table. in our example, if we did it the more common way, instead of owner_id, it would be friend_id. or we’d rename the friends table to owners, and keep our current owner_id column.

but anyway, i hope that makes things clear

Anonymous 0 Comments

Here’s the shortest explanation (of a complicated subject) that actually works…

A primary key (PK) just means “this is how you uniquely identify one of these rows in the table”. It can be one field, or the combination of a couple of fields.

A foreign key (FK) is just a reference to a *primary* key on another table. This helps to establish relationships, and I’ll talk more about that later. A PK value from another table can appear in multiple rows on the table that refers to it in the FK field, but only once on the PK table. Think single Customer ID on multiple Orders.

A natural key just means that the key value is a natural property of the thing in the table. If everyone in the world had guaranteed unique names, then we could use their names as a natural key. It’s not reflective of reality though, as anyone named John Smith can testify.

A surrogate key replaces a *natural* key with a guaranteed-unique value. Most often, we use auto-numbered (auto-incremented) primary keys, where the database management system (DBMS) manages the addition of the next value for you when the record is added. After Employee 123 is added to the Employee table, the next Employee will be 124.

There are also candidate keys, which aren’t designated as keys in the DBMS, per se, but are alternative ways to identify a single row in the table. Think CityStateZip Code. Zip Code is the primary key, but you can identify the same place by naming the city and the state. (Bonus reality note: Zip Code is 100% artificial, and in no way describes the location.)

Here’s the critical part about keys: A primary key only needs to be *unique*, and does NOT have to reflect any quality of the object in that row. It supplants the full set of fields (columns) in that row for use in relationships in order to help ensure their accuracy. Put the customer information in one place, just once, and refer to it using it’s ID.

The *reason* we need all of these tables to have uniquely identifiable rows is so that we can associate pieces of information easily (through Relationships), which sets context for events that occur. Ex: I can set up a Customers table with its own auto-numbered primary key. We can now refer to that Customer ID on the Orders table (instead of their name address, shipping address, phone number, etc.) and know for certain which exact customer made that order. It reduces the chances to introduce mistakes like misspelling the customer’s name badly enough that you can’t find their order, and it lets you tie them to multiple orders over time.

OK, not so simple of an explanation, nor a short one. I hope this helps…

Anonymous 0 Comments

Here’s the shortest explanation (of a complicated subject) that actually works…

A primary key (PK) just means “this is how you uniquely identify one of these rows in the table”. It can be one field, or the combination of a couple of fields.

A foreign key (FK) is just a reference to a *primary* key on another table. This helps to establish relationships, and I’ll talk more about that later. A PK value from another table can appear in multiple rows on the table that refers to it in the FK field, but only once on the PK table. Think single Customer ID on multiple Orders.

A natural key just means that the key value is a natural property of the thing in the table. If everyone in the world had guaranteed unique names, then we could use their names as a natural key. It’s not reflective of reality though, as anyone named John Smith can testify.

A surrogate key replaces a *natural* key with a guaranteed-unique value. Most often, we use auto-numbered (auto-incremented) primary keys, where the database management system (DBMS) manages the addition of the next value for you when the record is added. After Employee 123 is added to the Employee table, the next Employee will be 124.

There are also candidate keys, which aren’t designated as keys in the DBMS, per se, but are alternative ways to identify a single row in the table. Think CityStateZip Code. Zip Code is the primary key, but you can identify the same place by naming the city and the state. (Bonus reality note: Zip Code is 100% artificial, and in no way describes the location.)

Here’s the critical part about keys: A primary key only needs to be *unique*, and does NOT have to reflect any quality of the object in that row. It supplants the full set of fields (columns) in that row for use in relationships in order to help ensure their accuracy. Put the customer information in one place, just once, and refer to it using it’s ID.

The *reason* we need all of these tables to have uniquely identifiable rows is so that we can associate pieces of information easily (through Relationships), which sets context for events that occur. Ex: I can set up a Customers table with its own auto-numbered primary key. We can now refer to that Customer ID on the Orders table (instead of their name address, shipping address, phone number, etc.) and know for certain which exact customer made that order. It reduces the chances to introduce mistakes like misspelling the customer’s name badly enough that you can’t find their order, and it lets you tie them to multiple orders over time.

OK, not so simple of an explanation, nor a short one. I hope this helps…

Anonymous 0 Comments

Here’s the shortest explanation (of a complicated subject) that actually works…

A primary key (PK) just means “this is how you uniquely identify one of these rows in the table”. It can be one field, or the combination of a couple of fields.

A foreign key (FK) is just a reference to a *primary* key on another table. This helps to establish relationships, and I’ll talk more about that later. A PK value from another table can appear in multiple rows on the table that refers to it in the FK field, but only once on the PK table. Think single Customer ID on multiple Orders.

A natural key just means that the key value is a natural property of the thing in the table. If everyone in the world had guaranteed unique names, then we could use their names as a natural key. It’s not reflective of reality though, as anyone named John Smith can testify.

A surrogate key replaces a *natural* key with a guaranteed-unique value. Most often, we use auto-numbered (auto-incremented) primary keys, where the database management system (DBMS) manages the addition of the next value for you when the record is added. After Employee 123 is added to the Employee table, the next Employee will be 124.

There are also candidate keys, which aren’t designated as keys in the DBMS, per se, but are alternative ways to identify a single row in the table. Think CityStateZip Code. Zip Code is the primary key, but you can identify the same place by naming the city and the state. (Bonus reality note: Zip Code is 100% artificial, and in no way describes the location.)

Here’s the critical part about keys: A primary key only needs to be *unique*, and does NOT have to reflect any quality of the object in that row. It supplants the full set of fields (columns) in that row for use in relationships in order to help ensure their accuracy. Put the customer information in one place, just once, and refer to it using it’s ID.

The *reason* we need all of these tables to have uniquely identifiable rows is so that we can associate pieces of information easily (through Relationships), which sets context for events that occur. Ex: I can set up a Customers table with its own auto-numbered primary key. We can now refer to that Customer ID on the Orders table (instead of their name address, shipping address, phone number, etc.) and know for certain which exact customer made that order. It reduces the chances to introduce mistakes like misspelling the customer’s name badly enough that you can’t find their order, and it lets you tie them to multiple orders over time.

OK, not so simple of an explanation, nor a short one. I hope this helps…