top |
Design |
RDBMS |
Step1 |
Step2 |
Step3 |
Step4 |
Step5 |
Step6 |
Final |
Normalize |
Prove |
Much to my surprise, my original design was correctly normalized, even before I knew what "normalization" meant. So I jotted down some notes about the steps I had taken to do the design. A few databases and a few refinements led me to an organized design method. Then I discovered that my method led immediately to Boyce-Codd normal form. (Boyce-Codd is superior to the then-standard third normal form.) I returned to normalization theory and was able to prove that my method led to fourth normal form, and then I found a proof that my designs were in another intermediate form stopping just short of the final fifth normal form. In the years since I've developed other successful databases using this procedure, and I've discovered this year (2005) the final proof that my designs are, in fact, in fifth normal form. This method yields a correctly normalized design, every time.
I've named it ORE&D (pronounced Ore 'n' Dee) after its four main table classes:
top |
Design |
RDBMS |
Step1 |
Step2 |
Step3 |
Step4 |
Step5 |
Step6 |
Final |
Normalize |
Prove |
If you haven't already picked a database, Oracle and DB2 dominate the $8 billion RDBMS market. On the other hand, a free RDBMS such as MySQL running on an open source OS, such as Linux or FreeBSD, provides a remarkably capable solution for small and medium-sized databases.
In a relational database you'll have a table for every type of thing that you track. Let's say you sell products to customers. Begin with a customer table and a product table. These are some of the columns you'll probably have in your tables.
Product Table | Customer Table |
|
|
Here are a few rows from the Product table:
|
|
(Do you see the design flaw in this table? We'll get to it. Relational database pioneer Chris Date called normalization "formalized common sense." Everything known about relational design was summarized when Granny said, "A place for everything and everything in its place.")
![]() |
Here you'll meet keys, which are absolutely key to understanding the relational database system, the ORE&D design method and classic database normalization. If this is new to you, this is the section where you have to slow down and be absolutely sure you understand primary and foreign keys. | ||||||||||||||||||||
![]() |
The Product ID field is called the key. Given a key (10001235, for example), a computer can find an individual record (small, blue, 10 oz. widget) nearly instantaneously, even if there are millions of records. The key that identifies a particular record is called the primary key. | ||||||||||||||||||||
![]() |
The simplest key system, consecutive integers, is the best. Your first record is assigned key zero. The next is given key one, and so on. When a record is deleted its key is never reused. (Some databases use random integers for keys. Don't use these for anything more important than keeping track of your CD collection. This is explained in my java database book in more detail.) | ||||||||||||||||||||
![]() |
Let's show how these keys are used with a very simple Sales table - it allows customers to buy one product at a time. | ||||||||||||||||||||
![]() |
|
||||||||||||||||||||
![]() |
In this example, customer 123456 bought 3 small, blue widgets (product 10001235 looked up in the Product Table, above). We don't store data about the customer or product in the Sales table, just the keys to the appropriate records in the Customer and Product tables. A join process takes care of looking up the customer's address and the product's name, size and color when it's time to generate a shipping label or an invoice. In this example the Sale ID is the record's primary key. The Customer ID and Product ID fields are called foreign keys. | ||||||||||||||||||||
![]() |
A handy trick, if your database allows, is to use a large starting key to keep your records aligned when they're dumped to screen or print. Here's our Sales table, both ways: | ||||||||||||||||||||
![]() |
|
By keeping the customer data in one record, if the customer moves, we have exactly one customer record to update. If we find a way to pack those widgets that saves an ounce, we update the product table. Here it is again.
Product Table | ||||
Product ID | Name | Size | Color | Shipping Weight |
---|---|---|---|---|
. . . | ||||
10001234 | Widget | Small | Red | 10 |
10001235 | Widget | Small | Blue | 10 |
10001236 | Widget | Medium | White | 17 |
. . . |
Now we're up to the design flaw. Did you notice that the weight of small widgets is recorded in two records? We'd have to update them both. Granny didn't say, "... and everything in two places." If there were more colors, we'd repeat that weight in more places. In our example the small widgets are next to each other. In reality, you'll start with red, white and blue widgets. A month or a year later you'll add purple and gold widgets — the records won't be adjacent.
A worse problem is that if you stop carrying white widgets a delete of the single white widget also deletes the weight of medium widgets. Collectively, problems like these are called "update anomalies" and the goal of database design is to eliminate them.
We'll see when we get to the fifth ORE&D step how that should have been designed. Now let's actually start designing.
top |
Design |
RDBMS |
Step1 |
Step2 |
Step3 |
Step4 |
Step5 |
Step6 |
Final |
Normalize |
Prove |
|
In our example above, we had customers and products as the two classes of objects about which we needed data. The first step in design is to list the different classes of objects. What's an object?
Anything that you can kick is an object. More formally, an object has a long or permanent duration and it exists independently from any other object. Abstract, non-kickable objects include music compositions and mathematical formulae. In practice, listing the kickable things generally works. |
top |
Design |
RDBMS |
Step1 |
Step2 |
Step3 |
Step4 |
Step5 |
Step6 |
Final |
Normalize |
Prove |
|
In the example above, we had a Sale event. It created a relationship among objects at a point in time. A customer bought 3 small blue widgets, February 14, 2005. An event that doesn't involve our objects is not relevant. An important event that doesn't involve our objects means we'd better add to our list of objects. |
top |
Design |
RDBMS |
Step1 |
Step2 |
Step3 |
Step4 |
Step5 |
Step6 |
Final |
Normalize |
Prove |
|
Begin every table with an ID. Then list the other characteristics. For customers you'll want names, probably prefix (Mr., Ms., Dr., etc.) and suffix ( Jr., III, etc.). Two lines of street address, a city or town, political subdivision (state, province, etc.), country and postal code, phone and email address and any other information about the customer (but not about your transactions with the customer — those will be entered in event tables).
How elaborate you get depends entirely on your needs. An Internet retailer may ship product all over the world. A pizza parlor probably won't need a "country" field in the address. For events, you'll have an ID (primary key), time stamp, IDs of the objects involved (foreign keys) and other characteristics. For a sale you'll have price, shipping method, payment characteristics and so on. Depending on your application, you might want a finer time stamp than just a day. A fast-food restaurant might track sales by minute, so they know how many burgers to have on the grill at noon and how many to have on the grill at 3:30 P.M. Sometimes you'll meet events that don't happen at a point in time, but happen over a period of time. (Nothing really happens at a single point in time but most database events can be thought of that way.) Homebuilders start a house one day and finish weeks later. This design method isn't a law. It's a guide. You have to adapt to your reality. If your data includes long-duration events, replace the time stamp with start and end times. |
top |
Design |
RDBMS |
Step1 |
Step2 |
Step3 |
Step4 |
Step5 |
Step6 |
Final |
Normalize |
Prove |
What relationships do we need to track? That depends on your application. If Fred and Sally are married to each other and are both customers, you probably should store their relationship in your head, not in your database. On the other hand, if they both work for you and you provide health benefits, you'll need the relationship in the database. Without it, you'll pay to insure Fred, Fred's wife, Sally and Sally's husband.
Here's a relationship table:
Employee Relationship Table | |||||
Relation ID | Start Date | End Date | Employee ID | Related Employee ID |
Relationship |
---|---|---|---|---|---|
1012 | 19950615 | 101234 | 104321 | married | |
1013 | 19570123 | 105654 | 103898 | parent | |
1014 | 19810615 | 19920715 | 104567 | 103456 | married |
You've seen the ORE in ORE&D, now we're going on to the D, details. There are two main types of Detail table, repeating characteristics and histories.
top |
Design |
RDBMS |
Step1 |
Step2 |
Step3 |
Step4 |
Step5 |
Step6 |
Final |
Normalize |
Prove |
Very Bad Design! | ||||||
Employee ID | Name | Other Data |
Sport 1 | Sport 2 | Sport 3 | Sport 4 |
---|---|---|---|---|---|---|
100001 | Martin Rinehart | . . . | cycling | kayaking | tennis | hiking |
What's wrong? First, you'll always have employees who participate in more than four sports (or whatever other number you've allowed). Second, to find out who plays tennis, you'll have queries like:"...where Employee.Sport1='tennis' OR Employee.Sport2='tennis' OR Employee.Sport3='tennis' OR ...".
(And as your poor staff creates these phrases for queries and reports, you then add four more sports fields and all those queries and reports need updating. You will not be loved.)
The right way to do it is to create a detail table for each repeating characteristic:
Employee Sports Table, Sound Design | ||
Employee Sports ID | Employee ID | Sport |
---|---|---|
1000001 | 100001 | cycling |
1000002 | 100001 | kayaking |
1000003 | 100001 | tennis |
1000004 | 100001 | mountain climbing |
1000005 | 100001 | snowshoeing |
1000006 | 101234 | jogging |
1000007 | 101234 | rock climbing |
1000008 | . . . |
Now that query becomes "...Sport = 'tennis'", which is what it should have been from the beginning. Also, you can add as many (or as few) sports for each employee as you need.
I'm not sure that I'd allow words in the sport field here. In addition to the simple typos, this will have one person "mountain climbing" while another goes "mountaineering." To prevent this, you could add a sport table, this way:
|
|
Now the query is "...Sport = 1007"
which is not so friendly. On the other hand, if you use the Sport table to populate a multi-selecting dropdown list and let your query text be generated behind the scenes the staff will love you as they click to select, shift-click for ranges and control-click for more selections.
Now about those widgets. Do all widgets come in the same colors, or do small widgets come in different colors from the medium and large ones? If they all come in the same colors, then add a color detail table and a size detail table, this way:
|
|
|
If the colors vary by size, create the same size detail table and another color detail table that is a detail about the size detail. In the widget table you have data about all widgets. In the widget-size table you have data about small widgets, such as their weight. In the widget-size-color table you have the available colors for each size of widget. It will look like this:
|
|
|
This design works for different sets of colors applying to different sizes of widgets. You see here that the small widgets come in red and white; the medium widgets come in blue.
Note here that however you do it, both these designs record the fact that small widgets weigh 10 ounces in exactly one place, fixing the original problem.
In general, you keep creating detail tables until you don't have any repeating characteristics remaining. It doesn't happen often in practice, but there are times when you'll need detail tables to support other detail tables. There are also more times when you'll need the other main type of detail table, the history table.
top |
Design |
RDBMS |
Step1 |
Step2 |
Step3 |
Step4 |
Step5 |
Step6 |
Final |
Normalize |
Prove |
If you will need to go back and regenerate an old invoice, you'll need to be able to get the correct price on the invoice date. For this, use a history table:
Widget Price History Table | |||
WPH ID | Widget-Size ID | Start Date | Price |
---|---|---|---|
. . . | |||
10123 | 1001 | 19990115 | 19,95 |
. . . | |||
10234 | 1001 | 20010412 | 29,95 |
. . . | |||
10345 | 1001 | 20030512 | 24.95 |
. . . | |||
10456 | 1001 | 20041214 | 26,95 |
. . . |
You see that it starts with its own ID, as do all our tables. It has the ID of the widget-size it is pricing (in this case, small widgets), a start date and a price. An end date is implied - the day before the next price starts.
Systems that I engineer automatically keep histories for every characteristic (to provide roll back and roll forward, just in case Murphy's Law still applies). If you do this you don't need to include history tables in the design — you're done. If you don't have this feature, design all the history tables that you'll need and your ORE&D design is done.
top |
Design |
Final |
Normalize |
Prove |
With a little head start and a little coaching, my clients do their own design, as they're closer to their business than any outsider. It's not rocket science.
Did I say that your design could be done? Well, the truth is that it can only be "done" if your enterprise keeps right on doing what it does, without change and your information needs don't change. As enterprises are dynamic, "done" doesn't usually last very long. The good news is that if your design is good, it will be easy to make changes.
top |
Design |
Normalize |
1st NF |
2nd NF |
3rd NF |
Boyce-Codd |
4th NF |
5th NF |
Prove |
Over the years many "normal" design forms have been described. Here I'll explain the five main forms and one intermediate form which will be very helpful when we get to proving ORE&D's designs are correctly normalized.
The higher normal forms each remove a class of design flaws in lower-numbered forms. Third normal form used to be considered "good enough" as the problems corrected going to fourth and fifth normal forms were not common. Since an ORE&D design is in fifth normal form, we're not going to settle for "good enough."
Now a warning. The jargon here comes from mathematics where relation is the term for a table. The jargon is dense and the concepts are formal — which is good if you're doing math, but not so good for passers by who want to understand the concepts. I'll translate all of this into English.
First, I'll propose the Granny test. Have "a place for everything and everything in its place." A design that doesn't make Granny happy has problems. If small widgets weigh 10 ounces, record that fact just once, not once with every color. If you do the latter, it's inevitable that you'll end up with small widgets with different weights.
These are the formal definitions:
First Normal Form | A relation is in 1NF if and only if all underlying domains contain atomic values only. http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/1nf.html |
Second Normal Form | A relation is in 2NF if it is in 1NF and every non-key attribute is fully [functionally] dependent on each candidate key of the relation. http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/2nf.html |
Third Normal Form | A relation R is in 3NF if it is in 2NF and every non-key attribute of R is non-transitively dependent on each candidate key of R. http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/3nf.html |
Boyce-Codd Normal Form | A relation R is said to be in BCNF if [it is in 1NF and] whenever X -> A holds in R, and A is not in X, then X is a candidate key for R. http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/bcnf.html |
Fourth Normal Form | A relation R is in 4NF if [it is in 3NF and], whenever a multivalued dependency
X [->>] Y holds then either
(a) the dependency is trivial, or
(b) X is a candidate key for R. http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/4nf.html |
Fifth Normal Form | A relation R is in 5NF [if it is in 4NF and]... if for all join dependencies at least one of the following holds.
(a) (R1, R2, ..., Rn) is a trivial join-dependency (that is, one of Ri is R)
(b) Every Ri is a candidate key for R. http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/5nf.html |
Let's turn these definitions into English that we can all understand. We'll start with first normal form.
top |
Design |
Normalize |
1st NF |
2nd NF |
3rd NF |
Boyce-Codd |
4th NF |
5th NF |
Prove |
A relation is in 1NF if and only if all underlying domains contain atomic values only.
A "relation" is the mathematical term for a table. "1NF" is shorthand notation for first normal form. A "domain," as used here, is simply a cell in the table. By "atomic" the definition means contains just a single value. So we could rewrite, this way:
A table is in first normal form if and only if each cell contains just one value.This is the classic definition of first normal form, and it's not a good one. First, there's atomicity.
If you remember the history of science, the atom was postulated as the indivisible building block from which all matter is constructed. Well, it turned out not to be so atomic. Our table's atoms can be like that, too. Consider a date, such as 20050519. That's composed of a year, month and day. But that doesn't disqualify a table with dates from being in first normal form.
Next, there's the issue of domains. The domain of a column in a table is the set of values allowed. One column could be restricted to dates, another to names and a third to integers within some range. A most common restriction: the foreign keys in this column must be primary keys in some other table. Columns aren't restricted to a single domain in this definition, but that shouldn't worry you. Your RDBMS will definitely impose this restriction.
If you try to put more than one fact in a cell, you flunk the Granny test since you'll be putting multiple things in one place.
top |
Design |
Normalize |
1st NF |
2nd NF |
3rd NF |
Boyce-Codd |
4th NF |
5th NF |
Prove |
A relation is in 2NF if it is in 1NF and every non-key attribute is fully functionally dependent on each candidate key of the relation.
In an ORE&D design we use abstract IDs as our keys. There are other possibilities. For example, you could use names as keys to any table of people. This will work until the second "William Smith" appears in a list that already has a "William Smith". Keys must be unique — permanently unique. An RDBMS will not allow you to add the second "William Smith" if you specified that the name was the primary key. So you might specify a combination, such as name and phone number or name and address as a key. These other possibilities are "candidate keys."
"Functional dependency" simply means that the key determines a single value. In a people table key 1234 might identify "William Smith" at a particular address. Key 4321 might identify another "William Smith" at a different address. Both keys identify a single record with a single name, so the names are functionally dependent on the keys. The reverse is not true. The name "William Smith" identifies two different keys, so the keys are not functionally dependent on the name.
One more item and we're ready to put second normal form into English. Each record contains multiple fields, also called "attributes." A combination of two or more attributes could be a candidate key. In a people table, we might combine name and postal code to get a key. (Well, someone might. We wouldn't — we'd use an abstract, permanently unique key.) "Full functional dependency" specifies that the values must be unique for the whole key, and not for any part. You have multiple people in one postal code. You have more than one "William Smith." But if you have only one "William Smith" in a given postal code, then the attributes (city, phone number, etc.) of that "William Smith" are fully functionally dependent on the name plus postal code key. So:
A table is in second normal form if it is in first normal form and each candidate key, but not just a part of a candidate key, identifies just one record.
This is also expressed with more wit by many database normalizers this way (referring to Ted Codd, the inventor of relational databases):
All the data in the table depends on the key, the whole key and nothing but the key, so help me Codd.
If you're designing a database, you have to separate your attributes (fields) into different tables to satisfy this constraint. For example, you have customers and products. You'll put the product data into one table and the customer data into another table to go from first to second normal form.
In first normal form, you can have the product ID, product attributes, customer ID and customer attributes in one table. That's not second normal form because the customer attributes don't depend on the product ID; the product attributes don't depend on the customer ID. Granny would know that you put customer data in one table and product data in another table.
top |
Design |
Normalize |
1st NF |
2nd NF |
3rd NF |
Boyce-Codd |
4th NF |
5th NF |
Prove |
A relation R is in 3NF if it is in 2NF and every non-key attribute of R is non-transitively dependent on each candidate key of R.To violate 3NF takes deliberate violations of the Granny principle. Let's put the product data into the Sale table to see this in action.
Very Bad Design! | ||||||||
Sale ID | Sale Date | Product ID | Product Name | Other Product Data |
||||
. . . | ||||||||
10001234 | 20050519 | 100432 | Widget | . . . | ||||
. . . |
The sale ID (primary key) determines the product ID (foreign key). The product ID specifies the product name. The product name is transitively determined by the sale ID. The data about the product should be in the product table, where product ID is the primary key.
Putting the information identified by product ID, a foreign key, into this table means that every sale of product 100432 also carries the name "Widget". You'd need to update lots of records if you ever changed that name. Granny doesn't approve. In English,
A table is in third normal form if it includes foreign keys, but not any data that you'd find by looking up the foreign keys in their own tables.Foreign key lookups (joins) are fast, cheap and easy.
top |
Design |
Normalize |
1st NF |
2nd NF |
3rd NF |
Boyce-Codd |
4th NF |
5th NF |
Prove |
A relation R is said to be in BCNF if [it is in 1NF and] whenever X -> A holds in R, and A is not in X, then X is a candidate key for R.Let's start with "X -> A" — the spoken version is "X implies A" — meaning that for any X, which is one or more fields in a record used as a key, you find exactly one value for A, another attribute (field). Let's look at a single record in a simplified Customer table:
Simple Customer Table | |||||
Customer ID | Last Name | First Name | Address | Phone | |
. . . | |||||
100023 | Smith | William | 1234 Main St. | (123) 456-7890 | BillS@MailService.com |
. . . |
Here the Customer ID, 100023, since it is unique in the Customer table implies that the last name will be "Smith", that the first name is "William" and so on. There is another candidate key here, too. The Email should be unique. (Should be, but some families share a family mail address, so don't depend on it being unique.) Similarly, the combination of last name, first name and phone might be unique. Of course, Bill might have a son, Bill Jr., who lives with him.
So here the Customer ID implies each of the other attributes and is the only field that unambiguously implies each of the other attributes. This table is in Boyce-Codd normal form. When is a third normal form table not Boyce-Codd normal? That happens when you have multiple compound candidate keys (name plus address for one, name plus phone for another) and the candidate keys share at least one field. If we were using name plus address as a key, and name plus phone as another, the table would be in third normal form, but not in Boyce-Codd normal form.
Note that every Boyce-Codd normal form table is in third normal form (and therefore is in second normal form as well) but the definition only requires first normal form.
Mathematicians worry a lot about completeness, which is a good thing for mathematics, but from now on I'm going to ignore the trivial cases. Customer ID 100023 implies that the Customer ID is 100023. I'm glad that we have mathematicians who worry about such things. On the other hand, there are some other people who need to get on with designing databases.
top |
Design |
Normalize |
1st NF |
2nd NF |
3rd NF |
Boyce-Codd |
4th NF |
5th NF |
Prove |
A relation R is in 4NF if [it is in 3NF and], whenever a multivalued dependency X [->>] Y holds then either (a) the dependency is trivial, or (b) X is a candidate key for R.Note: the most common usage is "X -> Y" for "X implies Y" and "X ->> Y" for "X multivalued implies Y". The original quoted here used "multivalued dependency X -> Y" — I've edited in the more common notation.
Fourth and fifth normal forms go from properties of a single table to multi-table properties. Fourth normal introduces the concept of multivalued dependencies. When we were designing our databases I called these "repeating characteristics." Let's say widgets can be red, white or blue. So if the product is a widget, the color is one of red, white or blue. The product ID for widgets implies these three colors. Fourth normal form simply says that we'll not have more than one of these dependencies per table.
If all widgets come in red, white and blue, this is a good, fourth normal form, design:
|
|
|
How do we violate fourth normal form? Simple. Combine the color and size values into a single table, this way:
|
|
What's wrong with this design? Start with the first data record. It tells you that widgets can be red. It also tells that widgets can be small, and small widgets weigh 10 ounces. It puts two facts into one record, two things in one place, which flunks the Granny test. What happens when you stop carrying red widgets? If you delete this record, you also lose the facts that you carry small widgets and that small widgets weigh 10 ounces.
When the design is in fourth normal form, to find all the sizes of widgets you carry you ask the Widget-Size Table for all records where the Product ID is 10001234. With the bad design, you ask the same thing of the Widget Color and Size Table and you get as many as six records for two sizes and three colors (small red, small white, small blue, medium red, ...).
top |
Design |
Normalize |
1st NF |
2nd NF |
3rd NF |
Boyce-Codd |
4th NF |
5th NF |
Prove |
A relation R is in 5NF [if it is in 4NF and]... if for all join dependencies at least one of the following holds. (a) (R1, R2, ..., Rn) is a trivial join-dependency (that is, one of Ri is R) (b) Every Ri is a candidate key for R.
I had lots of trouble understanding fifth normal form. My two books left open the question of whether ORE&D designs were in fifth normal form. When I finally figured this form out I realized immediately that ORE&D is a fifth normal form design.
Fifth normal form requires that you break down tables, but only as long as the breakdown mirrors the real world reality. Here's an example. Assume that Ford and Chevy dealers carry the full line of Ford and Chevy products. This is a fifth normal form design (greatly simplified) for auto dealers.
|
|
What's good about this design? Consider how well it supports changes. If Chevy stops building cars, you delete one record. This will automatically delete Chevy cars from both Charley's Chevrolet and Huge Automall's product lines. Similarly, if Chevy introduces a new product it will automatically be part of Charley's Chevrolet and Huge Automall's product lines.
The next table is not in fifth normal form because you can split it into the two tables above:
Not Fifth Normal Dealer Table | ||
Dealer | Manufacturer | Product |
Charley's Chevrolet | Chevy | Cars |
Charley's Chevrolet | Chevy | Trucks |
Fanny's Ford | Ford | Cars |
Fanny's Ford | Ford | Trucks |
Huge Automall | Chevy | Cars |
Huge Automall | Chevy | Trucks |
Huge Automall | Ford | Cars |
Huge Automall | Ford | Trucks |
The non-fifth normal form table is the join of the two fifth normal form tables. Now, if we change the definition of the world we are modeling, we can use the same table as a correct, fifth normal form table. The key assumption was that the dealer carries the manufacturer's full line. This is why you can join the Manufacturer Table and the Dealer Table to get the list of all the products carried by the dealers. If we relax that restriction and allow the dealer to choose exactly which products they will carry, then we can no longer do this.
Assume that Charley only carries cars, and Fanny just sells trucks. Huge Automall sells everything. This is the correct, fifth normal form design that shows this:
Fifth Normal Form Dealer Table | ||
Dealer | Manufacturer | Product |
Charley's Chevrolet | Chevy | Cars |
Fanny's Ford | Ford | Trucks |
Huge Automall | Chevy | Cars |
Huge Automall | Chevy | Trucks |
Huge Automall | Ford | Cars |
Huge Automall | Ford | Trucks |
Again, you could just ask Granny. If the dealer carries all products from their manufacturer, then you list the manufacturer's products once and don't repeat the list for every dealer. If each dealer chooses the products it will carry, then you need to record each selected product for every dealer.
top |
Design |
Normalize |
Prove |
1st NF |
Boyce-Codd |
5th NF |
Steps |
To prove that ORE&D is the best way, we need to show that ORE&D designs are as good as classic normalization's designs. We need to prove that ORE&D designs are in fifth-normal form.
Let's begin with first-normal form.
It is possible to put non-atomic values into cells in an RDBMS, but I'm not going to explain how. Nice people just don't do such things. All our values are atomic and our tables are all in first normal form.
From here there are two paths to fifth normal form: leaping and one at a time. I'll start with the leaping proof, which takes little time, but relies on the brilliant work of the pioneers of this field. After that's done, we'll go one step at a time, just to show that you can depend on those brilliant pioneers.
top |
Design |
Normalize |
Prove |
1st NF |
Boyce-Codd |
5th NF |
Steps |
Our tables all include an ID, which is a unique, abstract and single-valued key. Our ID implies every other attribute (field) in its tuple (record). And it is a candidate key. (Actually, it's the primary and only key.) Therefore, our tables are in Boyce-Codd normal form.
"A key is simple if it consists of a single attribute. It is shown that if a relation schema is in third normal form and every key is simple, then it is in projection-join normal form (sometimes called fifth normal form),"In an ORE&D design, every key is simple and therefore, an ORE&D design is in fifth normal form. QED.
I'd like to take this opportunity to thank Boyce, Codd, Date and Fagin for having done all the hard work. And I'd like to thank IBM for its fine online library, where you can find Simple Conditions for Guaranteeing Higher Normal Forms for Relational Databases.
top |
Design |
Normalize |
Prove |
1st NF |
Boyce-Codd |
5th NF |
Steps |
The second normal form is achieved when our data is "about the key, the whole key and nothing but the key." It's here that we separate Product data from Customer data. The ORE&D method never intermingles — the design process starts by identifying object and event classes, which separates the data from the beginning. By having unique, abstract keys we guarantee functional dependence, and since we don't have compound fields, we also guarantee full functional dependence. We're in second normal form.
Every time we use a foreign key (Customer ID, outside the Customer table) we use it to refer to the customer data. We don't ever accompany the Customer ID with other data from the Customer table. We assume that we'll look it up in the Customer table (do a join), as required. So the design method eliminates transitive dependencies. We're in third normal form.
Fourth normal would be violated if we put multiple repeating characteristics into a single table. The design method is to create a separate detail table for each repeating characteristic, which means we're sticking to fourth normal form.
Finally, there's fifth normal form, where I'll defer to the math of Date and Fagin. If you go back to the original distinction made between objects (which have an independent existence) and events (which relate objects at a point in time) you'll see the areas in which the designer has been looking at the objects carefully to see how they get related. It was our good fortune to stick to the real world, and the Granny principle. That led us to fifth normal form.
© 2005 by Martin Rinehart