Database Design

Getting to Fifth Normal Form the Easy Way


This article has three parts:
top
Design
RDBMS
Step1
Step2
Step3
Step4
Step5
Step6
Final
Normalize
Prove

How to Design a Database

Some years ago I designed a relational database, although I had no training in the art. The design had to be a good one, so I got a database text and learned how to "normalize" the design. The normalization process gets rid of the design errors.

Much to my surprise, my original design was correctly normalized, even before I knew what "normalization" meant. So I jotted down some notes about the steps I had taken to do the design. A few databases and a few refinements led me to an organized design method. Then I discovered that my method led immediately to Boyce-Codd normal form. (Boyce-Codd is superior to the then-standard third normal form.) I returned to normalization theory and was able to prove that my method led to fourth normal form, and then I found a proof that my designs were in another intermediate form stopping just short of the final fifth normal form. In the years since I've developed other successful databases using this procedure, and I've discovered this year (2005) the final proof that my designs are, in fact, in fifth normal form. This method yields a correctly normalized design, every time.

I've named it ORE&D (pronounced Ore 'n' Dee) after its four main table classes:

If you just want to design a good database, this is the only section of this article that you need to read. If you have a "show me" and "prove it" attitude, the other two sections describe the traditional process and then prove that ORE&D designs are correctly normalized. Let's get started. There are six steps in the ORE&D method.
  1. List the Objects
  2. List the Events
  3. List the characteristics of each Object and Event
  4. Design Relationship Tables
  5. Design Repeating Characteristic Tables
  6. Design History Tables
Before we get started, let's take a look at relational database management systems (RDBMSs).

top
Design
  RDBMS
Step1
Step2
Step3
Step4
Step5
Step6
Final
Normalize
Prove

Relational Database Management Systems

A relational database is one where all the data is stored in tables. I'll skip the discussion of relational vs. other database types. All the leading database systems (Oracle, DB2, SQL Server, MySQL) are relational. Over the years they've been tuned for peak performance and they've grown a fine crop of tools, such as form painters and report writers. Unless you have very special needs, it pays to stick to the mainstream.

If you haven't already picked a database, Oracle and DB2 dominate the $8 billion RDBMS market. On the other hand, a free RDBMS such as MySQL running on an open source OS, such as Linux or FreeBSD, provides a remarkably capable solution for small and medium-sized databases.

In a relational database you'll have a table for every type of thing that you track. Let's say you sell products to customers. Begin with a customer table and a product table. These are some of the columns you'll probably have in your tables.

Product Table Customer Table
  • Product ID
  • Name
  • Size
  • Color
  • Shipping Weight
  • Customer ID
  • First Name
  • Last Name
  • Address
  • Home Phone
  • Mobile Phone

Here are a few rows from the Product table:

Product Table
Product ID Name Size Color Shipping Weight
. . .
10001234 Widget Small Red 10
10001235 Widget Small Blue 10
10001236 Widget Medium White 17
. . .
Terminology
  • record — a row in the table
  • field — a cell in a row
  • key — a unique row identifier

(Do you see the design flaw in this table? We'll get to it. Relational database pioneer Chris Date called normalization "formalized common sense." Everything known about relational design was summarized when Granny said, "A place for everything and everything in its place.")

drawing of a key Here you'll meet keys, which are absolutely key to understanding the relational database system, the ORE&D design method and classic database normalization. If this is new to you, this is the section where you have to slow down and be absolutely sure you understand primary and foreign keys.
drawing of a key The Product ID field is called the key. Given a key (10001235, for example), a computer can find an individual record (small, blue, 10 oz. widget) nearly instantaneously, even if there are millions of records. The key that identifies a particular record is called the primary key.
drawing of a key The simplest key system, consecutive integers, is the best. Your first record is assigned key zero. The next is given key one, and so on. When a record is deleted its key is never reused. (Some databases use random integers for keys. Don't use these for anything more important than keeping track of your CD collection. This is explained in my java database book in more detail.)
drawing of a key Let's show how these keys are used with a very simple Sales table - it allows customers to buy one product at a time.
drawing of a key
Sale Event Table
Sale ID Date Customer ID Product ID Quantity Price
. . .
1000123456 20050214 123456 10001235 3 59.85
. . .
picture of a house key In this example, customer 123456 bought 3 small, blue widgets (product 10001235 looked up in the Product Table, above). We don't store data about the customer or product in the Sales table, just the keys to the appropriate records in the Customer and Product tables. A join process takes care of looking up the customer's address and the product's name, size and color when it's time to generate a shipping label or an invoice. In this example the Sale ID is the record's primary key. The Customer ID and Product ID fields are called foreign keys.
picture of a house key A handy trick, if your database allows, is to use a large starting key to keep your records aligned when they're dumped to screen or print. Here's our Sales table, both ways:
picture of a house key
Start with Large Keys for Alignment
Aligned

10000000 20050214 123456 10001235 3 59.85
10000015 20050215 100031 10001768 1 19,95
10000123 20050301 100566 10000001 2 4.48
10001234 20050305 104512 10023455 3 66.11
Started at zero

0 20050214 123456 1235 3 59.85
15 20050215 31 1768 1 19.95
123 20050301 566 1 2 4.48
1234 20050305 4512 23455 3 66.11

By keeping the customer data in one record, if the customer moves, we have exactly one customer record to update. If we find a way to pack those widgets that saves an ounce, we update the product table. Here it is again.

Product Table
Product ID Name Size Color Shipping Weight
. . .
10001234 Widget Small Red 10
10001235 Widget Small Blue 10
10001236 Widget Medium White 17
. . .

Now we're up to the design flaw. Did you notice that the weight of small widgets is recorded in two records? We'd have to update them both. Granny didn't say, "... and everything in two places." If there were more colors, we'd repeat that weight in more places. In our example the small widgets are next to each other. In reality, you'll start with red, white and blue widgets. A month or a year later you'll add purple and gold widgets — the records won't be adjacent.

A worse problem is that if you stop carrying white widgets a delete of the single white widget also deletes the weight of medium widgets. Collectively, problems like these are called "update anomalies" and the goal of database design is to eliminate them.

We'll see when we get to the fifth ORE&D step how that should have been designed. Now let's actually start designing.

top
Design
RDBMS
  Step1
Step2
Step3
Step4
Step5
Step6
Final
Normalize
Prove

Step 1. List the Objects

Common Object Classes
Customers
Employees
Departments
Teachers
Languages
Products
Assemblies
Parts
In our example above, we had customers and products as the two classes of objects about which we needed data. The first step in design is to list the different classes of objects. What's an object?

Anything that you can kick is an object. More formally, an object has a long or permanent duration and it exists independently from any other object. Abstract, non-kickable objects include music compositions and mathematical formulae.

In practice, listing the kickable things generally works.

top
Design
RDBMS
Step1
  Step2
Step3
Step4
Step5
Step6
Final
Normalize
Prove

Step 2. List the Events

Common Event Classes
Sales
Shipping
Receiving
Insuring
Depositing
Withdrawing
Paying
Billing
In the example above, we had a Sale event. It created a relationship among objects at a point in time. A customer bought 3 small blue widgets, February 14, 2005. An event that doesn't involve our objects is not relevant. An important event that doesn't involve our objects means we'd better add to our list of objects.

top
Design
RDBMS
Step1
Step2
  Step3
Step4
Step5
Step6
Final
Normalize
Prove

Step 3. List the characteristics of each Object and Event

Customer Characteristics
Customer ID
Prefix
First Name
Middle Name
Last Name
Suffix
Address Line 1
Address Line 2
City
State or Province
Country
Postal Code
Home Phone
Office Phone
Mobile Phone
Email Address
Date met
Credit Rating
Begin every table with an ID. Then list the other characteristics. For customers you'll want names, probably prefix (Mr., Ms., Dr., etc.) and suffix ( Jr., III, etc.). Two lines of street address, a city or town, political subdivision (state, province, etc.), country and postal code, phone and email address and any other information about the customer (but not about your transactions with the customer — those will be entered in event tables).

How elaborate you get depends entirely on your needs. An Internet retailer may ship product all over the world. A pizza parlor probably won't need a "country" field in the address.

For events, you'll have an ID (primary key), time stamp, IDs of the objects involved (foreign keys) and other characteristics. For a sale you'll have price, shipping method, payment characteristics and so on.

Depending on your application, you might want a finer time stamp than just a day. A fast-food restaurant might track sales by minute, so they know how many burgers to have on the grill at noon and how many to have on the grill at 3:30 P.M.

Sometimes you'll meet events that don't happen at a point in time, but happen over a period of time. (Nothing really happens at a single point in time but most database events can be thought of that way.) Homebuilders start a house one day and finish weeks later.

This design method isn't a law. It's a guide. You have to adapt to your reality. If your data includes long-duration events, replace the time stamp with start and end times.

top
Design
RDBMS
Step1
Step2
Step3
  Step4
Step5
Step6
Final
Normalize
Prove

Step 4. Design Relationship Tables

Relational databases ought to be able to handle relationships, right? Actually "relational" refers to mathematical relations — you and I call them tables — and have nothing to do with relationships. Relationships are one of the trickier things to handle in a relational database. You handle them by using foreign keys.

What relationships do we need to track? That depends on your application. If Fred and Sally are married to each other and are both customers, you probably should store their relationship in your head, not in your database. On the other hand, if they both work for you and you provide health benefits, you'll need the relationship in the database. Without it, you'll pay to insure Fred, Fred's wife, Sally and Sally's husband.

Here's a relationship table:

Employee Relationship Table
Relation ID Start Date End Date Employee ID Related
Employee ID
Relationship
1012 19950615   101234 104321 married
1013 19570123   105654 103898 parent
1014 19810615 19920715 104567 103456 married

You've seen the ORE in ORE&D, now we're going on to the D, details. There are two main types of Detail table, repeating characteristics and histories.

top
Design
RDBMS
Step1
Step2
Step3
Step4
  Step5
Step6
Final
Normalize
Prove

Step 5. Design Repeating Characteristic Tables

An insurance company wants to know what sports its employees participate in to see if there is a relationship between sports and health costs. The beginner would modify the Employee table, this way:

Very Bad Design!
Employee ID Name Other
Data
Sport 1 Sport 2 Sport 3 Sport 4
100001 Martin Rinehart . . . cycling kayaking tennis hiking

What's wrong? First, you'll always have employees who participate in more than four sports (or whatever other number you've allowed). Second, to find out who plays tennis, you'll have queries like:"...where Employee.Sport1='tennis' OR Employee.Sport2='tennis' OR Employee.Sport3='tennis' OR ...". (And as your poor staff creates these phrases for queries and reports, you then add four more sports fields and all those queries and reports need updating. You will not be loved.)

The right way to do it is to create a detail table for each repeating characteristic:

Employee Sports Table, Sound Design
Employee Sports ID Employee ID Sport
1000001100001cycling
1000002100001kayaking
1000003100001tennis
1000004100001mountain climbing
1000005100001snowshoeing
1000006101234jogging
1000007101234rock climbing
1000008. . .  

Now that query becomes "...Sport = 'tennis'", which is what it should have been from the beginning. Also, you can add as many (or as few) sports for each employee as you need.

I'm not sure that I'd allow words in the sport field here. In addition to the simple typos, this will have one person "mountain climbing" while another goes "mountaineering." To prevent this, you could add a sport table, this way:

Sport Table
Sport IDSport
1001cycling
1002jogging
1003kayaking
1004mountain climbing
1005rock climbing
1006snowshoeing
1007tennis
Employee Sports Table
Employee Sports ID Employee ID Sport ID
10000011000011001
10000021000011003
10000031000011007
10000041000011004
10000051000011006
10000061012341002
10000071012341005
1000008. . .  

Now the query is "...Sport = 1007" which is not so friendly. On the other hand, if you use the Sport table to populate a multi-selecting dropdown list and let your query text be generated behind the scenes the staff will love you as they click to select, shift-click for ranges and control-click for more selections.

Now about those widgets. Do all widgets come in the same colors, or do small widgets come in different colors from the medium and large ones? If they all come in the same colors, then add a color detail table and a size detail table, this way:

Product Table
Product IDName
...  
10001234Widget
...  
Widget-Color Table
W-C IDProduct IDColor
. . .    
100110001234red
100210001234white
100310001234blue
. . .    
Widget-Size Table
W-S ID Product ID Size Weight
...      
100110001234small10
100210001234medium17
...      

If the colors vary by size, create the same size detail table and another color detail table that is a detail about the size detail. In the widget table you have data about all widgets. In the widget-size table you have data about small widgets, such as their weight. In the widget-size-color table you have the available colors for each size of widget. It will look like this:

Product Table
Product IDName
...  
10001234Widget
...  
Widget-Sizes Table
W-S ID Product ID Size Weight
...      
100110001234small10
100210001234medium17
...      
Widget-Size Color Table
W-S Colors IDW-S IDColor
. . .    
10011001red
10021001white
10031002blue
. . .    

This design works for different sets of colors applying to different sizes of widgets. You see here that the small widgets come in red and white; the medium widgets come in blue.

Note here that however you do it, both these designs record the fact that small widgets weigh 10 ounces in exactly one place, fixing the original problem.

In general, you keep creating detail tables until you don't have any repeating characteristics remaining. It doesn't happen often in practice, but there are times when you'll need detail tables to support other detail tables. There are also more times when you'll need the other main type of detail table, the history table.

top
Design
RDBMS
Step1
Step2
Step3
Step4
Step5
  Step6
Final
Normalize
Prove

Step 6. Design History Tables

History tables are just like the detail tables we use for repeating characteristics, but in this case we'll have a single characteristic varying over time. While almost all characteristics can change over time, we'll need history tables for only some of them. Prices are one example.

If you will need to go back and regenerate an old invoice, you'll need to be able to get the correct price on the invoice date. For this, use a history table:

Widget Price History Table
WPH IDWidget-Size IDStart DatePrice
. . .      
1012310011999011519,95
. . .      
1023410012001041229,95
. . .      
1034510012003051224.95
. . .      
1045610012004121426,95
. . .      

You see that it starts with its own ID, as do all our tables. It has the ID of the widget-size it is pricing (in this case, small widgets), a start date and a price. An end date is implied - the day before the next price starts.

Systems that I engineer automatically keep histories for every characteristic (to provide roll back and roll forward, just in case Murphy's Law still applies). If you do this you don't need to include history tables in the design — you're done. If you don't have this feature, design all the history tables that you'll need and your ORE&D design is done.

top
Design
  Final
Normalize
Prove

Final Design Notes

These neat steps in no way parallel the actual design process. I note objects and events as I learn about them and sketch characteristic lists as I go. If a characteristic is repeating, I immediately add its detail table instead of listing it with the non-repeating characteristics. And that's only to get a design started.

With a little head start and a little coaching, my clients do their own design, as they're closer to their business than any outsider. It's not rocket science.

Did I say that your design could be done? Well, the truth is that it can only be "done" if your enterprise keeps right on doing what it does, without change and your information needs don't change. As enterprises are dynamic, "done" doesn't usually last very long. The good news is that if your design is good, it will be easy to make changes.

top
Design
Normalize
1st NF
2nd NF
3rd NF
Boyce-Codd
4th NF
5th NF
Prove

Database Normalization

The classic design "methodology" is normalization. List everything you need to know, then put your design in first normal form. Once you're in first, proceed to second normal form, and so on, up to fifth.

Over the years many "normal" design forms have been described. Here I'll explain the five main forms and one intermediate form which will be very helpful when we get to proving ORE&D's designs are correctly normalized.

The higher normal forms each remove a class of design flaws in lower-numbered forms. Third normal form used to be considered "good enough" as the problems corrected going to fourth and fifth normal forms were not common. Since an ORE&D design is in fifth normal form, we're not going to settle for "good enough."

Now a warning. The jargon here comes from mathematics where relation is the term for a table. The jargon is dense and the concepts are formal — which is good if you're doing math, but not so good for passers by who want to understand the concepts. I'll translate all of this into English.

First, I'll propose the Granny test. Have "a place for everything and everything in its place." A design that doesn't make Granny happy has problems. If small widgets weigh 10 ounces, record that fact just once, not once with every color. If you do the latter, it's inevitable that you'll end up with small widgets with different weights.

These are the formal definitions:

First Normal FormA relation is in 1NF if and only if all underlying domains contain atomic values only.
http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/1nf.html
Second Normal FormA relation is in 2NF if it is in 1NF and every non-key attribute is fully [functionally] dependent on each candidate key of the relation.
http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/2nf.html
Third Normal FormA relation R is in 3NF if it is in 2NF and every non-key attribute of R is non-transitively dependent on each candidate key of R.
http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/3nf.html
Boyce-Codd Normal Form A relation R is said to be in BCNF if [it is in 1NF and] whenever X -> A holds in R, and A is not in X, then X is a candidate key for R.
http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/bcnf.html
Fourth Normal Form A relation R is in 4NF if [it is in 3NF and], whenever a multivalued dependency X [->>] Y holds then either (a) the dependency is trivial, or (b) X is a candidate key for R.
http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/4nf.html
Fifth Normal FormA relation R is in 5NF [if it is in 4NF and]... if for all join dependencies at least one of the following holds. (a) (R1, R2, ..., Rn) is a trivial join-dependency (that is, one of Ri is R) (b) Every Ri is a candidate key for R.
http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/5nf.html

Let's turn these definitions into English that we can all understand. We'll start with first normal form.

top
Design
Normalize
  1st NF
2nd NF
3rd NF
Boyce-Codd
4th NF
5th NF
Prove

First Normal Form

A relation is in 1NF if and only if all underlying domains contain atomic values only.

A "relation" is the mathematical term for a table. "1NF" is shorthand notation for first normal form. A "domain," as used here, is simply a cell in the table. By "atomic" the definition means contains just a single value. So we could rewrite, this way:

A table is in first normal form if and only if each cell contains just one value.
This is the classic definition of first normal form, and it's not a good one. First, there's atomicity.

If you remember the history of science, the atom was postulated as the indivisible building block from which all matter is constructed. Well, it turned out not to be so atomic. Our table's atoms can be like that, too. Consider a date, such as 20050519. That's composed of a year, month and day. But that doesn't disqualify a table with dates from being in first normal form.

Next, there's the issue of domains. The domain of a column in a table is the set of values allowed. One column could be restricted to dates, another to names and a third to integers within some range. A most common restriction: the foreign keys in this column must be primary keys in some other table. Columns aren't restricted to a single domain in this definition, but that shouldn't worry you. Your RDBMS will definitely impose this restriction.

If you try to put more than one fact in a cell, you flunk the Granny test since you'll be putting multiple things in one place.

top
Design
Normalize
1st NF
  2nd NF
3rd NF
Boyce-Codd
4th NF
5th NF
Prove

Second Normal Form

A relation is in 2NF if it is in 1NF and every non-key attribute is fully functionally dependent on each candidate key of the relation.

In an ORE&D design we use abstract IDs as our keys. There are other possibilities. For example, you could use names as keys to any table of people. This will work until the second "William Smith" appears in a list that already has a "William Smith". Keys must be unique — permanently unique. An RDBMS will not allow you to add the second "William Smith" if you specified that the name was the primary key. So you might specify a combination, such as name and phone number or name and address as a key. These other possibilities are "candidate keys."

"Functional dependency" simply means that the key determines a single value. In a people table key 1234 might identify "William Smith" at a particular address. Key 4321 might identify another "William Smith" at a different address. Both keys identify a single record with a single name, so the names are functionally dependent on the keys. The reverse is not true. The name "William Smith" identifies two different keys, so the keys are not functionally dependent on the name.

One more item and we're ready to put second normal form into English. Each record contains multiple fields, also called "attributes." A combination of two or more attributes could be a candidate key. In a people table, we might combine name and postal code to get a key. (Well, someone might. We wouldn't — we'd use an abstract, permanently unique key.) "Full functional dependency" specifies that the values must be unique for the whole key, and not for any part. You have multiple people in one postal code. You have more than one "William Smith." But if you have only one "William Smith" in a given postal code, then the attributes (city, phone number, etc.) of that "William Smith" are fully functionally dependent on the name plus postal code key. So:

A table is in second normal form if it is in first normal form and each candidate key, but not just a part of a candidate key, identifies just one record.

This is also expressed with more wit by many database normalizers this way (referring to Ted Codd, the inventor of relational databases):

All the data in the table depends on the key, the whole key and nothing but the key, so help me Codd.

If you're designing a database, you have to separate your attributes (fields) into different tables to satisfy this constraint. For example, you have customers and products. You'll put the product data into one table and the customer data into another table to go from first to second normal form.

In first normal form, you can have the product ID, product attributes, customer ID and customer attributes in one table. That's not second normal form because the customer attributes don't depend on the product ID; the product attributes don't depend on the customer ID. Granny would know that you put customer data in one table and product data in another table.

top
Design
Normalize
1st NF
2nd NF
  3rd NF
Boyce-Codd
4th NF
5th NF
Prove

Third Normal Form

A relation R is in 3NF if it is in 2NF and every non-key attribute of R is non-transitively dependent on each candidate key of R.
To violate 3NF takes deliberate violations of the Granny principle. Let's put the product data into the Sale table to see this in action.

Very Bad Design!
Sale ID Sale Date Product ID Product Name Other
Product Data
. . .        
10001234 20050519 100432 Widget . . .
. . .        

The sale ID (primary key) determines the product ID (foreign key). The product ID specifies the product name. The product name is transitively determined by the sale ID. The data about the product should be in the product table, where product ID is the primary key.

Putting the information identified by product ID, a foreign key, into this table means that every sale of product 100432 also carries the name "Widget". You'd need to update lots of records if you ever changed that name. Granny doesn't approve. In English,

A table is in third normal form if it includes foreign keys, but not any data that you'd find by looking up the foreign keys in their own tables.
Foreign key lookups (joins) are fast, cheap and easy.

top
Design
Normalize
1st NF
2nd NF
3rd NF
  Boyce-Codd
4th NF
5th NF
Prove

Boyce-Codd Normal Form

A relation R is said to be in BCNF if [it is in 1NF and] whenever X -> A holds in R, and A is not in X, then X is a candidate key for R.
Let's start with "X -> A" — the spoken version is "X implies A" — meaning that for any X, which is one or more fields in a record used as a key, you find exactly one value for A, another attribute (field). Let's look at a single record in a simplified Customer table:

Simple Customer Table
Customer ID Last Name First Name Address Phone Email
. . .          
100023 Smith William 1234 Main St. (123) 456-7890 BillS@MailService.com
. . .          

Here the Customer ID, 100023, since it is unique in the Customer table implies that the last name will be "Smith", that the first name is "William" and so on. There is another candidate key here, too. The Email should be unique. (Should be, but some families share a family mail address, so don't depend on it being unique.) Similarly, the combination of last name, first name and phone might be unique. Of course, Bill might have a son, Bill Jr., who lives with him.

So here the Customer ID implies each of the other attributes and is the only field that unambiguously implies each of the other attributes. This table is in Boyce-Codd normal form. When is a third normal form table not Boyce-Codd normal? That happens when you have multiple compound candidate keys (name plus address for one, name plus phone for another) and the candidate keys share at least one field. If we were using name plus address as a key, and name plus phone as another, the table would be in third normal form, but not in Boyce-Codd normal form.

Note that every Boyce-Codd normal form table is in third normal form (and therefore is in second normal form as well) but the definition only requires first normal form.

Mathematicians worry a lot about completeness, which is a good thing for mathematics, but from now on I'm going to ignore the trivial cases. Customer ID 100023 implies that the Customer ID is 100023. I'm glad that we have mathematicians who worry about such things. On the other hand, there are some other people who need to get on with designing databases.

top
Design
Normalize
1st NF
2nd NF
3rd NF
Boyce-Codd
  4th NF
5th NF
Prove

Fourth Normal Form

A relation R is in 4NF if [it is in 3NF and], whenever a multivalued dependency X [->>] Y holds then either (a) the dependency is trivial, or (b) X is a candidate key for R.
Note: the most common usage is "X -> Y" for "X implies Y" and "X ->> Y" for "X multivalued implies Y". The original quoted here used "multivalued dependency X -> Y" — I've edited in the more common notation.

Fourth and fifth normal forms go from properties of a single table to multi-table properties. Fourth normal introduces the concept of multivalued dependencies. When we were designing our databases I called these "repeating characteristics." Let's say widgets can be red, white or blue. So if the product is a widget, the color is one of red, white or blue. The product ID for widgets implies these three colors. Fourth normal form simply says that we'll not have more than one of these dependencies per table.

If all widgets come in red, white and blue, this is a good, fourth normal form, design:

Product Table
Product IDName
...  
10001234Widget
...  
Widget-Color Table
W-S Colors IDProduct IDColor
. . .    
100110001234red
100210001234white
100310001234blue
. . .    
Widget-Size Table
W-S ID Product ID Size Weight
...      
100110001234small10
100210001234medium17
...      

How do we violate fourth normal form? Simple. Combine the color and size values into a single table, this way:

Product Table
Product IDName
...  
10001234Widget
...  
Very Bad Design
Widget Color and Size Table
W-CS ID Product ID Color Size Weight
. . .        
100110001234redsmall10
100210001234whitemedium17
100310001234bluesmall10
. . .        

What's wrong with this design? Start with the first data record. It tells you that widgets can be red. It also tells that widgets can be small, and small widgets weigh 10 ounces. It puts two facts into one record, two things in one place, which flunks the Granny test. What happens when you stop carrying red widgets? If you delete this record, you also lose the facts that you carry small widgets and that small widgets weigh 10 ounces.

When the design is in fourth normal form, to find all the sizes of widgets you carry you ask the Widget-Size Table for all records where the Product ID is 10001234. With the bad design, you ask the same thing of the Widget Color and Size Table and you get as many as six records for two sizes and three colors (small red, small white, small blue, medium red, ...).

top
Design
Normalize
1st NF
2nd NF
3rd NF
Boyce-Codd
4th NF
  5th NF
Prove

Fifth Normal Form

A relation R is in 5NF [if it is in 4NF and]... if for all join dependencies at least one of the following holds. (a) (R1, R2, ..., Rn) is a trivial join-dependency (that is, one of Ri is R) (b) Every Ri is a candidate key for R.

I had lots of trouble understanding fifth normal form. My two books left open the question of whether ORE&D designs were in fifth normal form. When I finally figured this form out I realized immediately that ORE&D is a fifth normal form design.

Fifth normal form requires that you break down tables, but only as long as the breakdown mirrors the real world reality. Here's an example. Assume that Ford and Chevy dealers carry the full line of Ford and Chevy products. This is a fifth normal form design (greatly simplified) for auto dealers.

Manufacturer Table
ManufacturerProduct
ChevyCars
ChevyTrucks
FordCars
FordTrucks
Dealer Table
DealerManufacturer
Charley's ChevroletChevy
Fanny's FordFord
Huge AutomallChevy
Huge AutomallFord

What's good about this design? Consider how well it supports changes. If Chevy stops building cars, you delete one record. This will automatically delete Chevy cars from both Charley's Chevrolet and Huge Automall's product lines. Similarly, if Chevy introduces a new product it will automatically be part of Charley's Chevrolet and Huge Automall's product lines.

The next table is not in fifth normal form because you can split it into the two tables above:

Not Fifth Normal
Dealer Table
DealerManufacturerProduct
Charley's ChevroletChevyCars
Charley's ChevroletChevyTrucks
Fanny's FordFordCars
Fanny's FordFordTrucks
Huge AutomallChevyCars
Huge AutomallChevyTrucks
Huge AutomallFordCars
Huge AutomallFordTrucks

The non-fifth normal form table is the join of the two fifth normal form tables. Now, if we change the definition of the world we are modeling, we can use the same table as a correct, fifth normal form table. The key assumption was that the dealer carries the manufacturer's full line. This is why you can join the Manufacturer Table and the Dealer Table to get the list of all the products carried by the dealers. If we relax that restriction and allow the dealer to choose exactly which products they will carry, then we can no longer do this.

Assume that Charley only carries cars, and Fanny just sells trucks. Huge Automall sells everything. This is the correct, fifth normal form design that shows this:

Fifth Normal Form
Dealer Table
DealerManufacturerProduct
Charley's ChevroletChevyCars
Fanny's FordFordTrucks
Huge AutomallChevyCars
Huge AutomallChevyTrucks
Huge AutomallFordCars
Huge AutomallFordTrucks

Again, you could just ask Granny. If the dealer carries all products from their manufacturer, then you list the manufacturer's products once and don't repeat the list for every dealer. If each dealer chooses the products it will carry, then you need to record each selected product for every dealer.

top
Design
Normalize
Prove
1st NF
Boyce-Codd
5th NF
Steps

Proof that the Easy Way Is the Best Way

Given two ways to achieve a result, one simple and one complex, the simple way is the best way, if, and only if, it produces a result as good as the complex way's result.

To prove that ORE&D is the best way, we need to show that ORE&D designs are as good as classic normalization's designs. We need to prove that ORE&D designs are in fifth-normal form.

Let's begin with first-normal form.

First Normal Form

Our ORE&D designs are in first normal form. We only put single values into cells and our RDBMSs make sure that every column holds data from the domain we specify. It will be especially nice about ensuring that foreign keys in one table are always the primary keys in a target table.

It is possible to put non-atomic values into cells in an RDBMS, but I'm not going to explain how. Nice people just don't do such things. All our values are atomic and our tables are all in first normal form.

From here there are two paths to fifth normal form: leaping and one at a time. I'll start with the leaping proof, which takes little time, but relies on the brilliant work of the pioneers of this field. After that's done, we'll go one step at a time, just to show that you can depend on those brilliant pioneers.

top
Design
Normalize
Prove
1st NF
  Boyce-Codd
5th NF
Steps

Boyce-Codd Normal Form

It has been proven that a table in Boyce-Codd normal form is also in third normal form. A third normal form table is also in second normal form, so if we can prove that an ORE&D design is in Boyce-Codd normal form we'll be more than half done. Fortunately, the proof is simple.

Our tables all include an ID, which is a unique, abstract and single-valued key. Our ID implies every other attribute (field) in its tuple (record). And it is a candidate key. (Actually, it's the primary and only key.) Therefore, our tables are in Boyce-Codd normal form.

Fifth Normal Form

In 1991 two of the pioneers of the relational database, C. J. Date and Ronald Fagin prepared a paper which was published in ACM Transactions on Database Systems, Vol. 17, No. 3, September 1992. This is the first sentence of the abstract:
"A key is simple if it consists of a single attribute. It is shown that if a relation schema is in third normal form and every key is simple, then it is in projection-join normal form (sometimes called fifth normal form),"
In an ORE&D design, every key is simple and therefore, an ORE&D design is in fifth normal form. QED.

I'd like to take this opportunity to thank Boyce, Codd, Date and Fagin for having done all the hard work. And I'd like to thank IBM for its fine online library, where you can find Simple Conditions for Guaranteeing Higher Normal Forms for Relational Databases.

top
Design
Normalize
Prove
1st NF
Boyce-Codd
5th NF
  Steps

One Step at a Time

Now we can look a little more closely and not just cite the brilliant works of our pioneers. We've seen that first normal form is simply a matter of putting all the data into tables, which is what we do when we decide to use a relational database. We're in first normal form.

The second normal form is achieved when our data is "about the key, the whole key and nothing but the key." It's here that we separate Product data from Customer data. The ORE&D method never intermingles — the design process starts by identifying object and event classes, which separates the data from the beginning. By having unique, abstract keys we guarantee functional dependence, and since we don't have compound fields, we also guarantee full functional dependence. We're in second normal form.

Every time we use a foreign key (Customer ID, outside the Customer table) we use it to refer to the customer data. We don't ever accompany the Customer ID with other data from the Customer table. We assume that we'll look it up in the Customer table (do a join), as required. So the design method eliminates transitive dependencies. We're in third normal form.

Fourth normal would be violated if we put multiple repeating characteristics into a single table. The design method is to create a separate detail table for each repeating characteristic, which means we're sticking to fourth normal form.

Finally, there's fifth normal form, where I'll defer to the math of Date and Fagin. If you go back to the original distinction made between objects (which have an independent existence) and events (which relate objects at a point in time) you'll see the areas in which the designer has been looking at the objects carefully to see how they get related. It was our good fortune to stick to the real world, and the Granny principle. That led us to fifth normal form.

© 2005 by Martin Rinehart