Applying Normalization Rules - T-SQL

Using the Employees table shown in Figure (simple spreadsheet Employee Tablee), look for violations of first normal form. Does more than one column contain information about the same type of attributes? Beginning with the numbered Address and CityLine fields, each "location” consists of a column for the address and another column for the city, state, and zip code. Because there are two pairs of these columns, this may be a problem. Each phone number is a single column, designated as either the home or work phone. How would I make a single list of all phone numbers? What happens if I need to record a mobile phone for an employee? I could add a third column to the table. How about a fourth? How about the Title column? The Supervisor Name column may be viewed as a special case, but the fact is that the Employee Name and Supervisor Name columns store the same type of values. They both represent employees.

I can move all these columns into separate tables, but how do I keep them associated with the employee? This is accomplished through the use of keys. A key is just a simple value used to associate a record in one table to a record in another table (among other things). To satisfy first normal form, I'll move these columns to different tables and create key values to wire up the associations. In the following example, I have removed the address and city information and have placed it into a separate table.

I have devised a method to identify each employee with a six-character character key, using part of their last and first names. I chose this method because this was once a very popular method for assigning key values. This allows me to maintain the associations between employees and their addresses. In this first iteration (see Figure : Show in the below Table), I use this method to make a point. This is a relatively small database for a small company and I don't have any employees with similar first and last names, so this method ought to work just fine, right? Hold that thought for now.

Applying Normalization Rules

I do the same thing with the new Addresses table (see Figure (Show in the below Table)). Each address record is assigned an EmployeeKey value to link it back to the Employees table.

Applying-Normalization-Rules

I have lost a significant piece of information in doing this. I've flattened the address information so that I no longer have one address designated as either primary or secondary for an employee. I'll get to this later. For now, I'm only concerned with adhering to first normal form. Besides, does the information in the old Address1 and CityLine1 columns imply that this is the employee's primary residence? Did I have a complete understanding of the business rules when I began working with this data? Unfortunately, in most ad-hoc projects, it is more often a case of making things up as we go along.

For the phone numbers I'll do the same thing as before — move the phone number values into their own table and then add the corresponding key value to associate them with the employee record. I'm also going to add a column to designate the type of phone number this represents (see Figure (Show in the below Table)). I could use this as an argument to do the same thing with the addresses, but I'll hold off for now.

Applying-Normalization-Rules

Now that I have three tables with common column values, do I have a relational database? Although it may be true that this is related data, it's not a fully relational database. The key values only give me the ability to locate the related records in other tables, but this does nothing to ensure that my data stays intact. Take a look at what I have done so far (see Figure (Show in the below Table)). The presence of the same key value in all three of these tables is an implied relationship. There is currently no mechanism in place for the database to prevent users from making silly mistakes (such as deleting an employee record without also removing the corresponding address and phone information). This would create a condition, common in early database systems, called orphaned records.

Applying-Normalization-Rules

Before continuing, I must correct a horrible indiscretion. I told you that this business of using parts of different field values (such as the first and last name) to form a meaningful unique key was once a common practice. This is because database system designers in the past often had to create a system where users had to provide a special number to look up a record. To make this easier, they would come up with some kind of intelligent, unique value. It might include characters from a customer or patient's name, or perhaps a series of numbers with digits in specific positions representing an account type or region. For example, when was the last time you called the bank or the telephone company and were asked for your account number? This happens to me all the time. It amazes me that the companies in possession of the most sophisticated, state-of-the-art technology on the planet require me to memorize my account number. How about looking up my account using my name, address, phone number, mother's maiden name, or any of the other information they required when I set up my account?
Using this simple name-based key may have seemed like the right thing to do at the time, but the fact is that it will likely get me into a whole lot of trouble down the road. I worked for a company that used this approach in a small, commercial application. The program even appended numbers to the end of the keys so there could be nearly a hundred unique key values for a given last name/first name combination. What they didn't anticipate was that their product would eventually become the most popular medical billing software in the country and would be used in business environments they couldn't possibly have imagined. Eventually this got them into trouble, and they had to completely re-architect the application to get around this limitation. One customer, a medical office in the Chicago area, had so many patients with the same or similar names, that they actual ran out of key values.

Thinking Ahead

I'll resolve the EmployeeKey issue by changing it to an auto-sequencing integer called an identity (see Figure (Show in the below Table)). This is known as a surrogate key, which simply means that key values are absolutely meaningless as far as the user is concerned. The database assigns numbers that will always be unique within this column. The purpose of the key is to uniquely identify each row, not to give employees or users something to memorize.

Thinking Ahead

The next step is to designate the EmployeeKey in the Employees table as a primary key and the related keys as foreign keys. The foreign key constraints cause the database engine to validate any action that could cause these relationships to be violated. For example, the database would not allow an employee record to be deleted if there were existing, related address or phone records. Related tables are often documented using an entity-relationship diagram (ERD). The diagram in Figure (Show in the below Table) shows the columns and relationships between these tables.

Applying-Normalization-Rules

There is still work to do. The Supervisor Name is also a violation of first normal form because it duplicates some employee names. This is a special case, however, because these names already exist in the Employees table. This can be resolved using a self-join, or relationship, on the same table (see Figure (Show in the below Table)).

self-join, or relationship, on the same table

The supervisor designation within the Employees table is now just an integer value referring to another employee record.
The Title column is also in violation of first normal form and could be moved into its own table, as well. A title isn't uniquely owned by an employee, but each employee only has one title. To discern this relationship, you must look at it from both directions:

  • One employee has one title.
  • One title can have multiple employees.

This is a one-to-many relationship from the title to the employee. Resolving this is a simple matter of placing one instance of each title value in a separate table, identified by a unique primary key. A similar column is added to the Employees table as a non-unique foreign key (see Figure (Show in the below Table)).

Employees table as a non-unique foreign key

You should see a pattern developing. This is an iterative process that will typically send you in one of two directions in each cycle. You will either continue to move these values into related tables with related keys, or you will find discrepancies between your business rules and the data, and then head back to the drawing board to correct the data and table structure.

Multiple Associations

I know that a title can be associated with more than one employee, but what happens if an address is shared by more than one individual? This is a problem in the current database model. I can't use one primary key value and have multiple associations going in both directions. The only way I can do this is to create a primary key that includes two separate values: one for the employee key and one for the address key. However, I can't do this using either of these two tables. If I add the EmployeeKey to the Addresses table, I'm back to the original problem, where I would have duplicate address rows. Because a record in the Addresses table will no longer be directly tied to a record in the Employees table, I must remove the EmployeeKey and create a new primary key for this table and remove the duplicate values. Now the Addresses table conforms to first normal form and third normal form.

Many-to-many relationships are solved using a separate table, often called a join or bridge table. Often, this table contains no user-readable values, only keys to bridge one table to another. However, you may recall that we have a missing bit of information. Remember when I moved the address information from the Addressl/CityLinel columns and Address2/CityLine2 columns into the Address table? I said that we had no way to trace these back to their roots and recall which location was the employee's primary residence? I can now resolve this within the bridge table by adding an additional column (see Figure (Show in the below Table)).

Multiple Associations

The new AddressType column is used to indicate the type of residence. This allows employees to share addresses while eliminating redundant address records. Does the AddressType column violate first normal form? Technically, yes. This could be an opportunity to optimize the database even more by creating yet another table for these values. It looks like there would only be three address type records related to the nine employees (see Figure (Show in the below Table)).

three address type records related to the nine employees

It looks like the Vice President of Sales and the Inside Sales Coordinator share a residence only on weekends (see Figure (Show in the below Table)).

Sales Coordinator share a residence only on weekends

Multi-Valued Columns

The last issue to contend with is that of having multiple values stored in a single column. There are quite a few examples in these tables. For example, the EmployeeName column in the Employees table contains both the first and last name, the Address Line column in the Addresses table includes all parts of a street address, and the City Line contains the city name, U.S. state, and zip code/postal code. Before I just willy-nilly start parsing all the values into separate columns, it's important for me to consider how this data will be used and the advantages and disadvantages of breaking it into pieces. Here are some sample questions that can help to define these business requirements:

  • Will the employee first name and last name ever be used separately?
  • Will I ever need to sort on one single value (such as last name)?
  • Does every employee have a first name and last name? Do they only have a first name and last name (middle names/initials, hyphenated names, and so on)?
  • Is there any value or need in separating parts of the address line (will I need a list of streets, and so on)?
  • If I separate parts of the AddressLine or CityLine into separate columns, do I need to accommodate international addresses?

Apparently I do need to consider addresses in at least two locales because I have locations in the UK and the United States, so I will need to think beyond only one style of address. So, suppose that I have consulted my sponsoring customer and have learned that it would be useful to store separate first names and last names and we don't care about middle names or initials. We also don't plan to accommodate anyone without a first and last name. We have no need to break up the address line. This practice is highly uncommon outside of specialized systems and would be very cumbersome to maintain. We would benefit from storing the city, postal code or zip code, and state or province. It would also be useful to store the country, which is currently not included. Storing geographic information can be tricky because of the lack of consistency across international regions. This may require that you devise your own synonyms for different regional divisions (such as city, township, municipality, county, state, province, and country). In distributing these values into separate columns, you may find even more redundancies. Should these be further normalized and placed into separate tables? Does this ever end? I'll cite one example where the city, state, and zip code is normalized. I maintain a system that stores U.S. addresses and stores only the zip code on the individual's record. A separate table contains related city and state information obtained from the U.S. Postal Service.

I won't bore you will the mechanics of separating all these fields. The process is quite straightforward and very similar to what's already been done. Figure (Show in the below Table) shows the completed data model, based on the original flat table.

Applying-Normalization-Rules

To Normalize or To De-normalize?

Depending on how a database is to be used (generally, it will be used for data input or for reporting), it may or may not be appropriate to apply all the rules just presented. The fact of the matter is that fully normalized databases require some gnarly, complex queries to support reporting and business analysis requirements. Complying fully with all the rules of normal form often adds more overhead to the application. Without going into detail, here's something to think about: If you are designing a new database system to support a typical business process, you will usually want to follow these rules completely and normalize all your data structures. After a time, when you have a large volume of data to extract and analyze through reports and business intelligence tools, you may find it appropriate to create a second database system for this purpose. In this database, you will strategically break the rules of normal form, creating redundant values in fewer, larger tables. Here's the catch: Only after you fully understand the rules of normal form will you likely know when and where you should break them.

Question Authority

You should ask yourself an important question as you encounter each opportunity to normalize: "Why?" Know why you should apply the rules and what the benefits and cost are. One of the challenges of applying normalization rules is to know just how far to go and to what degree it makes sense to apply them. At times it just makes sense to break some of the rules. There are good arguments to support both sides of this issue and without a complete understanding of business requirements I would be hard pressed to make a general statement about how data elements (such as phone numbers, titles, or addresses) should always be managed. In short, you need to understand the business requirements for your application and then apply the appropriate level of database normalization to reach that goal. If ever in doubt, it's usually best to err on the side of keeping the rules.

That being said, don't let yourself be bullied into adhering strictly to normalization rules if they don't make sense. A case in point: I worked for a very smart guy who had virtually no database experience. Very often he would read an article or a chapter of a book and suddenly became an expert on that particular technology. When he noticed that duplicate state codes were showing up repeatedly in our database, he decided that we needed a State table so that the states were not duplicated and that only valid states were added to a record. Because the addition of state code data was completely controlled by the application and not by ad-hoc changes to the database tables, his reasoning was not very valid. The data team finally got him to relent after pretending to enthusiastically endorse his opinion and add to it by claiming we also needed a street number table to avoid erroneous addresses as well as an area code table and a phone prefix table. He got the message almost immediately and decided that separating the state codes out didn't make as much sense as he originally thought.

Normalization is very important to database integrity, but remember that the more tables involved in a query to extract data, the worse the performance of the query.
We have discussed some abstract normalization concepts, which can be a bit confusing to someone new to databases. It is important to understand the basics if you are ever going to be called on to build data entities and not just to query them. Another important aspect of T-SQL programming is to understand how queries are processed by SQL Server. This will give you insight on how you should formulate your queries. The next section gives a brief description of the query processing process.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

T-SQL Topics