Normalization - IBM Mainframe

Normalization is the formal process for deciding which attributes should be grouped together in a relation. We can use commonsense to decide which fields or attributes should be grouped together, but normalization provides us with a systematic and scientific process for doing this. Before proceeding with the physical design of the database, we need to validate the logical design and normalization serves as a tool for validating and improving the logical design, so that the logical design satisfies certain constraints and avoids unnecessary duplication of data. In the normalization process we analyze and decompose the complex relations and transform them into smaller, simpler and well-structured relations.

The process of normalization was first developed by E.F. Codd. Normalization is often performed as a series of tests on a relation to determine whether it satisfies or violates the requirements of a given normal form. Three normal forms were initially proposed, namely first, second and third normal forms. Subsequently, a stronger definition of the third normal form was introduced by R. Boyce and E.F. Codd, referred to as Boyce-Codd Normal Form (BCNF). All of these normal forms are based on functional dependencies among the attributes of a table.

Higher normal forms that go beyond BCNF were introduced later such as fourth and fifth normal forms. However these later normal forms deal with practical solutions that are very rare.

Data normalization is a methodology to devise a collection of tables that represent real world business records in a database, avoiding any data duplication as storage was expensive. Avoiding data duplication also means that update anomalies cannot happen. Data normalization is very well and widely documented. It starts with a single large table that represents all the properties of a real world business record together with its main identifier (a key), then removes hierarchies (repeating groups) to simplify querying with a language such as SQL. Any duplicate data and functional dependencies in the resulting tables must then also be removed.

To achieve normalization, the single table with all the required properties is split into more tables that are linked through primary and foreign keys. The result of data normalization is that a single business record can be represented in tens or sometimes hundreds of tables. Many artificial keys (and associated indexes) are introduced that do not exist in the real world, but are needed to reconstruct the real world business record. Storing multiple versions of a business record, for example, an order, and then any revisions made to the order, requires versioning all the tables involved which makes querying and maintaining the tables complex. An alternative approach, to conserve storage, is to store deltas only, instead of cascading full versions through the tables, introducing more complexity for programmers.

In 1980, the cost of two MB of storage was roughly equivalent to the cost of one week's work of a computer programmer in the US. In 2010, even one GB of storage represents only a tiny fraction, not even minutes of a computer programmer's time, and the storage price continues to drop. Moreover, memory is becoming plentiful, and the cost (latency) of I/O operations continues to decrease as new kinds of storage –such as solid state disks- are being introduced. With the notable exception of relational databases, storage media is typically used to store non-normalized artifacts, for example in file servers, web servers, content repositories, application servers, and so on.

Contrast relational storage with the tablets, tally sticks, and paper records that were used for record keeping before computer systems were introduced and always stored "as is". For several reasons they were not broken up or converted to a different format for storage purposes. First, storage space was usually plentiful and did not have to be conserved. Second, any conversion (and reconstruction) of these artifacts would have been very expensive. And third, storing these records in their original form made it easy to use and understand them when they were retrieved from storage. The same reasons apply today for storing real world digital business records in a non-normalized form as discussed later in this article.

As the use of paper records increased rapidly in the 19th and 20th century, storage space became a problem for some libraries and archives. This triggered the invention of microfilm and microfiche to reduce the required storage space to between 0.25% and 3% of the original material. However, this is merely a form of compression without representing the information in a conceptually different way. Similarly, digital compression can be applied today to reduce the storage consumption of non-normalized business records.

Normalization as we have seen is accomplished in steps or stages, each of which corresponds to a normal form. A normal form is a state of a table that results from applying simple rules regarding functional dependencies (or relationships between attributes) to that table. We will see an overview of the normal forms in the following section and will discuss each normal form in detail in the next sections.

  • First Normal Form (INF) - The multi-valued attributes (called repeating groups) should be removed i.e. elimination of repeating groups.
  • Second Normal Form (2NF) - The partial functional dependencies have to be removed, i.e. elimination bf redundant data.
  • Third Normal Form (3NF) - The transitive dependencies have to be removed, i.e. eliminate columns not dependent on the key.
  • Boyce-Codd Normal Fcym (BCNF) - The remaining anomalies that result from functionaldependencies are removed.
  • Fourth Normal Form (4NF) - Multi-valued dependencies are removed, i.e. isolation of independent multiple relationships.
  • Fifth Normal Form (5NF) - Any remaining anomalies are removed. In this normal form we isolate semantically related multiple relationships.

Before discussing the normal forms, there are two concepts required by normal forms and that are integral to Data Modeling—Keys and Relationships.

Keys

A key uniquely identifies a row in a table. There are two types of keys—intelligent keys and non-intelligent keys. An intelligent key is based upon data values such as a date, a last name or a combination of values.A non-intelligent key is completely arbitrary, having no function or meaning other than identification of the row.

Intelligent keys are problematic because the key value must be changed to reflect changing data values. An updateable key presents its own problems, because all related tables must be updated every time the key is updated. As a general rule, intelligent keys must never be used, but it is not always possible to do that.

A primary key is a column in the table whose purpose is to uniquely identify records from the same table. A foreign key is a column in a table that uniquely identifies the records from a different table. Primary and foreign keys therefore have a symbiotic relationship, in which the foreign key is used to store values of another table's primary key to describe the relationship between data from different tables.

Relationships

There are three possible relationships between related entities or tables. They are:

  • One-to-one
  • One-to-many
  • Many-to-many

Relational Databases are designed to work most effectively with one-to-many relationships between tables and expressed using primary and foreign keys. Defining a column as a primary key in the database will ensure that no duplicate values are contained in that column. In other words, a primary key will be unique.

One-to-one relationships are rare, because data elements related in this way are normally placed in the same table. When a one-to-one relationship is desired, it can be expressed simply as a one-to-many relationship using primary and foreign keys.

One-to-many relationships are the most common. A primary key is the 'one' side of the relationship, and the foreign key is the 'many' side of the relationship. A primary key value is unique in its own table. Each primary key may have any number of foreign keys using the same value, in any number of tables. In other words each pair of primary and foreign key columns is a one-to-many relationship. The data values stored in these columns determine which rows from the table containing the foreign key column are associated with an individual row in the table containing the primary key column. The foreign key value refers to one, and only one, value in the table containing the primary key. Multiple foreign keys may reference the primary key value. This is how one-to-many relationships are expressed and maintained by the database. The database can perform special operations on primary and foreign key relationship between tables, including the ability to delete all rows in a table containing a foreign key value when the associated primary key row is deleted or to restrict deletion of a row containing a primary key that has references in another table. Defining primary and foreign key columns in the database promotes data integrity.

Many-to-many relationships are problematic and cannot be adequately expressed directly in a relational database. Many-to-many relationships are usually expressed using intersection tables. An intersection table contains two or more foreign keys, relating the primary key values of two or more tables to each other. The role of an intersection table is to convert the many-to-many relationship into two one-to-many relationships that can be easily handled by the database.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

IBM Mainframe Topics