Relational-Database Design - Database system concepts

This chapter continues our discussion of design issues in relational databases. In general, the goal of a relational-database design is to generate a set of relation schemas that allows us to store information without unnecessary redundancy, yet also allows us to retrieve information easily. One approach is to design schemas that are in an appropriate normal form. To determine whether a relation schema is in one of thedesirable normal forms, we need additional information about the real-world enterprise that we are modelingwith the database. In this chapter,we introduce the notion of functional dependencies. We then define normal forms in terms of functional dependencies and other types of data dependencies.

First Normal Form

The first of the normal forms that we study, first normal form, imposes a very basic requirement on relations; unlike the other normal forms, it does not require additional information such as functional dependencies.A domain is atomic if elements of the domain are considered to be indivisible units. We say that a relation schema R is in first normal form (1NF) if the domains of all attributes of R are atomic.

A set of names is an example of a nonatomic value. For example, if the schema of a relation employee included an attribute children whose domain elements are sets of names, the schema would not be in first normal form. Composite attributes, such as an attribute address with component attributes street and city, also have nonatomic domains.

Integers are assumed to be atomic, so the set of integers is an atomic domain; the set of all sets of integers is a nonatomic domain. The distinction is that we do not normally consider integers to have subparts, but we consider sets of integers to have subparts namely, the integers making up the set. But the important issue is not what the domain itself is, but rather how we use domain elements in our database.

The domain of all integers would be nonatomic if we considered each integer to be an ordered list of digits. As a practical illustration of the above point, consider an organization that assigns employees identification numbers of the following form: The first two letters specify the department and the remaining four digits are a unique number within the department for the employee. Examples of such numbers would be CS0012 andEE1127. Such identification numbers can be divided into smaller units, and are therefore nonatomic. If a relation schema had an attribute whose domain consists of identification numbers encoded as above, the schema would not be in first normal form.

When such identification numbers are used, the department of an employee can be found by writing code that breaks up the structure of an identification number. Doing so requires extra programming, and information gets encoded in the application program rather than in the database. Further problems arise if such identification numbers are used as primary keys:When an employee changes department, the employee’s identification number must be changed everywhere it occurs, which can be a difficult task, or the code that interprets the number would give a wrong result.

The use of set valued attributes can lead to designs with redundant storage of data, which in turn can result in inconsistencies. For instance, instead of the relationship between accounts and customers being represented as a separate relation depositor, a database designer may be tempted to store a set of owners with each account, and a set of accounts with each customer. Whenever an account is created, or the set of owners of an account is updated, the update has to be performed at two places; failure to perform both updates can leave the database in an inconsistent state. Keeping only one of these sets would avoid repeated information, but would complicate some queries. Set valued attributes are also more complicated to write queries with, and more complicated to reason about.

In this chapter we consider only atomic domains, and assume that relations are in first normal form. Although we have not mentioned first normal form earlier, when we introduced the relational model in Relational Model we stated that attribute values must be atomic.

Some types of nonatomic values can be useful, although they should be used with care. For example, composite valued attributes are often useful, and set valued attributes are also useful in many cases, which is why both are supported in the E-R model. In many domains where entities have a complex structure, forcing a first normal form representation represents an unnecessary burden on the application programmer, who has to write code to convert data into atomic form. There is also a runtime overhead of converting data back and forth from the atomic form. Support for nonatomic values can thus be very useful in such domains. In fact, modern database systems do support many types of nonatomic values, as we will see in Object oriented Database and Object Relational Databases. However, in this chapter we restrict ourselves to relations in first normal form.

Pitfalls in Relational-Database Design

Before we continue our discussion of normal forms, let us look at what can go wrong in a bad database design. Among the undesirable properties that a bad design may have are:

  • Repetition of information
  • Inability to represent certain information

We shall discuss these problems with the help of a modified database design for our banking example: suppose the information concerning loans is kept in one single relation, lending, which is defined over the relation schema

Lending-schema = (branch-name, branch-city, assets, customer-name, loan-number, amount)

Figure shows an instance of the relation lending (Lending-schema). A tuple t in the lending relation has the following intuitive meaning:

  • t[assets] is the asset figure for the branch named t[branch-name].
  • t[branch-city] is the city in which the branch named t[branch-name] is located.
  • t[loan-number] is the number assigned to a loan made by the branch named t[branch-name] to the customer named t[customer-name].
  • t[amount] is the amount of the loan whose number is t[loan-number].

Suppose that we wish to add a new loan to our database. Say that the loan is made by the Perryridge branch to Adams in the amount of $1500. Let the loan-number be L-31. In our design, we need a tuple with values on all the attributes of Lendingschema. Thus, we must repeat the asset and city data for the Perryridge branch, and must add the tuple to the lending relation. In general, the asset and city data for a branch must appear once for each loan made by that branch.

(Perryridge, Horseneck, 1700000, Adams, L-31, 1500)

Pitfalls in Relational-Database Design

Sample lending relation.

The repetition of information in our alternative design is undesirable. Repeating information wastes space. Furthermore, it complicates updating the database. Suppose, for example, that the assets of the Perryridge branch change from 1700000 to 1900000. Under our original design, one tuple of the branch relation needs to be changed. Under our alternative design, many tuples of the lending relation need to be changed. Thus, updates are more costly under the alternative design than under the original design. When we perform the update in the alternative database, we must ensure that every tuple pertaining to the Perryridge branch is updated, or else our database will show two different asset values for the Perryridge branch.

That observation is central to understanding why the alternative design is bad.We know that a bank branch has a unique value of assets, so given a branch name we can uniquely identify the assets value. On the other hand, we know that a branch may make many loans, so given a branch name, we cannot uniquely determine a loan number. In other words, we say that the functional dependency

branch-name → assets

holds on Lending-schema, but we do not expect the functional dependency branchname → loan-number to hold. The fact that a branch has a particular value of assets, and the fact that a branch makes a loan are independent, and, as we have seen, these facts are best represented in separate relations.We shall see that we can use functional dependencies to specify formally when a database design is good. Another problem with the Lending-schema design is that we cannot represent directly the information concerning a branch (branch-name, branch-city, assets) unless there exists at least one loan at the branch. This is because tuples in the lending relation require values for loan-number, amount, and customer-name.

One solution to this problem is to introduce null values, as we did to handle updates through views. Recall, however, that null values are difficult to handle. If we are not willing to deal with null values, then we can create the branch information only when the first loan application at that branch is made. Worse, we would have to delete this information when all the loans have been paid.Clearly, this situation is undesirable, since, under our original database design, the branch information would be available regardless of whether or not loans are currently maintained in the branch, and without resorting to null values.

Functional Dependencies

Functional dependencies play a key role in differentiating good database designs from bad database designs. A functional dependency is a type of constraint that is a generalization of the notion of key.

Basic Concepts

Functional dependencies are constraints on the set of legal relations. They allow us to express facts about the enterprise that we are modeling with our database.

In Entity Relationship Model, we defined the notion of a superkey as follows. Let R be a relation schema. A subset K of R is a superkey of R if, in any legal relation r(R), for all pairs t1 and t2 of tuples in r such that t1 ≠ t2, then t1[K] ≠ t2[K]. That is, no two tuples in any legal relation r(R) may have the same value on attribute set K. The notion of functional dependency generalizes the notion of superkey. Consider a relation schema R, and let α ⊆ R and β ⊆ R. The functional dependency

α →β

holds on schema R if, in any legal relation r(R), for all pairs of tuples t1 and t2 in r such that t1[α] = t2[α], it is also the case that t1[β] = t2[β]. Using the functional-dependency notation, we say that K is a superkey of R if K → R. That is, K is a superkey if, whenever t1[K] = t2[K], it is also the case that

t1[R] = t2[R] (that is, t1 = t2).

Functional dependencies allow us to express constraints that we cannot express with superkeys. Consider the schema Loan-info-schema = (loan-number, branch-name, customer-name, amount) which is simplification of the Lending-schema that we saw earlier. The set of functional dependencies that we expect to hold on this relation schema is

loan-number →amount
loan-number →branch-name

We would not, however, expect the functional dependency

loan-number →customer-name

to hold, since, in general, a given loan can be made to more than one customer (for example, to both members of a husband–wife pair). We shall use functional dependencies in two ways:

  1. To test relations to see whether they are legal under a given set of functional dependencies. If a relation r is legal under a set F of functional dependencies, we say that r satisfies F.
  2. To specify constraints on the set of legal relations. We shall thus concern ourselves with only those relations that satisfy a given set of functional dependencies.

If we wish to constrain ourselves to relations on schema R that satisfy a set F of functional dependencies, we say that F holds on R.

Let us consider the relation r of Figure , to see which functional dependencies are satisfied. Observe that A → C is satisfied. There are two tuples that have an A value of a1. These tuples have the same C value namely, c1. Similarly, the two tuples with an A value of a2 have the same C value, c2. There are no other pairs of distinct tuples that have the same A value. The functional dependency C → A is not satisfied, however. To see that it is not, consider the tuples t1 = (a2, b3, c2, d3) and t2 = (a3, b3, c2, d4). These two tuples have the same C values, c2, but they have different A values, a2 and a3, respectively. Thus, we have found a pair of tuples t1 and t2 such that t1[C] = t2[C], but t1[A] ≠ t2[A].

Sample relation r.

Sample relation r.

Many other functional dependencies are satisfied by r, including, for example, the functional dependency AB → D. Note that we use AB as a shorthand for {A,B}, to conform with standard practice. Observe that there is no pair of distinct tuples t1 and t2 such that t1[AB] = t2[AB]. Therefore, if t1[AB] = t2[AB], it must be that t1 = t2 and, thus, t1[D] = t2[D]. So, r satisfies AB → D.

Some functional dependencies are said to be trivial because they are satisfied by all relations. For example, A → A is satisfied by all relations involving attribute A. Reading the definition of functional dependency literally, we see that, for all tuples t1 and t2 such that t1[A] = t2[A], it is the case that t1[A] = t2[A]. Similarly, AB → A is satisfied by all relations involving attribute A. In general, a functional dependencyof the form α → β is trivial if β ⊆ α. To distinguish between the concepts of a relation satisfying a dependency and a dependency holding on a schema, we return to the banking example. If we consider the customer relation (on Customer-schema) in Figure , we see that customer-street → customer-city is satisfied. However, we believe that, in the real world, two cities can have streets with the same name. Thus, it is possible, at some time, to have an instance of the customer relation in which customer-street → customer-city is not satisfied.

So, we would not include customer-street→ customer-city in the set of functional dependencies that hold on Customer-schema.

customer relation.

The customer relation.

The loan relation.

The loan relation.

In the loan relation (on Loan-schema) of Figure , we see that the dependency loannumber → amount is satisfied. In contrast to the case of customer-city and customerstreet in Customer-schema, we do believe that the real-world enterprise that we are modeling requires each loan to have only one amount. Therefore, we want to require that loan-number→amount be satisfied by the loan relation at all times. In other words, we require that the constraint loan-number → amount hold on Loan-schema.

In the branch relation of Figure, we see that branch-name → assets is satisfied, as is assets → branch-name. We want to require that branch-name → assets hold on Branch-schema. However, we do not wish to require that assets → branch-name hold, since it is possible to have several branches that have the same asset value. In what follows, we assume that, when we design a relational database, we first list those functional dependencies that must always hold. In the banking example, our list of dependencies includes the following:

The branch relation.

The branch relation.

branch relation.

Closure of a Set of Functional Dependencies

It is not sufficient to consider the given set of functional dependencies. Rather, we need to consider all functional dependencies that hold.We shall see that, given a set F of functional dependencies, we can prove that certain other functional dependencies hold.We say that such functional dependencies are “logically implied” by F.

More formally, given a relational schema R, a functional dependency f on R is logically implied by a set of functional dependencies F on R if every relation instance r(R) that satisfies F also satisfies f.

Suppose we are given a relation schema R = (A, B, C, G, H, I) and the set of functional dependencies

A →B
A →C
CG→ H
CG→ I
B → H

The functional dependency

A→ H

is logically implied. That is, we can show that, whenever our given set of functional dependencies holds on a relation, A→H must also hold on the relation. Suppose that t1 and t2 are tuples such that

t1[A] = t2[A]

Since we are given that A→B, it follows from the definition of functional dependency that

t1[B] = t2[B]

Then, since we are given that B → H, it follows from the definition of functional dependency that

t1[H] = t2[H]

Therefore, we have shown that, whenever t1 and t2 are tuples such that t1[A] = t2[A], it must be that t1[H] = t2[H]. But that is exactly the definition of A → H. Let F be a set of functional dependencies. The closure of F, denoted by F+, is the set of all functional dependencies logically implied by F. Given F, we can compute F+ directly from the formal definition of functional dependency. If F were large, this process would be lengthy and difficult. Such a computation of F+ requires arguments of the type just used to show that A → H is in the closure of our example set of dependencies.

Axioms, or rules of inference, provide a simpler technique for reasoning about functional dependencies. In the rules that follow, we use Greek letters (α, β, γ, . . . ) for sets of attributes, and uppercase Roman letters from the beginning of the alphabet for individual attributes. We use αβ to denote α ∪ β.We can use the following three rules to find logically implied functional dependencies.

By applying these rules repeatedly, we can find all of F+, given F. This collection of rules is called Armstrong’s axioms in honor of the person who first proposed it.

  • Reflexivity rule. If α is a set of attributes and β ⊆ α, then α →β holds.
  • Augmentation rule. If α → β holds and γ is a set of attributes, then γα → γβ holds.
  • Transitivity rule. If α →β holds and β → γ holds, then α → γ holds.

Armstrong’s axioms are sound, because they do not generate any incorrect functional dependencies. They are complete, because, for a given set F of functional dependencies, they allow us to generate all F+. The bibliographical notes provide references for proofs of soundness and completeness.

Although Armstrong’s axioms are complete, it is tiresome to use them directly for the computation of F+. To simplify matters further, we list additional rules. It is possible to use Armstrong’s axioms to prove that these rules are correct.

  • Union rule. If α → β holds and α → γ holds, then α →βγ holds.
  • Decomposition rule. If α →βγ holds, then α → β holds and α →γ holds.
  • Pseudotransitivity rule. If α→β holds and γβ →δ holds, then αγ →δ

Let us apply our rules to the example of schema R = (A, B, C, G, H, I) and the set F of functional dependencies {A → B, A → C, CG → H, CG → I, B → H}. We list several members of F+ here:

  • A → H. Since A → B and B → H hold, we apply the transitivity rule. Observe that it was much easier to use Armstrong’s axioms to show that A → H holds than it was to argue directly from the definitions, as we did earlier in this section.
  • CG → HI . Since CG → H and CG → I , the union rule implies that CG → HI .
  • AG → I. Since A → C and CG → I, the pseudotransitivity rule implies that AG → I holds.

Another way of finding that AG → I holds is as follows. We use the augmentation rule on A → C to infer AG → CG. Applying the transitivity rule to this dependency and CG → I, we infer AG → I.

Figure shows a procedure that demonstrates formally how to use Armstrong’s axioms to compute F+. In this procedure, when a functional dependency is added to F+, it may be already present, and in that case there is no change to F+. The left-hand and right-hand sides of a functional dependency are both subsets of R. Since a set of size n has 2n subsets, there are a total of 2 × 2n = 2n+1 possible functional dependencies, where n is the number of attributes in R. Each iteration of the repeat loop of the procedure, except the last iteration, adds at least one functional dependency to F+. Thus, the procedure is guaranteed to terminate.

Closure of Attribute Sets

To test whether a set α is a superkey, we must devise an algorithm for computing the set of attributes functionally determined by α. One way of doing this is to compute F+, take all functional dependencies with α as the left-hand side, and take the union of the right-hand sides of all such dependencies.However, doing so can be expensive, since F+ can be large.

F+ = F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F+
for each pair of functional dependencies f1 and f2 in F+
if f1 and f2 can be combined using transitivity
add the resulting functional dependency to F+
until F+ does not change any further

Figure A procedure to compute F+.

An efficient algorithm for computing the set of attributes functionally determined by α is useful not only for testing whether α is a superkey, but also for several other tasks, as we will see later in this section.

Let α be a set of attributes.We call the set of all attributes functionally determined by α under a set F of functional dependencies the closure of α under F; we denote it by α+. Figure shows an algorithm, written in pseudocode, to compute α+. The input is a set F of functional dependencies and the set α of attributes. The output is stored in the variable result.

To illustrate how the algorithm works, we shall use it to compute (AG)+ with the functional dependencies .We start with result = AG. The first time that we execute the while loop to test each functional dependency, we find that

  • A → B causes us to include B in result. To see this fact, we observe that A → B is in F, A ⊆ result (which is AG), so result := result ∪ B.
  • A→ C causes result to become ABCG.
  • CG→H causes result to become ABCGH.
  • CG→I causes result to become ABCGHI.

The second time that we execute the while loop, no new attributes are added to result, and the algorithm terminates.

Let us see why the algorithm of Figure is correct. The first step is correct, since α→α always holds (by the reflexivity rule).We claim that, for any subset β of result, α→β. Sincewe start the while loop with α→result being true, we can add γ to result only if β ⊆ result and β → γ. But then result → β by the reflexivity rule, so α → β by transitivity. Another application of transitivity shows that α → γ (using α → β and β → γ). The union rule implies that α → result ∪ γ, so α functionally determines any new result generated in the while loop. Thus, any attribute returned by the algorithm is in α+.

It is easy to see that the algorithm finds all α+. If there is an attribute in α+ that is not yet in result, then there must be a functional dependency β → γ for which β ⊆ result, and at least one attribute in γ is not in result.It turns out that, in the worst case, this algorithm may take an amount of time quadratic in the size of F. There is a faster (although slightly more complex) algorithm that runs in time linear in the size of F;

result := α;
while (changes to result) do
for each functional dependency β → γ in F do
begin
if β ⊆ result then result := result ∪ γ;
end

Figure An algorithm to compute α+, the closure of α under F.

There are several uses of the attribute closure algorithm:

  • To test if α is a superkey, we compute α+, and check if α+ contains all attributes of R.
  • We can check if a functional dependency α → β holds (or, in other words, is in F+), by checking if β ⊆ α+. That is, we compute α+ by using attribute closure, and then check if it contains β. This test is particularly useful, as we will see later in this chapter.
  • It gives us an alternative way to compute F+: For each γ ⊆ R, we find the closure γ+, and for each S ⊆ γ+, we output a functional dependency γ → S.

Canonical Cover

Suppose that we have a set of functional dependencies F on a relation schema. Whenever a user performs an update on the relation, the database system must ensure that the update does not violate any functional dependencies, that is, all the functional dependencies in F are satisfied in the new database state.

The system must roll back the update if it violates any functional dependencies in the set F. We can reduce the effort spent in checking for violations by testing a simplified set of functional dependencies that has the same closure as the given set. Any database that satisfies the simplified set of functional dependencies will also satisfy the original set, and vice versa, since the two sets have the same closure. However, the simplifiedset is easier to test.We shall see how the simplified set can be constructed in a moment. First, we need some definitions.

An attribute of a functional dependency is said to be extraneous if we can remove it without changing the closure of the set of functional dependencies. The formal definition of extraneous attributes is as follows. Consider a set F of functional dependencies and the functional dependency α →β in F.

  • Attribute A is extraneous in α if A ∈ α, and F logically implies (F − {α → β}) ∪ {(α − A) → β}.
  • Attribute A is extraneous in β if A ∈ β, and the set of functional dependencies (F − {α →β}) ∪ {α → (β − A)} logically implies F.

For example, suppose we have the functional dependencies AB → C and A → C in F. Then, B is extraneous in AB → C. As another example, suppose we have the functional dependencies AB → CD and A → C in F. Then C would be extraneous in the right-hand side of AB → CD.

Beware of the direction of the implications when using the definition of extraneous attributes: If you exchange the left-hand side with right-hand side, the implication will always hold. That is, (F − {α → β}) ∪ {(α − A) → β} always logically implies F, and also F always logically implies (F − {α → β}) ∪ {α → (β − A)}

Here is how we can test efficiently if an attribute is extraneous. Let R be the relation schema, and let F be the given set of functional dependencies that hold on R. Consider an attribute A in a dependency α → β.

  • If A ∈ β, to check if A is extraneous consider the set F' = (F − {α → β}) ∪ {α → (β − A)} and check if α → A can be inferred fromF

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Database system concepts Topics