Recovery System - Database system concepts

A computer system, like any other device, is subject to failure from a variety of causes: disk crash, power outage, software error, a fire in the machine room, even sabotage. In any failure, information may be lost. Therefore, the database system must take actions in advance to ensure that the atomicity and durability properties of transactions, introduced in Transactions, are preserved. An integral part of a databasesystem is a recovery scheme that can restore the database to the consistent state that existed before the failure. The recovery scheme must also provide high availability; that is, it must minimize the time for which the database is not usable after a crash.

Failure Classification

There are various types of failure that may occur in a system, each of which needs to be dealt with in a different manner. The simplest type of failure is one that does not result in the loss of information in the system. The failures that are more difficult to deal with are those that result in loss of information. In this chapter,we shall consider only the following types of failure:

  • Transaction failure. There are two types of errors that may cause a transaction to fail:
    1. Logical error. The transaction can no longer continue with its normal execution because of some internal condition, such as bad input, data not found, overflow, or resource limit exceeded.
    2. System error. The system has entered an undesirable state (for example, deadlock), as a result of which a transaction cannot continue with its normal execution. The transaction, however, can be reexecuted at a later time.
  • System crash. There is a hardware malfunction, or a bug in the database software or the operating system, that causes the loss of the content of volatile storage, and brings transaction processing to a halt. The content of nonvolatile storage remains intact, and is not corrupted.

The assumption that hardware errors and bugs in the software bring the system to a halt, but do not corrupt the nonvolatile storage contents, is known as the fail-stop assumption. Well-designed systems have numerous internal checks, at the hardware and the software level, that bring the system to a halt when there is an error. Hence, the fail-stop assumption is a reasonable one.

  • Disk failure. A disk block loses its content as a result of either a head crash or failure during a data transfer operation. Copies of the data on other disks, or archival backups on tertiary media, such as tapes, are used to recover from the failure.

To determine how the system should recover from failures, we need to identify the failure modes of those devices used for storing data. Next, we must consider how these failure modes affect the contents of the database. We can then propose algorithms to ensure database consistency and transaction atomicity despite failures. These algorithms, known as recovery algorithms, have two parts:

  1. Actions taken during normal transaction processing to ensure that enough information exists to allow recovery from failures.
  2. Actions taken after a failure to recover the database contents to a state that ensures database consistency, transaction atomicity, and durability.

Storage Structure

As we saw in Storage and File Structure, the various data items in the database may be stored and accessed in a number of different storage media. To understand how to ensure the atomicity and durability properties of a transaction, we must gain a better understanding of these storage media and their access methods.

Storage Types

In Storage And Fle Structure we saw that storage media can be distinguished by their relative speed, capacity, and resilience to failure, and classified as volatile storage or nonvolatile storage. We review these terms, and introduce another class of storage, called stable storage.

  • Volatile storage. Information residing in volatile storage does not usually survive system crashes. Examples of such storage are main memory and cache memory. Access to volatile storage is extremely fast, both because of the speed of the memory access itself, and because it is possible to access any data item in volatile storage directly.
  • Nonvolatile storage. Information residing in nonvolatile storage survives system crashes. Examples of such storage are disk and magnetic tapes. Disks are used for online storage, whereas tapes are used for archival storage. Both, however, are subject to failure (for example, head crash), which may result in loss of information. At the current state of technology, nonvolatile storage is slower than volatile storage by several orders of magnitude. This is because disk and tape devices are electromechanical, rather than based entirely on chips, as is volatile storage. In database systems, disks are used for most nonvolatile storage. Other nonvolatile media are normally used only for backup data. Flash storage , though nonvolatile, has insufficient capacity for most database systems.
  • Stable storage. Information residing in stable storage is never lost (never should be taken with a grain of salt, since theoretically never cannot be guaranteed for example, it is possible, although extremely unlikely, that a black hole may envelop the earth and permanently destroy all data!). Although stable storage is theoretically impossible to obtain, it can be closely approximated by techniques that make data loss extremely unlikely. stable-storage implementation.

The distinctions among the various storage types are often less clear in practice than in our presentation. Certain systems provide battery backup, so that some main memory can survive system crashes and power failures. Alternative forms of nonvolatile storage, such as optical media, provide an even higher degree of reliability than do disks.

Stable-Storage Implementation

To implement stable storage, we need to replicate the needed information in several nonvolatile storage media (usually disk) with independent failure modes, and to update the information in a controlled manner to ensure that failure during data transfer does not damage the needed information.

Recall that RAID systems guarantee that the failure of a single disk (even during data transfer) will not result in loss of data. The simplest and fastest form of RAID is the mirrored disk, which keeps two copies of each block, on separate disks. Other forms of RAID offer lower costs, but at the expense of lower performance.

RAID systems, however, cannot guard against data loss due to disasters such as fires or flooding. Many systems store archival backups of tapes off-site to guard against such disasters. However, since tapes cannot be carried off-site continually, updates since the most recent time that tapes were carried off-site could be lost in such a disaster. More secure systems keep a copy of each block of stable storage at a remote site, writing it out over a computer network, in addition to storing the block on a local disk system. Since the blocks are output to a remote system as and when they are output to local storage, once an output operation is complete, the output is not lost, even in the event of a disaster such as a fire or flood.

In the remainder of this section, we discuss how storage media can be protected from failure during data transfer. Block transfer between memory and disk storage can result in

  • Successful completion. The transferred information arrived safely at its destination.
  • Partial failure. A failure occurred in the midst of transfer, and the destination block has incorrect information.
  • Total failure. The failure occurred sufficiently early during the transfer that the destination block remains intact.

We require that, if a data-transfer failure occurs, the system detects it and invokes a recovery procedure to restore the block to a consistent state. To do so, the system must maintain two physical blocks for each logical database block; in the case of mirrored disks, both blocks are at the same location; in the case of remote backup, one of the blocks is local, whereas the other is at a remote site. An output operationis executed as follows:

  1. Write the information onto the first physical block.
  2. When the first write completes successfully, write the same information onto the second physical block.
  3. The output is completed only after the second write completes successfully.

During recovery, the system examines each pair of physical blocks. If both are the same and no detectable error exists, then no further actions are necessary. (Recall that errors in a disk block, such as a partial write to the block, are detected by storing a checksum with each block.) If the system detects an error in one block, then it replaces its content with the content of the other block. If both blocks contain no detectableerror, but they differ in content, then the system replaces the content of the first block with the value of the second. This recovery procedure ensures that a write to stable storage either succeeds completely (that is, updates all copies) or results in no change.

The requirement of comparing every corresponding pair of blocks during recovery is expensive to meet.We can reduce the cost greatly by keeping track of block writes that are in progress, using a small amount of nonvolatile RAM. On recovery, only blocks for which writes were in progress need to be compared.

The protocols for writing out a block to a remote site are similar to the protocols for writing blocks to a mirrored disk system, which we examined in Storage And File Structure.We can extend this procedure easily to allow the use of an arbitrarily large number of copies of each block of stable storage. Although a large number of copies reduces the probability of a failure to even lower than two copies do, it is usually reasonable to simulate stable storage with only two copies.

Data Access

As we saw inStorage And File Structure , the database system resides permanently on nonvolatile storage (usually disks), and is partitioned into fixed-length storage units called blocks. Blocks are the units of data transfer to and from disk, and may contain several data items.We shall assume that no data item spans two or more blocks. This assumption is realistic for most data-processing applications, such as our banking example.

Block storage operations.

Block storage operations.

Transactions input information from the disk to main memory, and then output the information back onto the disk. The input and output operations are done in block units. The blocks residing on the disk are referred to as physical blocks; the blocks residing temporarily in main memory are referred to as buffer blocks. The area of memory where blocks reside temporarily is called the disk buffer.

Block movements between disk and main memory are initiated through the following two operations:

  1. input(B) transfers the physical block B to main memory.
  2. output(B) transfers the buffer block B to the disk, and replaces the appropriate physical block there.

Each transaction Ti has a private work area in which copies of all the data items accessed and updated by Ti are kept. The system creates this work area when the transaction is initiated; the system removes it when the transaction either commits or aborts. Each data item X kept in the work area of transaction Ti is denoted by xi. Transaction Ti interacts with the database system by transferring data to and from its work area to the system buffer.We transfer data by these two operations:

  1. read(X) assigns the value of data item X to the local variable xi. It executes this operation as follows:
    1. If block BX on which X resides is not in main memory, it issues input(BX).
    2. It assigns to xi the value of X from the buffer block.

  2. write(X) assigns the value of local variable xi to data item X in the buffer block. It executes this operation as follows:
  1. If block BX on whichX resides is not in main memory, it issues input(BX).
  2. It assigns the value of xi to X in buffer BX.

Note that both operations may require the transfer of a block from disk to main memory. They do not, however, specifically require the transfer of a block from main memory to disk.

A buffer block is eventually written out to the disk either because the buffer manager needs the memory space for other purposes or because the database system wishes to reflect the change to B on the disk. We shall say that the database system performs a force-output of buffer B if it issues an output(B).

When a transaction needs to access a data item X for the first time, it must execute read(X). The system then performs all updates to X on xi. After the transaction accesses X for the final time, it must execute write(X) to reflect the change to X in the database itself.

The output(BX) operation for the buffer block BX on which X resides does not need to take effect immediately after write(X) is executed, since the block BX may contain other data items that are still being accessed. Thus, the actual output may take place later. Notice that, if the system crashes after the write(X) operation was executed but before output(BX) was executed, the new value of X is never written to disk and, thus, is lost.

Recovery and Atomicity

Consider again our simplified banking system and transaction Ti that transfers $50 from account A to account B, with initial values of A and B being $1000 and $2000, respectively. Suppose that a system crash has occurred during the execution of Ti, after output(BA) has taken place, but before output(BB) was executed, where BA and BB denote the buffer blocks on which A and B reside. Since the memory contentswere lost, we do not know the fate of the transaction; thus, we could invoke one of two possible recovery procedures:

  • Reexecute Ti. This procedure will result in the value of A becoming $900, rather than $950. Thus, the system enters an inconsistent state.
  • Do not reexecute Ti. The current system state has values of $950 and $2000 for A and B, respectively. Thus, the system enters an inconsistent state.

In either case, the database is left in an inconsistent state, and thus this simple recovery scheme does not work. The reason for this difficulty is that we have modified the database without having assurance that the transaction will indeed commit. Our goal is to perform either all or no database modifications made by Ti. However, if Ti performed multiple database modifications, several output operations may be required,and a failure may occur after some of these modifications have been made, but before all of them are made.

To achieve our goal of atomicity, we must first output information describing the modifications to stable storage, without modifying the database itself. As we shall see, this procedure will allow us to output all the modifications made by a committed transaction, despite failures. There are two ways to perform such outputs; we shall assume that transactions are executed serially; in other words, only a single transaction is active ata time.

Log-Based Recovery

The most widely used structure for recording database modifications is the log. The log is a sequence of log records, recording all the update activities in the database. There are several types of log records. An update log record describes a single database write. It has these fields:

  • Transaction identifier is the unique identifier of the transaction that performed the write operation.
  • Data-item identifier is the unique identifier of the data item written. Typically, it is the location on disk of the data item.
  • Old value is the value of the data item prior to the write.
  • New value is the value that the data item will have after the write.

Other special log records exist to record significant events during transaction processing, such as the start of a transaction and the commit or abort of a transaction. We denote the various types of log records as:

  • <Ti start>. Transaction Ti has started.
  • <Ti, Xj, V1, V2>. Transaction Ti has performed a write on data item Xj . Xj had value V1 before the write, and will have value V2 after the write.
  • <Ti commit>. Transaction Ti has committed.
  • <Ti abort>. Transaction Ti has aborted.

Whenever a transaction performs a write, it is essential that the log record for that write be created before the database is modified. Once a log record exists, we can output the modification to the database if that is desirable. Also, we have the ability to undo a modification that has already been output to the database. We undo it by using the old-value field in log records.

For log records to be useful for recovery from system and disk failures, the log must reside in stable storage. For now, we assume that every log record is written to the end of the log on stable storage as soon as it is created. Observe that the log contains a complete record of all database activity. As a result, the volume of data stored in the log may become unreasonably large.

Deferred Database Modification

The deferred-modification technique ensures transaction atomicity by recording all database modifications in the log, but deferring the execution of all write operations of a transaction until the transaction partially commits. Recall that a transaction is said to be partially committed once the final action of the transaction has been executed.

The version of the deferred-modification technique that we describe in this section assumes that transactions are executed serially. When a transaction partially commits, the information on the log associated withthe transaction is used in executing the deferred writes. If the system crashes before the transaction completes its execution, or if the transaction aborts, then the information on the log is simply ignored.

The execution of transaction Ti proceeds as follows. Before Ti starts its execution, a record <Ti start> is written to the log. A write(X) operation by Ti results in the writing of a new record to the log. Finally, when Ti partially commits, a record <Ti commit> is written to the log.

When transaction Ti partially commits, the records associated with it in the log are used in executing the deferred writes. Since a failure may occur while this updating is taking place, we must ensure that, before the start of these updates, all the log records are written out to stable storage. Once they have been written, the actual updating takes place, and the transaction enters the committed state.

Observe that only the new value of the data item is required by the deferredmodification technique. Thus, we can simplify the general update-log record structure that we saw in the previous section, by omitting the old-value field. To illustrate, reconsider our simplified banking system. Let T0 be a transaction that transfers $50 from account A to account B:

T0: read(A);
A := A − 50;
write(A);
read(B);
B := B + 50;
write(B).

Let T1 be a transaction that withdraws $100 from account C:

T1: read(C);
C := C − 100;
write(C).

Suppose that these transactions are executed serially, in the order T0 followed by T1, and that the values of accounts A, B, and C before the execution took place were $1000, $2000, and $700, respectively. The portion of the log containing the relevant information on these two transactions appears in Figure . There are various orders in which the actual outputs can take place to both the database system and the log as a result of the execution of T0 and T1. One such order appears in Figure. Note that the value of A is changed in the database only after the record <T0, A, 950> has been placed in the log.

<T0 start>
<T0 , A, 950>
<T0 , B, 2050>
<T0 commit>
<T1 start>
<T1 , C, 600>
<T1 commit>

Figure: Portion of the database log corresponding to T0 and T1.

Using the log, the system can handle any failure that results in the loss of information on volatile storage. The recovery scheme uses the following recovery procedure:

  • redo(Ti) sets the value of all data items updated by transaction Ti to the new values.

The set of data items updated by Ti and their respective new values can be found in the log. The redo operation must be idempotent; that is, executing it several times must be equivalent to executing it once. This characteristic is required if we are to guarantee correct behavior even if a failure occurs during the recovery process.

After a failure, the recovery subsystem consults the log to determine which transactions need to be redone. Transaction Ti needs to be redone if and only if the log contains both the record <Ti start> and the record <Ti commit>. Thus, if the system crashes after the transaction completes its execution, the recovery scheme uses the information in the log to restore the system to a previous consistent state after the transaction had completed.

As an illustration, let us return to our banking example with transactions T0 and T1 executed one after the other in the order T0 followed by T1. Figure shows the log that results from the complete execution of T0 and T1. Let us suppose that the system crashes before the completion of the transactions, so that we can see how the recovery technique restores the database to a consistent state. Assume that the crash occurs just after the log record for the step

write(B)

State of the log and database corresponding to T0 and T1.

Figure: State of the log and database corresponding to T0 and T1.

same log as that in Figure , shown at three different times.

The same log as that in Figure , shown at three different times.

of transaction T0 has been written to stable storage. The log at the time of the crash appears in Figure a. When the system comes back up, no redo actions need to be taken, since no commit record appears in the log. The values of accounts A and B remain $1000 and $2000, respectively. The log records of the incomplete transaction T0 can be deleted from the log.

Now, let us assume the crash comes just after the log record for the step

write(C)

of transaction T1 has been written to stable storage. In this case, the log at the time of the crash is as in Figure b. When the system comes back up, the operation redo(T0) is performed, since the record

<T0 commit>

appears in the log on the disk. After this operation is executed, the values of accounts A and B are $950 and $2050, respectively. The value of account C remains $700. As before, the log records of the incomplete transaction T1 can be deleted from the log. Finally, assume that a crash occurs just after the log record

<T1 commit>

is written to stable storage. The log at the time of this crash is as in Figure c. When the system comes back up, two commit records are in the log: one for T0 and one for T1. Therefore, the system must perform operations redo(T0) and redo(T1), in the order in which their commit records appear in the log. After the system executes these operations, the values of accounts A, B, and C are $950, $2050, and $600, respectively. Finally, let us consider a case in which a second system crash occurs during recovery from the first crash. Some changes may have been made to the database as a result of the redo operations, but all changes may not have been made. When the system comes up after the second crash, recovery proceeds exactly as in the preceding examples. For each commit record

<Ti commit>

found in the log, the the system performs the operation redo(Ti). In other words, it restarts the recovery actions from the beginning. Since redo writes values to the database independent of the values currently in the database, the result of a successful second attempt at redo is the same as though redo had succeeded the first time.

Immediate Database Modification

The immediate-modification technique allows database modifications to be output to the database while the transaction is still in the active state. Data modifications written by active transactions are called uncommitted modifications. In the event of a crash or a transaction failure, the system must use the old-value field of the log records described in restore the modified data items to the value they had prior to the start of the transaction. The undo operation, described next, accomplishes this restoration.

Before a transaction Ti starts its execution, the system writes the record <Ti start> to the log. During its execution, any write(X) operation by Ti is preceded by the writing of the appropriate new update record to the log. When Ti partially commits, the system writes the record <Ti commit> to the log.

Since the information in the log is used in reconstructing the state of the database, we cannot allow the actual update to the database to take place before the corresponding log record is written out to stable storage.We therefore require that, before execution of an output(B) operation, the log records corresponding to B be written onto stable storage.

As an illustration, let us reconsider our simplified banking system, with transactions T0 and T1 executed one after the other in the order T0 followed by T1. The portion of the log containing the relevant information concerning these two transactions appears in Figure .

Figure shows one possible order in which the actual outputs took place in both the database system and the log as a result of the execution of T0 and T1.

<T0 start>
<T0 , A, 1000, 950>
<T0 , B, 2000, 2050>
<T0 commit>
<T1 start>
<T1 , C, 700, 600>
<T1 commit>

Figure : Portion of the system log corresponding to T0 and T1.

State of system log and database corresponding to T0 and T1

State of system log and database corresponding to T0 and T1.

Using the log, the system can handle any failure that does not result in the loss of information in nonvolatile storage. The recovery scheme uses two recovery procedures:

  • undo(Ti) restores the value of all data items updated by transaction Ti to the old values.
  • redo(Ti) sets the value of all data items updated by transaction Ti to the new values.

The set of data items updated by Ti and their respective old and new values can be found in the log. The undo and redo operations must be idempotent to guarantee correct behavior even if a failure occurs during the recovery process. After a failure has occurred, the recovery scheme consults the log to determine which transactions need to be redone, and which need to be undone:

  • Transaction Ti needs to be undone if the log contains the record <Ti start>, but does not contain the record <Ti commit>.
  • Transaction Ti needs to be redone if the log contains both the record <Ti start> and the record <Ti commit>.

As an illustration, return to our banking example, with transaction T0 and T1 executed one after the other in the order T0 followed by T1. Suppose that the system crashes before the completion of the transactions. We shall consider three cases. The state of the logs for each of these cases appears in Figure . First, let us assume that the crash occurs just after the log record for the step

write(B)

state of the logs for each of these cases appears

The same log, shown at three different times.

of transaction T0 has been written to stable storage . When the system comes back up, it finds the record <T0 start> in the log, but no corresponding <T0 commit> record. Thus, transaction T0 must be undone, so an undo(T0) is performed. As a result, the values in accounts A and B (on the disk) are restored to $1000 and $2000, respectively. Next, let us assume that the crash comes just after the log record for the step

write(C)

of transaction T1 has been written to stable storage When the system comes back up, two recovery actions need to be taken. The operation undo(T1) must be performed, since the record <T1 start> appears in the log, but there is no record<T1 commit>. The operation redo(T0)must be performed, since the log contains both the record <T0 start> and the record <T0 commit>. At the end of the entire recovery procedure, the values of accounts A, B, and C are $950, $2050, and $700, respectively. Note that the undo(T1) operation is performed before the redo(T0). In this example, the same outcome would result if the order were reversed. However, the order of doing undo operations first, and then redo operations, is important for the recovery algorithm.

Finally, let us assume that the crash occurs just after the log record<T1 commit> has been written to stable storage. When the system comes back up, both T0 and T1 need to be redone, since the records <T0 start> and <T0 commit> appear in the log, as do the records <T1 start> and <T1 commit>. After the system performs the recovery procedures redo(T0) and redo(T1), the values in accounts A, B, and C are $950, $2050, and $600, respectively.

Checkpoints

When a system failure occurs, we must consult the log to determine those transactions that need to be redone and those that need to be undone. In principle, we need to search the entire log to determine this information. There are two major difficulties with this approach:

  1. The search process is time consuming.
  2. Most of the transactions that, according to our algorithm, need to be redone have already written their updates into the database. Although redoing them will cause no harm, it will nevertheless cause recovery to take longer.

To reduce these types of overhead, we introduce checkpoints. During execution, the system maintains the log, using one of the two techniques In addition, the system periodically performs checkpoints, which requirethe following sequence of actions to take place:

  1. Output onto stable storage all log records currently residing in main memory.
  2. Output to the disk all modified buffer blocks.
  3. Output onto stable storage a log record <checkpoint>. Transactions are not allowed to perform any update actions, such as writing to a buffer block or writing a log record, while a checkpoint is in progress.

The presence of a <checkpoint> record in the log allows the system to streamline its recovery procedure. Consider a transaction Ti that committed prior to the checkpoint. For such a transaction, the <Ti commit> record appears in the log before the<checkpoint> record. Any database modifications made by Ti must have been written to the database either prior to the checkpoint or as part of the checkpoint itself. Thus, at recovery time, there is no need to perform a redo operation on Ti. This observation allows us to refine our previous recovery schemes. (We continue to assume that transactions are run serially.) After a failure has occurred, the recovery scheme examines the log to determine the most recent transaction Ti that started executing before the most recent checkpoint took place. It can find such a transaction by searching the log backward, from the end of the log, until it finds the first<checkpoint> record (since we are searching backward, the record found is the final<checkpoint> record in the log); then it continues the search backward until it finds the next <Ti start> record. This record identifies a transaction Ti.

Once the system has identified transaction Ti, the redo and undo operations need to be applied to only transaction Ti and all transactions Tj that started executing after transaction Ti. Let us denote these transactions by the set T. The remainder (earlier part) of the log can be ignored, and can be erased whenever desired. The exact recovery operations to be performed depend on the modification technique being used. For the immediate-modification technique, the recovery operations are:

  • For all transactions Tk in T that have no <Tk commit> record in the log, execute undo(Tk).
  • For all transactions Tk in T such that the record <Tk commit> appears in the log, execute redo(Tk).
  • Obviously, the undo operation does not need to be applied when the deferred-modification technique is being employed.

    As an illustration, consider the set of transactions {T0, T1, . . ., T100} executed in the order of the subscripts. Suppose that the most recent checkpoint took place during the execution of transaction T67. Thus, only transactions T67, T68, . . ., T100 need to be considered during the recovery scheme. Each of them needs to be redone if it has committed; otherwise, it needs to be undone.

    Shadow Paging

    An alternative to log-based crash-recovery techniques is shadow paging. The shadow-paging technique is essentially an improvement on the shadow-copy technique. Under certain circumstances, shadow paging mayrequire fewer disk accesses than do the log-based methods discussed previously. There are, however, disadvantages to the shadow-paging approach, as we shall see, that limit its use. For example, it is hard to extend shadow paging to allow multiple transactions to execute concurrently.

    As before, the database is partitioned into some number of fixed-length blocks, which are referred to as pages. The term page is borrowed from operating systems, since we are using a paging scheme for memory management. Assume that there are n pages, numbered 1 through n. (In practice, n may be in the hundreds of thousands.) These pages do not need to be stored in any particular order on disk. However, there must be a way to find the ith page of the database for any given i.We use a page table, as in Figure , for this purpose. The page table has n entries one for each database page. Each entry contains a pointer to a page on disk. The first entry contains a pointer to the first page of the database, the second entry points to the second page, and so on. The example in Figure shows that the logical order of database pages does not need to correspond to the physical order in which the pages are placed on disk.

    The key idea behind the shadow-paging technique is to maintain two page tables during the life of a transaction: the current page table and the shadow page table. When the transaction starts, both page tables are identical. The shadow page table is never changed over the duration of the transaction. The current page table may be changed when a transaction performs a write operation. All input and output operationsuse the current page table to locate database pages on disk.

    Suppose that the transaction Tj performs a write(X) operation, and that X resides on the ith page. The system executes the write operation as follows:

    1. If the ith page (that is, the page on which X resides) is not already in main memory, then the system issues input(X).
    2. If this is the write first performed on the ith page by this transaction, then the system modifies the current page table as follows:
      1. It finds an unused page on disk. Usually, the database system has access to a list of unused (free) pages.
      2. It deletes the page found in step 2a from the list of free page frames; it copies the contents of the ith page to the page found in step 2a.
      3. It modifies the current page table so that the ith entry points to the page found in step 2a.
    3. It assigns the value of xj to X in the buffer page.

    Sample page table.

    Sample page table.

    Shadow and current page tables.

    Shadow and current page tables.

    Intuitively, the shadow-page approach to recovery is to store the shadow page table in nonvolatile storage, so that the state of the database prior to the execution of the transaction can be recovered in the event of a crash, or transaction abort. When the transaction commits, the system writes the current page table to nonvolatile storage.

    The current page table then becomes the new shadow page table, and the next transaction is allowed to begin execution. It is important that the shadow page table be stored in nonvolatile storage, since it provides the only means of locating database pages. The current page table may be kept in main memory (volatile storage).We do not care whether the current page table is lost in a crash, since the system recovers by using the shadow page table.

    Successful recovery requires that we find the shadow page table on disk after a crash. A simple way of finding it is to choose one fixed location in stable storage that contains the disk address of the shadow page table. When the system comes back up after a crash, it copies the shadow page table into main memory and uses it for subsequent transaction processing. Because of our definition of the write operation, we are guaranteed that the shadow page table will point to the database pages corresponding to the state of the database prior to any transaction that was active at the time of the crash. Thus, aborts are automatic. Unlike our log-based schemes, shadow paging needs to invoke no undo operations.To commit a transaction, we must do the following:

    1. Ensure that all buffer pages in main memory that have been changed by the transaction are output to disk. (Note that these output operations will not change database pages pointed to by some entry in the shadow page table.)
    2. Output the current page table to disk. Note that we must not overwrite the shadow page table, since we may need it for recovery from a crash.
    3. Output the disk address of the current page table to the fixed location in stable storage containing the address of the shadow page table. This action overwrites the address of the old shadow page table. Therefore, the current page table has become the shadow page table, and the transaction is committed.

    If a crash occurs prior to the completion of step 3, we revert to the state just prior to the execution of the transaction. If the crash occurs after the completion of step 3, the effects of the transaction will be preserved; no redo operations need to be invoked. Shadow paging offers several advantages over log-based techniques. The overhead of log-record output is eliminated, and recovery from crashes is significantly faster (since no undo or redo operations are needed). However, there are drawbacks to the shadow-page technique:

    • Commit overhead. The commit of a single transaction using shadow paging requires multiple blocks to be output the actual data blocks, the current page table, and the disk address of the current page table. Log-based schemes need to output only the log records, which, for typical small transactions, fit within one block.

    The overhead of writing an entire page table can be reduced by implementing the page table as a tree structure, with page table entries at the leaves. We outline the idea below, and leave it to the reader to fill in missing details. The nodes of the tree are pages and have a high fanout, like B+-trees. The current page table’s tree is initially the same as the shadow page table’s tree. When a page is to be updated for the first time, the system changes the entry in the current page table to point to the copy of the page. If the leaf page containing the entry has been copied already, the system directly updates it. Otherwise, the system first copies it, and updates the copy. In turn, the parent of the copied page needs to be updated to point to the new copy, which the system does by applying the same procedure to its parent, copying it if it was not already copied. The process of copying proceeds up to the root of the tree. Changes are made only to the copied nodes, so the shadow page table’s tree does not get modified.

    The benefit of the tree representation is that the only pages that need to be copied are the leaf pages that are updated, and all their ancestors in the tree. All the other parts of the tree are shared between the shadow and the current page table, and do not need to be copied. The reduction in copying costs can be very significant for large databases. However, several pages of the page table still need to copied for each transaction, and the log-based schemes continue to be superior as long as most transactions update only small parts of the database.

    • Data fragmentation. In Storage And File Structure,we considered strategies to ensure locality that is, to keep related database pages close physically on the disk. Locality allows for faster data transfer. Shadow paging causes database pages to change location when they are updated. As a result, either we lose the locality property of the pages or we must resort to more complex, higher-overhead schemes for physical storage management. (See the bibliographical notes for references.)
    • Garbage collection. Each time that a transaction commits, the database pages containing the old version of data changed by the transaction become inaccessible. In Figure , the page pointed to by the fourth entry of the shadow page table will become inaccessible once the transaction of that example commits. Such pages are considered garbage, since they are not part of free space and do not contain usable information. Garbage may be created also as a side effect of crashes. Periodically, it is necessary to find all the garbage pages, and to add them to the list of free pages. This process, called garbage collection,imposes additional overhead and complexity on the system. There are several standard algorithms for garbage collection. (See the bibliographical notes for references.)

    In addition to the drawbacks of shadow paging just mentioned, shadow paging is more difficult than logging to adapt to systems that allow several transactions to execute concurrently. In such systems, some logging is usually required, even if shadow paging is used. The System R prototype, for example, used a combination of shadow paging and a logging scheme . It is relatively easy to extend the log-based recovery schemes to allow concurrent transactions . For these reasons, shadow paging is not widely used.

    Recovery with Concurrent Transactions

    Until now, we considered recovery in an environment where only a single transaction at a time is executing. We now discuss how we can modify and extend the log-based recovery scheme to deal with multiple concurrent transactions. Regardless of the number of concurrent transactions, the system has a single disk buffer and a single log. All transactions share the buffer blocks.We allow immediate modification, and permit a buffer block to have data items updated by one or more transactions.

    Interaction with Concurrency Control

    The recovery scheme depends greatly on the concurrency-control scheme that is used. To roll back a failed transaction, we must undo the updates performed by the transaction. Suppose that a transaction T0 has to be rolled back, and a data itemQthat was updated by T0 has to be restored to its old value. Using the log-based schemes for recovery,we restore the value by using the undo information in a log record. Supposenow that a second transaction T1 has performed yet another update on Q before T0 is rolled back. Then, the update performed by T1 will be lost if T0 is rolled back.

    Therefore, we require that, if a transaction T has updated a data item Q, no other transaction may update the same data item until T has committed or been rolled back. We can ensure this requirement easily by using strict two-phase locking that is, two-phase locking with exclusive locks held until the end of the transaction.

    Transaction Rollback

    We roll back a failed transaction, Ti, by using the log. The system scans the log backward; for every log record of the form <Ti, Xj, V1, V2> found in the log, the system restores the data item Xj to its old value V1. Scanning of the log terminates when the log record <Ti, start> is found. Scanning the log backward is important, since a transaction may have updated a data item more than once. As an illustration, consider the pair of log records

    <Ti, A, 10, 20>
    <Ti, A, 20, 30>

    The log records represent a modification of data item A by Ti, followed by another modification of A by Ti. Scanning the log backward sets A correctly to 10. If the log were scanned in the forward direction, A would be set to 20, which is incorrect. If strict two-phase locking is used for concurrency control, locks held by a transaction T may be released only after the transaction has been rolled back as described. Once transaction T (that is being rolled back) has updated a data item, no other transaction could have updated the same data item, because of the concurrency-control . Therefore, restoring the old value of the data item will not erase the effects of any other transaction.

    Checkpoints

    we used checkpoints to reduce the number of log records that the system must scan when it recovers from a crash. Since we assumed no concurrency, it was necessary to consider only the following transactions during recovery:

    • Those transactions that started after the most recent checkpoint
    • The one transaction, if any, that was active at the time of the most recent checkpoint
    The situation is more complex when transactions can execute concurrently, since several transactions may have been active at the time of the most recent checkpoint.

    In a concurrent transaction-processing system, we require that the checkpoint log record be of the form <checkpoint L>, where L is a list of transactions active at the time of the checkpoint. Again, we assume that transactions do not perform updates either on the buffer blocks or on the log while the checkpoint is in progress. The requirement that transactions must not perform any updates to buffer blocks or to the log during checkpointing can be bothersome, since transaction processing will have to halt while a checkpoint is in progress. A fuzzy checkpoint is a checkpoint where transactions are allowed to perform updates even while buffer blocks are being written out.

    Restart Recovery

    When the system recovers from a crash, it constructs two lists: The undo-list consists of transactions to be undone, and the redo-list consists of transactions to be redone. The system constructs the two lists as follows: Initially, they are both empty. The system scans the log backward, examining each record, until it finds the first<checkpoint> record:

    • For each record found of the form <Ti commit>, it adds Ti to redo-list.
    • For each record found of the form <Ti start>, if Ti is not in redo-list, then it adds Ti to undo-list.

    When the system has examined all the appropriate log records, it checks the list L in the checkpoint record. For each transaction Ti in L, if Ti is not in redo-list then it adds Ti to the undo-list. Once the redo-list and undo-list have have been constructed, the recovery proceeds as follows:

    1. The system rescans the log from the most recent record backward, and performs an undo for each log record that belongs transaction Ti on the undo-list. Log records of transactions on the redo-list are ignored in this phase. The scan stops when the <Ti start> records have been found for every transaction Ti in the undo-list.
    2. The system locates the most recent <checkpoint L> record on the log. Notice that this step may involve scanning the log forward, if the checkpoint record was passed in step 1.
    3. The system scans the log forward fromthe most recent<checkpoint L>record, and performs redo for each log record that belongs to a transaction Ti that is on the redo-list. It ignores log records of transactions on the undo-list in this phase.

    It is important in step 1 to process the log backward, to ensure that the resulting state of the database is correct.

    After the system has undone all transactions on the undo-list, it redoes those transactions on the redo-list. It is important, in this case, to process the log forward. When the recovery process has completed, transaction processing resumes. It is important to undo the transaction in the undo-list before redoing transactions in the redo-list, using the algorithm in steps 1 to 3; otherwise, a problem may occur. Suppose that data item A initially has the value 10. Suppose that a transaction Ti updated data item A to 20 and aborted; transaction rollback would restore A to the value 10. Suppose that another transaction Tj then updated data item A to 30 and committed, following which the system crashed. The state of the log at the time of the crash is

    <Ti, A, 10, 20>
    <Tj, A, 10, 30>
    <Tj commit>

    If the redo pass is performed first, A will be set to 30; then, in the undo pass, A will be set to 10, which is wrong. The final value of Q should be 30, which we can ensure by performing undo before performing redo.

    Buffer Management

    In this section, we consider several subtle details that are essential to the implementation of a crash-recovery scheme that ensures data consistency and imposes aminimal amount of overhead on interactions with the database.

    Log-Record Buffering

    So far, we have assumed that every log record is output to stable storage at the time it is created. This assumption imposes a high overhead on system execution for several

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Database system concepts Topics