Assessing Fault Tolerance and Disaster Recovery Needs Networking

Before implementing fault tolerance or disaster recovery, you should determine how critical your systems are to daily business operations. Additionally, you should determine how long each system could afford to be nonfunctional (down). Making these determinations will dictate which fault tolerance and disaster recovery methods you implement and to what extent. The more vital the system, the greater lengths (and, thus, the greater expense) you should go to in order to protect it from downtime. Less-critical systems may call for simpler measures. For example, banks, insurance companies, the U.S. government, and airlines all run highly critical computer and network systems. Thus, they all have complex and expensive fault tolerance and disaster recovery systems in place.

In terms of how fault tolerance and disaster recovery are implemented, sites can be described as hot, warm, or cold. As the temperature decreases, so does the level of fault tolerance and disaster recovery that are implemented at a site.

Hot Site

In a hot site, every computer system and piece of information has a redundant copy (possibly multiple redundancies). This level of fault tolerance is used when systems must be up 100 percent of the time. Hot sites are strictly fault-tolerant implementations, not disaster recovery implementations (as no downtime is allowed). Budgets for this type of fault-tolerant implementation are typically large.

In a system that has 100-percent redundancy, the redundant system(s) will take over for the failed system without any downtime. The technology used to implement hot sites is clustering, which is the process of grouping multiple computers in order to provide increased performance and fault tolerance.

Although servers are commonly clustered, workstations are normally not because they are simple and cheap to replace. Each computer in the cluster is connected to the other computers in the cluster by high-speed, redundant links (usually multiple fiber-optic cables). Each computer runs special clustering software that makes the cluster of computers appear as a single entity to clients.

There are two levels of cluster service: failover and true.

1. Failover Clustering

A failover cluster includes two entities (usually servers). The first is the active device (the device that responds to network requests), and the second is the failover device. The failover device is an exact duplicate of the active device, but it is inactive and connected to the active device by a highspeed link. The failover device monitors the active device and its condition by using what is known as a heartbeat. A heartbeat is a signal that comes from the active device at a specified interval. If the failover device doesn’t receive a heartbeat from the active device in the specified interval, the failover device considers the active device inactive, and the failover device comes online (becomes active) and is now the active device.

When the previously active device comes back online, it starts sending out the heartbeat. The failover device, which currently is responding to requests as the active device, hears the heartbeat and detects that the active device is now back online. The failover device then goes back into standby mode and starts listening to the heartbeat of the active device again.

In a failover cluster, both servers must be running failover clustering software, such as Novell’s System Fault Tolerance, Level III (SFTIII), Standby Server and High Availability Server (with Novell’s High Availability software, either of the servers can fail and the other will take over), and Microsoft Cluster Server (MSCS) for Windows NT servers. This functionality is built into Microsoft Windows 2000 and later operating systems. Each software package provides failover functionality.

Here are some advantages of this approach to fault tolerance:

  • Resources are almost always available. This approach ensures that the network service(s) that the device provides will be available as much as 99 percent of the time. Each network service and all data are exactly duplicated on each device, and when one experiences problems, the other takes over for virtually uninterrupted service.
  • It is relatively inexpensive when compared with true clustering (discussed in the next section). But, as with any technology, there are disadvantages, and failover clustering has its fair share.
  • There is only one level of fault tolerance. This technology works great if the active device fails, but if the failover device fails as well, the network will totally lose that device’s functionality.
  • There is no load balancing. Servers in a failover-clustering configuration are in either active or standby mode. There is no balancing of network service load across both servers in the cluster. The active server responds to network requests, and the failover server simply monitors the active server, wasting its processor resources.
  • Failover clusters take anywhere from a few seconds to a few minutes to detect and recover from a failed server, a delay referred to as cutover time. During cutover time, the server can’t respond to network client requests, so the server is effectively down. This time is indeed short, but nevertheless, clients can’t get access to their services in the meantime.
  • Hardware and software must be exactly duplicated. In most failover configurations, the hardware for both active and failover devices must be identical.. If it’s not, the transition of the failover device to active device may be hindered. These differences may even cause the failover to fail. This is a disadvantage because it involves checking all aspects of the hardware. (For servers, this means disk types and sizes, NICs, processor speed and type, and RAM.)

Even though Microsoft Cluster Server (MSCS) is described earlier as a failover clustering technology, it does have some capability for load balancing (according to Microsoft). It currently supports only a two-device configuration, so it primarily fits into this category of clustering.

True Clustering

True clustering differs from failover clustering in two major ways:

  • It supports multiple devices.
  • It provides load balancing.

In true clustering (also called multiple server clustering ), multiple servers (or any network devices)act together as a kind of super server. True clusters must provide load balancing. For example, 20 servers can act as one big server. All network services are duplicated across all servers, and network requests are distributed across all servers. Each server is connected to the other servers through a high-speed, dedicated link. If one server in the cluster malfunctions, the other servers automatically take over the burden of the failed server. When the failed server comes back online, it resumes responding to requests as part of the cluster. This technology can provide greater than 99-percent availability for network services hosted by the cluster.

Several advantages are associated with true clustering:

  • There is more than 99-percent availability for network services. With multiple servers, the impact of a single server, or even more than one server, in the cluster going down is minimized because other servers take over the functionality.
  • It offers increased performance. Because each server is taking part of the load of the cluster, much higher total performance is possible.
  • There is no cutover time. Because multiple servers are always responding to network requests, true clusters don’t suffer from the cutover time even when a server goes down. The remaining servers do receive an increased load, and clients may see a Server Busy or Not Found error message if they should, by some chance, try to communicate with the server that is going down. But if the user tries the operation again, one of the remaining servers will respond to the request.
  • It provides for replication. If the clustering software in use supports it, a few servers can be located offsite in case the main site is destroyed by fire, flood, or other disaster. Because there is a replica (copy) of all data in a different location, this technology is known as replication.

But these advantages don’t come without a price. Here are a couple of disadvantages to true clustering:

  • The more servers, the more complex the cluster. As you add servers to the cluster to increase performance, you also increase the complexity. For this reason, most clustering software is limited to a maximum of 64 servers. As technology develops, this limit will increase. The minimum number of servers in a true cluster is 2.
  • It is much more expensive. Because of the hardware involved and the complexity of the clustering software, true clustering requires a serious financial commitment. To justify the expense, ask the keepers of the purse strings how much money would be lost if the system were down for a day.

Warm Site

In a warm site, the network service and data are available most of the time. The data and services are less critical than those in a hot site. With hot-site technologies, all fault tolerance procedures are automatic and are controlled by the NOS. Warm-site technologies require a little more administrator intervention, but they aren’t as expensive.

The most commonly used warm-site technology is a duplicate server. A duplicate server, as its name suggests, is one that is currently not being used and is available to replace any server that fails. When a server fails, the administrator installs the new server and restores the data; the network services are available to users with a minimum of downtime. The administrator sends the failed server out to be repaired. Once the repaired server comes back, it is now the spare server and is available when another server fails.

Using a duplicate server is a disaster recovery method because the entire server is replaced but in a shorter time than if all the components had to be ordered and configured at the time of the system failure. The major advantage of using duplicate servers rather than clustering is that it’s less expensive. A single duplicate server costs much less than a comparable clustering solution. Corporate networks don’t often use duplicate servers, and that’s because there are some major disadvantages associated with using them:

  • You must keep current backups. Because the duplicate server relies on a current backup, you must back up every day and verify every backup, which is time-consuming. To stay as current as possible, some companies run continuous backups.
  • You can lose data. If a server fails in mid-afternoon and the backup was run the evening before, you will lose any data that was placed on the server since the last backup. This may not be a big problem on servers that aren’t updated frequently.

Cold Site

A cold site cannot guarantee server uptime. Generally speaking, cold sites have little or no fault tolerance and rely completely on efficient disaster recovery methods to ensure data integrity. If a server fails, the IT personnel will do their best to recover and fix the problem. If a major component needs to be replaced, the server stays down until the component is replaced. Errors and failures are handled as they occur. Apart from regular system backups, no fault tolerance or disaster recovery methods are implemented.

This type of site has one major advantage: It is the cheapest way to deal with errors and system failures. No extra hardware is required (except the hardware required for backing up). Any disadvantages of implementing a cold site would stem from having an application that cannot afford the downtime associated with service-affecting faults and disasters.

The term nearline refers to a storage method that is neither online nor offline but somewhere in the middle, like tape backup. It involves material that is not likely to be needed except in cases of disaster recovery. While there is not a one-to-one correspondence between any type of site (hot, warm, or cold) and nearline storage, which is not actively accessed during normal operation, you can see that nearline storage comes in handy when recovering from disasters in warm and cold sites.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd Protection Status

Networking Topics