Site home page
(news and notices)

Get alerts when Linktionary is updated

Book updates and addendums

Get info about the Encyclopedia of Networking and Telecommunicatons, 3rd edition (2001)

Download the electronic version of the Encyclopedia of Networking, 2nd edition (1996). It's free!

Contribute to this site

Electronic licensing info



Fault Tolerance and High Availability

Related Entries    Web Links    New/Updated Information

Search Linktionary (powered by FreeFind)

Note: Many topics at this site are reduced versions of the text in "The Encyclopedia of Networking and Telecommunications." Search results will not be as extensive as a search of the book's CD-ROM.

Fault tolerance and high availability is about keeping systems up and running 24 hours a day, 7 days a week, or at least keeping systems up and running with a reasonable amount of performance. Downed systems can cost an organization thousands of dollars per hour, as outlined in the following table:

The Cost of Internet Commerce Downtime
(Source: Forrester Research)

Web Site

Daily Internet Commerce Revenue as of 1/15/99 (U.S. $)

Lost Revenue per Hour of Downtime as of 1/15/99 (U.S. $)*











*Lost revenue assumes a U.S. $1-million-per-day site where 20 percent of transactions are lost during downtime.

A fault-tolerant system is designed to keep running even after a fault has occurred. Fault-tolerant features in early network operating systems included mirrored disks, with both disks reading and writing the same information. If one disk failed, the other kept running in what is called "failover" mode. This fault tolerance was expanded to disk duplexing, in which the disks and disk controllers were duplicated. These redundant components not only provided fault tolerance, but also improved performance since disk reads could come from either disk (writes still had to be performed by both disks). Of course, fault-tolerant systems must provide more than just disk failover. Some other examples of redundant systems include the following:

  • RAID disk systems combine multiple hard drives into fault-protected arrays.

  • Redundant components (power supplies, I/O boards, and so on).

  • Multiple servers are clustered to minimize problems if any of the servers should fail.

  • Alternate pathing and load balancing improve throughput and provide redundant links.

  • Multiple data centers to protect against local disasters.

This topic continues in "The Encyclopedia of Networking and Telecommunications" with a discussion of the following:

  • High availability (resiliency) and ways of measuring it (mean time to failure, and mean time to recover)
  • Classes of availability, including two nines, three nines, four nines, five nines, six nines
  • Ways to achieve fault tolerance and high availability, including:
  • Disk-level protection
  • Trasaction-monitoring systems
  • Redundant components
  • uniterruptible power
  • disk mirroring and duplexing
  • RAIDs (redundant arrays of inexpensive disks)
  • Mirrored servers
  • Clustering
  • Load balancing
  • Redundant communication links
  • Distributed computing
  • Duplicate data centers
  • outsourcing and colocation

Copyright (c) 2001 Tom Sheldon and Big Sur Multimedia.
All rights reserved under Pan American and International copyright conventions.