Emmanuel García: High Availability Cluster and Fault Tolerance

Previously while looking at some information about clusters, I stumbled upon a certain kind of clusters, the High Availability Clusters. So I was a little curious, and I decided to look further into the those clusters, and I found out that they pretty much are involved with fault tolerance(which is the topic for this week), so then I had a motive to really research about that.

Fault Tolerance

Fault tolerance is the ability of a system to perform its function correctly even in the presence of internal faults. The purpose of fault tolerance is to increase the dependability of a system. A complementary but separate approach to increasing dependability is fault prevention. This consists of techniques, such as inspection, whose intent is to eliminate the circumstances by which faults arise.

Classifications

Based on duration, faults can be classified as transient or permanent. A transient fault will eventually disappear without any apparent intervention, whereas a permanent one will remain unless it is removed by some external agency. While it may seem that permanent faults are more severe, from an engineering perspective, they are much easier to diagnose and handle. A particularly problematic type of transient fault is the intermittent fault that recurs, often unpredictably.

A different way to classify faults is by their underlying cause. Design faults are the result of design failures, like the coding.While it may appear that in a carefully designed system all such faults should be eliminated through fault prevention, this is usually not realistic in practice. For this reason, many fault-tolerant systems are built with the assumption that design faults are inevitable, and theta mechanisms need to be put in place to protect the system against them. Operational faults, on the other hand, are faults that occur during the lifetime of the system and are invariably due to physical causes, such as processor failures or disk crashes.

Finally, based on how a failed component behaves once it has failed, faults can be classified into the following categories:

Crash faults -- the component either completely stops operating or never returns to a valid state;
Omission faults -- the component completely fails to perform its service;
Timing faults -- the component does not complete its service on time;
Byzantine faults -- these are faults of an arbitrary nature

Error Detection

The most common techniques for error detection are:

Replication checks -- In this case, multiple replicas of a component perform the same service simultaneously. The outputs of the replicas are compared, and any discrepancy is an indication of an error in one or more components. A particular form of this that is often used in hardware is called triple-modular redundancy (TMR), in which the output of three independent components is compared, and the output of the majority of the components is actually passed on ⁴ . In software, this can be achieved by providing multiple independently developed realizations of the same component. This is called N-version programming.
Timing checks -- This is used for detecting timing faults. Typically a timer is started, set to expire at a point at which a given service is expected to be complete. If the service terminates successfully before the timer expires, the timer is cancelled. However, if the timer times out, then a timing error has occurred. The problem with timers is in cases where there is variation in the execution of a function. In such cases, it is dangerous to set the timer too tightly, since it may indicate false positives. However, setting it too loosely would delay the detection of the error, allowing the effects to be propagated much more widely.
Run-time constraints checking -- This involves detecting that certain constraints, such as boundary values of variables not being exceeded, are checked at run time. The problem is that such checks introduce both code and performance overhead. A particular form is robust data structures, which have built-in redundancy (e.g., a checksum). Every time these data structures are modified, the redundancy checks are performed to detect any inconsistencies. Some programming languages also support an assertion mechanism.
Diagnostic checks -- These are typically background audits that determine whether a component is functioning correctly. In many cases, the diagnostic consists of driving a component with a known input for which the correct output is also known.

High Availability Clusters

High-availability clusters (also known as HA clusters or failover clusters) are groups of computers that support server applications that can be reliably utilized with a minimum of down-time. They operate by harnessing redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it.

Some features of HA clusters:

HA clusters are often used for critical databases, file sharing on a network, business applications, and customer services such as electronic commerce websites.
HA cluster implementations attempt to build redundancy into a cluster to eliminate single points of failure, including multiple network connections and data storage which is redundantly connected viastorage area networks.
HA clusters usually use a heartbeat private network connection which is used to monitor the health and status of each node in the cluster.

Some known High Availability Clusters

References:

Fault tolerance techniques for distributed systems

High-availability cluster