Friday, March 31, 2017

What is failure detection and failure masking in distributed system ?

A distributed system

Is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. 


Fault Tolerance

A system is said to “fail” when it cannot meet its expected target. A characteristic feature of a distributed system from a stand-alone system is the notion of partial failure. There are different types of failures in distributed system. Fault tolerance is dealing successfully with partial failure. If the system is fault tolerance it can provide its services even in the presence of faults.

Failure Detection Phase

There are two failure detection mechanism that can be used to identify failure
Messaging system is a one failure detection mechanism. Here, each process sends a message ("are you live? ") to each other and expecting a reply (acknowledgement) from them in a finite amount of time. If someone does not reply among that finite amount of time then he will be considered as failed process.
We can identify failure by redundant information in transactions. One approach is number the frames in order to detect. If a frame is being lost during the transaction it can identify by checking received frame numbers. Another approach would be framing the packets to allow for bit error detection. (Parity bit)

 

Failure Recovery Phase

If a system is to be fault tolerant, it can deal successfully with partial failure. The key technique for this is to use redundancy. There are three possible ways to do redundancy.
They are
  • Time redundancy
  • Information redundancy
  • Physical redundancy

Time Redundancy
In here the transactions are used. An action is performed and if wants it again performs. If a transaction is aborted, it can be re-done with no harm. Time redundancy is especially helpful when the faults are transient or intermittent.

Information redundancy
Add extra bits to allow for error detection/recovery.
Ex – Hamming codes

Physical redundancy
Add extra (duplicate) software and/or hardware to the system.
Ex - Replicating processes
Adding extra processes
Backup servers

Process Resilience 

 

Process resilience can be made by arranging groups of processes. In here process groups may be dynamic. So new groups can be created and old groups can be destroyed. A process can be a member of several groups at the same time and it can join or leave from group while system operation. When message is sent to the group all group members receive copy of that. But only one of them performs required service. If one process in a group fails, hopefully some other can take over for it.
Mainly there are two different types of groups when consider their internal structure.

  Flat Groups
 
Flat groups are symmetrical and there is no single point of failure. All processes are same in group. There is no distinctive leader. Decision making is more complicated. Voting process should be carried out to get a decision. So it gets more time and overhead.

Hierarchical groups


There is a coordinator and he will get decisions without considering others. When a request comes coordinator takes the decision that which worker is best suits to carry out the request. Failing of the coordinator will results the entire group failing. So there is a single point of failure in hierarchical group. If the coordinator fails then another should be elected as the coordinator. Selecting a coordinator will be based on algorithms.

Failure Masking and Replication

We can protect a single vulnerable process by organizing a fault tolerant group of processes. There are two ways to achieve replication.
  • Primary based protocols

The group of processes are organized in a hierarchical manner. There is a fixed primary and it coordinates all the write operations. When primary crashes the backups will do an election and selects a new primary.

  • Replicated write protocols

Replicated-write protocols are used in the form of active replication, as well as by means of quorum-based protocols. These are used in organizing a collection of identical processes into a flat group. These groups have no single point of failure.

  If you like this post, spread by sharing it on social media. Thank you!

1 comment:

How to send Slack notification using a Python script?

 In this article,  I am focussing on sending Slack notifications periodically based on the records in the database. Suppose we have to monit...