A distributed system
Is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal.Fault Tolerance
A system is said to “fail” when it cannot
meet its expected target. A characteristic feature of a distributed
system from a stand-alone system is the notion of partial failure.
There are different types of failures in distributed system. Fault
tolerance is dealing successfully with partial failure. If the system
is fault tolerance it can provide its services even in the presence
of faults.
Failure Detection Phase
There
are two failure detection mechanism that can be used to identify
failure
Messaging
system is a one failure detection mechanism. Here, each process sends
a message ("are you live? ") to each other and expecting a
reply (acknowledgement) from them in a finite amount of time. If
someone does not reply among that finite amount of time then he will
be considered as failed process.
We can identify failure by redundant information in transactions. One approach is number the frames in order to detect. If a frame is being lost during the transaction it can identify by checking received frame numbers. Another approach would be framing the packets to allow for bit error detection. (Parity bit)Failure Recovery Phase
If
a system is to be fault tolerant, it can
deal successfully with partial failure. The
key technique for this is to use redundancy. There are three possible
ways to do redundancy.
They
are
-
Time redundancy
-
Information redundancy
-
Physical redundancy
Time
Redundancy
In here the transactions are used. An action is
performed and if wants it again performs. If a transaction is
aborted, it can be re-done with no harm. Time redundancy is
especially helpful when the faults are transient or intermittent.
Information
redundancy
Add
extra bits to allow for error detection/recovery.
Ex – Hamming codes
Physical
redundancy
Add
extra (duplicate) software and/or hardware to the system.
Ex
- Replicating processes
Adding extra processes
Backup servers
Process Resilience
Process
resilience can be made by arranging groups of processes. In here
process groups may be dynamic. So new groups can be created and old
groups can be destroyed. A process
can be a member of several groups at the same time and it can join or
leave from group while system operation. When message is sent to the
group all group members receive copy of that. But only one of them
performs required service. If one process in
a group fails, hopefully some other can take over for it.
Mainly
there are two different types of groups when consider their internal
structure.
Hierarchical groups
There is a coordinator and he will get decisions without considering others. When a request comes coordinator takes the decision that which worker is best suits to carry out the request. Failing of the coordinator will results the entire group failing. So there is a single point of failure in hierarchical group. If the coordinator fails then another should be elected as the coordinator. Selecting a coordinator will be based on algorithms.
Failure Masking and Replication
We
can protect a single vulnerable process by organizing a fault
tolerant group of processes. There are
two ways to achieve replication.
-
Primary based protocols
The group of processes are organized in a
hierarchical manner. There is a fixed primary and it coordinates all
the write operations. When primary crashes the backups will do an
election and selects a new primary.
-
Replicated write protocols
Replicated-write protocols are used in the form
of active replication, as well as by means of quorum-based protocols.
These are used in organizing a collection of
identical processes into a flat group. These groups have no single
point of failure.
If you like this post, spread by sharing it on social media. Thank you!
If you like this post, spread by sharing it on social media. Thank you!
thanx a lot the piece was helpful to my research
ReplyDelete