|
Our current Raft implementation which is as per spec relies on simple heartbeats (AppendEntries) to detect if a Leader is still around or not. If the Follower misses 'n' heartbeats (configurable) it considers the Leader to be down and becomes a Candidate. Many a times however heartbeats are simply missed because there is a major GC triggered on the Leader which stops the Leader from sending AppendEntries.
Akka Cluster which uses the Phi Accrual failure detector algorithm is smarter and can take into consideration GC delays when it comes to reporting a cluster member as Reachable/Unreachable. It makes sense to use feedback from Akka cluster to determine if a Follower should switch to Candidate.
The basic idea is that when an election timeout occurs on the Follower it should check if the Leader is Reachable as per Akka Cluster. If it is it should prevent the Follower from becoming a Candidate and at the same time reschedule the Election timeout. The new election timeout however should be 1/2 (or some other fraction) of the previous election timeouts. This is to ensure that in the case of a situation where Akka reports the Unreachable state just after an election timeout we can trigger a new election fast.
|