[CONTROLLER-1789] Recover from shard stopped condition on akka persistence circuit-breaker failure Created: 14/Nov/17  Updated: 18/May/21

Status: Confirmed
Project: controller
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Ajay Lele Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Under heavy transaction load in cluster setup, sometimes akka persistence circuit-breaker failure is observed. When this happens, the shard gets stopped and it never recovers from this condition. This bug is opened to request recovery of shard when it gets into this condition

Problem can be forced to happen pretty easily:
1. On node-1, install controller in a nfs mounted directory
2. Start all nodes and make sure node-1 is leader of the target shard
3. Make write transactions at high rate against this shard
4. Stop/start the nfs service on server from which the dir was mounted. This will cause the dir to be not writable for some time
5. During this period, akka persistence for journal records will fail. When this failure happens, the shard is stopped and never started again

Increasing the akka journal persistence circuit-breaker call-timeout value (default is 10s) does help in making it more tolerant to outage

Few others have seen this issue. Ref. https://lists.opendaylight.org/pipermail/controller-dev/2017-August/013777.html



 Comments   
Comment by Ajay Lele [ 13/Aug/18 ]

The basic cause of Circuit Breaker issues is that disk I/O operation takes longer than the (circuit-breaker) timeouts used by akka persistence plugins. This could be because disk is slow/choked, amount of data in data-store is very large (order of hundreds of MBs), transactions are happening at very high rate, or a combination of these factors. Suggested steps for troubleshooting this problem:

  1. Monitor disk performance by using tools like sar [0], identify and fix bottle-necks
  2. Tune akka persistence plugin circuit-breaker settings. In controller/configuration/initial/akka.conf, override the defaults, esp. the call-timeout. Refer to [1] and [2] for more details. Example:
akka {
  persistence {
    journal-plugin-fallback {
      circuit-breaker {
        max-failures = 10
        call-timeout = 60s
        reset-timeout = 30s
      }
    }
    snapshot-store-plugin-fallback {
      circuit-breaker {
        max-failures = 10
        call-timeout = 120s
        reset-timeout = 60s
      }
    }
  }
}

[0] https://linux.die.net/man/1/sar

[1] https://github.com/akka/akka/blob/master/akka-persistence/src/main/resources/reference.conf

[2] https://doc.akka.io/docs/akka/2.5/common/circuitbreaker.html

Comment by Robert Varga [ 14/Nov/18 ]

ajayslele any progress on this? Are you still working on it?

Comment by Ajay Lele [ 14/Nov/18 ]

rovarga yes, I will be proposing a patch soon

Comment by Ajay Lele [ 27/Jun/20 ]

The PR which I had opened for this [0] was in pretty good shape, but could not get through the reviews. Whoever wants to work on this can pick up from where it is right now.

[0] https://git.opendaylight.org/gerrit/c/controller/+/79328

Generated at Wed Feb 07 19:56:27 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.