-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
None
-
None
-
None
Under heavy transaction load in cluster setup, sometimes akka persistence circuit-breaker failure is observed. When this happens, the shard gets stopped and it never recovers from this condition. This bug is opened to request recovery of shard when it gets into this condition
Problem can be forced to happen pretty easily:
1. On node-1, install controller in a nfs mounted directory
2. Start all nodes and make sure node-1 is leader of the target shard
3. Make write transactions at high rate against this shard
4. Stop/start the nfs service on server from which the dir was mounted. This will cause the dir to be not writable for some time
5. During this period, akka persistence for journal records will fail. When this failure happens, the shard is stopped and never started again
Increasing the akka journal persistence circuit-breaker call-timeout value (default is 10s) does help in making it more tolerant to outage
Few others have seen this issue. Ref. https://lists.opendaylight.org/pipermail/controller-dev/2017-August/013777.html