[CONTROLLER-1789] Recover from shard stopped condition on akka persistence circuit-breaker failure Created: 14/Nov/17 Updated: 18/May/21 |
|
| Status: | Confirmed |
| Project: | controller |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Medium |
| Reporter: | Ajay Lele | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
Under heavy transaction load in cluster setup, sometimes akka persistence circuit-breaker failure is observed. When this happens, the shard gets stopped and it never recovers from this condition. This bug is opened to request recovery of shard when it gets into this condition Problem can be forced to happen pretty easily: Increasing the akka journal persistence circuit-breaker call-timeout value (default is 10s) does help in making it more tolerant to outage Few others have seen this issue. Ref. https://lists.opendaylight.org/pipermail/controller-dev/2017-August/013777.html |
| Comments |
| Comment by Ajay Lele [ 13/Aug/18 ] |
|
The basic cause of Circuit Breaker issues is that disk I/O operation takes longer than the (circuit-breaker) timeouts used by akka persistence plugins. This could be because disk is slow/choked, amount of data in data-store is very large (order of hundreds of MBs), transactions are happening at very high rate, or a combination of these factors. Suggested steps for troubleshooting this problem:
akka {
persistence {
journal-plugin-fallback {
circuit-breaker {
max-failures = 10
call-timeout = 60s
reset-timeout = 30s
}
}
snapshot-store-plugin-fallback {
circuit-breaker {
max-failures = 10
call-timeout = 120s
reset-timeout = 60s
}
}
}
}
[0] https://linux.die.net/man/1/sar [1] https://github.com/akka/akka/blob/master/akka-persistence/src/main/resources/reference.conf [2] https://doc.akka.io/docs/akka/2.5/common/circuitbreaker.html |
| Comment by Robert Varga [ 14/Nov/18 ] |
|
ajayslele any progress on this? Are you still working on it? |
| Comment by Ajay Lele [ 14/Nov/18 ] |
|
rovarga yes, I will be proposing a patch soon |
| Comment by Ajay Lele [ 27/Jun/20 ] |
|
The PR which I had opened for this [0] was in pretty good shape, but could not get through the reviews. Whoever wants to work on this can pick up from where it is right now. [0] https://git.opendaylight.org/gerrit/c/controller/+/79328 |