[CONTROLLER-1866] circuit breaker timed out; datastore shutdown Created: 17/Oct/18 Updated: 20/Oct/18 Resolved: 20/Oct/18 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | Neon, Fluorine SR1, Oxygen SR4 |
| Type: | Bug | Priority: | Medium |
| Reporter: | Jamo Luhrsen | Assignee: | Jamo Luhrsen |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | csit, csit:failures | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
There is a failure showing up in netvirt 1node CSIT where the output of the karaf Timestamp: Tue Oct 09 21:02:42 UTC 2018 Node IP Address: 10.30.170.157 System is operational: false System ready state: ACTIVE OPENFLOW : OPERATIONAL IFM : OPERATIONAL ITM : OPERATIONAL ELAN : OPERATIONAL OVSDB : OPERATIONAL DATASTORE : ERROR java.lang.reflect.UndeclaredThrowableException Looking at the karaf.log it seems the reason for this is that we hit 2018-10-09T20:58:22,469 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-39 | Shard | 228 - org.opendaylight.controller.sal-clustering-commons - 1.7.4 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.SimpleReplicatedLogEntry] with sequence number [78318] for persistenceId [member-1-shard-default-config]. akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out. 2018-10-09T20:58:22,515 | INFO | opendaylight-cluster-data-shard-dispatcher-215 | Shard | 228 - org.opendaylight.controller.sal-clustering-commons - 1.7.4 | Stopping Shard member-1-shard-default-config 2018-10-09T20:58:22,517 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-70 | LocalThreePhaseCommitCohort | 235 - org.opendaylight.controller.sal-distributed-datastore - 1.7.4 | Failed to prepare transaction member-1-datastore-config-fe-0-txn-65215-0 on backend java.lang.RuntimeException: Transaction aborted due to shutdown. This is not neccessarily a heavy job so I am not suspecting that this job is not able |
| Comments |
| Comment by Tom Pantelis [ 17/Oct/18 ] |
|
This failure usually means the disk is busy/overloaded or slow. The CSIT env uses shared disks I assume so perhaps there were other jobs or whatever running using the disk heavily. It could also be due in part to a really high rate of transactions. The sequence number was 78318 which seems pretty high, especially for the default config shard. You say it's not a heavy job but that seems a bit heavy although I don't know how long a period the job runs for or what it does. If this failure is happening repeatedly now all of a sudden then I would suggest looking into the app code or the test - perhaps some recent change is causing an unusually high transaction rate? Either way the only potential remedy to alleviate the failure from the CDS or akka or levelDB side is to increase the timeout (I think the default timeout is 5 sec). In controller/configuration/initial/akka.conf, akka {
persistence {
journal-plugin-fallback {
circuit-breaker {
call-timeout = 300s
}
}
snapshot-store-plugin-fallback {
circuit-breaker {
max-failures = 10
call-timeout = 300s
}
}
}
}
|
| Comment by Jamo Luhrsen [ 20/Oct/18 ] |
|
ok, I searched through ~300 job results and this was the only instance of this failure. Let's go with the theory that this |
| Comment by Jamo Luhrsen [ 20/Oct/18 ] |
|
assuming to be caused by an overloaded environment |