Details
-
Bug
-
Status: Resolved
-
Medium
-
Resolution: Cannot Reproduce
-
None
Description
There is a failure showing up in netvirt 1node CSIT where the output of the karaf
cli "showSvcStatus" has the DATASTORE in ERROR state:
Timestamp: Tue Oct 09 21:02:42 UTC 2018 Node IP Address: 10.30.170.157 System is operational: false System ready state: ACTIVE OPENFLOW : OPERATIONAL IFM : OPERATIONAL ITM : OPERATIONAL ELAN : OPERATIONAL OVSDB : OPERATIONAL DATASTORE : ERROR java.lang.reflect.UndeclaredThrowableException
Looking at the karaf.log it seems the reason for this is that we hit
a circuit breaker timed out issue and some cluster/akka logic is shutting down
the datastore.
2018-10-09T20:58:22,469 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-39 | Shard | 228 - org.opendaylight.controller.sal-clustering-commons - 1.7.4 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.SimpleReplicatedLogEntry] with sequence number [78318] for persistenceId [member-1-shard-default-config]. akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out. 2018-10-09T20:58:22,515 | INFO | opendaylight-cluster-data-shard-dispatcher-215 | Shard | 228 - org.opendaylight.controller.sal-clustering-commons - 1.7.4 | Stopping Shard member-1-shard-default-config 2018-10-09T20:58:22,517 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-70 | LocalThreePhaseCommitCohort | 235 - org.opendaylight.controller.sal-distributed-datastore - 1.7.4 | Failed to prepare transaction member-1-datastore-config-fe-0-txn-65215-0 on backend java.lang.RuntimeException: Transaction aborted due to shutdown.
This is not neccessarily a heavy job so I am not suspecting that this job is not able
to keep up with writing to disk, which I think is one reason this might happen.