[CONTROLLER-1866] circuit breaker timed out; datastore shutdown Created: 17/Oct/18  Updated: 20/Oct/18  Resolved: 20/Oct/18

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: Neon, Fluorine SR1, Oxygen SR4

Type: Bug Priority: Medium
Reporter: Jamo Luhrsen Assignee: Jamo Luhrsen
Resolution: Cannot Reproduce Votes: 0
Labels: csit, csit:failures
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

There is a failure showing up in netvirt 1node CSIT where the output of the karaf
cli "showSvcStatus" has the DATASTORE in ERROR state:

Timestamp: Tue Oct 09 21:02:42 UTC 2018
Node IP Address: 10.30.170.157
System is operational: false
System ready state: ACTIVE
  OPENFLOW : OPERATIONAL
  IFM : OPERATIONAL
  ITM : OPERATIONAL
  ELAN : OPERATIONAL
  OVSDB : OPERATIONAL
  DATASTORE : ERROR java.lang.reflect.UndeclaredThrowableException

Looking at the karaf.log it seems the reason for this is that we hit
a circuit breaker timed out issue and some cluster/akka logic is shutting down
the datastore.

2018-10-09T20:58:22,469 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-39 | Shard                            | 228 - org.opendaylight.controller.sal-clustering-commons - 1.7.4 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.SimpleReplicatedLogEntry] with sequence number [78318] for persistenceId [member-1-shard-default-config].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
2018-10-09T20:58:22,515 | INFO  | opendaylight-cluster-data-shard-dispatcher-215 | Shard                            | 228 - org.opendaylight.controller.sal-clustering-commons - 1.7.4 | Stopping Shard member-1-shard-default-config
2018-10-09T20:58:22,517 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-70 | LocalThreePhaseCommitCohort      | 235 - org.opendaylight.controller.sal-distributed-datastore - 1.7.4 | Failed to prepare transaction member-1-datastore-config-fe-0-txn-65215-0 on backend
java.lang.RuntimeException: Transaction aborted due to shutdown.

This is not neccessarily a heavy job so I am not suspecting that this job is not able
to keep up with writing to disk, which I think is one reason this might happen.
 



 Comments   
Comment by Tom Pantelis [ 17/Oct/18 ]

This failure usually means the disk is busy/overloaded or slow. The CSIT env uses shared disks I assume so perhaps there were other jobs or whatever running using the disk heavily. It could also be due in part to a really high rate of transactions.

The sequence number was 78318 which seems pretty high, especially for the default config shard. You say it's not a heavy job but that seems a bit heavy although I don't know how long a period the job runs for or what it does. If this failure is happening repeatedly now all of a sudden then I would suggest looking into the app code or the test - perhaps some recent change is causing an unusually high transaction rate?

Either way the only potential remedy to alleviate the failure from the CDS or akka or levelDB side is to increase the timeout (I think the default timeout is 5 sec). In controller/configuration/initial/akka.conf,

akka {
  persistence {
    journal-plugin-fallback {
      circuit-breaker {
        call-timeout = 300s
      }
    }
    snapshot-store-plugin-fallback {
      circuit-breaker {
        max-failures = 10
        call-timeout = 300s
      }
    }
  }
}
Comment by Jamo Luhrsen [ 20/Oct/18 ]

ok, I searched through ~300 job results and this was the only instance of this failure. Let's go with the theory that this
was caused by some external processes in this shared environment consuming/overloading the disk. I'll close it out.

Comment by Jamo Luhrsen [ 20/Oct/18 ]

assuming to be caused by an overloaded environment

Generated at Wed Feb 07 19:56:39 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.