Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1737

Current transaction in state READY leads to RequestTimeoutException (120s) after make-leader-local

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • 4.0.0
    • None
    • clustering
    • Operating System: All
      Platform: All

    • 8792

      The Robot symptom is basically the same as in CONTROLLER-1707.
      There is test which started failing [1] quite reliably, the difference from original CONTROLLER-1707 description is that there is a transaction writer on each member and there is a data tree change listener on the old leader member. Once again, the only failing writer is on the old leader member:

      {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Submit failed","error-info":"TransactionCommitFailedException

      {message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 120.023957045seconds]]}

      \n\tat org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)\n\tat
      ...

      Contrary to CONTROLLER-1707, "No transactions enqueued while attempting to start canCommit" is not seen on the new leader.
      Instead, the following is seen in karaf.log [2] of the writer (member-1, old leader and listener). Not sure which log message is the critical one (if any)

      Very many "rejecting request" messages right after "Received LeaderStateChanged":

      2017-07-02 15:57:40,058 | INFO | rd-dispatcher-32 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-1-shard-default-config: not currently leader, rejecting request Envelope{sessionId=0, txSequence=3115, message=ModifyTransactionRequest{target=member-2-datastore-config-fe-0-chn-2-txn-4416-0, sequence=0, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.2:2550/user/$a#-603757263], modifications=1, protocol=READY}}. isLeader: true, isLeaderActive: false,isLeadershipTransferInProgress: true.

      The NotLeaderException has been received at 15:57:41,474. The new connection has been established at 15:57:46,839. The next reconnect (without a visible reason) has been initiated at 15:59:16,939 and finished at 15:59:16,945. Another one initiated at 15:59:46,958 and finished at 15:59:46,962. Finally request timeout happened:

      2017-07-02 15:59:51,878 | WARN | lt-dispatcher-43 | ConcurrentDOMDataBroker | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Tx: DOM-CHAIN-1-5382 Error during phase CAN_COMMIT, starting Abort
      org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 120.023957045seconds
      at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.lockedCheckTimeout(AbstractClientConnection.java:372)[197:org.opendaylight.controller.cds-access-client:1.1.1.SNAPSHOT]
      ...

      In the whole segment, there were multiple (arguably harmless) snapshot persist events.

      On the new leader [3], there is only one suspicious message:

      2017-07-02 15:58:06,430 | WARN | rd-dispatcher-31 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-2-shard-default-config: Current transaction member-1-datastore-config-fe-0-chn-2-txn-5382-0 has timed out after 15000 ms in state READY

      Update:
      While trying to replicate on Sandbox, two [4] tests [5] have failed. In the latter test, the failing writer was on the old leader (without listener listener), but in the former one, the failing writer was actually on the member which stayed follower (also without listener).
      In both cases, the leader was member-2. Karaf log [6] is huge, but I have confirmed in both tests there was a "Current transaction ... has timed out after 15000 ms in state READY" message during the time the test was waiting for writer response:

      2017-07-03 11:29:50,834 | WARN | rd-dispatcher-44 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-2-shard-default-config: Current transaction member-1-datastore-config-fe-0-chn-2-txn-11757-0 has timed out after 15000 ms in state READY

      [1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/769/log.html.gz#s1-s36-t1-k2-k12
      [2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/769/odl1_karaf.log.gz
      [3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/769/odl2_karaf.log.gz
      [4] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/3/log.html.gz#s1-s2-t1-k2-k12
      [5] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/3/log.html.gz#s1-s2-t3-k2-k12
      [6] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/3/odl2_karaf.log.gz

            oleksii.mozghovyi Oleksii Mozghovyi
            vrpolak Vratko Polak
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: