Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1697

entity-owners read too slow: Transport failure detector triggered.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • None
    • None
    • clustering
    • None
    • Operating System: All
      Platform: All

    • 8547

      This is a Carbon-SR1 bug, in the sense that it has been seen on a build based on a Change [0] which has +2 but will not be merged into Carbon Release.

      This is similar to CONTROLLER-1672 but some symptoms are different.
      From Robot point of view, this is still just a restconf read [1] not finishing in time (30 seconds). This time we are reading entity-owners (module-based shard), as we are testing Singleton Service (tell-based protocol) by repeatedly isolating the member with active instance.

      The test was run with debug log verbosity on some loggers. Member-1 karaf log is here [2], but it is from Sandbox, so it will be deleted over the weekend.

      The first iteration has passed.
      At its start, member-1 client has connected to member-2 backend:
      2017-05-24 14:13:34,374 | INFO | lt-dispatcher-20 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: replaced connection ConnectingClientConnection{client=ClientIdentifier

      {frontend=member-1-frontend-datastore-operational, generation=0}, cookie=1} with ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-operational, generation=0}

      , cookie=1, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.172:2550/user/shardmanager-operational/member-2-shard-entity-ownership-operational#-605622033], sessionId=1, version=BORON, maxMessages=1000, cookie=1, shard=entity-ownership, dataTree present=false}}

      Member-1 has been isolated in the first iteration. After rejoin, the test spend 1 minute making sure responses from the active singleton instance are not changing.

      The next read from entity ownership is at the start of the second iteration. The log shows member-1 client detected a mismatch:
      2017-05-24 14:16:02,976 | INFO | ult-dispatcher-3 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational Received follower initial sync status for member-1-shard-entity-ownership-operational status sync done true
      2017-05-24 14:17:09,242 | DEBUG | qtp841096586-310 | AbstractClientHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | org.opendaylight.controller.cluster.databroker.actors.dds.SingleClientHistory@6bed1f87: creating a new snapshot member-1-datastore-operational-fe-0-txn-1-0
      2017-05-24 14:17:09,242 | DEBUG | qtp841096586-310 | ProxyHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Allocated proxy member-1-datastore-operational-fe-0-txn-1-1 for transaction member-1-datastore-operational-fe-0-txn-1-0
      2017-05-24 14:17:09,242 | DEBUG | qtp841096586-310 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Transaction RemoteProxyTransaction

      {identifier=member-1-datastore-operational-fe-0-txn-1-1, state=open} allocated sequence 0
      2017-05-24 14:17:09,243 | DEBUG | qtp841096586-310 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Transaction proxy RemoteProxyTransaction{identifier=member-1-datastore-operational-fe-0-txn-1-1, state=open}

      sending request ReadTransactionRequest

      {target=member-1-datastore-operational-fe-0-txn-1-1, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$b#-460505189], snapshotOnly=true, path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:entity-owners?revision=2015-08-04)entity-owners}

      callback org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTransaction$$Lambda$165/912139580@3c26e7ca
      2017-05-24 14:17:09,243 | DEBUG | qtp841096586-310 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: scheduling timeout in 30 seconds
      2017-05-24 14:17:09,243 | DEBUG | qtp841096586-310 | TransmitQueue | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | Queue ConnectionEntry{request=ReadTransactionRequest

      {target=member-1-datastore-operational-fe-0-txn-1-1, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$b#-460505189], snapshotOnly=true, path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:entity-owners?revision=2015-08-04)entity-owners}

      , enqueuedTicks=684758707440} transmitting entry {}
      2017-05-24 14:17:09,267 | INFO | ult-dispatcher-3 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: connection ConnectedClientConnection{client=ClientIdentifier

      {frontend=member-1-frontend-datastore-operational, generation=0}, cookie=1, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.172:2550/user/shardmanager-operational/member-2-shard-entity-ownership-operational#-605622033], sessionId=1, version=BORON, maxMessages=1000, cookie=1, shard=entity-ownership, dataTree present=false}} indicated no sequencing mismatch on member-1-datastore-operational-fe-0-txn-1-1 sequence 0, reconnecting it
      org.opendaylight.controller.cluster.access.commands.OutOfSequenceEnvelopeException: Expecting envelope 0

      and after some replays reconnected to member-2 backend (again):
      2017-05-24 14:17:09,279 | INFO | ult-dispatcher-3 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: replaced connection ReconnectingClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-operational, generation=0}

      , cookie=1, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.172:2550/user/shardmanager-operational/member-2-shard-entity-ownership-operational#-605622033], sessionId=1, version=BORON, maxMessages=1000, cookie=1, shard=entity-ownership, dataTree present=false}} with ConnectedClientConnection{client=ClientIdentifier

      {frontend=member-1-frontend-datastore-operational, generation=0}

      , cookie=1, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.172:2550/user/shardmanager-operational/member-2-shard-entity-ownership-operational#-605622033], sessionId=2, version=BORON, maxMessages=1000, cookie=1, shard=entity-ownership, dataTree present=false}}

      In the second iteration, active instance was on member-2, so it got isolated, and after wait for new leader (member-1 got elected), member-1 was asked about the new active instance. This is the entity-owner read that got stuck for at least 30 seconds. Here is the complete end of karaf.log:
      2017-05-24 14:17:20,110 | INFO | lt-dispatcher-31 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational: Received role changed for member-1-shard-entity-ownership-operational from Candidate to Leader
      2017-05-24 14:17:20,110 | INFO | lt-dispatcher-31 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational: All Shards are ready - data store operational is ready, available count is 0
      2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | AbstractClientHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | org.opendaylight.controller.cluster.databroker.actors.dds.SingleClientHistory@6bed1f87: creating a new snapshot member-1-datastore-operational-fe-0-txn-2-0
      2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | ProxyHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Allocated proxy member-1-datastore-operational-fe-0-txn-2-1 for transaction member-1-datastore-operational-fe-0-txn-2-0
      2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Transaction RemoteProxyTransaction

      {identifier=member-1-datastore-operational-fe-0-txn-2-1, state=open} allocated sequence 0
      2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Transaction proxy RemoteProxyTransaction{identifier=member-1-datastore-operational-fe-0-txn-2-1, state=open}

      sending request ReadTransactionRequest

      {target=member-1-datastore-operational-fe-0-txn-2-1, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$b#-460505189], snapshotOnly=true, path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:entity-owners?revision=2015-08-04)entity-owners}

      callback org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTransaction$$Lambda$165/912139580@1482dd4a
      2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: timer already scheduled
      2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | TransmitQueue | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | Queue ConnectionEntry{request=ReadTransactionRequest

      {target=member-1-datastore-operational-fe-0-txn-2-1, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$b#-460505189], snapshotOnly=true, path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:entity-owners?revision=2015-08-04)entity-owners}

      , enqueuedTicks=697074151026} transmitting entry {}
      2017-05-24 14:17:28,858 | INFO | lt-dispatcher-32 | ProtocolStateActor | 174 - com.typesafe.akka.slf4j - 2.4.17 | No response from remote. Transport failure detector triggered. (internal state was Open)
      2017-05-24 14:17:28,861 | WARN | lt-dispatcher-32 | ReliableDeliverySupervisor | 174 - com.typesafe.akka.slf4j - 2.4.17 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.29.14.172:2550] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
      2017-05-24 14:17:32,778 | INFO | ult-dispatcher-6 | ProtocolStateActor | 174 - com.typesafe.akka.slf4j - 2.4.17 | No response from remote. Transport failure detector triggered. (internal state was Open)
      2017-05-24 14:17:39,288 | DEBUG | ult-dispatcher-3 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: scheduling timeout in 30 seconds
      2017-05-24 14:17:48,899 | INFO | ult-dispatcher-5 | ProtocolStateActor | 174 - com.typesafe.akka.slf4j - 2.4.17 | No response from remote for outbound association. Associate timed out after [15000 ms].
      2017-05-24 14:17:48,900 | WARN | ult-dispatcher-5 | ReliableDeliverySupervisor | 174 - com.typesafe.akka.slf4j - 2.4.17 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.29.14.172:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.29.14.172:2550]] Caused by: [No response from remote for outbound association. Associate timed out after [15000 ms].]

      Things to note:

      • No reconnect attempted.
      • "timer already scheduled" at 14:17:21,558 followed by "scheduling timeout in 30 seconds" at 14:17:39,288
      • "No response from remote. Transport failure detector triggered. (internal state was Open)" twice.

      This time there is no transaction stream making anything busy, this seems like a straight bug due to wrong timers or lack of reachability processing.

      [0] https://git.opendaylight.org/gerrit/57751
      [1] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-cs-partnheal-longevity-only-carbon/12/archives/log.html.gz#s1-s2-t1-k3-k1-k1-k1-k1-k1-k1-k2-k1-k1-k4-k6-k1-k1-k3-k1-k3-k1-k3-k1
      [2] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-cs-partnheal-longevity-only-carbon/12/archives/odl1_karaf.log.gz

            Unassigned Unassigned
            vrpolak Vratko Polak
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: