Loading...

Type: Bug
Resolution: Cannot Reproduce
Fix Version/s: None
Affects Version/s: None
Component/s: clustering
Labels:
None
Environment:

Operating System: All
Platform: All

External issue ID:
8547

This is a Carbon-SR1 bug, in the sense that it has been seen on a build based on a Change [0] which has +2 but will not be merged into Carbon Release.

This is similar to ~~CONTROLLER-1672~~ but some symptoms are different.
From Robot point of view, this is still just a restconf read [1] not finishing in time (30 seconds). This time we are reading entity-owners (module-based shard), as we are testing Singleton Service (tell-based protocol) by repeatedly isolating the member with active instance.

The test was run with debug log verbosity on some loggers. Member-1 karaf log is here [2], but it is from Sandbox, so it will be deleted over the weekend.

The first iteration has passed.
At its start, member-1 client has connected to member-2 backend:
2017-05-24 14:13:34,374 | INFO | lt-dispatcher-20 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: replaced connection ConnectingClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-operational, generation=0}, cookie=1} with ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-operational, generation=0}

, cookie=1, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.172:2550/user/shardmanager-operational/member-2-shard-entity-ownership-operational#-605622033], sessionId=1, version=BORON, maxMessages=1000, cookie=1, shard=entity-ownership, dataTree present=false}}

Member-1 has been isolated in the first iteration. After rejoin, the test spend 1 minute making sure responses from the active singleton instance are not changing.

The next read from entity ownership is at the start of the second iteration. The log shows member-1 client detected a mismatch:
2017-05-24 14:16:02,976 | INFO | ult-dispatcher-3 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational Received follower initial sync status for member-1-shard-entity-ownership-operational status sync done true
2017-05-24 14:17:09,242 | DEBUG | qtp841096586-310 | AbstractClientHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | org.opendaylight.controller.cluster.databroker.actors.dds.SingleClientHistory@6bed1f87: creating a new snapshot member-1-datastore-operational-fe-0-txn-1-0
2017-05-24 14:17:09,242 | DEBUG | qtp841096586-310 | ProxyHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Allocated proxy member-1-datastore-operational-fe-0-txn-1-1 for transaction member-1-datastore-operational-fe-0-txn-1-0
2017-05-24 14:17:09,242 | DEBUG | qtp841096586-310 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Transaction RemoteProxyTransaction

{identifier=member-1-datastore-operational-fe-0-txn-1-1, state=open} allocated sequence 0
2017-05-24 14:17:09,243 | DEBUG | qtp841096586-310 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Transaction proxy RemoteProxyTransaction{identifier=member-1-datastore-operational-fe-0-txn-1-1, state=open}

sending request ReadTransactionRequest

{target=member-1-datastore-operational-fe-0-txn-1-1, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$b#-460505189], snapshotOnly=true, path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:entity-owners?revision=2015-08-04)entity-owners}

callback org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTransaction$$Lambda$165/912139580@3c26e7ca
2017-05-24 14:17:09,243 | DEBUG | qtp841096586-310 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: scheduling timeout in 30 seconds
2017-05-24 14:17:09,243 | DEBUG | qtp841096586-310 | TransmitQueue | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | Queue ConnectionEntry{request=ReadTransactionRequest

{target=member-1-datastore-operational-fe-0-txn-1-1, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$b#-460505189], snapshotOnly=true, path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:entity-owners?revision=2015-08-04)entity-owners}

, enqueuedTicks=684758707440} transmitting entry {}
2017-05-24 14:17:09,267 | INFO | ult-dispatcher-3 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: connection ConnectedClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-operational, generation=0}, cookie=1, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.172:2550/user/shardmanager-operational/member-2-shard-entity-ownership-operational#-605622033], sessionId=1, version=BORON, maxMessages=1000, cookie=1, shard=entity-ownership, dataTree present=false}} indicated no sequencing mismatch on member-1-datastore-operational-fe-0-txn-1-1 sequence 0, reconnecting it
org.opendaylight.controller.cluster.access.commands.OutOfSequenceEnvelopeException: Expecting envelope 0

and after some replays reconnected to member-2 backend (again):
2017-05-24 14:17:09,279 | INFO | ult-dispatcher-3 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: replaced connection ReconnectingClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-operational, generation=0}

, cookie=1, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.172:2550/user/shardmanager-operational/member-2-shard-entity-ownership-operational#-605622033], sessionId=1, version=BORON, maxMessages=1000, cookie=1, shard=entity-ownership, dataTree present=false}} with ConnectedClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-operational, generation=0}

, cookie=1, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.172:2550/user/shardmanager-operational/member-2-shard-entity-ownership-operational#-605622033], sessionId=2, version=BORON, maxMessages=1000, cookie=1, shard=entity-ownership, dataTree present=false}}

In the second iteration, active instance was on member-2, so it got isolated, and after wait for new leader (member-1 got elected), member-1 was asked about the new active instance. This is the entity-owner read that got stuck for at least 30 seconds. Here is the complete end of karaf.log:
2017-05-24 14:17:20,110 | INFO | lt-dispatcher-31 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational: Received role changed for member-1-shard-entity-ownership-operational from Candidate to Leader
2017-05-24 14:17:20,110 | INFO | lt-dispatcher-31 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational: All Shards are ready - data store operational is ready, available count is 0
2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | AbstractClientHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | org.opendaylight.controller.cluster.databroker.actors.dds.SingleClientHistory@6bed1f87: creating a new snapshot member-1-datastore-operational-fe-0-txn-2-0
2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | ProxyHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Allocated proxy member-1-datastore-operational-fe-0-txn-2-1 for transaction member-1-datastore-operational-fe-0-txn-2-0
2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Transaction RemoteProxyTransaction

{identifier=member-1-datastore-operational-fe-0-txn-2-1, state=open} allocated sequence 0
2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Transaction proxy RemoteProxyTransaction{identifier=member-1-datastore-operational-fe-0-txn-2-1, state=open}

sending request ReadTransactionRequest

{target=member-1-datastore-operational-fe-0-txn-2-1, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$b#-460505189], snapshotOnly=true, path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:entity-owners?revision=2015-08-04)entity-owners}

callback org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTransaction$$Lambda$165/912139580@1482dd4a
2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: timer already scheduled
2017-05-24 14:17:21,558 | DEBUG | qtp841096586-310 | TransmitQueue | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | Queue ConnectionEntry{request=ReadTransactionRequest

{target=member-1-datastore-operational-fe-0-txn-2-1, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$b#-460505189], snapshotOnly=true, path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:entity-owners?revision=2015-08-04)entity-owners}

, enqueuedTicks=697074151026} transmitting entry {}
2017-05-24 14:17:28,858 | INFO | lt-dispatcher-32 | ProtocolStateActor | 174 - com.typesafe.akka.slf4j - 2.4.17 | No response from remote. Transport failure detector triggered. (internal state was Open)
2017-05-24 14:17:28,861 | WARN | lt-dispatcher-32 | ReliableDeliverySupervisor | 174 - com.typesafe.akka.slf4j - 2.4.17 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.29.14.172:2550] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-05-24 14:17:32,778 | INFO | ult-dispatcher-6 | ProtocolStateActor | 174 - com.typesafe.akka.slf4j - 2.4.17 | No response from remote. Transport failure detector triggered. (internal state was Open)
2017-05-24 14:17:39,288 | DEBUG | ult-dispatcher-3 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: scheduling timeout in 30 seconds
2017-05-24 14:17:48,899 | INFO | ult-dispatcher-5 | ProtocolStateActor | 174 - com.typesafe.akka.slf4j - 2.4.17 | No response from remote for outbound association. Associate timed out after [15000 ms].
2017-05-24 14:17:48,900 | WARN | ult-dispatcher-5 | ReliableDeliverySupervisor | 174 - com.typesafe.akka.slf4j - 2.4.17 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.29.14.172:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.29.14.172:2550]] Caused by: [No response from remote for outbound association. Associate timed out after [15000 ms].]

Things to note:

No reconnect attempted.
"timer already scheduled" at 14:17:21,558 followed by "scheduling timeout in 30 seconds" at 14:17:39,288
"No response from remote. Transport failure detector triggered. (internal state was Open)" twice.

This time there is no transaction stream making anything busy, this seems like a straight bug due to wrong timers or lack of reachability processing.

[0] https://git.opendaylight.org/gerrit/57751
[1] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-cs-partnheal-longevity-only-carbon/12/archives/log.html.gz#s1-s2-t1-k3-k1-k1-k1-k1-k1-k1-k2-k1-k1-k4-k6-k1-k1-k3-k1-k3-k1-k3-k1
[2] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-cs-partnheal-longevity-only-carbon/12/archives/odl1_karaf.log.gz

Details

Description

Attachments

Activity

People

Dates