[CONTROLLER-1708] write-transactions does not respond (no evident reason) Created: 07/Jun/17 Updated: 25/Jul/23 Resolved: 24/Jul/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8619 |
| Description |
|
This is basically This happened on Sandbox, when the original module-based leader member-1 was isolated for a brief time. The unresponsive member was member-2, which stayed a Follower, but it is not clear whether it was considering member-3 as a Leader. Writers on the other two members have reported [1] success. [0] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/6/odl2_karaf.log.gz |
| Comments |
| Comment by Vratko Polak [ 09/Jun/17 ] |
|
This is now happening [2] on RelEng. The unresponsive member-2 is the one which stayed follower. [2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/737/log.html.gz#s1-s28-t1-k2-k25-k1-k1 |
| Comment by Vratko Polak [ 09/Jun/17 ] |
|
Note that this time Robot waits for: duration writer should work, plus three 125s timeouts plus 10 seconds just to be sure something is really wrong. |
| Comment by Vratko Polak [ 15/Jun/17 ] |
|
This is still happening, does not matter if the isolation time is brief [4] or long [5]. Only the member which stays follower acts this way, and only for module-based shards (only tell-based protocol is tested). Prefix-based shard tests either pass, or run into [4] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/749/log.html.gz#s1-s28-t1-k2-k25-k1-k2-k3 |
| Comment by Vratko Polak [ 27/Jun/17 ] |
|
Sandbox run [6] with verbose debug logs, codebase [7] with first two failures as they happen on releng. [6] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/5/ |
| Comment by Vratko Polak [ 30/Jun/17 ] |
|
This is still happening, but in the test (3 writers, previous Leader isolated longer than 2 minutes) where [8] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/767/log.html.gz#s1-s28-t3-k2-k25-k1-k8 |
| Comment by Vratko Polak [ 03/Jul/17 ] |
|
Recent Sandbox occurrence: [10]. |
| Comment by Vratko Polak [ 07/Jul/17 ] |
|
Based on one Sandbox run, code [11] may be fixing [12] this Bug. [11] https://git.opendaylight.org/gerrit/#/c/60033/5 |
| Comment by Vratko Polak [ 10/Jul/17 ] |
|
> code [11] may be fixing [12] this Bug. No. This week Sandbox run on the same .zip shows [13] read timeout again. |
| Comment by Vratko Polak [ 17/Jul/17 ] |
|
This week Sandbox (codebase [14]) fails [15], huge log [16]. As far as I can see, every future completed successfully, no idea why writer has not returned success. Will retry with extended logs [17]. [14] https://git.opendaylight.org/gerrit/#/c/59506/32 |
| Comment by Vratko Polak [ 17/Jul/17 ] |
|
> every future completed successfully That was a case when this Bug hits the member which stayed follower. In the next run [18] however, this bug happened on the old leader after rejoin (after long isolation). [18] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/4/ |
| Comment by Vratko Polak [ 18/Jul/17 ] |
|
Sandbox keeps reproducing [19] the variant with old leader failure. I will keep looking into this some more. [19] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/8/log.html.gz#s1-s2-t1-k2-k25-k1-k8 |
| Comment by Vratko Polak [ 20/Jul/17 ] |
|
> ReconnectingClientConnection has never reconnected after rejoin. Looks like in lockedCheckTimeout we are completing (failing) connection entries, which leads to purge request entering the queue and sleeping the thread. Segment of karaf.log [22]: 2017-07-20 10:38:35,854 | DEBUG | lt-dispatcher-25 | AbstractClientConnection | 197 - org.opendaylight.control actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#879061049, sessio nId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} timed out entry TransmittedConnect ionEntry{request=CommitLocalTransactionRequest{target=member-1-datastore-config-fe-0-chn-2-txn-4402-0, sequence=0, rep lyTo=Actor[akka://opendaylight-cluster-data/user/$a#-451310230], coordinated=false}, enqueuedTicks=443234413299, sessi onId=0, txSequence=7862} 2017-07-20 10:38:35,854 | WARN | lt-dispatcher-25 | ConcurrentDOMDataBroker | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Tx: DOM-CHAIN-1-4402 Error during phase CAN_COMMIT, starting Abort org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 120.022340136seconds at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.lockedCheckTimeout(AbstractClientConnection.java:390)[197:org.opendaylight.controller.cds-access-client:1.1.2.SNAPSHOT] at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.runTimer(AbstractClientConnection.java:321)[197:org.opendaylight.controller.cds-access-client:1.1.2.SNAPSHOT] at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:123)[197:org.opendaylight.controller.cds-access-client:1.1.2.SNAPSHOT] at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:46)[197:org.opendaylight.controller.cds-access-client:1.1.2.SNAPSHOT] at org.opendaylight.controller.cluster.access.client.AbstractClientActor.onReceiveCommand(AbstractClientActor.java:60)[197:org.opendaylight.controller.cds-access-client:1.1.2.SNAPSHOT] at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.18] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)[173:com.typesafe.akka.actor:2.4.18] at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[173:com.typesafe.akka.actor:2.4.18] at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18] at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)[180:com.typesafe.akka.persistence:2.4.18] at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[180:com.typesafe.akka.persistence:2.4.18] at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[173:com.typesafe.akka.actor:2.4.18] at akka.actor.ActorCell.invoke(ActorCell.scala:495)[173:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[173:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.run(Mailbox.scala:224)[173:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[173:com.typesafe.akka.actor:2.4.18] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] 2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-4402-0, state=FLUSHED} directCommit completed 2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-4402-0, state=FLUSHED}: initiating purge 2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-4402-0, state=DONE} allocated sequence 1 2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction proxy LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-4402-0, state=DONE} enqueing request TransactionPurgeRequest{target=member-1-datastore-config-fe-0-chn-2-txn-4402-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$a#-451310230]} callback org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction$$Lambda$152/951909138@6759f4d3 2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.2.SNAPSHOT | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0} , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#879061049, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} has a successor, not scheduling timer 2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.2.SNAPSHOT | member-1-frontend-datastore-config: Sleeping for 498ms on connection ReconnectingClientConnection{client=ClientIdentifier {frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#879061049, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} |
| Comment by Robert Varga [ 20/Jul/17 ] |
|
The problem now is the TransmitQueue:233, which switches from enqueue to send mechanics. This needs to be investigated more. |
| Comment by Robert Varga [ 20/Jul/17 ] |
| Comment by Vratko Polak [ 21/Jul/17 ] |
|
> https://git.opendaylight.org/gerrit/60614 Merged, but the Bug is still present. Big karaf.log [24] with additional [25] logs. The scenario starts with transaction-writers on each node. Then Leader is isolated for more than 120 seconds, its writer fails as expected. Then the member is rejoined and a new writer is started on that member. That is the writer which never returns any RPC result. Previously that was because it got stuck in "write only the top list" step (possibly because ReconnectingClientConnection never connected after the rejoin). More details after I take a look into the log. [24] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/16/odl1_karaf.log.gz |
| Comment by Vratko Polak [ 21/Jul/17 ] |
|
> More details after I take a look into the log. >> ReconnectingClientConnection never connected after the rejoin Still happens. Contrary to previous attempt at 11:02:32,894, the attempt at 11:02:53,934 never failed (nor resolved the new backend), so ReconnectingClientConnection has never been replaced by ConnectedClientConnection after rejoin (started at 11:03:16,111). 2017-07-21 11:02:53,934 | DEBUG | lt-dispatcher-26 | ClientActorBehavior | 197 - org.opendaylight.control , cookie=0, backend >> it got stuck in "write only the top list" step This time it got stuck after "Submitting list merge". 2017-07-21 11:03:29,669 | TRACE | qtp608302457-332 | WriteTransactionsHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Submitting list merge. Currently I have no idea why any of this happens, I will experiment some more. |
| Comment by Vratko Polak [ 21/Jul/17 ] |
|
I have partially good news. The Should a separate bug be open if the aforementioned fix gets merged? |
| Comment by Vratko Polak [ 24/Jul/17 ] |
|
> client application which handles callbacks from a single thread That should not make a difference. The test passed [27] today on releng, but the test was not always failing before and I believe this Bug is not fixed yet. Will create more runs on Sandbox. |
| Comment by Vratko Polak [ 24/Jul/17 ] |
|
>>> https://git.opendaylight.org/gerrit/60614 > I believe this Bug is not fixed yet. Under the weight of Sandbox passes, now I believe that last week I was testing some different build by mistake, and [28] has fixed this Bug indeed. |