[CONTROLLER-1674] Timeout waiting for task in producer when moving shard leader Created: 15/May/17 Updated: 25/Jul/23 Resolved: 03/Jul/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| External issue ID: | 8445 | ||||||||
| Description |
|
This happened [0] in a suite for module-based shards with tell-based protocol. Karaf.log [1] shows an unexpected UnreachableMember at 06:17:00,808, followed by many warning as this: [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/684/archives/log.html.gz#s1-s29-t1-k2-k9-k1 |
| Comments |
| Comment by Vratko Polak [ 17/May/17 ] |
|
Today, the writer has timed out [2] on restconf level, without returning a timeout error on its own. Karaf.log [3] shows no suspicious message within the test case duration (between 03:01:22,254 and 03:02:52,753). Note that the run was still without a potential fix [4]. [2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/692/archives/log.html.gz#s1-s24-t1-k2-k8 |
| Comment by Vratko Polak [ 18/May/17 ] |
|
This is still happening. The good news is that now only one writer reports this error [5], others pass (albeit with lower performance as they wait for submits). |
| Comment by Vratko Polak [ 18/May/17 ] |
|
In another scenario (the previous one has listener and moves leader gracefully, this one has no listener but isolates the leader for a brief time, both scenarios have 3 transaction writers and 2 of them finish correctly now; module-based shards with tell-based protocol) something similar [6] happened. But the exceptions are slightly different. The graceful movement response starts with: The brief isolation response starts with: \n\tat com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)\n\tat com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:272)\n\tat com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)\n\tat org.opendaylight.controller.clustering.it.provider.impl.WriteTransactionsHandler.maybeFinish(WriteTransactionsHandler.java:238)\n\tat org.opendaylight.controller.clustering.it.provider.impl.WriteTransactionsHandler.run(WriteTransactionsHandler.java:112)\n\tat |
| Comment by Viera Zelcamova [ 29/May/17 ] |
|
Vratko, please add comment when you have new run, thank you |
| Comment by Vratko Polak [ 29/May/17 ] |
|
This is still happening. In a recent Sandbox test [7] (which failed to upload archive with logs) one writer failed with "TimeoutException: Timeout waiting for task" (as opposed to RequestTimeoutException), one finished successfully, and one overstepped the generous (440s) timeout for Restconf response. This was the brief isolation scenario (module-based shards, tell-based protocol), with [8] codebase. [7] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon/6/robot/report/log.html#s1-s10-t1-k2-k25-k1-k1 |
| Comment by Vratko Polak [ 02/Jun/17 ] |
|
2017-06-01 23:37:21,958 | INFO | h for user karaf | command | 265 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Local_Leader_Shutdown org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-579249557 is not the current leader at org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:455)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT] at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:293)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT] at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)[193:org.opendaylight.controller.sal-akka-raft:1.5.1.SNAPSHOT] at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)[192:org.opendaylight.controller.sal-clustering-commons:1.5.1.SNAPSHOT] at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.18] at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)[192:org.opendaylight.controller.sal-clustering-commons:1.5.1.SNAPSHOT] at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)[173:com.typesafe.akka.actor:2.4.18] at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[173:com.typesafe.akka.actor:2.4.18] at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18] at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)[180:com.typesafe.akka.persistence:2.4.18] at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[180:com.typesafe.akka.persistence:2.4.18] at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[173:com.typesafe.akka.actor:2.4.18] at akka.actor.ActorCell.invoke(ActorCell.scala:495)[173:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[173:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.run(Mailbox.scala:224)[173:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[173:com.typesafe.akka.actor:2.4.18] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] 2017-06-01 23:37:38,662 | INFO | lt-dispatcher-32 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-2-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-config, generation=0} , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-579249557, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} reconnecting as ReconnectingClientConnection{client=ClientIdentifier {frontend=member-2-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-579249557, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}}2017-06-01 23:37:38,662 | INFO | lt-dispatcher-32 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-2-frontend-datastore-config: refreshing backend for shard 0 2017-06-01 23:37:38,703 | WARN | monPool-worker-2 | AbstractShardBackendResolver | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Failed to resolve shard java.util.concurrent.TimeoutException: Connection attempt failed at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT] at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$connectShard$2(AbstractShardBackendResolver.java:142)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)[:1.8.0_121] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)[:1.8.0_121] at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:443)[:1.8.0_121] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)[:1.8.0_121] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)[:1.8.0_121] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)[:1.8.0_121] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)[:1.8.0_121] Caused by: org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-579249557 is not the current leader at org.opendaylight.controller.cluster.datastore.Shard.handleConnectClient(Shard.java:436)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT] at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:306)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT] at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)[193:org.opendaylight.controller.sal-akka-raft:1.5.1.SNAPSHOT] at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)[192:org.opendaylight.controller.sal-clustering-commons:1.5.1.SNAPSHOT] at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.18] at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)[192:org.opendaylight.controller.sal-clustering-commons:1.5.1.SNAPSHOT] at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)[173:com.typesafe.akka.actor:2.4.18] at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[173:com.typesafe.akka.actor:2.4.18] at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18] at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)[180:com.typesafe.akka.persistence:2.4.18] at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[180:com.typesafe.akka.persistence:2.4.18] at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[173:com.typesafe.akka.actor:2.4.18] at akka.actor.ActorCell.invoke(ActorCell.scala:495)[173:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[173:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.run(Mailbox.scala:224)[173:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[173:com.typesafe.akka.actor:2.4.18] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] 2017-06-01 23:37:39,688 | INFO | lt-dispatcher-23 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-2-shard-default-config (Leader): Term 3 in "AppendEntriesReply [term=3, success=false, followerId=member-3-shard-default-config, logLastIndex=1047, logLastTerm=2, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]" message is greater than leader's term 2 - switching to Follower 2017-06-01 23:37:39,689 | INFO | lt-dispatcher-23 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-2-shard-default-config (Leader) :- Switching from behavior Leader to Follower, election term: 3 2017-06-01 23:37:39,689 | INFO | lt-dispatcher-20 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-2-shard-default-config, leaderId=null, leaderPayloadVersion=-1] 2017-06-01 23:37:39,689 | INFO | ult-dispatcher-4 | RoleChangeNotifier | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | RoleChangeNotifier for member-2-shard-default-config , received role change from Leader to Follower 2017-06-01 23:37:39,689 | INFO | ult-dispatcher-4 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: Received role changed for member-2-shard-default-config from Leader to Follower 2017-06-01 23:37:40,765 | INFO | lt-dispatcher-22 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.1.SNAPSHOT | member-2-shard-default-config: Successfully transferred leadership to null in 4.320 s 2017-06-01 23:37:40,768 | INFO | ult-dispatcher-3 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | Stopping Shard member-2-shard-default-config 2017-06-01 23:37:50,032 | INFO | lt-dispatcher-32 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | Capping ReconnectingClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-config, generation=0} , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-579249557, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} throttle delay from 6 to 5 seconds |
| Comment by Vratko Polak [ 05/Jun/17 ] |
|
The karaf log segment above is from Sandbox running against 57996/5 with cherry-picked 57699/5. The scenario uses a module-based shard (default-config) with tell-based protocol. The transaction writer is started on the leader member, then remove-shard-replica is called. A similar test case (writer on a follower, still calling remove-shard-replica on the leader) passed (the writer finished without reporting any failure). Similar failure happened on recent RelEng run [9], but karaf.log [10] segment between 13:53:54,938 and 13:56:47,140 shows UnreachableMember happened. [9] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/733/log.html.gz#s1-s20-t1-k2-k10-k1 |
| Comment by Vratko Polak [ 16/Jun/17 ] |
|
This symptom re-appeared, it now happens [11] consistently but only in one test case. It tests module-based shards (tell based protocol) with single writer started on a follower, but then cluster-admin:make-leader-local attempts to move the leader to the writer member. The test expects writer to finish writing successfully. The full writer response: ]}} |
| Comment by Vratko Polak [ 21/Jun/17 ] |
|
> consistently but only in one test case Now I would say 50% of the time, still in only one [12] test case. The writer took 160 seconds to respond. Few suspicious lines from writer member karaf.log [13] since 07:45:22,131 follow. The first one is especially confusing, since the connection is now reconnected to the new (local, member-2) leader, but it is still reporting NotLeaderException from the previous (remote, member-1) leader. 2017-06-21 07:45:38,519 | INFO | lt-dispatcher-20 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-2-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier {frontend=member-2-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#494504499, sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} indicated no leadership, reconnecting it 2017-06-21 07:45:38,624 | ERROR | lt-dispatcher-22 | OneForOneStrategy | 174 - com.typesafe.akka.slf4j - 2.4.18 | Transaction member-2-datastore-config-fe-0-chn-4-txn-5365-0 has already been sealed 2017-06-21 07:45:38,657 | INFO | ult-dispatcher-5 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-2-frontend-datastore-config: Ignoring unknown response Envelope{sessionId=1, txSequence=1, message=TransactionPurgeResponse{target=member-2-datastore-config-fe-0-chn-4-txn-5346-0, sequence=0}} 2017-06-21 07:48:18,136 | WARN | pool-30-thread-1 | WriteTransactionsHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.1.SNAPSHOT | Future #5362/7971 not completed yet [12] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/756/log.html.gz#s1-s24-t5-k2-k10 |
| Comment by Vratko Polak [ 26/Jun/17 ] |
|
> the connection is now reconnected to the new (local, member-2) leader, Comparing log messages to code, I think that is a case of multiple FailureEnvelope from the same old backend. Specifically, this [14] line confirms the connection to the new backend is established, but then this [15] line says the repeated message from the old backend is causing reconnect on the new connection. In karaf.log [16] both log messages occur at 09:14:37,757. In theory, spurious reconnects should not cause system test failures, so there is something to fix in other parts of the code as well. A cosmetic remark: ReconnectingClientConnection probably only needs to know maxMessages value; printing the whole ShardBackendInfo pointing to the old backend is slightly confusing, as that is not the backend ODL is likely to connect to. [14] https://git.opendaylight.org/gerrit/gitweb?p=controller.git;a=blob;f=opendaylight/md-sal/cds-access-client/src/main/java/org/opendaylight/controller/cluster/access/client/ClientActorBehavior.java;h=554ffe97c77ae7c3011daca2a28c22f660a20e78;hb=refs/heads/stable/carbon#l307 |
| Comment by Robert Varga [ 26/Jun/17 ] |
| Comment by Vratko Polak [ 29/Jun/17 ] |
|
After rework of transaction writer, the error message [17] is different, but I believe it is the same bug. Now it happened on releng when the old leader got remove-shard-replica, and the single writer was on the member which stayed follower. The new writer response: ]}} The karaf.log [18] reports 2011 futures as not completed, nothing else is visible on INFO level. [17] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/765/log.html.gz#s1-s20-t3-k2-k8 |