[CONTROLLER-1690] Frontend client tries to reconnect to the same member which is no longer leader Created: 19/May/17  Updated: 25/Jul/23  Resolved: 22/May/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Duplicate
duplicates CONTROLLER-1689 stopping resolution of shard 0 on sta... Resolved
External issue ID: 8513

 Description   

I had most of this written when I realized member-1 received UnreachableMember just around the time shard replica removal was happening, so at the end this might be just another symptom of general CONTROLLER-1645 behavior.
Opening this anyway, just to have this particular symptom documented.

Scenario: Prefix-based shard (thus tell-based protocol) is created, transaction producer active on the leader member. Then the replica on that member is removed, transaction producer is expected to continue on when one of the other members becomes the new leader.

Robot sees the producer fail [0].
The message corresponds to what is in karaf.log [1]:
2017-05-19 05:13:04,133 | ERROR | ult-dispatcher-2 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.Carbon | member-3-frontend-datastore-Shard-id-ints!: failed to resolve shard 0
org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-3-shard-id-ints!-config#300839828 is not the current leader
at org.opendaylight.controller.cluster.datastore.Shard.handleConnectClient(Shard.java:435)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.0.Carbon]
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:305)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.0.Carbon]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)[193:org.opendaylight.controller.sal-akka-raft:1.5.0.Carbon]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)[192:org.opendaylight.controller.sal-clustering-commons:1.5.0.Carbon]
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.17]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)[192:org.opendaylight.controller.sal-clustering-commons:1.5.0.Carbon]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)[173:com.typesafe.akka.actor:2.4.17]
at akka.actor.Actor$class.aroundReceive(Actor.scala:497)[173:com.typesafe.akka.actor:2.4.17]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.17]
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:664)[180:com.typesafe.akka.persistence:2.4.17]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[180:com.typesafe.akka.persistence:2.4.17]
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.17]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[173:com.typesafe.akka.actor:2.4.17]
at akka.actor.ActorCell.invoke(ActorCell.scala:495)[173:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[173:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.Mailbox.run(Mailbox.scala:224)[173:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[173:com.typesafe.akka.actor:2.4.17]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]

In this case member-3 was the original leader and the (expected) message at 05:13:04,063 shows the client detected member-3 is no longer the leader.
The client should attempt to contact the new leader, but as I mentioned at the start, the new leader might be unreachable at the time.

This fix [2] should make this situation less likely anyway.

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/697/archives/log.html.gz#s1-s22-t1-k2-k10
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/697/archives/odl3_karaf.log.gz
[2] https://git.opendaylight.org/gerrit/57074


Generated at Wed Feb 07 21:54:13 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.