[CONTROLLER-2036] Failure of initial removal of candidates from previous iteration Created: 05/Apr/22  Updated: 06/Apr/22  Resolved: 06/Apr/22

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: 5.0.1
Fix Version/s: 5.0.2

Type: Bug Priority: Medium
Reporter: Sangwook Ha Assignee: Sangwook Ha
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates CONTROLLER-2035 ODL Cluster - Akka Cluster Singleton ... Resolved

 Description   

Found in openflowplugin-csit-3node-clustering-bulkomatic-only-sulfur/191/

After cluster is restarted, one instance generates the following warning message repeatedly for about 5 minutes until the leader is killed in a different test case.:

https://s3-logs.opendaylight.org/logs/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-bulkomatic-only-sulfur/191/odl_3/odl3_karaf.log.gz

2022-04-05T02:58:18,683 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-45 | CandidateRegistryInit            | 202 - org.opendaylight.controller.eos-dom-akka - 5.0.1 | member-3 : Initial removal of candidates from previous iteration failed. Rescheduling.
java.util.concurrent.TimeoutException: Ask timed out on [Actor[akka://opendaylight-cluster-data/system/singletonProxyOwnerSupervisor-no-dc#2033451887]] after [5000 ms]. Message of type [org.opendaylight.controller.eos.akka.owner.supervisor.command.ClearCandidatesForMember]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
	at akka.actor.typed.scaladsl.AskPattern$.$anonfun$onTimeout$1(AskPattern.scala:131) ~[bundleFile:?]
	at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:730) ~[bundleFile:?]
	at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:479) ~[bundleFile:?]
	at scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222) ~[bundleFile:?]
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:365) ~[bundleFile:?]
	at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:314) ~[bundleFile:?]
	at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:318) ~[bundleFile:?]
	at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:270) ~[bundleFile:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]


 Comments   
Comment by Robert Varga [ 06/Apr/22 ]

Hmm, this looks rather weird. We are getting these:

2022-04-05T03:03:15,442 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-44 | LocalActorRef                    | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Message [akka.actor.ReceiveTimeout$] to Actor[akka://opendaylight-cluster-data/system/IO-TCP/selectors/$a/9#-1115372167] was not delivered. [295] dead letters encountered, of which 284 were not logged. The counter will be reset now. If this is not an expected behavior then Actor[akka://opendaylight-cluster-data/system/IO-TCP/selectors/$a/9#-1115372167] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

Hunting through int/test, it seems openflowplugin using is using an ancient akka.conf.template. Perhaps it needs an update?

Comment by Sangwook Ha [ 06/Apr/22 ]

The template file is in the tools directory, outside of csit, and I think it's not really used for the test suite.
Clustering configuration doesn't seem to be generated/modified once the configuration files are created during the initial controller installation for openflowplugin clustering tests.

Comment by Robert Varga [ 06/Apr/22 ]

Interesting, I wonder where is that TCP reference coming from.
Anyway, it would seem the problem is that the target actor is unreachable – and all we have is an ActorRef. We probably need to restart when the supervisor.
I'll need to dig into a bit more.

Comment by Robert Varga [ 06/Apr/22 ]

Actually https://s3-logs.opendaylight.org/logs/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-bulkomatic-only-sulfur/191/odl_1/odl1_karaf.log.gz has:

2022-04-05T02:58:10,693 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-32 | Behavior$                        | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Supervisor StopSupervisor saw failure: Ask timed out on [Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#380268694]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Get]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
java.util.concurrent.TimeoutException: Ask timed out on [Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#380268694]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Get]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
	at akka.actor.typed.scaladsl.AskPattern$.$anonfun$onTimeout$1(AskPattern.scala:131) ~[bundleFile:?]
	at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:730) ~[bundleFile:?]
	at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:479) ~[bundleFile:?]
	at scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222) ~[bundleFile:?]
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:365) ~[bundleFile:?]
	at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:314) ~[bundleFile:?]
	at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:318) ~[bundleFile:?]
	at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:270) ~[bundleFile:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
2022-04-05T02:58:10,700 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-14 | ClusterSingletonManager          | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Singleton actor [akka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor] was terminated
Generated at Wed Feb 07 19:57:03 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.