[CONTROLLER-2032] Phosphorus SR1 - AskTimeoutException while trying to mount multiple Netconf devices at a time Created: 22/Feb/22  Updated: 06/Apr/22

Status: In Progress
Project: controller
Component/s: clustering
Affects Version/s: 4.0.10
Fix Version/s: None

Type: Bug Priority: Highest
Reporter: Shibu Vijayakumar Assignee: Shibu Vijayakumar
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File member1_karaf.log     Text File member2_karaf.log     Text File member3_karaf.log    

 Description   

Build Version: Phosphorus SR1

3 Node Cluster

From the Logs - member_1 is chosen as the Master for the Devices 

Scenario:

--------------------

Triggered 200 Mount requests at the same time to ODL.

In the logs, could see the requests are received and the Mount process was started for few of the devices, but after sometime we start getting the AskTimeoutException continuously which causes the further device mounts.

When checked the mount status only around 78 devices among 200 were Connected, and other devices mount got stuck due to above, it dint even span the Netconf actor for the mounting. 

 

2022-02-22T20:43:22,747 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-46 | SupervisorStrategy               | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Ask timed out on Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#-1603400317]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Get]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.2022-02-22T20:43:22,747 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-46 | SupervisorStrategy               | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Ask timed out on Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#-1603400317]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Get]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.java.util.concurrent.TimeoutException: Ask timed out on Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#-1603400317]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Get]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. at akka.actor.typed.scaladsl.AskPattern$.$anonfun$onTimeout$1(AskPattern.scala:131) ~[bundleFile:?] at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:730) ~[bundleFile:?] at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:479) ~[bundleFile:?] at scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:365) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:314) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:318) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:270) ~[bundleFile:?] at java.lang.Thread.run(Thread.java:829) ~[?:?]2022-02-22T20:43:22,788 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | RemoteActorRefProvider$RemoteDeadLetterActorRef | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Message [akka.cluster.ddata.typed.javadsl.Replicator$GetSuccess] to Actor[akka://opendaylight-cluster-data/deadLetters] was not delivered. [5] dead letters encountered. If this is not an expected behavior then Actor[akka://opendaylight-cluster-data/deadLetters] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.2022-02-22T20:43:22,726 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | Behavior$                        | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Supervisor StopSupervisor saw failure: Ask timed out on Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#-1603400317]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Update]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.java.util.concurrent.TimeoutException: Ask timed out on Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#-1603400317]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Update]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. at akka.actor.typed.scaladsl.AskPattern$.$anonfun$onTimeout$1(AskPattern.scala:131) ~[bundleFile:?] at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:730) ~[bundleFile:?] at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:479) ~[bundleFile:?] at scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:365) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:314) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:318) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:270) ~[bundleFile:?] at java.lang.Thread.run(Thread.java:829) ~[?:?]2022-02-22T20:43:22,805 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-16 | ClusterSingletonManager          | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Singleton actor [akka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor] was terminated2022-02-22T20:43:22,806 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-16 | LocalActorRef                    | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Message [akka.cluster.ddata.typed.javadsl.Replicator$Changed] wrapped in [akka.actor.typed.internal.AdaptMessage] to Actorakka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor#901657858 was not delivered. [6] dead letters encountered. If this is not an expected behavior then Actorakka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor#901657858 may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.2022-02-22T20:43:22,807 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-16 | LocalActorRef                    | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Message [akka.cluster.ddata.typed.javadsl.Replicator$Changed] wrapped in [akka.actor.typed.internal.AdaptMessage] to Actorakka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor#901657858 was not delivered. [7] dead letters encountered. If this is not an expected behavior then Actorakka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor#901657858 may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

 

 

 

 

 



 Comments   
Comment by Shibu Vijayakumar [ 22/Feb/22 ]

Major Concern on this Issue is: Once we hit this timeout issue, no further mount request is processed by ODL.

Tried giving a new POST mount request for a new device other than the above 200 - device vmx-201, api was passed, but the actual mount process was not started by the ODL. 

Even after trying Deleting the 200 devices and then trying a new mount dint go through. But delete process was successful.

 

Device 201 Single mount after the 200 parallel mount was performed at 
2022-02-22T20:48:14,434 | INFO | opendaylight-cluster-data-notification-dispatcher-59 | DefaultSchemaResourceManager | 287 - org.opendaylight.netconf.sal-netconf-connector - 2.0.11 | Netconf connector for device RemoteDevice{vmx-201} will use schema cache directory junos instead of schema

From the logs it could see, DOMEntity of the EntityTypeListenerActor as 

_nodeId=Uri{_value=vmx-201}}]]}}]], state=REMOTE_OWNERSHIP_LOST_NO_OWNER [wasOwner=false, isOwner=false, hasOwner=false]

Comment by Shibu Vijayakumar [ 23/Feb/22 ]

rovarga Another observation, once we get this error in the ODL instance which has the Master role was the devices mount, if we down that instance - could see another election happens and one of the other ODL gets the Leader role and after that it resumes all the above mounts that were stuck and all devices moves to Connected.

Comment by Robert Varga [ 03/Mar/22 ]

So the ATE is troublesome, as it seems to indicate something in Akka replication. Two things:

  1. try to monitor the VMs/JVMs for signs of issues, most notably GC pauses and similar
  2. try the test case with odl-netconf-topology, i.e. without cluster, to bring down the amount of moving pieces
Comment by Shibu Vijayakumar [ 03/Mar/22 ]

rovarga Sure, will check on the point 1.

W.r.t point 2 , have executed the test case and it was SUCCESS with no ATE in single instance ODL, enabling odl-netconf-topology, also in the Clustered ODL setup of 3 node, with the feature: odl-netconf-topology installed instead of "odl-netconf-clustered-topology".

But the concern with the 3 node clustered ODL in the above case is the Netconf device Mount Point is created only in one of the instance where the POST mount api hits, and in the other 2 its not, and due to that the below GET and PUT netconf apis were failing in those instances, saying "mount point doesn't exists". 

Any workaround or suggestions on this scenario to make it work as a Cluster with the 
odl-netconf-topology feature installed?

Sample apis:

GET: curl -v GET -u admin:admin http://<ip>:8181/restconf/operational/network-topology:network-topology/topology/topology-netconf/node/<device_name>/yang-ext:mount

PUT: curl -H 'Content-Type: application/json' -H 'Accept: application/json' -v -X PUT -T policy-options.json -u admin:admin http://<ip>:8181/restconf/config/network-topology:network-topology/topology/topology-netconf/node/<device_name>/yang-ext:mount/junos-conf-root:configuration/junos-conf-policy-options:policy-options

 

Comment by Robert Varga [ 03/Mar/22 ]

Right, and I would expect single-node deployments to not be suspectible to timing (quoted 5000ms). Distributed systems are pesky

In a cluster, odl-netconf-topology has a snowflake's chance in hell to work – the three instances will end up tripping over each other.

But, if that works okay, it is a matter of clustering integration and we need to understand what is going wrong. Unfortunately I cannot look at this closely enough because of other (long overdue) stuff...

Comment by Shibu Vijayakumar [ 07/Mar/22 ]

rovarga In the above point, u r suggesting we could check and enhance "odl-netconf-topology" to work as a cluster to create "Mount points" in all of the ODL instance?

Comment by Shibu Vijayakumar [ 09/Mar/22 ]

rovarga I hope I kind of have figured out the root cause for the above mount stuck issue, 

Reason: Singleton actor [akka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor] was terminated

And this actor doesn't resumes. Ideally the ClusterSingletonManager should make sure that this Single actor always runs right? Even if it terminates it should be restarted I think. 

 

I could understand controller/EOSMain.java is the one creating this ClusterSingleton actor,

// start the initial sync behavior that switches to the regular one after syncing
ownerSupervisor = clusterSingleton.init(
SingletonActor.of(IdleSupervisor.create(iidCodec), "OwnerSupervisor"));

 

And the class of this actor is controller/OwnerSupervisor.java

as this actor is terminated, the DOMEntities are stuck in a REMOTE_OWNERSHIP_LOST_NO_OWNER stage and no odl instance takes the master ownership (LOCAL_OWNERSHIP_GRANTED) and proceeds with the mount.

 

If we do restart the ODL instance, then this Singleton OwnerSupervisor is getting started in any of the other ODL instance and that resumes the on-hold mounts. 

 

Could see akka.cluster.singleton.hand-over-retry-interval = 1s is configured in factory/akka.conf, but still not sure why the OwnerSupervisor actor doesn't restarts in the other instance when it gets terminated.

Comment by Shibu Vijayakumar [ 04/Apr/22 ]

rovarga Found a solution for the above issue, when increased the below "distributed-data" intervals from ms to 5s in akka.conf, we were able to mount 250 netconf devices at the same time to the ODL.

 

akka.cluster.distributed-data.gossip-interval = 5s

akka.cluster.distributed-data.notify-subscribers-interval = 5s

 

akka {

     cluster {

        distributed-data

{         # How often the Replicator should send out gossip information.         # This value controls how quickly Entity Ownership Service data is replicated         # across cluster nodes.         gossip-interval = 5s         # How often the subscribers will be notified of changes, if any.         # This value controls how quickly Entity Ownership Service decisions are         # propagated within a node.         notify-subscribers-interval = 5s       }

 

Comment by Shibu Vijayakumar [ 04/Apr/22 ]

Added fix for restarting the EOS Singleton Actor - CONTROLLER-2035 ODL Cluster - Akka Cluster Singleton Manager Actor - OwnerSupervisor Getting Terminated and never restarts - OpenDaylight JIRA

 

Generated at Wed Feb 07 19:57:02 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.