[CONTROLLER-2032] Phosphorus SR1 - AskTimeoutException while trying to mount multiple Netconf devices at a time Created: 22/Feb/22 Updated: 06/Apr/22 |
|
| Status: | In Progress |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | 4.0.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Highest |
| Reporter: | Shibu Vijayakumar | Assignee: | Shibu Vijayakumar |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
Build Version: Phosphorus SR1 3 Node Cluster From the Logs - member_1 is chosen as the Master for the Devices Scenario: -------------------- Triggered 200 Mount requests at the same time to ODL. In the logs, could see the requests are received and the Mount process was started for few of the devices, but after sometime we start getting the AskTimeoutException continuously which causes the further device mounts. When checked the mount status only around 78 devices among 200 were Connected, and other devices mount got stuck due to above, it dint even span the Netconf actor for the mounting.
2022-02-22T20:43:22,747 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-46 | SupervisorStrategy | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Ask timed out on Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#-1603400317]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Get]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.2022-02-22T20:43:22,747 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-46 | SupervisorStrategy | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Ask timed out on Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#-1603400317]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Get]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.java.util.concurrent.TimeoutException: Ask timed out on Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#-1603400317]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Get]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. at akka.actor.typed.scaladsl.AskPattern$.$anonfun$onTimeout$1(AskPattern.scala:131) ~[bundleFile:?] at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:730) ~[bundleFile:?] at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:479) ~[bundleFile:?] at scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:365) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:314) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:318) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:270) ~[bundleFile:?] at java.lang.Thread.run(Thread.java:829) ~[?:?]2022-02-22T20:43:22,788 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | RemoteActorRefProvider$RemoteDeadLetterActorRef | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Message [akka.cluster.ddata.typed.javadsl.Replicator$GetSuccess] to Actor[akka://opendaylight-cluster-data/deadLetters] was not delivered. [5] dead letters encountered. If this is not an expected behavior then Actor[akka://opendaylight-cluster-data/deadLetters] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.2022-02-22T20:43:22,726 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | Behavior$ | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Supervisor StopSupervisor saw failure: Ask timed out on Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#-1603400317]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Update]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.java.util.concurrent.TimeoutException: Ask timed out on Actor[akka://opendaylight-cluster-data/system/typedDdataReplicator#-1603400317]] after [5000 ms]. Message of type [akka.cluster.ddata.typed.javadsl.Replicator$Update]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. at akka.actor.typed.scaladsl.AskPattern$.$anonfun$onTimeout$1(AskPattern.scala:131) ~[bundleFile:?] at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:730) ~[bundleFile:?] at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:479) ~[bundleFile:?] at scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:365) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:314) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:318) ~[bundleFile:?] at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:270) ~[bundleFile:?] at java.lang.Thread.run(Thread.java:829) ~[?:?]2022-02-22T20:43:22,805 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-16 | ClusterSingletonManager | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Singleton actor [akka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor] was terminated2022-02-22T20:43:22,806 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-16 | LocalActorRef | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Message [akka.cluster.ddata.typed.javadsl.Replicator$Changed] wrapped in [akka.actor.typed.internal.AdaptMessage] to Actorakka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor#901657858 was not delivered. [6] dead letters encountered. If this is not an expected behavior then Actorakka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor#901657858 may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.2022-02-22T20:43:22,807 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-16 | LocalActorRef | 213 - org.opendaylight.controller.repackaged-akka - 4.0.7 | Message [akka.cluster.ddata.typed.javadsl.Replicator$Changed] wrapped in [akka.actor.typed.internal.AdaptMessage] to Actorakka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor#901657858 was not delivered. [7] dead letters encountered. If this is not an expected behavior then Actorakka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor#901657858 may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
|
| Comments |
| Comment by Shibu Vijayakumar [ 22/Feb/22 ] |
|
Major Concern on this Issue is: Once we hit this timeout issue, no further mount request is processed by ODL. Tried giving a new POST mount request for a new device other than the above 200 - device vmx-201, api was passed, but the actual mount process was not started by the ODL. Even after trying Deleting the 200 devices and then trying a new mount dint go through. But delete process was successful.
Device 201 Single mount after the 200 parallel mount was performed at From the logs it could see, DOMEntity of the EntityTypeListenerActor as _nodeId=Uri{_value=vmx-201}}]]}}]], state=REMOTE_OWNERSHIP_LOST_NO_OWNER [wasOwner=false, isOwner=false, hasOwner=false] |
| Comment by Shibu Vijayakumar [ 23/Feb/22 ] |
|
rovarga Another observation, once we get this error in the ODL instance which has the Master role was the devices mount, if we down that instance - could see another election happens and one of the other ODL gets the Leader role and after that it resumes all the above mounts that were stuck and all devices moves to Connected. |
| Comment by Robert Varga [ 03/Mar/22 ] |
|
So the ATE is troublesome, as it seems to indicate something in Akka replication. Two things:
|
| Comment by Shibu Vijayakumar [ 03/Mar/22 ] |
|
rovarga Sure, will check on the point 1. W.r.t point 2 , have executed the test case and it was SUCCESS with no ATE in single instance ODL, enabling odl-netconf-topology, also in the Clustered ODL setup of 3 node, with the feature: odl-netconf-topology installed instead of "odl-netconf-clustered-topology". But the concern with the 3 node clustered ODL in the above case is the Netconf device Mount Point is created only in one of the instance where the POST mount api hits, and in the other 2 its not, and due to that the below GET and PUT netconf apis were failing in those instances, saying "mount point doesn't exists". Any workaround or suggestions on this scenario to make it work as a Cluster with the Sample apis: GET: curl -v GET -u admin:admin http://<ip>:8181/restconf/operational/network-topology:network-topology/topology/topology-netconf/node/<device_name>/yang-ext:mount PUT: curl -H 'Content-Type: application/json' -H 'Accept: application/json' -v -X PUT -T policy-options.json -u admin:admin http://<ip>:8181/restconf/config/network-topology:network-topology/topology/topology-netconf/node/<device_name>/yang-ext:mount/junos-conf-root:configuration/junos-conf-policy-options:policy-options
|
| Comment by Robert Varga [ 03/Mar/22 ] |
|
Right, and I would expect single-node deployments to not be suspectible to timing (quoted 5000ms). Distributed systems are pesky In a cluster, odl-netconf-topology has a snowflake's chance in hell to work – the three instances will end up tripping over each other. But, if that works okay, it is a matter of clustering integration and we need to understand what is going wrong. Unfortunately I cannot look at this closely enough because of other (long overdue) stuff... |
| Comment by Shibu Vijayakumar [ 07/Mar/22 ] |
|
rovarga In the above point, u r suggesting we could check and enhance "odl-netconf-topology" to work as a cluster to create "Mount points" in all of the ODL instance? |
| Comment by Shibu Vijayakumar [ 09/Mar/22 ] |
|
rovarga I hope I kind of have figured out the root cause for the above mount stuck issue, Reason: Singleton actor [akka://opendaylight-cluster-data/system/singletonManagerOwnerSupervisor/OwnerSupervisor] was terminated And this actor doesn't resumes. Ideally the ClusterSingletonManager should make sure that this Single actor always runs right? Even if it terminates it should be restarted I think.
I could understand controller/EOSMain.java is the one creating this ClusterSingleton actor, // start the initial sync behavior that switches to the regular one after syncing
And the class of this actor is controller/OwnerSupervisor.java, as this actor is terminated, the DOMEntities are stuck in a REMOTE_OWNERSHIP_LOST_NO_OWNER stage and no odl instance takes the master ownership (LOCAL_OWNERSHIP_GRANTED) and proceeds with the mount.
If we do restart the ODL instance, then this Singleton OwnerSupervisor is getting started in any of the other ODL instance and that resumes the on-hold mounts.
Could see akka.cluster.singleton.hand-over-retry-interval = 1s is configured in factory/akka.conf, but still not sure why the OwnerSupervisor actor doesn't restarts in the other instance when it gets terminated. |
| Comment by Shibu Vijayakumar [ 04/Apr/22 ] |
|
rovarga Found a solution for the above issue, when increased the below "distributed-data" intervals from ms to 5s in akka.conf, we were able to mount 250 netconf devices at the same time to the ODL.
akka.cluster.distributed-data.gossip-interval = 5s akka.cluster.distributed-data.notify-subscribers-interval = 5s
akka { cluster { distributed-data { # How often the Replicator should send out gossip information. # This value controls how quickly Entity Ownership Service data is replicated # across cluster nodes. gossip-interval = 5s # How often the subscribers will be notified of changes, if any. # This value controls how quickly Entity Ownership Service decisions are # propagated within a node. notify-subscribers-interval = 5s }
|
| Comment by Shibu Vijayakumar [ 04/Apr/22 ] |
|
Added fix for restarting the EOS Singleton Actor -
|