[CONTROLLER-2029] ODL Clustering - Shard Has no Leader or AskTimeoutException Error - Silicon SR4 Created: 11/Feb/22  Updated: 26/Jul/23  Resolved: 18/May/23

Status: Resolved
Project: controller
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Highest
Reporter: Shibu Vijayakumar Assignee: Samuel Schneider
Resolution: Cannot Reproduce Votes: 0
Labels: pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File karaf_member1.log     Text File karaf_member2.log     Text File karaf_member3.log    

 Description   

In a 3 node Cluster,  ODL Version: Silicon SR4

features Enabled: odl-restconf-all and odl-netconf-clustered-topology

Scenario:

Tried to mount 250 netconf devices via mount POST restApi, each mount api send synchronously with a time gap of around 2 secs between each apis.

 

After around 40-50 passes APIs, all the other mount POST apis gets failed with the error,

Shard Has no Leader or AskTimeoutException

 

 

2022-02-08T21:06:07,310 | INFO  | qtp1545674178-437 | RestconfImpl                     | 279 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.13.8 | Error creating data config/network-topology:network-topology/topology/topology-netconf
2022-02-08T21:06:07,310 | INFO  | qtp1545674178-437 | RestconfImpl                     | 279 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.13.8 | Error creating data config/network-topology:network-topology/topology/topology-netconfjava.util.concurrent.ExecutionException: TransactionCommitFailedException{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.mdsal.common.api.DataStoreUnavailableException: Could not process forwarded ready transaction member-1-datastore-config-fe-0-txn-28-0. Shard member-3-shard-topology-config currently has no leader. Try again later.]]}
 at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:564) ~[bundleFile:?]
 at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:545) ~[bundleFile:?]
 at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:88) ~[bundleFile:?]
 at org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.createConfigurationData(RestconfImpl.java:966) ~[bundleFile:?]
 at org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.createConfigurationData(RestconfImpl.java:906) ~[bundleFile:?]
 at org.opendaylight.netconf.sal.restconf.impl.StatisticsRestconfServiceWrapper.createConfigurationData(StatisticsRestconfServiceWrapper.java:161) ~[bundleFile:?]
 at org.opendaylight.netconf.sal.rest.impl.RestconfCompositeWrapper.createConfigurationData(RestconfCompositeWrapper.java:86) ~[bundleFile:?]
 at jdk.internal.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) ~[?:?]
 at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
 at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
2022-02-11T13:15:23,853 | INFO  | qtp1228907817-891 | RestconfImpl                     | 279 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.13.8 | Error creating data config/network-topology:network-topology/topology/topology-netconfjava.util.concurrent.ExecutionException: TransactionCommitFailedException{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-topology-config#-1826243286)] after [30000 ms]. Message of type [org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.]]}
 at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:564) ~[bundleFile:?]
 at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:545) ~[bundleFile:?]
 at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:88) ~[bundleFile:?]
 at org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.createConfigurationData(RestconfImpl.java:966) ~[bundleFile:?]
 at org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.createConfigurationData(RestconfImpl.java:906) ~[bundleFile:?]
 at org.opendaylight.netconf.sal.restconf.impl.StatisticsRestconfServiceWrapper.createConfigurationData(StatisticsRestconfServiceWrapper.java:161) ~[bundleFile:?]
 at org.opendaylight.netconf.sal.rest.impl.RestconfCompositeWrapper.createConfigurationData(RestconfCompositeWrapper.java:86) ~[bundleFile:?]
 at jdk.internal.reflect.GeneratedMethodAccessor80.invoke(Unknown Source) ~[?:?]
 at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
 at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]

 

 

When checked on the Shards Role assigned: 

Initially when the ODL is started, could see the member-1 is assigned the Leader and later after pushing the mount apis, could see reelection happening and the role changes from Leader to IsolatedLeader.

 

 

 



 Comments   
Comment by Shibu Vijayakumar [ 11/Feb/22 ]

Above issue not seen in ODL current release - Phosphorus SR1, but exists in Silicon SR3 and SR4 also.

The difference seen reg the RoleChange is, in Silicon Version once the mount apis are pushed after some time, the Shards members are changing role, Leader changing to IsolatedLeader, but in Phosphorus not seeing those RoleChangeNotifier. Once after the intialization no rolechanges observed during the process of the mount apis.

Comment by Robert Varga [ 12/Feb/22 ]

Well, the causes can be quite diverse and without karaf logs from all members I cannot determine the root cause.

Without a root cause there is just no way I could pinpoint the difference.

Comment by Shibu Vijayakumar [ 12/Feb/22 ]

Attached the karaf logs of all the 3 members.

Comment by Samuel Schneider [ 18/May/23 ]

I wasn't able to reproduce this issue on ODL Silicon SR4 3 node Cluster.
I tried to connect 250 devices simulated by ODL-NETCONF testtool and all of them connected successfully.

We would need some replicator test or consistent steps to reproduce for this issue.

After looking trough the logs I noticed that all devices were connected by the same member-3, I could not reproduce this behavior. member-3 end up being without leader so this can be related to the issue.

Also it seems bierman02 endpoint was used, which was removed in later versions of ODL. You can try using RFC-8040 endpoint instead.

Generated at Wed Feb 07 19:57:01 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.