[CONTROLLER-2029] ODL Clustering - Shard Has no Leader or AskTimeoutException Error - Silicon SR4 Created: 11/Feb/22 Updated: 26/Jul/23 Resolved: 18/May/23 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Highest |
| Reporter: | Shibu Vijayakumar | Assignee: | Samuel Schneider |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | pt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
In a 3 node Cluster, ODL Version: Silicon SR4 features Enabled: odl-restconf-all and odl-netconf-clustered-topology Scenario: Tried to mount 250 netconf devices via mount POST restApi, each mount api send synchronously with a time gap of around 2 secs between each apis.
After around 40-50 passes APIs, all the other mount POST apis gets failed with the error, Shard Has no Leader or AskTimeoutException
2022-02-08T21:06:07,310 | INFO | qtp1545674178-437 | RestconfImpl | 279 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.13.8 | Error creating data config/network-topology:network-topology/topology/topology-netconf
2022-02-08T21:06:07,310 | INFO | qtp1545674178-437 | RestconfImpl | 279 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.13.8 | Error creating data config/network-topology:network-topology/topology/topology-netconfjava.util.concurrent.ExecutionException: TransactionCommitFailedException{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.mdsal.common.api.DataStoreUnavailableException: Could not process forwarded ready transaction member-1-datastore-config-fe-0-txn-28-0. Shard member-3-shard-topology-config currently has no leader. Try again later.]]}
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:564) ~[bundleFile:?]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:545) ~[bundleFile:?]
at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:88) ~[bundleFile:?]
at org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.createConfigurationData(RestconfImpl.java:966) ~[bundleFile:?]
at org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.createConfigurationData(RestconfImpl.java:906) ~[bundleFile:?]
at org.opendaylight.netconf.sal.restconf.impl.StatisticsRestconfServiceWrapper.createConfigurationData(StatisticsRestconfServiceWrapper.java:161) ~[bundleFile:?]
at org.opendaylight.netconf.sal.rest.impl.RestconfCompositeWrapper.createConfigurationData(RestconfCompositeWrapper.java:86) ~[bundleFile:?]
at jdk.internal.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
2022-02-11T13:15:23,853 | INFO | qtp1228907817-891 | RestconfImpl | 279 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.13.8 | Error creating data config/network-topology:network-topology/topology/topology-netconfjava.util.concurrent.ExecutionException: TransactionCommitFailedException{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-topology-config#-1826243286)] after [30000 ms]. Message of type [org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.]]}
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:564) ~[bundleFile:?]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:545) ~[bundleFile:?]
at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:88) ~[bundleFile:?]
at org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.createConfigurationData(RestconfImpl.java:966) ~[bundleFile:?]
at org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.createConfigurationData(RestconfImpl.java:906) ~[bundleFile:?]
at org.opendaylight.netconf.sal.restconf.impl.StatisticsRestconfServiceWrapper.createConfigurationData(StatisticsRestconfServiceWrapper.java:161) ~[bundleFile:?]
at org.opendaylight.netconf.sal.rest.impl.RestconfCompositeWrapper.createConfigurationData(RestconfCompositeWrapper.java:86) ~[bundleFile:?]
at jdk.internal.reflect.GeneratedMethodAccessor80.invoke(Unknown Source) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
When checked on the Shards Role assigned: Initially when the ODL is started, could see the member-1 is assigned the Leader and later after pushing the mount apis, could see reelection happening and the role changes from Leader to IsolatedLeader.
|
| Comments |
| Comment by Shibu Vijayakumar [ 11/Feb/22 ] |
|
Above issue not seen in ODL current release - Phosphorus SR1, but exists in Silicon SR3 and SR4 also. The difference seen reg the RoleChange is, in Silicon Version once the mount apis are pushed after some time, the Shards members are changing role, Leader changing to IsolatedLeader, but in Phosphorus not seeing those RoleChangeNotifier. Once after the intialization no rolechanges observed during the process of the mount apis. |
| Comment by Robert Varga [ 12/Feb/22 ] |
|
Well, the causes can be quite diverse and without karaf logs from all members I cannot determine the root cause. Without a root cause there is just no way I could pinpoint the difference. |
| Comment by Shibu Vijayakumar [ 12/Feb/22 ] |
|
Attached the karaf logs of all the 3 members. |
| Comment by Samuel Schneider [ 18/May/23 ] |
|
I wasn't able to reproduce this issue on ODL Silicon SR4 3 node Cluster. We would need some replicator test or consistent steps to reproduce for this issue. After looking trough the logs I noticed that all devices were connected by the same member-3, I could not reproduce this behavior. member-3 end up being without leader so this can be related to the issue. Also it seems bierman02 endpoint was used, which was removed in later versions of ODL. You can try using RFC-8040 endpoint instead. |