[NETCONF-470] Device access can fail shortly after cluster member is killed Created: 12/Sep/17  Updated: 31/Jan/22  Resolved: 31/Jan/22

Status: Resolved
Project: netconf
Component/s: netconf
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Kostiantyn Nosach
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File [NETCONF-470] Steps to reproduce.rtf    
External issue ID: 9148

 Description   

This manifests as a Robot failure, especially in all tests, both on Carbon and Nitrogen.

For example in this [0] failure, post fail with:
java.lang.IllegalStateException: Can't create ProxyReadTransaction
at org.opendaylight.netconf.topology.singleton.impl.ProxyDOMDataBroker.newReadWriteTransaction(ProxyDOMDataBroker.java:92)
...

The real cause is:
Caused by:</h3><pre>java.util.concurrent.TimeoutException: Futures timed out after [5 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at scala.concurrent.Await.result(package.scala)
at org.opendaylight.netconf.topology.singleton.impl.ProxyDOMDataBroker.newReadWriteTransaction(ProxyDOMDataBroker.java:84)
at org.opendaylight.netconf.sal.restconf.impl.BrokerFacade.commitConfigurationDataPost(BrokerFacade.java:475)
...

Looking at karaf.log [1], new leaders were not elected at that time yet, so akka ask is expected to fail.

Data broker now supports tell-based protocol, designed to work in such cases. As netconf does not use data broker, it should improve its own code to offer similar functionality, or at least document that accessing mounted devices can randomly fail during cluster HA events.

Robot tests can be relaxed (by waiting for new leaders) if Netconf behavior is not going to be improved soon.

[0] https://logs.opendaylight.org/releng/jenkins092/netconf-csit-3node-clustering-all-carbon/399/log.html.gz#s1-s9-t7-k2-k1-k1-k4-k7-k1
[1] https://logs.opendaylight.org/releng/jenkins092/netconf-csit-3node-clustering-all-carbon/399/odl2_karaf.log.gz



 Comments   
Comment by Tomas Cere [ 12/Oct/17 ]

For now this is expected, to switch to a tell based protocol we would have to do a rewrite of the singleton again, however we should be able to leverage the tell-based stuff from controller if we switched mountpoints to Shards with a different behavior than the datastore counterparts - nonpersistent, no replication and without actually having a backend that stores data, only forwards it to the device.
This is pretty big task but should have the best payoff regarding clustered-netconf, and we would benefit from any future improvements to the tell-based protocol.

Generated at Wed Feb 07 20:15:07 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.