[NETCONF-621] Un-mounting a netconf device does not clean up MD-SAL DOM mountpoints Created: 21/May/19  Updated: 22/Jan/24

Status: Confirmed
Project: netconf
Component/s: netconf
Affects Version/s: Sodium SR3, Aluminium SR1, 5.0.2
Fix Version/s: 7.0.0, 5.0.10, 6.0.7

Type: Bug Priority: Medium
Reporter: Prathap krishnamurthy Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: pick-next, pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File Full logs cluster(3 nodes).7z     Text File Logs from debugging cluster.TXT     File Node1_karaf.log     File Node2_karaf.log     File Node3_karaf.log     PNG File cluster breakpoints for logs.png     Text File device_add.log     Text File deviceunmount.log    
Issue Links:
Relates
relates to NETCONF-784 ReconnectPromise keep reconnecting af... Resolved

 Description   

device_add.logDescription:

Un-mounting a netconf device does not clean up MD-SAL DOM mountpoints

Topology:

ODL-Neon---------(junos netconf device)

 

Steps to reproduce:

  1. Mount Juniper-MX device to ODL with below payload.

Mount URL:(operation PUT)

http://<IP>:8181/restconf/config/network-topology:network-topology/topology/topology-netconf/node/Man-2

xml_playload

<node xmlns="urn:TBD:params:xml:ns:yang:network-topology">
 <node-id>Man-2</node-id>
 <host xmlns="urn:opendaylight:netconf-node-topology"><device IP></host>
 <password xmlns="urn:opendaylight:netconf-node-topology">pwd</password>
 <username xmlns="urn:opendaylight:netconf-node-topology">user</username>
 <port xmlns="urn:opendaylight:netconf-node-topology">830</port>
 <tcp-only xmlns="urn:opendaylight:netconf-node-topology">false</tcp-only>
 <keepalive-delay xmlns="urn:opendaylight:netconf-node-topology">2</keepalive-delay>
 </node>

2. observe that device status chages to connected , in couple of minutes. GET. http://<IP>:8181/restconf/operational/network-topology:network-topology/topology/topology-netconf/node/Man-2 

3. Delete the device 

 

Operation DELETE:

http://<IP>:8181/restconf/config/network-topology:network-topology/topology/topology-netconf/node/Man-2

 

observe that ODL fails to clean up the mounted device properly.

if the device is remounted again , it complains below.

2019-05-21T14:15:08,756 | ERROR | remote-connector-processing-executor-14 | NetconfDevice | 293 - org.opendaylight.netconf.sal-netconf-connector - 1.9.0 | RemoteDevice\{Man-1}: Initialization in sal failed, disconnecting from device
 java.lang.IllegalStateException: Mount point already exists
 at com.google.common.base.Preconditions.checkState(Preconditions.java:507) ~[32:com.google.guava:25.1.0.jre]
 at org.opendaylight.mdsal.dom.broker.DOMMountPointServiceImpl.createMountPoint(DOMMountPointServiceImpl.java:48) ~[253:org.opendaylight.mdsal.dom-broker:3.0.6]
 at Proxy0f2bd717_d90b_4019_b6ab_2db555b4c1be.createMountPoint(Unknown Source) ~[?:?]
 at Proxy86cb11c3_dc67_4420_95b9_7eb4b50e7d58.createMountPoint(Unknown Source) ~[?:?]
 at org.opendaylight.netconf.sal.connect.netconf.sal.NetconfDeviceSalProvider$MountInstance.onTopologyDeviceConnected(NetconfDeviceSalProvider.java:130) ~[293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
 at org.opendaylight.netconf.sal.connect.netconf.sal.NetconfDeviceSalProvider$MountInstance.onTopologyDeviceConnected(NetconfDeviceSalProvider.java:120) ~[293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
 at org.opendaylight.netconf.topology.singleton.impl.MasterSalFacade.registerMasterMountPoint(MasterSalFacade.java:148) ~[?:?]
 at org.opendaylight.netconf.topology.singleton.impl.MasterSalFacade.onDeviceConnected(MasterSalFacade.java:92) ~[?:?]
 at org.opendaylight.netconf.topology.singleton.impl.MasterSalFacade.onDeviceConnected(MasterSalFacade.java:79) ~[?:?]
 at org.opendaylight.netconf.topology.singleton.impl.MasterSalFacade.onDeviceConnected(MasterSalFacade.java:41) ~[?:?]
 at org.opendaylight.netconf.sal.connect.netconf.sal.KeepaliveSalFacade.onDeviceConnected(KeepaliveSalFacade.java:144) ~[293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
 at org.opendaylight.netconf.sal.connect.netconf.sal.KeepaliveSalFacade.onDeviceConnected(KeepaliveSalFacade.java:51) ~[293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
 at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice.handleSalInitializationSuccess(NetconfDevice.java:243) [293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
 at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice.access$500(NetconfDevice.java:79) [293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
 at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice$SchemaSetup.setUpSchema(NetconfDevice.java:522) [293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
 at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice$SchemaSetup.run(NetconfDevice.java:481) [293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:?]
 at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) [32:com.google.guava:25.1.0.jre]
 at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57) [32:com.google.guava:25.1.0.jre]
 at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) [32:com.google.guava:25.1.0.jre]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:?]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:?]
 at java.lang.Thread.run(Thread.java:748) [?:?]

 



 Comments   
Comment by Eric Moore [ 07/Aug/19 ]

Has there been an progress on this issue and are there any plans to back port it into Oxygen release?

Comment by Sanjana Babu [ 27/Dec/19 ]

Prathap krishnamurthy , is this specific to juniper device?  I have validated the scenario in Netopeer but I'm not able to reproduce it.
Please provide your inputs on this.

Comment by Ravi Pendurty [ 12/Oct/20 ]

metaljackL demskeq8 ... FYI...

rovarga, just wondering if you already know if this is fixed

We have the exact same problem in Sodium SR4.

2020-10-12T08:19:25,925 | ERROR | remote-connector-processing-executor-39 | AbstractFuture | 36 - com.google.guava - 27.1.0.jre | - | RuntimeException while executing runnable CallbackListener{org.opendaylight.netconf.sal.connect.netconf.NetconfDevice$1@e3fcead} with executor MoreExecutors.directExecutor()
java.lang.IllegalStateException: Mount point already exists
at com.google.common.base.Preconditions.checkState(Preconditions.java:508) ~[36:com.google.guava:27.1.0.jre]
at org.opendaylight.mdsal.dom.broker.DOMMountPointServiceImpl.createMountPoint(DOMMountPointServiceImpl.java:46) ~[305:org.opendaylight.mdsal.dom-broker:4.0.17]
at Proxy66a84a00_3d40_4cf5_b701_878f56fd23ae.createMountPoint(Unknown Source) ~[?:?]
at Proxye8cffb62_cb69_49c6_880b_190848679992.createMountPoint(Unknown Source) ~[?:?]
at org.opendaylight.netconf.sal.connect.netconf.sal.NetconfDeviceSalProvider$MountInstance.onTopologyDeviceConnected(NetconfDeviceSalProvider.java:130) ~[343:org.opendaylight.netconf.sal-netconf-connector:1.10.4]
at org.opendaylight.netconf.sal.connect.netconf.sal.NetconfDeviceSalFacade.onDeviceConnected(NetconfDeviceSalFacade.java:84) ~[343:org.opendaylight.netconf.sal-netconf-connector:1.10.4]
at org.opendaylight.netconf.sal.connect.netconf.sal.NetconfDeviceSalFacade.onDeviceConnected(NetconfDeviceSalFacade.java:42) ~[343:org.opendaylight.netconf.sal-netconf-connector:1.10.4]
at org.opendaylight.netconf.sal.connect.netconf.sal.KeepaliveSalFacade.onDeviceConnected(KeepaliveSalFacade.java:142) ~[343:org.opendaylight.netconf.sal-netconf-connector:1.10.4]
at org.opendaylight.netconf.sal.connect.netconf.sal.KeepaliveSalFacade.onDeviceConnected(KeepaliveSalFacade.java:48) ~[343:org.opendaylight.netconf.sal-netconf-connector:1.10.4]
at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice.handleSalInitializationSuccess(NetconfDevice.java:259) ~[343:org.opendaylight.netconf.sal-netconf-connector:1.10.4]
at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice.access$000(NetconfDevice.java:87) ~[343:org.opendaylight.netconf.sal-netconf-connector:1.10.4]
at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice$1.onSuccess(NetconfDevice.java:182) ~[343:org.opendaylight.netconf.sal-netconf-connector:1.10.4]
at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice$1.onSuccess(NetconfDevice.java:179) ~[343:org.opendaylight.netconf.sal-netconf-connector:1.10.4]
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1076) ~[36:com.google.guava:27.1.0.jre]
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) ~[36:com.google.guava:27.1.0.jre]
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138) [36:com.google.guava:27.1.0.jre]
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958) [36:com.google.guava:27.1.0.jre]
at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:789) [36:com.google.guava:27.1.0.jre]
at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.setResult(AbstractTransformFuture.java:224) [36:com.google.guava:27.1.0.jre]
at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.setResult(AbstractTransformFuture.java:202) [36:com.google.guava:27.1.0.jre]
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:163) [36:com.google.guava:27.1.0.jre]
at com.google.common.util.concurrent.MoreExecutors$5$1.run(MoreExecutors.java:982) [36:com.google.guava:27.1.0.jre]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]

 

 

Comment by Robert Varga [ 12/Oct/20 ]

I am sorry, Sodium release stream is no longer supported by the community.

Comment by Michael Dürre [ 12/Oct/20 ]

rovarga Sorry but I have to make it clear. Is it fixed in magnesium? aluminium? or you haven't seen this yet?

Comment by Robert Varga [ 13/Oct/20 ]

I don't know, it will have to be triaged by someone to ascertain that. What I was implying that even if it gets fixed, it will not get fixed on sodium stream.

Comment by Herbert Eiselt [ 13/Oct/20 ]

Thanks for the answer. Related ONAP Guilin (ODL Sodium) ticket: https://jira.onap.org/browse/SDNC-1380

We plan to do the same test with Aluminium.

Comment by Ravi Pendurty [ 19/Jan/21 ]

We tested this with Aluminium SR1 and observe the same error.

Comment by Robert Varga [ 13/Jul/21 ]

Preliminary target to 2.0.1 to keep this on at least some radar.

Comment by Tomas Cere [ 15/Jul/21 ]

I'm unable to reproduce this locally, mount-> unmount -> mount seems to be working for me with testtool.

We will need debug logs for entire netconf (org.opendaylight.netconf) to proceed further. They also need to contain the previous iteration of the same device(basically need to see the how unregister went with debug logs).

Also it would be helpfull if you could check if the same happens with the nonclustered version of netconf-topology feature so we can pinpoint which module the issue lies in.

If odl-netconf-topology looks fine, I would suggest using it in the meantime if running in a single node scenario, while we get this figured out.

Comment by Robert Varga [ 16/Jul/21 ]

I think the problem here is lifecycle confusion between sal-netconf-connector and netconf-client re. reconnecting client. NETCONF-784 might solve something, but it certainly not a complete solution, as we are violating API contract set out for the returned Promise.

Comment by Ivan Hrasko [ 18/Jan/23 ]

The problems can be caused by the fact that ODL/netconf is slow to mount Junos device (device with a lot of models). It takes minutes to move from connecting to connected status. I expect that there can be problems with synchronization of mount point creation/deletion process which are not discoverable when testing with device that has only a few models.

With my local cluster (3 nodes with 1GiB memory) I haven't even been able to connect such a device - my cluster nodes went down during connecting .

Comment by Peter Suna [ 09/Feb/23 ]

I tested this issue with three cluster nodes, each equipped with 1 GB of RAM. The issue still persists in the current master NETCONF 5.0.2-SNAPSHOT. I have attached logs from all three clusters [Node1_karaf.log, Node2_karaf.log, Node3_karaf.log] and the exception is only present in the elected leader node 1.

To reproduce the issue, I followed these steps:
 1) Initialize the cluster with three nodes.
 2) Start the NETCONF test tool with the following command: 

java -jar netconf-testtool-5.0.2-SNAPSHOT-executable.jar --device-count 1 --debug false --md-sal true --schemas-dir PATH_TO_JUNIPER_MODELS --controller-ip 192.168.56.103 --ip 192.168.56.101 --generate-config-address 192.168.56.101 

 3) Verify the connected device.
 4) Remove the connected device.

 

Comment by Peter Suna [ 09/Feb/23 ]

With steps described in this comment, I was able to encounter this error despite the netconf-testtool only having one model (ietf-inet-types@2013-07-15.yang), in the folder specified by the --schema-dir path parameter at the 2) step.

 

 

Comment by Robert Varga [ 01/Mar/23 ]

The logs are a bit weird, as the exception is logged after the dust settled, so timing is ... interesting.
I will add better log message to the checkState(), but I think we also need to pepper some more debugs in NetconfNodeActor – note that executeInSelf() gets scheduled through an explicit message to self, and hence we may be acting on outdated state by that time.
There is some form of a guard, but it is not exactly clear it is effective.

Comment by Ivan Hrasko [ 29/Mar/23 ]

The logs provided by Prathapkrishnamurthy shows that mount point is going to be created by registerMasterMountPoint (MasterSalFacade). On the other hand logs provided by PeterSuna shows that here we call registerSlaveMountPoint (SlaveSalFacade).

There can be only one mount point for one device. To create it is responsibility of leader not follower. Followers are creating theirs followers mount point as an action on message in NetconfNodeActor.

The fix can be to remove call to create mount point from SlaveSalFacade.

Comment by Oleksandr Zharov [ 06/Apr/23 ]

After some testing, results says that proposed solution will not work - this issue will require more deep investigation of Master ↔︎ Slave interactions. Removing call to create mount point from SlaveSalFacade is breaking cluster. Since problem occurred after disconnecting and connecting device for the second time it seems like something is wrong with close() method of SalFacade and problem might lie there.

Comment by Oleksandr Zharov [ 19/Apr/23 ]

Debugging on cluster with adding additional logs via breakpoints showed that problem occurs when we sending DELETE request. I uploaded log file with a little bit of formatting for understanding. There you can see that for some reason one of the slaves (in my case - someone got this error log on leader node) started behave weird. On second time when I deleted device it for some reason called MasterSalFacade.registerMasterMountPoint() on the slave node. After this point it stopped sending any logs on DELETE request but device was still deleted from cluster. So the problem is there - in delete. Also attaching same full logs from all nodes without formatting if someone needs it.

Comment by Oleksandr Zharov [ 20/Apr/23 ]

Also adding screenshot with location of breakpoints that I used for logging. They mostly logging the time when we going through create/delete process and some vars like YangInstanceIdentifier. It was used to see the order of these steps and timing between nodes.

Generated at Wed Feb 07 20:15:29 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.