[NETCONF-621] Un-mounting a netconf device does not clean up MD-SAL DOM mountpoints Created: 21/May/19 Updated: 22/Jan/24 |
|
| Status: | Confirmed |
| Project: | netconf |
| Component/s: | netconf |
| Affects Version/s: | Sodium SR3, Aluminium SR1, 5.0.2 |
| Fix Version/s: | 7.0.0, 5.0.10, 6.0.7 |
| Type: | Bug | Priority: | Medium |
| Reporter: | Prathap krishnamurthy | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | pick-next, pt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
device_add.log Un-mounting a netconf device does not clean up MD-SAL DOM mountpoints Topology: ODL-Neon---------(junos netconf device)
Steps to reproduce:
Mount URL:(operation PUT) xml_playload <node xmlns="urn:TBD:params:xml:ns:yang:network-topology"> <node-id>Man-2</node-id> <host xmlns="urn:opendaylight:netconf-node-topology"><device IP></host> <password xmlns="urn:opendaylight:netconf-node-topology">pwd</password> <username xmlns="urn:opendaylight:netconf-node-topology">user</username> <port xmlns="urn:opendaylight:netconf-node-topology">830</port> <tcp-only xmlns="urn:opendaylight:netconf-node-topology">false</tcp-only> <keepalive-delay xmlns="urn:opendaylight:netconf-node-topology">2</keepalive-delay> </node> 2. observe that device status chages to connected , in couple of minutes. GET. http://<IP>:8181/restconf/operational/network-topology:network-topology/topology/topology-netconf/node/Man-2 3. Delete the device
Operation DELETE:
observe that ODL fails to clean up the mounted device properly. if the device is remounted again , it complains below. 2019-05-21T14:15:08,756 | ERROR | remote-connector-processing-executor-14 | NetconfDevice | 293 - org.opendaylight.netconf.sal-netconf-connector - 1.9.0 | RemoteDevice\{Man-1}: Initialization in sal failed, disconnecting from device
java.lang.IllegalStateException: Mount point already exists
at com.google.common.base.Preconditions.checkState(Preconditions.java:507) ~[32:com.google.guava:25.1.0.jre]
at org.opendaylight.mdsal.dom.broker.DOMMountPointServiceImpl.createMountPoint(DOMMountPointServiceImpl.java:48) ~[253:org.opendaylight.mdsal.dom-broker:3.0.6]
at Proxy0f2bd717_d90b_4019_b6ab_2db555b4c1be.createMountPoint(Unknown Source) ~[?:?]
at Proxy86cb11c3_dc67_4420_95b9_7eb4b50e7d58.createMountPoint(Unknown Source) ~[?:?]
at org.opendaylight.netconf.sal.connect.netconf.sal.NetconfDeviceSalProvider$MountInstance.onTopologyDeviceConnected(NetconfDeviceSalProvider.java:130) ~[293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
at org.opendaylight.netconf.sal.connect.netconf.sal.NetconfDeviceSalProvider$MountInstance.onTopologyDeviceConnected(NetconfDeviceSalProvider.java:120) ~[293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
at org.opendaylight.netconf.topology.singleton.impl.MasterSalFacade.registerMasterMountPoint(MasterSalFacade.java:148) ~[?:?]
at org.opendaylight.netconf.topology.singleton.impl.MasterSalFacade.onDeviceConnected(MasterSalFacade.java:92) ~[?:?]
at org.opendaylight.netconf.topology.singleton.impl.MasterSalFacade.onDeviceConnected(MasterSalFacade.java:79) ~[?:?]
at org.opendaylight.netconf.topology.singleton.impl.MasterSalFacade.onDeviceConnected(MasterSalFacade.java:41) ~[?:?]
at org.opendaylight.netconf.sal.connect.netconf.sal.KeepaliveSalFacade.onDeviceConnected(KeepaliveSalFacade.java:144) ~[293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
at org.opendaylight.netconf.sal.connect.netconf.sal.KeepaliveSalFacade.onDeviceConnected(KeepaliveSalFacade.java:51) ~[293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice.handleSalInitializationSuccess(NetconfDevice.java:243) [293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice.access$500(NetconfDevice.java:79) [293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice$SchemaSetup.setUpSchema(NetconfDevice.java:522) [293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
at org.opendaylight.netconf.sal.connect.netconf.NetconfDevice$SchemaSetup.run(NetconfDevice.java:481) [293:org.opendaylight.netconf.sal-netconf-connector:1.9.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:?]
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) [32:com.google.guava:25.1.0.jre]
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57) [32:com.google.guava:25.1.0.jre]
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) [32:com.google.guava:25.1.0.jre]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:?]
at java.lang.Thread.run(Thread.java:748) [?:?]
|
| Comments |
| Comment by Eric Moore [ 07/Aug/19 ] |
|
Has there been an progress on this issue and are there any plans to back port it into Oxygen release? |
| Comment by Sanjana Babu [ 27/Dec/19 ] |
|
Prathap krishnamurthy , is this specific to juniper device? I have validated the scenario in Netopeer but I'm not able to reproduce it. |
| Comment by Ravi Pendurty [ 12/Oct/20 ] |
|
metaljackL demskeq8 ... FYI... rovarga, just wondering if you already know if this is fixed We have the exact same problem in Sodium SR4. 2020-10-12T08:19:25,925 | ERROR | remote-connector-processing-executor-39 | AbstractFuture | 36 - com.google.guava - 27.1.0.jre | - | RuntimeException while executing runnable CallbackListener{org.opendaylight.netconf.sal.connect.netconf.NetconfDevice$1@e3fcead} with executor MoreExecutors.directExecutor()
|
| Comment by Robert Varga [ 12/Oct/20 ] |
|
I am sorry, Sodium release stream is no longer supported by the community. |
| Comment by Michael Dürre [ 12/Oct/20 ] |
|
rovarga Sorry but I have to make it clear. Is it fixed in magnesium? aluminium? or you haven't seen this yet? |
| Comment by Robert Varga [ 13/Oct/20 ] |
|
I don't know, it will have to be triaged by someone to ascertain that. What I was implying that even if it gets fixed, it will not get fixed on sodium stream. |
| Comment by Herbert Eiselt [ 13/Oct/20 ] |
|
Thanks for the answer. Related ONAP Guilin (ODL Sodium) ticket: https://jira.onap.org/browse/SDNC-1380 We plan to do the same test with Aluminium. |
| Comment by Ravi Pendurty [ 19/Jan/21 ] |
|
We tested this with Aluminium SR1 and observe the same error. |
| Comment by Robert Varga [ 13/Jul/21 ] |
|
Preliminary target to 2.0.1 to keep this on at least some radar. |
| Comment by Tomas Cere [ 15/Jul/21 ] |
|
I'm unable to reproduce this locally, mount-> unmount -> mount seems to be working for me with testtool. We will need debug logs for entire netconf (org.opendaylight.netconf) to proceed further. They also need to contain the previous iteration of the same device(basically need to see the how unregister went with debug logs). Also it would be helpfull if you could check if the same happens with the nonclustered version of netconf-topology feature so we can pinpoint which module the issue lies in. If odl-netconf-topology looks fine, I would suggest using it in the meantime if running in a single node scenario, while we get this figured out. |
| Comment by Robert Varga [ 16/Jul/21 ] |
|
I think the problem here is lifecycle confusion between sal-netconf-connector and netconf-client re. reconnecting client. |
| Comment by Ivan Hrasko [ 18/Jan/23 ] |
|
The problems can be caused by the fact that ODL/netconf is slow to mount Junos device (device with a lot of models). It takes minutes to move from connecting to connected status. I expect that there can be problems with synchronization of mount point creation/deletion process which are not discoverable when testing with device that has only a few models. With my local cluster (3 nodes with 1GiB memory) I haven't even been able to connect such a device - my cluster nodes went down during connecting |
| Comment by Peter Suna [ 09/Feb/23 ] |
|
I tested this issue with three cluster nodes, each equipped with 1 GB of RAM. The issue still persists in the current master NETCONF 5.0.2-SNAPSHOT. I have attached logs from all three clusters [Node1_karaf.log, Node2_karaf.log, Node3_karaf.log] and the exception is only present in the elected leader node 1. To reproduce the issue, I followed these steps: java -jar netconf-testtool-5.0.2-SNAPSHOT-executable.jar --device-count 1 --debug false --md-sal true --schemas-dir PATH_TO_JUNIPER_MODELS --controller-ip 192.168.56.103 --ip 192.168.56.101 --generate-config-address 192.168.56.101 3) Verify the connected device.
|
| Comment by Peter Suna [ 09/Feb/23 ] |
|
With steps described in this comment, I was able to encounter this error despite the netconf-testtool only having one model (ietf-inet-types@2013-07-15.yang), in the folder specified by the --schema-dir path parameter at the 2) step.
|
| Comment by Robert Varga [ 01/Mar/23 ] |
|
The logs are a bit weird, as the exception is logged after the dust settled, so timing is ... interesting. |
| Comment by Ivan Hrasko [ 29/Mar/23 ] |
|
The logs provided by Prathapkrishnamurthy shows that mount point is going to be created by registerMasterMountPoint (MasterSalFacade). On the other hand logs provided by PeterSuna shows that here we call registerSlaveMountPoint (SlaveSalFacade). There can be only one mount point for one device. To create it is responsibility of leader not follower. Followers are creating theirs followers mount point as an action on message in NetconfNodeActor. The fix can be to remove call to create mount point from SlaveSalFacade. |
| Comment by Oleksandr Zharov [ 06/Apr/23 ] |
|
After some testing, results says that proposed solution will not work - this issue will require more deep investigation of Master ↔︎ Slave interactions. Removing call to create mount point from SlaveSalFacade is breaking cluster. Since problem occurred after disconnecting and connecting device for the second time it seems like something is wrong with close() method of SalFacade and problem might lie there. |
| Comment by Oleksandr Zharov [ 19/Apr/23 ] |
|
Debugging on cluster with adding additional logs via breakpoints showed that problem occurs when we sending DELETE request. I uploaded log file with a little bit of formatting for understanding. There you can see that for some reason one of the slaves (in my case - someone got this error log on leader node) started behave weird. On second time when I deleted device it for some reason called MasterSalFacade.registerMasterMountPoint() on the slave node. After this point it stopped sending any logs on DELETE request but device was still deleted from cluster. So the problem is there - in delete. Also attaching same full logs from all nodes without formatting if someone needs it. |
| Comment by Oleksandr Zharov [ 20/Apr/23 ] |
|
Also adding screenshot with location of breakpoints that I used for logging. They mostly logging the time when we going through create/delete process and some vars like YangInstanceIdentifier. It was used to see the order of these steps and timing between nodes. |