[NETCONF-880] Netconf does not close device mountpoint properly Created: 27/May/22  Updated: 31/May/22  Resolved: 31/May/22

Status: Resolved
Project: netconf
Component/s: netconf
Affects Version/s: 2.0.14
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Igor Foltin Assignee: Robert Varga
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File netconf-logs-after-device-reconnect-rpc-call.txt     Text File netconf-rpc-transport-error.txt     Text File netconf-session-promise-complete-already-error.txt     Text File reproducer-debug-logs.txt    
Issue Links:
Duplicate
duplicates NETCONF-827 Mount loop when setting too low conne... Resolved

 Description   

We have an ODL-based app which manages netconf devices in a DC.
Recently, we've been experiencing netconf connectivity/sync issues between ODL app and netconf devices in some of our prod environments.
So here's what has been happening:

1. We start the ODL app which connects to devices and fetches data from them.
    All is good, devices are connected properly and are in sync (data was fetched correctly).
2. After a short time, ODL lose connection to some devices.
3. ODL netconf attempts automatic reconnect and the device gets reconnected.
4. ODL app tries to load data from device again using a netconf get RPC call.
5. ODL netconf (not the device) returns "transport error" and ODL app now marks the        device with sync-failed status. See the attachment "netconf-rpc-transport-error.txt"
6. User then calls ODL app RPC for reconnecting the device.
   This RPC first deletes the node from ODL netconf-topology and then recreates it.
   We added a 5 seconds delay (for investigation purposes only) between the topology  node deletion and recreation to be sure ODL netconf has enough time to dismount and mount the device again.
   We added a check if the ODL netconf mountpoint exists (using DOMMountPointService) after the topology node was deleted and before it's going to be recreated again.
7. ODL app asks ODL netconf to close the netconf session.
   Then we see the ODL netconf logs as they are in the attachment "netconf-logs-after-device-reconnect-rpc-call.txt".
8. ODL app asks ODL netconf to start the netconf session again, but it seems that the previous session is stuck.
   Our check if the mountpoint still exists, throws an exception because it still exists, even though it should not at this point.
   Devops guys from customer also reported lots of hanging netconf session between the device and ODL.
9. Afterwards we see lots of logs "netconf session established" followed by "netconf session promise complete already" errors in short intervals (few secs).
   See the logs in the attachment "netconf-session-promise-complete-already-error.txt"



 Comments   
Comment by Igor Foltin [ 30/May/22 ]

I added more logs:
reproducer-debug-logs.txt

They represent a scenario when a user starts the device reconnection via ODL app RPC (mountpoint removal and recreation).

The device name is: DCS:DCS:DCS-N9K-LEAF04
I tried to remove all parts of the logs which were related to other devices present in the DC, so hopefully there should not be any irrelevant logs there.

Comment by Ivan Hrasko [ 31/May/22 ]

Are you able to reproduce with netconf 3?

Comment by Robert Varga [ 31/May/22 ]

So if I am reading this right, the session goes up after the node has already been deleted:

[DEBUG] 2022-05-30 09:27:50.266 [opendaylight-cluster-data-notification-dispatcher-44] AbstractNetconfTopology - Disconnecting RemoteDevice{DCS:DCS:DCS-N9K-LEAF04}
[TRACE] 2022-05-30 09:27:50.267 [opendaylight-cluster-data-notification-dispatcher-44] NetconfDeviceSalProvider - RemoteDevice{DCS:DCS:DCS-N9K-LEAF04}: Not removing TOPOLOGY mountpoint from MD-SAL, mountpoint was not registered yet
[...]
[TRACE] 2022-05-30 09:27:50.269 [opendaylight-cluster-data-notification-dispatcher-44] NetconfDeviceSalProvider - RemoteDevice{DCS:DCS:DCS-N9K-LEAF04}: TransactionChain(org.opendaylight.mdsal.binding.dom.adapter.BindingDOMTransactionChainAdapter@7b6c02b7) SUCCESSFUL
[DEBUG] 2022-05-30 09:27:51.714 [nioEventLoopGroup-3-1] NetconfDeviceCommunicator - RemoteDevice{DCS:DCS:DCS-N9K-LEAF04}: Session established

I think the problem lies somewhere between NetconfSessionPromise's retry logic and AsyncSshHandler. Most notably the former does not check for isFailure() and the latter calls explicit setFailure(). The setFailure() is problematic because it renders the Future.cancel() inoperative and hence reconnection attempts are not stopped when the node is deleted.

We need debug logs for netconf-netty-util including initial connection setup to ascertain what went down.

Comment by Robert Varga [ 31/May/22 ]

The recorded failure in NetconfSessionPromise is the same as the one reported in NETCONF-827, hence I believe these are duplicates.

Generated at Wed Feb 07 20:16:08 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.