[NETCONF-961] CallHomeSessionContext session cannot be closed, so device could not make a new call-home connection Created: 09/Feb/23  Updated: 21/Jun/23  Resolved: 19/Apr/23

Status: Resolved
Project: netconf
Component/s: netconf
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: High
Reporter: Serkan Yıldırım Assignee: Ivan Hrasko
Resolution: Won't Do Votes: 0
Labels: pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:
  • Lighty.io version -> 15.1.0
  • Lighty.io 15.1.0 uses ODL netconf 2.0.9
  • Java 11

Attachments: File reproduced-2.0.9.karaf.log    
Issue Links:
Duplicate
duplicates NETCONF-681 Netconf Callhome SSH: drop connections Resolved
Relates
relates to NETCONF-989 Reconnection failure after deleting a... In Review
Priority: High

 Description   

Hi,

We are using the lighty.io 15.1.0 version in our application. We have devices that uses netopeer2server for making call-home connection to the application. Sometimes, device connects and disconnects repeatedly when we start loading network traffic on device. At this step, after a time, opendaylight starts to log as below:

2023-02-01 06:47:19.588 ERROR 1 .m.CallHomeMountSessionManager : Server Host Key/Certificate RSA Public Key [30:4e:62:da:08:41:35:80:85:7a:4b:86:3e:b6:ff:54:fe:07:f7:80],[56:66:d1:a4]
        modulus: bc625370b9fbdc38589892c0aba9c9ff1503662e6ea13194f847622d2c342a5979d0838be8dbfc9b08c8f8995257e6b77403a87f3c696ba6225a6cf03854c0d41bffa8346a054813931d6d518b1493fd0c12df92f8b3a13dc58a05ed6faaa262373fe8051162838f79c65c1d79d968528e1e3cb0b6a38168129dd6c86814472d991fbbe4b33ab182e825e67557a5f7b1165c672d1ef4ec3f6ac9d5a2615e076c32bd0c241f9a7414ab3920d9e11eb620489a983381cdf38b1875208a33a17302739166e0ac1119083221a3b292324eea92d88b74f27fc8eb342fbbea604d1011eeeaba525344269e7fb3f68e8092e45b3692a9e75b8618176de69ea0d8816cf3
public exponent: 10001
 is associated with existing session CallHomeMountSessionContext{address=/172.23.0.5:41032, hostKey=RSA Public Key [30:4e:62:da:08:41:35:80:85:7a:4b:86:3e:b6:ff:54:fe:07:f7:80],[56:66:d1:a4]
        modulus: bc625370b9fbdc38589892c0aba9c9ff1503662e6ea13194f847622d2c342a5979d0838be8dbfc9b08c8f8995257e6b77403a87f3c696ba6225a6cf03854c0d41bffa8346a054813931d6d518b1493fd0c12df92f8b3a13dc58a05ed6faaa262373fe8051162838f79c65c1d79d968528e1e3cb0b6a38168129dd6c86814472d991fbbe4b33ab182e825e67557a5f7b1165c672d1ef4ec3f6ac9d5a2615e076c32bd0c241f9a7414ab3920d9e11eb620489a983381cdf38b1875208a33a17302739166e0ac1119083221a3b292324eea92d88b74f27fc8eb342fbbea604d1011eeeaba525344269e7fb3f68e8092e45b3692a9e75b8618176de69ea0d8816cf3
public exponent: 10001
}, closing session org.opendaylight.netconf.callhome.protocol.CallHomeSessionContext@2a79b789 

Log says that there is an existing session with the public key, so I guess it is rejecting and closing the incoming session.

After we start to see this log, device cannot makes a call-home connection anymore. Even restarting netopeer2server or rebooting device do not change the situation. Only solution is restarting our application. But that is not wanted in production enviroment.

Log is printed by CallHomeMountSessionManager class. When I checked the code, I saw that deviceContext is kept in the below ConcurrentMap by device's SSH public key:

private final ConcurrentMap<PublicKey, CallHomeMountSessionContext> contextByPublicKey = new ConcurrentHashMap<>(); 

I think, the entry related with public key stays in the map and cannot be removed somehow. Maybe, this causes the problem.

Note: We changed ODL core property: IDLE_TIMEOUT to 2 minutes.

 



 Comments   
Comment by Ivan Hrasko [ 16/Feb/23 ]

The logs look very similar to NETCONF-942. Version 2.0.9 however is not affected by that bug.

Comment by Ivan Hrasko [ 16/Feb/23 ]

Steps to reproduce with version 2.0.9.

  • Setup netopeer2 call-home device
  • Add netopeer2 to allowed call-home devices == device is connected successfully
  • Delete all allowed devices
  • Add netopeer2 to allowed call-home devices again == device is connected but failed in call-home configuration
  • After 10 minutes there will be reconnect, block execution by debugger in CallHomeMountSessionManager#createSession
  • Session fails to establish and we get logs as described in bug report

Even after more than 1 hour the /rests/data/odl-netconf-callhome-server:netconf-callhome-server shows FAILED status and node in netconf topology is connecting.

Comment by Serkan Yıldırım [ 18/Feb/23 ]

Hi ivanhrasko ,

We solved this by adding the lines below to the createSession method in 
CallHomeMountSessionManager class.
 

CallHomeMountSessionContext createSession(final CallHomeProtocolSessionContext session,
        final CallHomeChannelActivator activator, final CloseCallback onCloseHandler) {
    final CallHomeMountSessionContext deviceContext = new CallHomeMountSessionContext(session.getSessionId(),
        session, activator, devCtxt -> onClosed(devCtxt, onCloseHandler));

    final PublicKey remoteKey = session.getRemoteServerKey();
    final CallHomeMountSessionContext existing = contextByPublicKey.putIfAbsent(remoteKey, deviceContext);
    if (existing != null) {
        // Check if the sshkey or certificate of the incoming netconf server is present. If present return null,
        // else store the session. The sshkey/certificate is the uniqueness of the callhome sessions not the
        // uniqueid/devicename
        LOG.error("Server Host Key/Certificate {} is associated with existing session {}, closing session {}",
            remoteKey, existing, session);
        session.terminate();
        // Added code - start
        existing.getProtocol().terminate();
        contextByPublicKey.remove(remoteKey);
        // Added code - finish
        return null;
    }

    final SocketAddress remoteAddress = session.getRemoteAddress();
    final CallHomeMountSessionContext prev = contextByAddress.put(remoteAddress, deviceContext);
    if (prev != null) {
        LOG.warn("Remote {} replaced context {} with {}", remoteAddress, prev, deviceContext);
    }
    return deviceContext;
} 
Comment by Ivan Hrasko [ 20/Feb/23 ]

Thanks, I will check it. Anyway I thing that this problem has been solved in NETCONF-681. Thus in recent netconf/lighty.io versions the problem should be fixed.

This needs to be verified. I have not tried to reproduce the issue on current releases yet.

Comment by Ivan Hrasko [ 20/Feb/23 ]

Are you able to make unit test for this? So we can test your changes in 2.0.x and current 5.x?

Comment by Serkan Yıldırım [ 27/Feb/23 ]

Sorry for late response Ivan. I could not see a unit test for class 
CallHomeMountSessionManager class. Therefore, I did not write it from scratch. Is there a doc for implementing new unit tests?

Comment by Yaroslav Lastivka [ 05/Apr/23 ]

Testing 2.0.x with patch from NETCONF-681

  • Setup netopeer2 call-home device
  • Add netopeer2 to allowed call-home devices == device is connected successfully
  • Delete all allowed devices
  • Add netopeer2 to allowed call-home devices again == device is connected but failed in call-home configuration
    After that action we don't see successful reconnection:
12:46:50.047 INFO [opendaylight-cluster-data-notification-dispatcher-35] Deleting the entry for callhome device InstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.netconf.callhome.server.rev201015.netconf.callhome.server.AllowedDevices, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.netconf.callhome.server.rev201015.NetconfCallhomeServer, org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.netconf.callhome.server.rev201015.netconf.callhome.server.AllowedDevices]}
13:46:10.284 INFO [nioEventLoopGroup-4-3] requestNewKeysExchange(ClientSessionImpl[netconf@/172.17.0.2:42848]) Initiating key re-exchange
14:46:12.928 INFO [nioEventLoopGroup-4-3] requestNewKeysExchange(ClientSessionImpl[netconf@/172.17.0.2:42848]) Initiating key re-exchange
15:46:15.220 INFO [nioEventLoopGroup-4-3] requestNewKeysExchange(ClientSessionImpl[netconf@/172.17.0.2:42848]) Initiating key re-exchange 

After more than 3 hours the /rests/data/odl-netconf-callhome-server:netconf-callhome-server shows DISCONNECTED status although the node in netconf topology is connected.

Comment by Ivan Hrasko [ 12/Apr/23 ]

We expect device connects again, please see new issue here: NETCONF-989.

Comment by Ivan Hrasko [ 19/Apr/23 ]

By cherry-picking of https://git.opendaylight.org/gerrit/c/netconf/+/105212 we have verified that NETCONF-681 has fixed this issue. Unfortunately netconf 2.x is no more maintained and we recommend users to upgrade.

During investigating this issue we have found out new one NETCONF-989 which is present also on current master branch. It will be addressed separately.

Comment by Ivan Hrasko [ 19/Apr/23 ]

The netconf 2.x branch is no more maintained. Current releases are not affected by this issue.

Generated at Wed Feb 07 20:16:20 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.