[NETCONF-961] CallHomeSessionContext session cannot be closed, so device could not make a new call-home connection Created: 09/Feb/23 Updated: 21/Jun/23 Resolved: 19/Apr/23 |
|
| Status: | Resolved |
| Project: | netconf |
| Component/s: | netconf |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | High |
| Reporter: | Serkan Yıldırım | Assignee: | Ivan Hrasko |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | pt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
|
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Priority: | High | ||||||||||||||||
| Description |
|
Hi, We are using the lighty.io 15.1.0 version in our application. We have devices that uses netopeer2server for making call-home connection to the application. Sometimes, device connects and disconnects repeatedly when we start loading network traffic on device. At this step, after a time, opendaylight starts to log as below:
2023-02-01 06:47:19.588 ERROR 1 .m.CallHomeMountSessionManager : Server Host Key/Certificate RSA Public Key [30:4e:62:da:08:41:35:80:85:7a:4b:86:3e:b6:ff:54:fe:07:f7:80],[56:66:d1:a4]
modulus: bc625370b9fbdc38589892c0aba9c9ff1503662e6ea13194f847622d2c342a5979d0838be8dbfc9b08c8f8995257e6b77403a87f3c696ba6225a6cf03854c0d41bffa8346a054813931d6d518b1493fd0c12df92f8b3a13dc58a05ed6faaa262373fe8051162838f79c65c1d79d968528e1e3cb0b6a38168129dd6c86814472d991fbbe4b33ab182e825e67557a5f7b1165c672d1ef4ec3f6ac9d5a2615e076c32bd0c241f9a7414ab3920d9e11eb620489a983381cdf38b1875208a33a17302739166e0ac1119083221a3b292324eea92d88b74f27fc8eb342fbbea604d1011eeeaba525344269e7fb3f68e8092e45b3692a9e75b8618176de69ea0d8816cf3
public exponent: 10001
is associated with existing session CallHomeMountSessionContext{address=/172.23.0.5:41032, hostKey=RSA Public Key [30:4e:62:da:08:41:35:80:85:7a:4b:86:3e:b6:ff:54:fe:07:f7:80],[56:66:d1:a4]
modulus: bc625370b9fbdc38589892c0aba9c9ff1503662e6ea13194f847622d2c342a5979d0838be8dbfc9b08c8f8995257e6b77403a87f3c696ba6225a6cf03854c0d41bffa8346a054813931d6d518b1493fd0c12df92f8b3a13dc58a05ed6faaa262373fe8051162838f79c65c1d79d968528e1e3cb0b6a38168129dd6c86814472d991fbbe4b33ab182e825e67557a5f7b1165c672d1ef4ec3f6ac9d5a2615e076c32bd0c241f9a7414ab3920d9e11eb620489a983381cdf38b1875208a33a17302739166e0ac1119083221a3b292324eea92d88b74f27fc8eb342fbbea604d1011eeeaba525344269e7fb3f68e8092e45b3692a9e75b8618176de69ea0d8816cf3
public exponent: 10001
}, closing session org.opendaylight.netconf.callhome.protocol.CallHomeSessionContext@2a79b789
Log says that there is an existing session with the public key, so I guess it is rejecting and closing the incoming session. After we start to see this log, device cannot makes a call-home connection anymore. Even restarting netopeer2server or rebooting device do not change the situation. Only solution is restarting our application. But that is not wanted in production enviroment. Log is printed by CallHomeMountSessionManager class. When I checked the code, I saw that deviceContext is kept in the below ConcurrentMap by device's SSH public key: private final ConcurrentMap<PublicKey, CallHomeMountSessionContext> contextByPublicKey = new ConcurrentHashMap<>(); I think, the entry related with public key stays in the map and cannot be removed somehow. Maybe, this causes the problem. Note: We changed ODL core property: IDLE_TIMEOUT to 2 minutes.
|
| Comments |
| Comment by Ivan Hrasko [ 16/Feb/23 ] |
|
The logs look very similar to |
| Comment by Ivan Hrasko [ 16/Feb/23 ] |
|
Steps to reproduce with version 2.0.9.
Even after more than 1 hour the /rests/data/odl-netconf-callhome-server:netconf-callhome-server shows FAILED status and node in netconf topology is connecting. |
| Comment by Serkan Yıldırım [ 18/Feb/23 ] |
|
Hi ivanhrasko , We solved this by adding the lines below to the createSession method in CallHomeMountSessionContext createSession(final CallHomeProtocolSessionContext session, final CallHomeChannelActivator activator, final CloseCallback onCloseHandler) { final CallHomeMountSessionContext deviceContext = new CallHomeMountSessionContext(session.getSessionId(), session, activator, devCtxt -> onClosed(devCtxt, onCloseHandler)); final PublicKey remoteKey = session.getRemoteServerKey(); final CallHomeMountSessionContext existing = contextByPublicKey.putIfAbsent(remoteKey, deviceContext); if (existing != null) { // Check if the sshkey or certificate of the incoming netconf server is present. If present return null, // else store the session. The sshkey/certificate is the uniqueness of the callhome sessions not the // uniqueid/devicename LOG.error("Server Host Key/Certificate {} is associated with existing session {}, closing session {}", remoteKey, existing, session); session.terminate(); // Added code - start existing.getProtocol().terminate(); contextByPublicKey.remove(remoteKey); // Added code - finish return null; } final SocketAddress remoteAddress = session.getRemoteAddress(); final CallHomeMountSessionContext prev = contextByAddress.put(remoteAddress, deviceContext); if (prev != null) { LOG.warn("Remote {} replaced context {} with {}", remoteAddress, prev, deviceContext); } return deviceContext; } |
| Comment by Ivan Hrasko [ 20/Feb/23 ] |
|
Thanks, I will check it. Anyway I thing that this problem has been solved in This needs to be verified. I have not tried to reproduce the issue on current releases yet. |
| Comment by Ivan Hrasko [ 20/Feb/23 ] |
|
Are you able to make unit test for this? So we can test your changes in 2.0.x and current 5.x? |
| Comment by Serkan Yıldırım [ 27/Feb/23 ] |
|
Sorry for late response Ivan. I could not see a unit test for class |
| Comment by Yaroslav Lastivka [ 05/Apr/23 ] |
|
Testing 2.0.x with patch from
12:46:50.047 INFO [opendaylight-cluster-data-notification-dispatcher-35] Deleting the entry for callhome device InstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.netconf.callhome.server.rev201015.netconf.callhome.server.AllowedDevices, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.netconf.callhome.server.rev201015.NetconfCallhomeServer, org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.netconf.callhome.server.rev201015.netconf.callhome.server.AllowedDevices]} 13:46:10.284 INFO [nioEventLoopGroup-4-3] requestNewKeysExchange(ClientSessionImpl[netconf@/172.17.0.2:42848]) Initiating key re-exchange 14:46:12.928 INFO [nioEventLoopGroup-4-3] requestNewKeysExchange(ClientSessionImpl[netconf@/172.17.0.2:42848]) Initiating key re-exchange 15:46:15.220 INFO [nioEventLoopGroup-4-3] requestNewKeysExchange(ClientSessionImpl[netconf@/172.17.0.2:42848]) Initiating key re-exchange After more than 3 hours the /rests/data/odl-netconf-callhome-server:netconf-callhome-server shows DISCONNECTED status although the node in netconf topology is connected. |
| Comment by Ivan Hrasko [ 12/Apr/23 ] |
|
We expect device connects again, please see new issue here: NETCONF-989. |
| Comment by Ivan Hrasko [ 19/Apr/23 ] |
|
By cherry-picking of https://git.opendaylight.org/gerrit/c/netconf/+/105212 we have verified that During investigating this issue we have found out new one NETCONF-989 which is present also on current master branch. It will be addressed separately. |
| Comment by Ivan Hrasko [ 19/Apr/23 ] |
|
The netconf 2.x branch is no more maintained. Current releases are not affected by this issue. |