[NETCONF-966] Long delay in connection status update after connection issue Created: 11/Feb/23  Updated: 23/Jan/24  Resolved: 15/Nov/23

Status: Resolved
Project: netconf
Component/s: netconf
Affects Version/s: 3.0.9, 6.0.0, 4.0.9, 6.0.5, 5.0.9
Fix Version/s: 7.0.0, 5.0.10, 6.0.6

Type: Bug Priority: Medium
Reporter: Sangwook Ha Assignee: Martin Sunal
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by NETCONF-1175 TCP session is hanging when NETCONF s... Resolved
Relates
relates to NETCONF-1195 Device Remain Connected after it get ... Confirmed

 Description   

When the network connectivity between controller & NETCONF device is lost and packets are dropped because of network disruption along the path, it takes a long time until connection loss is detected and connection status is updated.

The following logs show that it took more than 15 minutes after sending a NETCONF keepalive probe before the connection timed out ('keepalive-delay' was 30 seconds):

22:31:29.954 DEBUG [globalWorkerGroup-3-7] Consume RemoteWindow[client](NettyAwareChannelSubsystem[id=0, recipient=43]-NetconfClientSessionImpl[netconf@/10.100.18.43:830][netconf]) by 325 down to 1277183
22:31:29.954 DEBUG [globalWorkerGroup-3-7] encode(NetconfClientSessionImpl[netconf@/10.100.18.43:830]) packet #17 sending command=94[SSH_MSG_CHANNEL_DATA] len=334
22:31:29.955 DEBUG [globalWorkerGroup-3-7] writeBuffer(Nio2Session[local=/172.17.0.2:33432, remote=/10.100.18.43:830]) writing 388 bytes

22:47:01.413 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] handleReadCycleFailure(Nio2Session[local=/172.17.0.2:33432, remote=/10.100.18.43:830]) IOException after 961462127645 nanos at read cycle=20: Connection timed out
22:47:01.416 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] exceptionCaught(Nio2Session[local=/172.17.0.2:33432, remote=/10.100.18.43:830]) caught IOException[Connection timed out] - calling handler
22:47:01.417 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] signalAuthFailure(NetconfClientSessionImpl[netconf@/10.100.18.43:830]) type=IOException, signalled=false, first=false: Connection timed out
22:47:01.417 WARN [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] exceptionCaught(NetconfClientSessionImpl[netconf@/10.100.18.43:830])[state=Opened] IOException: Connection timed out
java.io.IOException: Connection timed out
	at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[?:?]
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47) ~[?:?]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:330) ~[?:?]
	at sun.nio.ch.IOUtil.read(IOUtil.java:296) ~[?:?]
	at sun.nio.ch.IOUtil.read(IOUtil.java:266) ~[?:?]
	at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishRead(UnixAsynchronousSocketChannelImpl.java:400) ~[?:?]
	at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:195) ~[?:?]
	at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:217) ~[?:?]
	at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:306) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]
22:47:01.419 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] sessionException(NetconfClientSessionImpl[netconf@/10.100.18.43:830]) IOException: Connection timed out
22:47:01.420 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] close(NetconfClientSessionImpl[netconf@/10.100.18.43:830]) Closing immediately
22:47:01.420 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] signalAuthFailure(NetconfClientSessionImpl[netconf@/10.100.18.43:830]) type=SshException, signalled=false, first=false: Session is being closed
22:47:01.421 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] close(ClientConnectionService[NetconfClientSessionImpl[netconf@/10.100.18.43:830]]) Closing immediately

When keepalive is enabled, it is reasonable to expect that the connection status is updated in the order of 'keepalive-delay'.



 Comments   
Comment by Vlad [ 14/Dec/23 ]

Hi msunal,

 

I just deployed one of the latest versions of the karaf -> and I still facing this issue: mounted 6k(5k real device and 1k simulated devices) - (stopped 1k simulated and 1k real device and the behavior is:

  • real devices 902 get unable to connect and 98 remains connected network topology 
  • simulated devices 1000 in status connecting in network topology 

 

https://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/org/opendaylight/integration/karaf/0.20.0-SNAPSHOT/

 

Comment by Martin Sunal [ 19/Dec/23 ]

Hi vladutd ,

However it sounds like the same problem from the surface, it can be something totally different. I think the latest versions have other problems which also show up like improper reporting of connection status.

Do you have complete karaf logs?

Can you confirm if TCP connection was still open for 98 "connected" devices?

What was different in terms of netconf node connection config options (node in config datastore) between real and simulated devices?

I would say you should create new ticket with all detailed information.

Comment by Vlad [ 20/Dec/23 ]

Hi,

You are right there is an improper reporting of the connection status. The faster way to update it is to try retrieving something from the device.

Generated at Wed Feb 07 20:16:21 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.