[NETCONF-966] Long delay in connection status update after connection issue Created: 11/Feb/23 Updated: 23/Jan/24 Resolved: 15/Nov/23 |
|
| Status: | Resolved |
| Project: | netconf |
| Component/s: | netconf |
| Affects Version/s: | 3.0.9, 6.0.0, 4.0.9, 6.0.5, 5.0.9 |
| Fix Version/s: | 7.0.0, 5.0.10, 6.0.6 |
| Type: | Bug | Priority: | Medium |
| Reporter: | Sangwook Ha | Assignee: | Martin Sunal |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
When the network connectivity between controller & NETCONF device is lost and packets are dropped because of network disruption along the path, it takes a long time until connection loss is detected and connection status is updated. The following logs show that it took more than 15 minutes after sending a NETCONF keepalive probe before the connection timed out ('keepalive-delay' was 30 seconds): 22:31:29.954 DEBUG [globalWorkerGroup-3-7] Consume RemoteWindow[client](NettyAwareChannelSubsystem[id=0, recipient=43]-NetconfClientSessionImpl[netconf@/10.100.18.43:830][netconf]) by 325 down to 1277183 22:31:29.954 DEBUG [globalWorkerGroup-3-7] encode(NetconfClientSessionImpl[netconf@/10.100.18.43:830]) packet #17 sending command=94[SSH_MSG_CHANNEL_DATA] len=334 22:31:29.955 DEBUG [globalWorkerGroup-3-7] writeBuffer(Nio2Session[local=/172.17.0.2:33432, remote=/10.100.18.43:830]) writing 388 bytes 22:47:01.413 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] handleReadCycleFailure(Nio2Session[local=/172.17.0.2:33432, remote=/10.100.18.43:830]) IOException after 961462127645 nanos at read cycle=20: Connection timed out 22:47:01.416 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] exceptionCaught(Nio2Session[local=/172.17.0.2:33432, remote=/10.100.18.43:830]) caught IOException[Connection timed out] - calling handler 22:47:01.417 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] signalAuthFailure(NetconfClientSessionImpl[netconf@/10.100.18.43:830]) type=IOException, signalled=false, first=false: Connection timed out 22:47:01.417 WARN [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] exceptionCaught(NetconfClientSessionImpl[netconf@/10.100.18.43:830])[state=Opened] IOException: Connection timed out java.io.IOException: Connection timed out at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[?:?] at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47) ~[?:?] at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:330) ~[?:?] at sun.nio.ch.IOUtil.read(IOUtil.java:296) ~[?:?] at sun.nio.ch.IOUtil.read(IOUtil.java:266) ~[?:?] at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishRead(UnixAsynchronousSocketChannelImpl.java:400) ~[?:?] at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:195) ~[?:?] at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:217) ~[?:?] at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:306) ~[?:?] at java.lang.Thread.run(Thread.java:833) ~[?:?] 22:47:01.419 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] sessionException(NetconfClientSessionImpl[netconf@/10.100.18.43:830]) IOException: Connection timed out 22:47:01.420 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] close(NetconfClientSessionImpl[netconf@/10.100.18.43:830]) Closing immediately 22:47:01.420 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] signalAuthFailure(NetconfClientSessionImpl[netconf@/10.100.18.43:830]) type=SshException, signalled=false, first=false: Session is being closed 22:47:01.421 DEBUG [sshd-NetconfSshClient[42908dcc]-nio2-thread-6] close(ClientConnectionService[NetconfClientSessionImpl[netconf@/10.100.18.43:830]]) Closing immediately When keepalive is enabled, it is reasonable to expect that the connection status is updated in the order of 'keepalive-delay'. |
| Comments |
| Comment by Vlad [ 14/Dec/23 ] |
|
Hi msunal,
I just deployed one of the latest versions of the karaf -> and I still facing this issue: mounted 6k(5k real device and 1k simulated devices) - (stopped 1k simulated and 1k real device and the behavior is:
|
| Comment by Martin Sunal [ 19/Dec/23 ] |
|
Hi vladutd , However it sounds like the same problem from the surface, it can be something totally different. I think the latest versions have other problems which also show up like improper reporting of connection status. Do you have complete karaf logs? Can you confirm if TCP connection was still open for 98 "connected" devices? What was different in terms of netconf node connection config options (node in config datastore) between real and simulated devices? I would say you should create new ticket with all detailed information. |
| Comment by Vlad [ 20/Dec/23 ] |
|
Hi, You are right there is an improper reporting of the connection status. The faster way to update it is to try retrieving something from the device. |