[NETCONF-1067] Unrecoverable NETCONF session negotiation failure Created: 27/Jun/23 Updated: 18/Oct/23 Resolved: 18/Oct/23 |
|
| Status: | Resolved |
| Project: | netconf |
| Component/s: | netconf |
| Affects Version/s: | 5.0.6, 4.0.8 |
| Fix Version/s: | 4.0.9, 5.0.9 |
| Type: | Bug | Priority: | High |
| Reporter: | Sangwook Ha | Assignee: | Sangwook Ha |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
An exception may get raised from the channel pipeline during NETCONF session negotiation and once this happens the channel negotiation cannot complete successfully even without the cause that originally triggered the exception. This is a such case where AbstractNetconfSessionNegotiator.exceptionCaught() is called during a session negotiation: 2023-06-27T00:32:36,057 | INFO | globalWorkerGroup-3-3 | AbstractNetconfSessionNegotiator | 265 - org.opendaylight.netconf.netty-util - 6.0.0.SNAPSHOT | Unexpected error during negotiation on channel [id: 0x2d4c45fc] io.netty.handler.codec.DecoderException: org.xml.sax.SAXParseException; lineNumber: 28; columnNumber: 3; The element type "nc:rpc-reply" must be terminated by the matching end-tag "</nc:rpc-reply>". at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499) ~[bundleFile:4.1.94.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[bundleFile:4.1.94.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[bundleFile:4.1.94.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[bundleFile:4.1.94.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[bundleFile:4.1.94.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[bundleFile:4.1.94.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:333) ~[bundleFile:4.1.94.Final] at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:454) ~[bundleFile:4.1.94.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[bundleFile:4.1.94.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[bundleFile:4.1.94.Final] at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61) ~[bundleFile:4.1.94.Final] at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:425) ~[bundleFile:4.1.94.Final] at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[bundleFile:4.1.94.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[bundleFile:4.1.94.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[bundleFile:4.1.94.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[bundleFile:4.1.94.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[bundleFile:4.1.94.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[bundleFile:4.1.94.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[bundleFile:4.1.94.Final] at java.lang.Thread.run(Thread.java:833) ~[?:?] This was triggered by a corrupted hello message from the device: <!-- No zombies were killed during the creation of this user interface -->
<!-- user MSEFULL, class j-MSEFULL -->
<nc:hello xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0">
<nc:capabilities>
<nc:capability>urn:ietf:params:netconf:base:1.0</nc:capability>
<nc:capability>urn:ietf:params:netconf:capability:candidate:1.0</nc:capability>
<nc:capability>urn:ietf:params:netconf:capability:confirmed-commit:1.0</nc:capability>
<nc:capability>urn:ietf:params:netconf:capability:validate:1.0</nc:capability>
<nc:capability>urn:ietf:params:netconf:capability:url:1.0?scheme=http,ftp,file</nc:capability>
<nc:capability>urn:ietf:params:xml:ns:netconf:base:1.0?module=ietf-netconf&revision=2011-06-01</nc:capability>
<nc:capability>urn:ietf:params:xml:ns:netconf:capability:candidate:1.0</nc:capability>
<nc:capability>urn:ietf:params:xml:ns:netconf:capability:confirmed-commit:1.0</nc:capability>
<nc:capability>urn:ietf:params:xml:ns:netconf:capability:validate:1.0</nc:capability>
<nc:capability>urn:ietf:params:xml:ns:netconf:capability:url:1.0?scheme=http,ftp,file</nc:capability>
<nc:capability>urn:ietf:params:xml:ns:yang:ietf-inet-types?module=ietf-inet-types&revision=2013-07-15</nc:capability>
<nc:capability>urn:ietf:params:xml:ns:yang:ietf-netconf-monitoring</nc:capability>
<nc:capability>http://xml.juniper.net/netconf/junos/1.0</nc:capability>
<nc:capability>http://xml.juniper.net/dmi/system/1.0</nc:capability>
<nc:rpc-reply xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0">
<rpc-error>
<error-type>protocol</error-type>
<error-tag>operation-failed</error-tag>
<error-severity>error</error-severity>
<error-message>
Failed to unlink YANG path : /var/run/db/yangs
</error-message>
</rpc-error>
</rpc-reply>
</nc:capabilities>
<nc:session-id>90451</nc:session-id>
</nc:hello>
This exception causes negotiation failure and in turn changes the state of NetconfSessionPromise to 'failure'. Once this happens the state cannot be changed while controller keeps retrying session negotiations meaninglessly & indefinitely with max-connection-attempts set to 0. In most cases the only way to get out of this repeated session negotiation failure and properly reestablish a NETCONF session with the device is to restart the controller, which is very disruptive in the production environment. |
| Comments |
| Comment by Robert Varga [ 03/Jul/23 ] |
|
sangwookha so... how does this work wiith max-connection-attempts = 2? I believe I found a bug in TimedReconnectStrategy integration as part of |
| Comment by Sangwook Ha [ 03/Jul/23 ] |
|
When max-connection-attempts is set to 2, ReconnectStrategy will return a failed future after 2 connection attempts and no more connection will be tried from NetconfSessionPromise. reconnectFutureComplete() And when the parameter is set to 0, it appears that TimedReconnecStrategy.maxAttempts is set to null by TimedReconnectStrategyFactory, hence the number of attempts becomes irrelevant. |
| Comment by Sangwook Ha [ 03/Jul/23 ] |
|
Given the change 106786 being developed, a proposed fix in change 106744 won't be relevant for netconf 6.0.0. |