[NETCONF-1067] Unrecoverable NETCONF session negotiation failure Created: 27/Jun/23  Updated: 18/Oct/23  Resolved: 18/Oct/23

Status: Resolved
Project: netconf
Component/s: netconf
Affects Version/s: 5.0.6, 4.0.8
Fix Version/s: 4.0.9, 5.0.9

Type: Bug Priority: High
Reporter: Sangwook Ha Assignee: Sangwook Ha
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

An exception may get raised from the channel pipeline during NETCONF session negotiation and once this happens the channel negotiation cannot complete successfully even without the cause that originally triggered the exception.

This is a such case where AbstractNetconfSessionNegotiator.exceptionCaught() is called during a session negotiation:

2023-06-27T00:32:36,057 | INFO  | globalWorkerGroup-3-3 | AbstractNetconfSessionNegotiator | 265 - org.opendaylight.netconf.netty-util - 6.0.0.SNAPSHOT | Unexpected error during negotiation on channel [id: 0x2d4c45fc]
io.netty.handler.codec.DecoderException: org.xml.sax.SAXParseException; lineNumber: 28; columnNumber: 3; The element type "nc:rpc-reply" must be terminated by the matching end-tag "</nc:rpc-reply>".
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499) ~[bundleFile:4.1.94.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[bundleFile:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[bundleFile:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[bundleFile:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[bundleFile:4.1.94.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[bundleFile:4.1.94.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:333) ~[bundleFile:4.1.94.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:454) ~[bundleFile:4.1.94.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[bundleFile:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[bundleFile:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61) ~[bundleFile:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:425) ~[bundleFile:4.1.94.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[bundleFile:4.1.94.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[bundleFile:4.1.94.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[bundleFile:4.1.94.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[bundleFile:4.1.94.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[bundleFile:4.1.94.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[bundleFile:4.1.94.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[bundleFile:4.1.94.Final]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

This was triggered by a corrupted hello message from the device:

<!-- No zombies were killed during the creation of this user interface -->
<!-- user MSEFULL, class j-MSEFULL -->
<nc:hello xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0">
   <nc:capabilities>
    <nc:capability>urn:ietf:params:netconf:base:1.0</nc:capability>
    <nc:capability>urn:ietf:params:netconf:capability:candidate:1.0</nc:capability>
    <nc:capability>urn:ietf:params:netconf:capability:confirmed-commit:1.0</nc:capability>
    <nc:capability>urn:ietf:params:netconf:capability:validate:1.0</nc:capability>
    <nc:capability>urn:ietf:params:netconf:capability:url:1.0?scheme=http,ftp,file</nc:capability>
    <nc:capability>urn:ietf:params:xml:ns:netconf:base:1.0?module=ietf-netconf&amp;revision=2011-06-01</nc:capability>
    <nc:capability>urn:ietf:params:xml:ns:netconf:capability:candidate:1.0</nc:capability>
    <nc:capability>urn:ietf:params:xml:ns:netconf:capability:confirmed-commit:1.0</nc:capability>
    <nc:capability>urn:ietf:params:xml:ns:netconf:capability:validate:1.0</nc:capability>
    <nc:capability>urn:ietf:params:xml:ns:netconf:capability:url:1.0?scheme=http,ftp,file</nc:capability>
    <nc:capability>urn:ietf:params:xml:ns:yang:ietf-inet-types?module=ietf-inet-types&amp;revision=2013-07-15</nc:capability>
    <nc:capability>urn:ietf:params:xml:ns:yang:ietf-netconf-monitoring</nc:capability>
    <nc:capability>http://xml.juniper.net/netconf/junos/1.0</nc:capability>
    <nc:capability>http://xml.juniper.net/dmi/system/1.0</nc:capability>
<nc:rpc-reply xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0">
<rpc-error>
<error-type>protocol</error-type>
<error-tag>operation-failed</error-tag>
<error-severity>error</error-severity>
<error-message>
Failed to unlink YANG path : /var/run/db/yangs
</error-message>
</rpc-error>
</rpc-reply>
  </nc:capabilities>
  <nc:session-id>90451</nc:session-id>
</nc:hello>

This exception causes negotiation failure and in turn changes the state of NetconfSessionPromise to 'failure'. Once this happens the state cannot be changed while controller keeps retrying session negotiations meaninglessly & indefinitely with max-connection-attempts set to 0.

In most cases the only way to get out of this repeated session negotiation failure and properly reestablish a NETCONF session with the device is to restart the controller, which is very disruptive in the production environment.



 Comments   
Comment by Robert Varga [ 03/Jul/23 ]

sangwookha  so... how does this work wiith max-connection-attempts = 2? I believe I found a bug in TimedReconnectStrategy integration as part of NETCONF-1070 – where 0 is not interpreted as it should be...

Comment by Sangwook Ha [ 03/Jul/23 ]

When max-connection-attempts is set to 2, ReconnectStrategy will return a failed future after 2 connection attempts and no more connection will be tried from NetconfSessionPromise. reconnectFutureComplete()

And when the parameter is set to 0, it appears that TimedReconnecStrategy.maxAttempts is set to null by TimedReconnectStrategyFactory, hence the number of attempts becomes irrelevant.

Comment by Sangwook Ha [ 03/Jul/23 ]

Given the change 106786 being developed, a proposed fix in change 106744 won't be relevant for netconf 6.0.0.
Other development streams that do not adopt the change would need some form of fix.

Generated at Wed Feb 07 20:16:35 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.