[NETCONF-323] NETCONF client does not close connections after errors in chunked encoding Created: 25/Nov/16  Updated: 15/Mar/19  Resolved: 10/Oct/17

Status: Resolved
Project: netconf
Component/s: netconf
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Giles Heron Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: PC


External issue ID: 7245

 Description   

So I posted to the netconf-dev mailer about this the other day.

I'm seeing very strange - and inconsistent behaviour with Boron-SR1 mounting XR6.1.2.

Sometimes I see multiple TCP connections on port 830, sometimes just one. When there are multiple connections sometimes the mount fails due to mismatched message IDs.

Sometimes the mount fails with java.lang.IllegalStateException (the logs make it look like the device sent an unexpected hello message)

Other times the mount fails due to errors in chunked encoding.

It'd probably be helpful to run through some of these scenarios and my various logs.



 Comments   
Comment by Jakub Morvay [ 28/Nov/16 ]

Hi Giles,

This behavior is indeed strange.

I will try to run through these described scenarios.

Comment by Giles Heron [ 28/Nov/16 ]

let me know if you need access to a router or to my logs.

Comment by Jakub Morvay [ 28/Nov/16 ]

Logs would be really helpful.

Comment by Jakub Morvay [ 06/Dec/16 ]

From the logs and discussion we've had together with Giles, this seems that we are getting wrong data from the device. Looks like device sends two netconf messages mixed together and hence mounting fails due to errors in chunked encoding.

I suspect the netconf client to incorrectly handle such errors and because of that we can see multiple TCP connections open (the connection is not really closed properly before reconnect attempt..).

Comment by Giles Heron [ 06/Dec/16 ]

so the million dollar question is why does this work fine from my vagrant box on my local PC, but fail from an ODL instance running in the same lab as my routers?

and why does Einar's python test tool work ok?

perhaps there's a way we can get more diagnostics? I can get router logs and packet dumps of course - but with NETCONF over SSH being encrypted it might be hard to decode the packet dumps unless somehow we can pass the encryption keys in. Or maybe there's a way to get the session to use a null encryptor or to get debugs out of ODL post-decryption but pre-parsing?

Comment by Jakub Morvay [ 08/Dec/16 ]

When netconf client encounters malformed chunk footer, it does not close netconf session and underlying connection properly.

2016-11-25 10:18:33,226 | WARN | oupCloseable-3-3 | AbstractNetconfSessionNegotiator | 98 - org.opendaylight.netconf.netty-util - 1.1.1.Boron-SR1 | An exception occurred during negotiation with null
io.netty.handler.codec.DecoderException: java.lang.IllegalStateException: Malformed chunk footer encountered (byte 0)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:442)[52:io.netty.codec:4.0.37.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)[52:io.netty.codec:4.0.37.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)[51:io.netty.transport:4.0.37.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:33)[51:io.netty.transport:4.0.37.Final]
at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:333)[51:io.netty.transport:4.0.37.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)[50:io.netty.common:4.0.37.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:394)[51:io.netty.transport:4.0.37.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)[50:io.netty.common:4.0.37.Final]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)[50:io.netty.common:4.0.37.Final]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_91]
Caused by: java.lang.IllegalStateException: Malformed chunk footer encountered (byte 0)
at org.opendaylight.netconf.nettyutil.handler.NetconfChunkAggregator.checkNewLine(NetconfChunkAggregator.java:46)[98:org.opendaylight.netconf.netty-util:1.1.1.Boron-SR1]
at org.opendaylight.netconf.nettyutil.handler.NetconfChunkAggregator.decode(NetconfChunkAggregator.java:129)[98:org.opendaylight.netconf.netty-util:1.1.1.Boron-SR1]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411)[52:io.netty.codec:4.0.37.Final]
... 9 more

We can see client wants to immediately reconnect the session, but does not close the connection correctly, so this leads to state where we can see multiple connections open.

Comment by Tomas Cere [ 15/Dec/16 ]

(In reply to Giles Heron from comment #5)
> so the million dollar question is why does this work fine from my vagrant
> box on my local PC, but fail from an ODL instance running in the same lab as
> my routers?
>
> and why does Einar's python test tool work ok?
>
> perhaps there's a way we can get more diagnostics? I can get router logs
> and packet dumps of course - but with NETCONF over SSH being encrypted it
> might be hard to decode the packet dumps unless somehow we can pass the
> encryption keys in. Or maybe there's a way to get the session to use a
> null encryptor or to get debugs out of ODL post-decryption but pre-parsing?

I suspect that the reason is that when you are running in your lab the network doesn't slow anything down, the trigger seems to be transport throttling error sent from xr, and if I'm correct this seems like some throttling built into xr once a certain threshold is reached?

It seems like xr incorrectly handles sending of the rpc error and just immediately sends it through the netconf session regardless of whether its already sending something else hence the botched message.

Comment by Giles Heron [ 19/Dec/16 ]

right - the delay to my lab might slow the connection down enough for XR not to need to throttle.

I'll see what the XR guys say.

Comment by Tomas Cere [ 10/Oct/17 ]

This was a problem with the device

Generated at Wed Feb 07 20:14:43 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.