[OPNFLWPLUG-1018] OVS not able to connect to ODL - contextChain stuck in CLOSED state Created: 08/Jun/18  Updated: 25/Jun/18  Resolved: 25/Jun/18

Status: Resolved
Project: OpenFlowPlugin
Component/s: openflowplugin-impl
Affects Version/s: None
Fix Version/s: Oxygen, Fluorine

Type: Bug Priority: High
Reporter: Victor Pickard Assignee: Anil Vishnoi
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File karaf.log.debug     File ovs-vswitchd.log     File test.sh     File test2.sh    
Issue Links:
Duplicate
duplicates OPNFLWPLUG-970 Unable to establish connections with ... Resolved

 Description   

We have seen sporadic failures where OVS on a compute node fails to connect to ODL when set-controller is executed on the compute node. These failures usually occur during deployment.

When this failure occurs, OVS on the compute node continues attempting to connect to ODL. Openflowplugin immediately closes the new incoming connection, as it thinks the connection is in termination state. This cycle repeats endlessly once we are in this state.

 

Since this is a random, intermittent, sorta difficult to reproduce bug (and would take forever having to try and hit this by having to redeploy every time), I wrote a small test script to reproduce this issue. Script is attached as test.sh.

I enabled TRACE logs for openflowplugin.impl.lifecycle, and was able to reproduce the issue. I've been looking at the logs and code, but would be really great if someone on the openflowplugin team could jump in and take a look.

I've included a snippet of some relevant karaf and ovs-vswitchd logs below. Full karaf log and ovs-vswitchd logs are attached. 

The timestamp of interest in the karaf logs starts around timestamp  2018-06-08T22:32:11,465, which is line number 4661.

 

Setup

Devstack setup, with 1 control node (VM) and 2 compute nodes (VMs) on 1 bm server.

Devstack is stable/queens.

ODL is stable/oxygen.

 

Karaf logs

2018-06-08T22:32:11,465 | WARN  | epollEventLoopGroup-9-6 | ContextChainHolderImpl           | 401 - org.opendaylight.openflowplugin.impl - 0.7.0.S
NAPSHOT | Device openflow:220918713701936 is already in termination state, closing all incoming connections.
2018-06-08T22:32:11,870 | INFO  | epollEventLoopGroup-9-7 | AbstractConnectionAdapter        | 410 - org.opendaylight.openflowplugin.openflowjava.o
penflow-protocol-impl - 0.7.0.SNAPSHOT | The channel outbound queue size:1024
2018-06-08T22:32:11,871 | INFO  | epollEventLoopGroup-9-7 | ConnectionAdapterImpl            | 410 - org.opendaylight.openflowplugin.openflowjava.o
penflow-protocol-impl - 0.7.0.SNAPSHOT | Hello received
2018-06-08T22:32:11,873 | INFO  | epollEventLoopGroup-9-7 | ContextChainHolderImpl           | 401 - org.opendaylight.openflowplugin.impl - 0.7.0.S
NAPSHOT | Device openflow:220918713701936 connected.
2018-06-08T22:32:11,873 | WARN  | epollEventLoopGroup-9-7 | ContextChainHolderImpl           | 401 - org.opendaylight.openflowplugin.impl - 0.7.0.S
NAPSHOT | Device openflow:220918713701936 is already in termination state, closing all incoming connections.

2018-06-08T22:32:13,868 | INFO  | epollEventLoopGroup-9-8 | AbstractConnectionAdapter        | 410 - org.opendaylight.openflowplugin.openflowjava.o
penflow-protocol-impl - 0.7.0.SNAPSHOT | The channel outbound queue size:1024
2018-06-08T22:32:13,869 | INFO  | epollEventLoopGroup-9-8 | ConnectionAdapterImpl            | 410 - org.opendaylight.openflowplugin.openflowjava.o
penflow-protocol-impl - 0.7.0.SNAPSHOT | Hello received
2018-06-08T22:32:13,872 | INFO  | epollEventLoopGroup-9-8 | ContextChainHolderImpl           | 401 - org.opendaylight.openflowplugin.impl - 0.7.0.S
NAPSHOT | Device openflow:220918713701936 connected.
2018-06-08T22:32:13,873 | WARN  | epollEventLoopGroup-9-8 | ContextChainHolderImpl           | 401 - org.opendaylight.openflowplugin.impl - 0.7.0.S
NAPSHOT | Device openflow:220918713701936 is already in termination state, closing all incoming connections.

 

ovs-vswitchd logs

2018-06-08T23:04:42.855Z|13767|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connected
2018-06-08T23:04:42.858Z|13768|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connection closed by peer
2018-06-08T23:04:50.854Z|13769|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connected
2018-06-08T23:04:50.857Z|13770|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connection closed by peer
2018-06-08T23:04:58.854Z|13771|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connected
2018-06-08T23:04:58.857Z|13772|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connection closed by peer
2018-06-08T23:05:06.855Z|13773|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connected
2018-06-08T23:05:06.858Z|13774|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connection closed by peer
2018-06-08T23:05:14.853Z|13775|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connected
2018-06-08T23:05:14.856Z|13776|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connection closed by peer
2018-06-08T23:05:22.854Z|13777|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connected
2018-06-08T23:05:22.856Z|13778|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connection closed by peer
2018-06-08T23:05:30.855Z|13779|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connected
2018-06-08T23:05:30.858Z|13780|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connection closed by peer
2018-06-08T23:05:38.853Z|13781|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connected
2018-06-08T23:05:38.855Z|13782|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connection closed by peer
2018-06-08T23:05:46.855Z|13783|rconn|INFO|br-int<->tcp:10.8.125.230:6653: connected

 

 

 

 



 Comments   
Comment by Anil Vishnoi [ 11/Jun/18 ]

vpickard OPNFLWPLUG-970 is similar issue and there are patches raised for that isuse for stable/carbon. We will cherry-pick it to stable/oxygen and probably you can test with those patch to see if this resolves the issue.

Comment by Victor Pickard [ 11/Jun/18 ]

Anil,

Thanks. I will test with the new changes once they are cherry-picked.

Just FYI, I also saw the ConcurrentModificationException as noted in https://jira.opendaylight.org/browse/OPNFLWPLUG-970, so would need 

https://git.opendaylight.org/gerrit/#/c/70093/

to be cherry-picked as well.

 

Thanks,

Vic

 

Comment by Victor Pickard [ 11/Jun/18 ]

I cherry-picked the patches and did some local testing. I was not able to reproduce this issue with these patches, looks good. Thanks Anil!

 

Vic

 

Comment by Victor Pickard [ 11/Jun/18 ]

Attached updated version of test script (test2.sh). I noticed that sometimes, after doing a set-controller, it took a little longer than usual for ovs to connect to odl. Updated script to account for those extra few seconds.

Comment by Anil Vishnoi [ 25/Jun/18 ]

vpickard I think this issue is resolved? Can you open another Jira  to trace the issue you mentioned and can you please provide the logs as  well.

Comment by Victor Pickard [ 25/Jun/18 ]

Anil,

Yes, this issue is resolved with the 2 patches. 

The issue I mentioned (Exception) was also resolved by the patch, so no new issue to open Jira for.

Should be good to mark this bug as resolved.

 

Thanks

Generated at Wed Feb 07 20:33:59 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.