[OPNFLWPLUG-970] Unable to establish connections with switches after frequent connecting and disconnecting Created: 08/Jan/18  Updated: 19/Jun/18  Resolved: 19/Jun/18

Status: Resolved
Project: OpenFlowPlugin
Component/s: openflowplugin-impl
Affects Version/s: Carbon-SR3, Carbon-SR2
Fix Version/s: Oxygen

Type: Bug Priority: Highest
Reporter: Chien-Hsin Chen Assignee: Chien-Hsin Chen
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Ubuntu 14.04
Platform: PC


Attachments: File karaf.log     PDF File stack trace.pdf    
Issue Links:
Duplicate
is duplicated by OPNFLWPLUG-1018 OVS not able to connect to ODL - cont... Resolved

 Description   

My environment has three physical switches connecting to the opendaylight controller of carbon-sr2.

After keeping making two of the switches connected and disconnected frequently for a while, the opendaylight controller couldn't set up the connections with switches any longer. Please see the attachment for karaf log.

 

One fact is that everything started to go wrong when the following log message showed up:

2018-01-08 14:21:35,601 | WARN | entLoopGroup-7-8 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 is already in termination state, closing all incoming connections.

 

Then all connections will be closed after that like this:

2018-01-08 14:21:35,601 | INFO | entLoopGroup-7-8 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 connected.
2018-01-08 14:21:35,601 | WARN | entLoopGroup-7-8 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 is already in termination state, closing all incoming connections.

 

Here is part of the log (filter by openflow:35776560030) when it went wrong:

2018-01-08 14:19:24,615 | WARN | lt-dispatcher-19 | ContextChainImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Not able to start clustering services for node openflow:35776560030
2018-01-08 14:19:24,615 | WARN | ofppool-2 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Not able to set MASTER role on device openflow:35776560030, reason: Failed to initialize device openflow:35776560030 in 9s: java.util.concurrent.TimeoutException: Timeout waiting for task.
2018-01-08 14:19:24,615 | WARN | ofppool-2 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | This mastering is mandatory, destroying context chain and closing connection for device openflow:35776560030.
2018-01-08 14:19:24,616 | INFO | ofppool-2 | ContextChainImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Closed clustering services registration for node openflow:35776560030
2018-01-08 14:19:24,621 | INFO | ult-dispatcher-3 | ContextChainImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Closing clustering services for node openflow:35776560030
2018-01-08 14:19:24,621 | INFO | ult-dispatcher-3 | StatisticsContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Stopping statistics context cluster services for node openflow:35776560030
2018-01-08 14:19:24,621 | INFO | ult-dispatcher-3 | StatisticsContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Stopping running statistics gathering for node openflow:35776560030
2018-01-08 14:19:24,621 | INFO | ult-dispatcher-3 | RpcContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Stopping rpc context cluster services for node openflow:35776560030
2018-01-08 14:19:24,621 | INFO | ult-dispatcher-3 | DeviceContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Stopping device context cluster services for node openflow:35776560030
2018-01-08 14:19:24,622 | INFO | ult-dispatcher-3 | ContextChainImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Closed clustering services for node openflow:35776560030
2018-01-08 14:19:24,625 | DEBUG | lt-dispatcher-20 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Entity openflow:35776560030 has no owner
2018-01-08 14:19:24,625 | INFO | lt-dispatcher-20 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Removing device Uri [_value=openflow:35776560030] from operational DS
2018-01-08 14:19:24,626 | DEBUG | CommitFutures-1 | DeviceManagerImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Delete Node KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.Nodes, org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node[key=NodeKey [_id=Uri [_value=openflow:35776560030]]]]} was successful
2018-01-08 14:19:32,642 | INFO | ofppool-2 | SystemNotificationsListenerImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | ConnectionEvent:Closing connection as device is idle. Echo sent at Mon Jan 08 14:19:30 CST 2018. Device:/192.168.1.103:29990, NodeId:openflow:35776560030
2018-01-08 14:19:32,642 | DEBUG | ofppool-2 | ConnectionContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Actively closing connection: device=openflow:35776560030 | auxiliaryId=0 | connectionState=RIP
2018-01-08 14:19:32,642 | DEBUG | ofppool-2 | ConnectionContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Trying unregister outbound queue handler registration for node openflow:35776560030
2018-01-08 14:19:32,642 | DEBUG | ofppool-2 | ConnectionContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Trying closing handshake context for node openflow:35776560030
2018-01-08 14:19:32,642 | DEBUG | ofppool-2 | ConnectionContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Propagating connection closed event: /192.168.1.103:29990, datapathId:35776560030.
2018-01-08 14:19:32,643 | INFO | ofppool-2 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 disconnected.
2018-01-08 14:19:32,644 | DEBUG | ofppool-2 | ContextChainImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | ContextChain for node openflow:35776560030 is already in TERMINATION state.
2018-01-08 14:21:35,601 | DEBUG | entLoopGroup-7-8 | HandshakeListenerImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | succeeded by getting sweep barrier after post-handshake for device openflow:35776560030
2018-01-08 14:21:35,601 | INFO | entLoopGroup-7-8 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 connected.
2018-01-08 14:21:35,601 | WARN | entLoopGroup-7-8 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 is already in termination state, closing all incoming connections.
2018-01-08 14:21:35,601 | DEBUG | entLoopGroup-7-8 | ConnectionContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Actively closing connection: device=openflow:35776560030 | auxiliaryId=0 | connectionState=RIP
2018-01-08 14:21:35,601 | DEBUG | entLoopGroup-7-8 | ConnectionContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Trying unregister outbound queue handler registration for node openflow:35776560030
2018-01-08 14:21:35,601 | DEBUG | entLoopGroup-7-8 | ConnectionContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Trying closing handshake context for node openflow:35776560030
2018-01-08 14:21:43,691 | DEBUG | ntLoopGroup-7-10 | HandshakeListenerImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | succeeded by getting sweep barrier after post-handshake for device openflow:35776560030
2018-01-08 14:21:43,692 | INFO | ntLoopGroup-7-10 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 connected.
2018-01-08 14:21:43,692 | WARN | ntLoopGroup-7-10 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 is already trying to connect, wait until succeeded or disconnected.
2018-01-08 14:21:43,692 | DEBUG | ntLoopGroup-7-10 | ConnectionContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Actively closing connection: device=openflow:35776560030 | auxiliaryId=0 | connectionState=RIP
2018-01-08 14:21:43,692 | DEBUG | ntLoopGroup-7-10 | ConnectionContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Trying unregister outbound queue handler registration for node openflow:35776560030
2018-01-08 14:21:43,692 | DEBUG | ntLoopGroup-7-10 | ConnectionContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Trying closing handshake context for node openflow:35776560030
2018-01-08 14:21:51,812 | DEBUG | ntLoopGroup-7-12 | HandshakeListenerImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | succeeded by getting sweep barrier after post-handshake for device openflow:35776560030
2018-01-08 14:21:51,812 | INFO | ntLoopGroup-7-12 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 connected.
2018-01-08 14:21:51,813 | WARN | ntLoopGroup-7-12 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 is already trying to connect, wait until succeeded or disconnected.

 

I tried to analyze it, it seems sometimes the failure in setting master role on switches would cause this issue.

The pattern is like: Failure in setting master role -> destroying context chain -> Closing connection as device is idle. -> disconnected -> conencted -> already in termination state -> connected -> already trying to connect -> connected -> already trying to connect -> ...



 Comments   
Comment by Anil Vishnoi [ 09/Jan/18 ]

jamescch Are you planning to work on this issue?

Comment by Chien-Hsin Chen [ 09/Jan/18 ]

I'm afraid I can't because of limited knowledge on openflowplugin.

But I can assist in debugging.

Comment by Kevin Wang [ 28/Feb/18 ]

I am able to reproduce this issue with just one switch disconnect and reconnect. It doesn't happen all the time, but once it happen, the state cannot recover. I will look into this issue.

Comment by Chien-Hsin Chen [ 23/Mar/18 ]

kevinxw I have found the root cause. I'm going to fix this.

Do you mind assigning back to me?

Comment by Kevin Wang [ 23/Mar/18 ]

Please go ahead. I am busy with other task recently and haven't really made much progress so far.

Comment by Chien-Hsin Chen [ 27/Mar/18 ]

Change has been opened here.

Comment by Paul Greenberg [ 05/Apr/18 ]

Experiencing what appears to be the same issue with Nitrogen:

2018-04-04 22:03:14,179 | WARN  | ntLoopGroup-16-7 | ContextChainHolderImpl           | 305 - org.opendaylight.openflowplugin.impl - 0.5.1 | Device openflow:218639756982094 already connected. Closing previous connection
2018-04-04 22:03:16,412 | WARN  | ntLoopGroup-16-4 | ContextChainImpl                 | 305 - org.opendaylight.openflowplugin.impl - 0.5.1 | Not able to start clustering services for node openflow:231865136161864
2018-04-04 22:03:16,414 | WARN  | ofppool-11       | ContextChainHolderImpl           | 305 - org.opendaylight.openflowplugin.impl - 0.5.1 | Not able to set MASTER role on device openflow:231865136161864, reason: java.lang.RuntimeException: Failed to initialize device openflow:231865136161864 in 9s: java.util.concurrent.TimeoutException
2018-04-04 22:03:16,414 | WARN  | ofppool-11       | ContextChainHolderImpl           | 305 - org.opendaylight.openflowplugin.impl - 0.5.1 | This mastering is mandatory, destroying context chain and closing connection for device openflow:231865136161864.

jamescch, what is the root cause of this issue?

Comment by Chien-Hsin Chen [ 10/Apr/18 ]

The root cause is that when destoryContextChain in class ContextChainHolderImpl is called, class DeviceContextImpl will try to iterate all requestContexts it has created in a collection to close them.

However, those requestContexts will remove themselves from the collection.

And this caused the CurrentModificationException and the thread was aborted.

The status in ContextChainImpl for this switch remained CLOSING and changed to ALREADY_CONNECTED later and blocked all incoming connections.

And the reason why we didn't see the error in the karaf console is that thread is executed using ExecutorService.submit() which didn't expose the exception.

I uploaded the stack trace so that you can see how it goes into this situation.

Comment by Anil Vishnoi [ 11/Jun/18 ]

jamescch Can you please cherry-pick the changes to stable/oxygen and master branch as well. We have seen similar issue with those branches as well.

Generated at Wed Feb 07 20:33:52 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.