[OPNFLWPLUG-970] Unable to establish connections with switches after frequent connecting and disconnecting Created: 08/Jan/18 Updated: 19/Jun/18 Resolved: 19/Jun/18 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | openflowplugin-impl |
| Affects Version/s: | Carbon-SR3, Carbon-SR2 |
| Fix Version/s: | Oxygen |
| Type: | Bug | Priority: | Highest |
| Reporter: | Chien-Hsin Chen | Assignee: | Chien-Hsin Chen |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: Ubuntu 14.04 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
My environment has three physical switches connecting to the opendaylight controller of carbon-sr2. After keeping making two of the switches connected and disconnected frequently for a while, the opendaylight controller couldn't set up the connections with switches any longer. Please see the attachment for karaf log.
One fact is that everything started to go wrong when the following log message showed up: 2018-01-08 14:21:35,601 | WARN | entLoopGroup-7-8 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 is already in termination state, closing all incoming connections.
Then all connections will be closed after that like this: 2018-01-08 14:21:35,601 | INFO | entLoopGroup-7-8 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Device openflow:35776560030 connected.
Here is part of the log (filter by openflow:35776560030) when it went wrong: 2018-01-08 14:19:24,615 | WARN | lt-dispatcher-19 | ContextChainImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.2.Carbon | Not able to start clustering services for node openflow:35776560030
I tried to analyze it, it seems sometimes the failure in setting master role on switches would cause this issue. The pattern is like: Failure in setting master role -> destroying context chain -> Closing connection as device is idle. -> disconnected -> conencted -> already in termination state -> connected -> already trying to connect -> connected -> already trying to connect -> ... |
| Comments |
| Comment by Anil Vishnoi [ 09/Jan/18 ] |
|
jamescch Are you planning to work on this issue? |
| Comment by Chien-Hsin Chen [ 09/Jan/18 ] |
|
I'm afraid I can't because of limited knowledge on openflowplugin. But I can assist in debugging. |
| Comment by Kevin Wang [ 28/Feb/18 ] |
|
I am able to reproduce this issue with just one switch disconnect and reconnect. It doesn't happen all the time, but once it happen, the state cannot recover. I will look into this issue. |
| Comment by Chien-Hsin Chen [ 23/Mar/18 ] |
|
kevinxw I have found the root cause. I'm going to fix this. Do you mind assigning back to me? |
| Comment by Kevin Wang [ 23/Mar/18 ] |
|
Please go ahead. I am busy with other task recently and haven't really made much progress so far. |
| Comment by Chien-Hsin Chen [ 27/Mar/18 ] |
|
Change has been opened here. |
| Comment by Paul Greenberg [ 05/Apr/18 ] |
|
Experiencing what appears to be the same issue with Nitrogen: 2018-04-04 22:03:14,179 | WARN | ntLoopGroup-16-7 | ContextChainHolderImpl | 305 - org.opendaylight.openflowplugin.impl - 0.5.1 | Device openflow:218639756982094 already connected. Closing previous connection 2018-04-04 22:03:16,412 | WARN | ntLoopGroup-16-4 | ContextChainImpl | 305 - org.opendaylight.openflowplugin.impl - 0.5.1 | Not able to start clustering services for node openflow:231865136161864 2018-04-04 22:03:16,414 | WARN | ofppool-11 | ContextChainHolderImpl | 305 - org.opendaylight.openflowplugin.impl - 0.5.1 | Not able to set MASTER role on device openflow:231865136161864, reason: java.lang.RuntimeException: Failed to initialize device openflow:231865136161864 in 9s: java.util.concurrent.TimeoutException 2018-04-04 22:03:16,414 | WARN | ofppool-11 | ContextChainHolderImpl | 305 - org.opendaylight.openflowplugin.impl - 0.5.1 | This mastering is mandatory, destroying context chain and closing connection for device openflow:231865136161864. jamescch, what is the root cause of this issue? |
| Comment by Chien-Hsin Chen [ 10/Apr/18 ] |
|
The root cause is that when destoryContextChain in class ContextChainHolderImpl is called, class DeviceContextImpl will try to iterate all requestContexts it has created in a collection to close them. However, those requestContexts will remove themselves from the collection. And this caused the CurrentModificationException and the thread was aborted. The status in ContextChainImpl for this switch remained CLOSING and changed to ALREADY_CONNECTED later and blocked all incoming connections. And the reason why we didn't see the error in the karaf console is that thread is executed using ExecutorService.submit() which didn't expose the exception. I uploaded the stack trace so that you can see how it goes into this situation. |
| Comment by Anil Vishnoi [ 11/Jun/18 ] |
|
jamescch Can you please cherry-pick the changes to stable/oxygen and master branch as well. We have seen similar issue with those branches as well. |