[OPNFLWPLUG-823] [Boron] Switch connection sometimes bounces on non-owner node when OWNER is shut down Created: 14/Nov/16  Updated: 27/Sep/21  Resolved: 01/Aug/17

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Evan Zeller Assignee: Anil Vishnoi
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 7184

 Description   

In a 3-node cluster we check the entity-ownership API to determine which switch is the owner for some openflow devices. We stop the OWNER controller and observe the connections bounce on one of the non-owner devices.

In netstat the issue appears as a growing number of failed connections:
[vagrant@localhost ~]$ netstat -an | grep 6633
tcp6 0 0 :::6633 :::* LISTEN
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50811 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50791 TIME_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50772 CLOSE_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50768 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50813 TIME_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50784 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50785 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50778 TIME_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50780 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50797 TIME_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50788 CLOSE_WAIT
tcp6 8 0 10.120.22.207:6633 10.120.22.184:50823 ESTABLISHED
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50774 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50806 TIME_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50792 CLOSE_WAIT
tcp6 271 0 10.120.22.207:6633 10.120.22.184:46788 CLOSE_WAIT
tcp6 8 0 10.120.22.207:6633 10.120.22.184:46785 ESTABLISHED
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50786 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50787 TIME_WAIT
tcp6 16 0 10.120.22.207:6633 10.120.22.184:46772 ESTABLISHED
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50808 CLOSE_WAIT
tcp6 16 0 10.120.22.207:6633 10.120.22.184:46769 ESTABLISHED
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50799 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50782 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50795 TIME_WAIT
tcp6 8 0 10.120.22.207:6633 10.120.22.184:46766 ESTABLISHED
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50776 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50777 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50790 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50818 TIME_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50824 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50766 TIME_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50820 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50803 TIME_WAIT
tcp6 271 0 10.120.22.207:6633 10.120.22.184:46770 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:46786 ESTABLISHED
tcp6 533 0 10.120.22.207:6633 10.120.22.184:46783 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50814 TIME_WAIT
tcp6 7168 0 10.120.22.207:6633 10.120.22.184:46778 ESTABLISHED
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50793 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50815 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:46775 ESTABLISHED
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50801 TIME_WAIT
tcp6 16 0 10.120.22.207:6633 10.120.22.184:50828 ESTABLISHED
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50798 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:46781 ESTABLISHED
tcp6 0 0 10.120.22.207:6633 10.120.22.184:46790 ESTABLISHED
tcp6 32 0 10.120.22.207:6633 10.120.22.184:50826 ESTABLISHED
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50819 TIME_WAIT
tcp6 16 0 10.120.22.207:6633 10.120.22.184:50827 ESTABLISHED
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50767 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50822 TIME_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50796 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50810 TIME_WAIT
tcp6 8 0 10.120.22.207:6633 10.120.22.184:46774 ESTABLISHED
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50809 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50807 TIME_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50800 CLOSE_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50816 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:46779 ESTABLISHED
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50821 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50802 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50789 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50794 TIME_WAIT
tcp6 433 0 10.120.22.207:6633 10.120.22.184:46777 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50805 TIME_WAIT
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50812 CLOSE_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:50817 TIME_WAIT
tcp6 0 0 10.120.22.207:6633 10.120.22.184:46784 ESTABLISHED
tcp6 17 0 10.120.22.207:6633 10.120.22.184:50804 CLOSE_WAIT

In the karaf.log we see the following sequence:
2016-11-09 00:10:28,235 | INFO | ntLoopGroup-11-1 | ConnectionAdapterImpl | 277 - org.opendaylight.openflowjava.openflow-protocol-impl - 0.8.1.Boron-SR1 | Hello received / branch
2016-11-09 00:10:28,242 | WARN | ntLoopGroup-11-1 | DeviceManagerImpl | 286 - org.opendaylight.openflowplugin.impl - 0.3.1.Boron-SR1 | Node openflow:102 already connected disconnecting device. Rejecting connection
2016-11-09 00:10:28,242 | WARN | ntLoopGroup-11-1 | DeviceManagerImpl | 286 - org.opendaylight.openflowplugin.impl - 0.3.1.Boron-SR1 | Node openflow:102 context state not in TERMINATION state.

Here is full TRACE log for openflowplugin & openflowjava:

https://gist.github.com/evanzeller/371612c402ef856ad5153ec3ec635817

In this state traffic relying on existing flows stops working until recovery, this state persists until the offending node is restarted, then the controller is recovered.



 Comments   
Comment by Evan Zeller [ 15/Nov/16 ]

Thanks to Luis I discovered this only occurs when you attempt graceful shutdown on one of the cluster nodes-- when using 'kill' it is not reproducible. Perhaps attempting graceful shutdown on one node in a 3-node cluster is a no-no, if so is it documented somewhere?

Comment by Luis Gomez [ 12/Dec/16 ]

I added a test to catch this bug and it showed 2 times in 28 runs:

https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-3node-clustering-only-boron/820/archives/log.html.gz

https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-3node-clustering-only-boron/843/archives/log.html.gz

So this issue happens not very often but it has bad consequences: switch stalled in node until this is rebooted.

Comment by Abhijit Kumbhare [ 23/Feb/17 ]

Please re-check after the latest clustering changes by Anil.

Comment by Luis Gomez [ 14/Apr/17 ]

It is happening more rarely now but I still see the issue in a recent test:

https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-3node-clustering-only-boron/1054/archives/log.html.gz

So I will still track this for a while.

Comment by Luis Gomez [ 25/Apr/17 ]

Boron still shows the issue once in a while:

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/openflowplugin-csit-3node-clustering-only-boron/1072/robot/openflowplugin-clustering.txt/Cluster%20HA%20Owner%20Restart/

Carbon does not show it:

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/openflowplugin-csit-3node-clustering-only-carbon/599/robot/openflowplugin-clustering.txt/Cluster%20HA%20Owner%20Restart/

So I think the split connection patch helped alleviating the problem.

BR/Luis

Comment by Luis Gomez [ 01/Aug/17 ]

This seems to be fixed now.

Generated at Wed Feb 07 20:33:28 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.