[OPNFLWPLUG-889] Switch disconnect while mastership is being negotiated produces stale switch entry Created: 10/May/17  Updated: 27/Sep/21  Resolved: 17/May/17

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Luis Gomez Assignee: Miroslav Macko
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8411

 Description   

This only happens in Carbon and it is tracked in the reconciliation suite:

https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-3node-clustering-only-carbon/614/archives/log.html.gz#s1-s7

To reproduce:

1) Start 3 nodes cluster

2) Start 1 switch (s1) and connect it to 2 nodes (.101 and .102):

sudo ovs-vsctl set-controller s1 "tcp:192.168.0.101:6633" "tcp:192.168.0.102:6633"

3) Block communication owner to switch (in this example .101):

sudo iptables -A INPUT --source 192.168.0.101 -j DROP

4) Just after block communication remaining node to switch:

sudo iptables -A INPUT --source 192.168.0.102 -j DROP

5) Now observe the following after few seconds:

  • The switch closes owner connection (.101).
  • New mastership negotiation starts in remaining node (.102).
  • The switch closes remaining connection (.102).

The result of the above is an stale switch in operational, only workaround is reboot entire cluster.

BR/Luis



 Comments   
Comment by Luis Gomez [ 16/May/17 ]

Also rising this to blocker, this issue is not seen in Boron so we should fix if we can.

Comment by Miroslav Macko [ 16/May/17 ]

Hello Luis,

I am not able to reproduce it locally on stable/carbon followed your instructions.

I have checked /restconf/operational/network-topology:network-topology after blocking both controllers, switch closes connections, and operational is clean.

I have tried to block SLAVE right after MASTER without waiting for new MASTER elected. And also with waiting for new MASTER elected. Topology was cleaned up both times.

I have checked also Jenkins logs.
Is it "Check No Switches After Disconnect" what is failing?
Probably it has no impact, but it looks like there are more switches connected.
I am only want to be sure, that I am looking to the right place.

Thanks,
Miro

Comment by Luis Gomez [ 16/May/17 ]

Hi Miroslav, the issue is still there, see this recent build:

https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-3node-gate-clustering-only-carbon/49/archives/log.html.gz#s1-s7

I also realized that sometimes the switch is cleared from operational but it is not from entity owner so when switch reconnects it does not work.

Please repeat the steps above and when you disconnected slave member just after master member, check entity owner API (I always see slave as master in entity owner when it is not connected).

Also make sure you block controller to switch communication, not the other way around. If you have problem reproducing I can setup some call and share my desktop.

BR/Luis

Comment by Miroslav Macko [ 17/May/17 ]

Hello Luis,

Yes. I am exactly following your instruction. I am blocking controller to switch communication.

Operational is still clean. But you are right, that entity owner is not.

http://10.0.42.201:8181/restconf/operational/entity-owners:entity-owners*
— TYPE [org.opendaylight.mdsal.AsyncServiceCloseEntityType]
ID : /a:entity[a:name='openflow:1']
+-- OWNER:
— TYPE [org.opendaylight.mdsal.ServiceEntityType]
ID : /a:entity[a:name='openflow:1']
+-- OWNER:

Is this what you meant? Or do you check it other way?

Jozef will try to prepare patch for it.

Thank you,
Miro

Comment by A H [ 17/May/17 ]

We are looking to build Carbon RC2 tomorrow 5/18 at 23:59 UTC time assuming there are no blocker bugs. Is there an ETA for when a fix can be merged and this bug resolved for stable/carbon branch?

Comment by Jozef Bacigal [ 17/May/17 ]

https://git.opendaylight.org/gerrit/#/c/57232/9

Here is the patch, Luis can you test it plz ?

Jozef

Comment by A H [ 17/May/17 ]

(In reply to Jozef Bacigal from comment #6)
> https://git.opendaylight.org/gerrit/#/c/57232/9
>
> Here is the patch, Luis can you test it plz ?
>
> Jozef

This patch is failing to build in jenkins and is still missing +2 from committers.

Comment by Luis Gomez [ 17/May/17 ]

Let me test this although patch test does not look very good.

Comment by Luis Gomez [ 17/May/17 ]

OK guys, bad and good news.

BAD:
The proposed patch: https://git.opendaylight.org/gerrit/#/c/57232 has serious issues as shown here: https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-1node-gate-flow-services-only-nitrogen/60/archives/log.html.gz

GOOD:
It looks like this other patch: https://git.opendaylight.org/gerrit/#/c/57096 fixed the issue.

So I am closing this bug

Generated at Wed Feb 07 20:33:39 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.