[OPNFLWPLUG-767] Switch connection bounce generates wrong entity owner in cluster env Created: 08/Sep/16  Updated: 27/Sep/21  Resolved: 09/Dec/16

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Luis Gomez Assignee: Luis Gomez
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File karaf_log_device_connection_bounce.txt    
Issue Links:
Blocks
is blocked by MDSAL-197 Switch connection bounce results in w... Resolved
External issue ID: 6672

 Description   

Bouncing a switch connection produces stale entry in entity owner and further connection requests rejects.

Karaf log with DEBUG is attached.

BR/Luis



 Comments   
Comment by Luis Gomez [ 08/Sep/16 ]

Attachment karaf_log_device_connection_bounce.txt has been added with description: Karaf log

Comment by Luis Gomez [ 09/Sep/16 ]

to reproduce the above:

1) start mininet 1 switch (s1) pointing to any of the cluster instances (e.g. 192.168.0.101)
2) Run the following script in the mininet system to bounce connection quickly:

#!/bin/bash
sudo ovs-vsctl del-controller s1
sleep 0.2
sudo ovs-vsctl set-controller s1 "tcp:192.168.0.101"

Comment by Tomas Slusny [ 13/Sep/16 ]

Posted patch that will hopefully fix this issue: https://git.opendaylight.org/gerrit/#/c/45526 . Luis, can you recheck it?

Comment by Tomas Slusny [ 13/Sep/16 ]

So I partially fixed this issue with patch I posted earlier, but I think there is problem with Singleton (sometimes, it is sending SLAVE instead of MASTER) so I raised bug in mdsal here: https://bugs.opendaylight.org/show_bug.cgi?id=6710 and added it as blocker for this issue.

Comment by Tom Pantelis [ 15/Sep/16 ]

I see some NPEs:

2016-09-08 23:35:05,409 | ERROR | pool-31-thread-1 | ExecutionList | 65 - com.google.guava - 18.0.0 | RuntimeException while executing runnable com.google.common.util.concurrent.Futures$6@66487d12 with executor INSTANCE
java.lang.NullPointerException
at org.opendaylight.openflowplugin.impl.device.DeviceContextImpl.shutdownConnection(DeviceContextImpl.java:568)

but looks like that is fixed by Tomas's patch.

The attached log is from member-1 which wasn't the EOS shard leader - member-2 was. When device connections were dropped, I see the candidates for the ServiceEntityType removed for member-3 and member-2. I don't see the candidate removed for member-1 - I assume that's not expected? Maybe the NPE prevented it? In any event, it seems member-1 remained as candidate and thus the owner. As far as the EOS is concerned, this is correct.

I don't see any evidence of an entity with an owner that isn't a candidate as we saw in the CI test a couple weeks ago. However I did find an issue that can result in that scenario that I fixed with https://git.opendaylight.org/gerrit/#/c/45516/.

Comment by Tomas Slusny [ 30/Sep/16 ]

Added another patch on Gerrit: https://git.opendaylight.org/gerrit/#/c/46321
that will hopefully make unregistering process of cluster singleton services better and more stable, and that means that connection bounce will also work better.

Comment by Tomas Slusny [ 07/Oct/16 ]

Gerrit: https://git.opendaylight.org/gerrit/#/c/46390/

This should fix all errors with fast connection and disconnection of device. This patch depends on this https://git.opendaylight.org/gerrit/#/c/45638/ patch in controller and this patch https://git.opendaylight.org/gerrit/#/c/46175/ in mdsal. After these 3 will be merged, I think everything should be fine.

Comment by Luis Gomez [ 07/Oct/16 ]

Thanks Tomas for spending time in this

Comment by Tomas Slusny [ 13/Oct/16 ]

So, both mdsal (https://git.opendaylight.org/gerrit/#/c/46175/) and controller (https://git.opendaylight.org/gerrit/#/c/45638/) changes what was required to solve this issue (and other related) issues was merged in master. So if you can Luis, can you test this connection bounce on my patch (https://git.opendaylight.org/gerrit/#/c/46390/) if it is really working?

Comment by Luis Gomez [ 24/Oct/16 ]

I am testing your patch today, I will let you know the results.

Comment by Shuva Jyoti Kar [ 08/Dec/16 ]

(In reply to Luis Gomez from comment #9)
> I am testing your patch today, I will let you know the results.

Luis , any updates on this ?

Comment by Luis Gomez [ 09/Dec/16 ]

This is fixed now.

Generated at Wed Feb 07 20:33:20 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.