[OPNFLWPLUG-767] Switch connection bounce generates wrong entity owner in cluster env Created: 08/Sep/16 Updated: 27/Sep/21 Resolved: 09/Dec/16 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Luis Gomez | Assignee: | Luis Gomez |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| External issue ID: | 6672 | ||||||||
| Description |
|
Bouncing a switch connection produces stale entry in entity owner and further connection requests rejects. Karaf log with DEBUG is attached. BR/Luis |
| Comments |
| Comment by Luis Gomez [ 08/Sep/16 ] |
|
Attachment karaf_log_device_connection_bounce.txt has been added with description: Karaf log |
| Comment by Luis Gomez [ 09/Sep/16 ] |
|
to reproduce the above: 1) start mininet 1 switch (s1) pointing to any of the cluster instances (e.g. 192.168.0.101) #!/bin/bash |
| Comment by Tomas Slusny [ 13/Sep/16 ] |
|
Posted patch that will hopefully fix this issue: https://git.opendaylight.org/gerrit/#/c/45526 . Luis, can you recheck it? |
| Comment by Tomas Slusny [ 13/Sep/16 ] |
|
So I partially fixed this issue with patch I posted earlier, but I think there is problem with Singleton (sometimes, it is sending SLAVE instead of MASTER) so I raised bug in mdsal here: https://bugs.opendaylight.org/show_bug.cgi?id=6710 and added it as blocker for this issue. |
| Comment by Tom Pantelis [ 15/Sep/16 ] |
|
I see some NPEs: 2016-09-08 23:35:05,409 | ERROR | pool-31-thread-1 | ExecutionList | 65 - com.google.guava - 18.0.0 | RuntimeException while executing runnable com.google.common.util.concurrent.Futures$6@66487d12 with executor INSTANCE but looks like that is fixed by Tomas's patch. The attached log is from member-1 which wasn't the EOS shard leader - member-2 was. When device connections were dropped, I see the candidates for the ServiceEntityType removed for member-3 and member-2. I don't see the candidate removed for member-1 - I assume that's not expected? Maybe the NPE prevented it? In any event, it seems member-1 remained as candidate and thus the owner. As far as the EOS is concerned, this is correct. I don't see any evidence of an entity with an owner that isn't a candidate as we saw in the CI test a couple weeks ago. However I did find an issue that can result in that scenario that I fixed with https://git.opendaylight.org/gerrit/#/c/45516/. |
| Comment by Tomas Slusny [ 30/Sep/16 ] |
|
Added another patch on Gerrit: https://git.opendaylight.org/gerrit/#/c/46321 |
| Comment by Tomas Slusny [ 07/Oct/16 ] |
|
Gerrit: https://git.opendaylight.org/gerrit/#/c/46390/ This should fix all errors with fast connection and disconnection of device. This patch depends on this https://git.opendaylight.org/gerrit/#/c/45638/ patch in controller and this patch https://git.opendaylight.org/gerrit/#/c/46175/ in mdsal. After these 3 will be merged, I think everything should be fine. |
| Comment by Luis Gomez [ 07/Oct/16 ] |
|
Thanks Tomas for spending time in this |
| Comment by Tomas Slusny [ 13/Oct/16 ] |
|
So, both mdsal (https://git.opendaylight.org/gerrit/#/c/46175/) and controller (https://git.opendaylight.org/gerrit/#/c/45638/) changes what was required to solve this issue (and other related) issues was merged in master. So if you can Luis, can you test this connection bounce on my patch (https://git.opendaylight.org/gerrit/#/c/46390/) if it is really working? |
| Comment by Luis Gomez [ 24/Oct/16 ] |
|
I am testing your patch today, I will let you know the results. |
| Comment by Shuva Jyoti Kar [ 08/Dec/16 ] |
|
(In reply to Luis Gomez from comment #9) Luis , any updates on this ? |
| Comment by Luis Gomez [ 09/Dec/16 ] |
|
This is fixed now. |