[OPNFLWPLUG-758] Li plugin: Controller rejects switch connections Created: 26/Aug/16 Updated: 27/Sep/21 Resolved: 08/Sep/16 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Luis Gomez | Assignee: | Jozef Bacigal |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 6554 |
| Description |
|
As this test shows: List of candidates vary from 1-2-3 when switch connects to 1 or 2 members only. BR/Luis |
| Comments |
| Comment by Luis Gomez [ 30/Aug/16 ] |
|
I changed the title of the bug because I think the entity owner unstability is generated by a quick switch disconnect + connect transition. Looking at this log: 2016-08-30 08:47:41,330 | INFO | entLoopGroup-5-1 | SystemNotificationsListenerImpl | 207 - org.opendaylight.openflowplugin.impl - 0.3.0.Boron-RC1 | ConnectionEvent: Connection closed by device, Device:/192.168.0.20:56480, NodeId:null A hello packet is received before the device is fully removed from the controller. The hello packet stops the device unregister and at the same time switch connection gets rejected because the device is still register. This generates an unstable switch connection. |
| Comment by Luis Gomez [ 31/Aug/16 ] |
|
This same issue is seen often during the cluster test and the result is always an unstable controller with entity-owner, inventory and topology entries when the switch in reality does not exist anymore. 2016-08-31 00:21:20,553 | INFO | entLoopGroup-7-4 | ConnectionAdapterImpl | 270 - org.opendaylight.openflowjava.openflow-protocol-impl - 0.8.0.Boron-RC1 | Hello received / branch |
| Comment by Luis Gomez [ 31/Aug/16 ] |
|
Raising to blocker, we cannot afford to have unstable OpenFlow cluster in Boron. |
| Comment by Luis Gomez [ 01/Sep/16 ] |
|
The easiest way to reproduce this issue is to stop and start mininet with no delay for few times. After a while you will see stale entries in entity-owner, topology and inventory. |
| Comment by Luis Gomez [ 01/Sep/16 ] |
|
The ERRORs described in this bug show very often in 2 scenarios: 1) OpenFlow Cluster test: That is why the test suite is not stable. 2) Switch scalability test single node: That is why we do not see good switch scalability. In general this issue is more reproducible:
|
| Comment by A H [ 02/Sep/16 ] |
|
Is there an ETA for this bug and someone assigned to fix? |
| Comment by Colin Dixon [ 02/Sep/16 ] |
|
Do we think this bug is related to (or caused by) BUG-6540? |
| Comment by Luis Gomez [ 02/Sep/16 ] |
|
No, this bugs occurs without cluster member isolation and even in single instance test. The bug you point seems to be related to https://bugs.opendaylight.org/show_bug.cgi?id=6177 which is not blocker (major bug). |
| Comment by Luis Gomez [ 05/Sep/16 ] |
|
I see also this ERROR very often prior to device reject: 2016-09-03 20:03:06,703 | WARN | pool-32-thread-1 | RoleContextImpl | 183 - org.opendaylight.openflowplugin.impl - 0.3.0.SNAPSHOT | New role BECOMESLAVE was not propagated to device openflow:160 during 10 sec fail . Reason java.util.concurrent.CancellationException: Task was cancelled. The above ERROR is kind of often in single node switch scalability and longevity tests. In cluster test, there is an easy way to reproduce the above: 1) start mininet 1 switch (s1) pointing to any of the instances (e.g. 192.168.0.101) #!/bin/bash |
| Comment by Jozef Bacigal [ 06/Sep/16 ] |
| Comment by A H [ 06/Sep/16 ] |
|
To better assess the impact of this bug and fix, could someone from your team please help us identify the following: |
| Comment by Luis Gomez [ 06/Sep/16 ] |
|
After this merge I see more stable cluster test but still some sporadic issue in switch scalability test: So far we cannot close this issue but maybe reduce importance if cluster behaves more stable (wee need few more runs to confirm). BR/Luis |
| Comment by Luis Gomez [ 07/Sep/16 ] |
|
OK, after a day of test I downgraded the importance from Blocker to Major. The reason is cluster test seems stable now: The switch scalability test, even when showing the issue sometimes, it has stabilize the max number: BR/Luis |
| Comment by Colin Dixon [ 07/Sep/16 ] |
|
So, can we note the maximum number of switch connections somewhere? I guess that would be part of the performance report, which we'll hopefully have time to update... |
| Comment by A H [ 08/Sep/16 ] |
|
Has this bug been verified as fixed in the latest Boron RC 3.1 Build? |
| Comment by Jozef Bacigal [ 08/Sep/16 ] |
|
Closing this as solved and creating new one for the switch scalability test. Jozef |