[OPNFLWPLUG-848] controllers with no connectivity with the switch gets device ownership Created: 05/Feb/17 Updated: 27/Sep/21 Resolved: 23/Mar/17 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Jon Castro | Assignee: | Anil Vishnoi |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 7736 |
| Description |
|
When a switch connects to a controller, it register a register cluster singleton service and MDSAL clustering gives ownership to one of the controllers. All controllers are registered as candidates, so if all controllers that contains a valid connections to the switch are rebooted, then another controller which does not have a connection gets the ownerships. Reproducing this error is very straight forward.
The ownership of the switch will move to node C. When node A and/or node B are active again with valid Openflow sessions, the ownership does not move to one of the controllers with an active connection. Making a HTTP GET to clustering service http://bsc:8181/restconf/operational/entity-owners:entity-owners before restarting a any controller will show that all 3 controllers are "candidates" for the switch even though only two have valid connections. For example, { , , { "name": "member-2" } ], Probably, the source of the problem is that in this example "member-3" is shown as valid candidate but it does not hold any active openflow connection to the switch. |
| Comments |
| Comment by Jon Castro [ 06/Feb/17 ] |
|
Openflow plugin consumes clustering singleton service. It seems, MDSAL clustering singleton service adds all controllers as candidates, independently if controllers registers to the service or not. So, if just one controller creates a service all controllers will appear as candidate. |
| Comment by Jon Castro [ 06/Feb/17 ] |
|
(In reply to Jon Castro from comment #1) discard my previous comment. The source of the problem is that there are two classes that register the cluster singleton service.
My first guess is that DeviceMastership class should not register the clustering singleton service because LifecycleServiceImpl is the one that knows when the connection to the switch is opened or closed. |
| Comment by Jon Castro [ 06/Feb/17 ] |
|
Gerrit proposed change: Both DeviceMastership and LifecycleServiceImpl cannot use the same cluster singleton name. In the following change we change the cluster singleton name used by DeviceMastership to openflow:<id>:frm https://git.opendaylight.org/gerrit/#/c/51489/ If the intent of forwarding rules manager is to be mounted in the same controller is that contains mastership to the switch, then, it will require more changes like tracking cluster status for the switch. |
| Comment by Anil Vishnoi [ 11/Feb/17 ] |
|
Using the same cluster singleton id is intention, because we want FRM to run locally on the same controller instance that owns the device. It's been done in this way to use the local rpc registration and avoid routing of the rpc call to other controller. Although this patch fixes the problem, but it enables the routing of rpc call from the frm owner instance to device owner instance (when frm and device owner is different). Routed rpc's adds the latency in flow installation and will create performance issue in the clustered environment. The root cause of this problem lies in FRM clustering registration. Currently FRM does the clustering registration based on the data store change event of the node addition, but this notification goes to all the controller instances and FRM on all the controller nodes ends up registering for the ownership. Ideally FRM should only register on the instance where device is connected. In my personal opinion, this is a bug in the singleton clustering implementation. It should throw the error while registration if they see the registration is happening on the node where other services didn't register. I also see that it's not easy to implement this at the clustering service level. We need to fix the FRM registration to get the correct behavior. I will push a patch with the correct fix. |
| Comment by Abhijit Kumbhare [ 23/Feb/17 ] |
|
Anil to move the patch to master and then close it. |
| Comment by Luis Gomez [ 23/Feb/17 ] |
|
Let me know when this is in master so I can adjust the system test for Boron/Carbon at once. |
| Comment by Anil Vishnoi [ 23/Feb/17 ] |
|
Master : https://git.opendaylight.org/gerrit/#/c/52225/ Luis above patch is for master branch, so now you can fix the CSIT tests. |
| Comment by Anil Vishnoi [ 23/Feb/17 ] |
|
Stable/boron : https://git.opendaylight.org/gerrit/#/c/51489/7 |