[OPNFLWPLUG-1114] Incorrect controller role after multiple connections from a single switch Created: 20/Oct/21 Updated: 02/Nov/21 Resolved: 02/Nov/21 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | clustering |
| Affects Version/s: | Phosphorus |
| Fix Version/s: | Phosphorus |
| Type: | Bug | Priority: | High |
| Reporter: | Sangwook Ha | Assignee: | Sangwook Ha |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
In Phosphorus the controller in a single controller cluster does not become master when there are multiple OpenFlow channel connections from a single switch, even after controller settings on the switch is reset and then a single OpenFlow channel connection is established. This does not always happen but it happens more often than not. Steps to reproduce: 1. Start the controller feature:install odl-openflowplugin-flow-services-rest 3. Create a Open vSwitch switch sudo ovs-vsctl add-br s1 4. Watch the controller connection status watch 'sudo ovs-vsctl --columns=target,role,status list controller' 5. Open another terminal & run the following commands - replace <CONTROLLER_IP> with the IP address of the controller CONTROLLER_IP="<CONTROLLER_IP>" sudo ovs-vsctl set-controller s1 "tcp:$CONTROLLER_IP:6633" "tcp:$CONTROLLER_IP:6653" sleep 30 sudo ovs-vsctl del-controller s1 sudo ovs-vsctl set-controller s1 "tcp:$CONTROLLER_IP:6633" During step 5 controller's role does not become master (just other or slave). Normally, the following get-entities RPC returns the owner of each switch when a switch is connected to the controller. curl --location --request POST 'http://<CONTROLLER_IP>:8181/rests/operations/odl-entity-owners:get-entities' \
--header 'Accept: application/yang-data+json' \
--header 'Content-Type: application/yang-data+json' \
--header 'Authorization: Basic YWRtaW46YWRtaW4='
For example, {
"odl-entity-owners:output": {
"entities": [
{
"type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
"name": "ofp-topology-manager",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
},
{
"type": "org.opendaylight.mdsal.ServiceEntityType",
"name": "openflow:81985529216486895",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
},
{
"type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
"name": "openflow:81985529216486895",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
},
{
"type": "org.opendaylight.mdsal.ServiceEntityType",
"name": "ofp-topology-manager",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
}
]
}
}
However, after step 5 the RPC shows that the switch is not registered in EOS even though the switch is connected to the controller. {
"odl-entity-owners:output": {
"entities": [
{
"type": "org.opendaylight.mdsal.ServiceEntityType",
"name": "ofp-topology-manager",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
},
{
"type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
"name": "ofp-topology-manager",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
}
]
}
}
This is a regression causing CSIT test Bug_Validation/8723.robot to fail. |
| Comments |
| Comment by Tomas Cere [ 22/Oct/21 ] |
|
Heya, can you try to check whether this config helps in this case?
I was able to reproduce this locally and with the above config applied I've seen the roles change to master on the OF channels, however I'm not familiar with OpenFlow at all, so I cant confirm whether thats correct behavior. |
| Comment by Sangwook Ha [ 22/Oct/21 ] |
|
That doesn't seem to help. When I tried the configuration change, there was no change in the behavior. After adding the configuration, I see the configuration appears in sal-clustering-config-4.0.3-factoryakkaconf.xml: $ grep -A 3 distributed ./target/assembly/system/org/opendaylight/controller/sal-clustering-config/4.0.3/sal-clustering-config-4.0.3-factoryakkaconf.xml
distributed-data {
gossip-interval = 100 ms
notify-subscribers-interval = 20 ms
}
And after installing openflowplugin feature, in configuration/factory/akka.conf: $ grep -A 3 distributed target/assembly/configuration/factory/akka.conf
distributed-data {
gossip-interval = 100 ms
notify-subscribers-interval = 20 ms
}
But it was not changing to master. |
| Comment by Tomas Cere [ 25/Oct/21 ] |
|
Default phosphorus behavior for me: target : "tcp:127.0.0.1:6653" role : other status : {last_error="End of file", sec_since_connect="0", sec_since_disconnect="1", sta te=ACTIVE} target : "tcp:127.0.0.1:6633" role : other status : {last_error="End of file", sec_since_connect="1", sec_since_disconnect="0", sta te=BACKOFF} Doesnt change, both stuck on other {
"odl-entity-owners:output": {
"entities": [
{
"type": "org.opendaylight.mdsal.ServiceEntityType",
"name": "ofp-topology-manager",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
},
{
"type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
"name": "ofp-topology-manager",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
}
]
}
}
With the config applied this is the output from the watch command: target : "tcp:127.0.0.1:6653" role : other status : {last_error="End of file", sec_since_connect="1", sec_since_disconnect="0", state=BACKOFF}target : "tcp:127.0.0.1:6633" role : master status : {last_error="End of file", sec_since_connect="0", sec_since_disconnect="1", state=ACTIVE} and target : "tcp:127.0.0.1:6653" role : master status : {last_error="End of file", sec_since_connect="0", sec_since_disconnect="1", state=ACTIVE} target : "tcp:127.0.0.1:6633" role : other status : {last_error="End of file", sec_since_connect="1", sec_since_disconnect="0", state=BACKOFF} They seem to switch master between the two connections which looks correct to me? Also they can be seen in get-entities as well: {
"odl-entity-owners:output": {
"entities": [
{
"type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
"name": "openflow:2439020958528",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
},
{
"type": "org.opendaylight.mdsal.ServiceEntityType",
"name": "openflow:2439020958528",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
},
{
"type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
"name": "ofp-topology-manager",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
},
{
"type": "org.opendaylight.mdsal.ServiceEntityType",
"name": "ofp-topology-manager",
"candidate-nodes": [
"member-1"
],
"owner-node": "member-1"
}
]
}
}
I also attempted to verify(to see whats the correct behavior with the old entity-ownership service) on silicon and here it looks pretty much the same, with one difference.
Can you please attach the entire contents of both <distribution>/configuration/factory/akka.conf and <distribution>/configuration/factory/akka.conf ?
|
| Comment by Sangwook Ha [ 25/Oct/21 ] |
|
Attached two files:
OpenFlow plugin drops the existing connection and takes a new connection, so expected behavior would be master going back and forth. But in this case there is no stable connection anyway, so I think the behavior while two connections competing is not very critical. |
| Comment by Sangwook Ha [ 26/Oct/21 ] |
|
I tried this again after upgrading controller to 4.0.4, and other MRI upstreams, and now I see the controller's role changes to master. Previously, I used 4.0.3 with AKKA configuration file updated with distributed-data parameter settings. Is there any other changes made between 4.0.3 & 4.0.4 that could affect this behavior? |
| Comment by Tomas Cere [ 26/Oct/21 ] |
|
The config you attached is incorrect, the distributed-data section needs to be nested in the cluster section like this: cluster {
seed-node-timeout = 12s # Following is an excerpt from Akka Cluster Documentation
# link - http://doc.akka.io/docs/akka/snapshot/java/cluster-usage.html
# Warning - Akka recommends against using the auto-down feature of Akka Cluster in production.
# This is crucial for correct behavior if you use Cluster Singleton or Cluster Sharding,
# especially together with Akka Persistence.
allow-weakly-up-members = on
use-dispatcher = cluster-dispatcher
failure-detector.acceptable-heartbeat-pause = 3 s
distributed-data {
gossip-interval = 100 ms
notify-subscribers-interval = 20 ms
}
}
|
| Comment by Sangwook Ha [ 26/Oct/21 ] |
|
Oh, I missed that. I just retried the configuration with controller 4.0.3, and confirmed that it works. |
| Comment by Tomas Cere [ 02/Nov/21 ] |
|
Already fixed with https://jira.opendaylight.org/browse/CONTROLLER-2004 |
| Comment by Sangwook Ha [ 02/Nov/21 ] |
|
Controller version has been upgraded to 4.0.5:
CSIT tests are passing: |