[OPNFLWPLUG-1114] Incorrect controller role after multiple connections from a single switch Created: 20/Oct/21  Updated: 02/Nov/21  Resolved: 02/Nov/21

Status: Resolved
Project: OpenFlowPlugin
Component/s: clustering
Affects Version/s: Phosphorus
Fix Version/s: Phosphorus

Type: Bug Priority: High
Reporter: Sangwook Ha Assignee: Sangwook Ha
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File akka.factory.conf     File akka.initial.conf    

 Description   

In Phosphorus the controller in a single controller cluster does not become master when there are multiple OpenFlow channel connections from a single switch, even after controller settings on the switch is reset and then a single OpenFlow channel connection is established. This does not always happen but it happens more often than not.

Steps to reproduce:

1. Start the controller
2. In the Karaf console enable odl-openflowplugin-flow-services-rest feature

feature:install odl-openflowplugin-flow-services-rest

3. Create a Open vSwitch switch

sudo ovs-vsctl add-br s1

4. Watch the controller connection status

watch 'sudo ovs-vsctl --columns=target,role,status list controller'

5. Open another terminal & run the following commands - replace <CONTROLLER_IP> with the IP address of the controller

CONTROLLER_IP="<CONTROLLER_IP>"
sudo ovs-vsctl set-controller s1 "tcp:$CONTROLLER_IP:6633" "tcp:$CONTROLLER_IP:6653"
sleep 30
sudo ovs-vsctl del-controller s1
sudo ovs-vsctl set-controller s1 "tcp:$CONTROLLER_IP:6633"

During step 5 controller's role does not become master (just other or slave).

Normally, the following get-entities RPC returns the owner of each switch when a switch is connected to the controller.

curl --location --request POST 'http://<CONTROLLER_IP>:8181/rests/operations/odl-entity-owners:get-entities' \
                        --header 'Accept: application/yang-data+json' \
                        --header 'Content-Type: application/yang-data+json' \
                        --header 'Authorization: Basic YWRtaW46YWRtaW4='

For example,

{
  "odl-entity-owners:output": {
    "entities": [
      {
        "type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
        "name": "ofp-topology-manager",
        "candidate-nodes": [
          "member-1"
        ],
        "owner-node": "member-1"
      },
      {
        "type": "org.opendaylight.mdsal.ServiceEntityType",
        "name": "openflow:81985529216486895",
        "candidate-nodes": [
          "member-1"
        ],
        "owner-node": "member-1"
      },
      {
        "type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
        "name": "openflow:81985529216486895",
        "candidate-nodes": [
          "member-1"
        ],
        "owner-node": "member-1"
      },
      {
        "type": "org.opendaylight.mdsal.ServiceEntityType",
        "name": "ofp-topology-manager",
        "candidate-nodes": [
          "member-1"
        ],
        "owner-node": "member-1"
      }
    ]
  }
}

However, after step 5 the RPC shows that the switch is not registered in EOS even though the switch is connected to the controller.

{
  "odl-entity-owners:output": {
    "entities": [
      {
        "type": "org.opendaylight.mdsal.ServiceEntityType",
        "name": "ofp-topology-manager",
        "candidate-nodes": [
          "member-1"
        ],
        "owner-node": "member-1"
      },
      {
        "type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
        "name": "ofp-topology-manager",
        "candidate-nodes": [
          "member-1"
        ],
        "owner-node": "member-1"
      }
    ]
  }
}

This is a regression causing CSIT test Bug_Validation/8723.robot to fail.



 Comments   
Comment by Tomas Cere [ 22/Oct/21 ]

Heya,

can you try to check whether this config helps in this case?

https://git.opendaylight.org/gerrit/c/controller/+/97973/3/opendaylight/md-sal/sal-clustering-config/src/main/resources/initial/factory-akka.conf#136

 

I was able to reproduce this locally and with the above config applied I've seen the roles change to master on the OF channels, however I'm not familiar with OpenFlow at all, so I cant confirm whether thats correct behavior.

Comment by Sangwook Ha [ 22/Oct/21 ]

That doesn't seem to help. When I tried the configuration change, there was no change in the behavior.

After adding the configuration, I see the configuration appears in sal-clustering-config-4.0.3-factoryakkaconf.xml:

$ grep -A 3 distributed ./target/assembly/system/org/opendaylight/controller/sal-clustering-config/4.0.3/sal-clustering-config-4.0.3-factoryakkaconf.xml
    distributed-data {
      gossip-interval = 100 ms
      notify-subscribers-interval = 20 ms
    }

And after installing openflowplugin feature, in configuration/factory/akka.conf:

$ grep -A 3 distributed target/assembly/configuration/factory/akka.conf
    distributed-data {
      gossip-interval = 100 ms
      notify-subscribers-interval = 20 ms
    }

But it was not changing to master.

Comment by Tomas Cere [ 25/Oct/21 ]

Default phosphorus behavior for me:

target : "tcp:127.0.0.1:6653"
role : other
status : {last_error="End of file", sec_since_connect="0", sec_since_disconnect="1", sta
te=ACTIVE}
target : "tcp:127.0.0.1:6633"
role : other
status : {last_error="End of file", sec_since_connect="1", sec_since_disconnect="0", sta
te=BACKOFF}

 Doesnt change, both stuck on other

{
    "odl-entity-owners:output": {
        "entities": [
            {
                "type": "org.opendaylight.mdsal.ServiceEntityType",
                "name": "ofp-topology-manager",
                "candidate-nodes": [
                    "member-1"
                ],
                "owner-node": "member-1"
            },
            {
                "type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
                "name": "ofp-topology-manager",
                "candidate-nodes": [
                    "member-1"
                ],
                "owner-node": "member-1"
            }
        ]
    }
}

With the config applied this is the output from the watch command:

target              : "tcp:127.0.0.1:6653"
role                : other
status              : {last_error="End of file", sec_since_connect="1", sec_since_disconnect="0", state=BACKOFF}target              : "tcp:127.0.0.1:6633"

role                : master
status              : {last_error="End of file", sec_since_connect="0", sec_since_disconnect="1", state=ACTIVE}

and

target : "tcp:127.0.0.1:6653"
role : master
status : {last_error="End of file", sec_since_connect="0", sec_since_disconnect="1", state=ACTIVE}
target : "tcp:127.0.0.1:6633"
role : other
status : {last_error="End of file", sec_since_connect="1", sec_since_disconnect="0", state=BACKOFF}

They seem to switch master between the two connections which looks correct to me?

Also they can be seen in get-entities as well:

{
    "odl-entity-owners:output": {
        "entities": [
            {
                "type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
                "name": "openflow:2439020958528",
                "candidate-nodes": [
                    "member-1"
                ],
                "owner-node": "member-1"
            },
            {
                "type": "org.opendaylight.mdsal.ServiceEntityType",
                "name": "openflow:2439020958528",
                "candidate-nodes": [
                    "member-1"
                ],
                "owner-node": "member-1"
            },
            {
                "type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
                "name": "ofp-topology-manager",
                "candidate-nodes": [
                    "member-1"
                ],
                "owner-node": "member-1"
            },
            {
                "type": "org.opendaylight.mdsal.ServiceEntityType",
                "name": "ofp-topology-manager",
                "candidate-nodes": [
                    "member-1"
                ],
                "owner-node": "member-1"
            }
        ]
    }
}

I also attempted to verify(to see whats the correct behavior with the old entity-ownership service) on silicon and here it looks pretty much the same, with one difference.
It looks like one channel is switching between master/other, while the other one remains as other the entire time.
This looks to me like a difference in the owner picking strategy used by the implementations but shouldn't cause issues
since it shouldnt matter which channel is picked as a master?

 

Can you please attach the entire contents of both <distribution>/configuration/factory/akka.conf and <distribution>/configuration/factory/akka.conf ?

 

Comment by Sangwook Ha [ 25/Oct/21 ]

Attached two files:

  • akka.factory.conf: configuration/factory/akka.conf
  • akka.initial.conf: configuration/initial/akka.conf

OpenFlow plugin drops the existing connection and takes a new connection, so expected behavior would be master going back and forth. But in this case there is no stable connection anyway, so I think the behavior while two connections competing is not very critical.

Comment by Sangwook Ha [ 26/Oct/21 ]

I tried this again after upgrading controller to 4.0.4, and other MRI upstreams, and now I see the controller's role changes to master.

Previously, I used 4.0.3 with AKKA configuration file updated with distributed-data parameter settings. Is there any other changes made between 4.0.3 & 4.0.4 that could affect this behavior?

Comment by Tomas Cere [ 26/Oct/21 ]

The config you attached is incorrect, the distributed-data section needs to be nested in the cluster section like this:

cluster {
      seed-node-timeout = 12s      # Following is an excerpt from Akka Cluster Documentation
      # link - http://doc.akka.io/docs/akka/snapshot/java/cluster-usage.html
      # Warning - Akka recommends against using the auto-down feature of Akka Cluster in production.
      # This is crucial for correct behavior if you use Cluster Singleton or Cluster Sharding,
      # especially together with Akka Persistence.      
      allow-weakly-up-members = on
      use-dispatcher = cluster-dispatcher
      failure-detector.acceptable-heartbeat-pause = 3 s
	  
      distributed-data {
        gossip-interval = 100 ms
        notify-subscribers-interval = 20 ms
      }
    }

 

 

Comment by Sangwook Ha [ 26/Oct/21 ]

Oh, I missed that. I just retried the configuration with controller 4.0.3, and confirmed that it works.

Comment by Tomas Cere [ 02/Nov/21 ]

Already fixed with https://jira.opendaylight.org/browse/CONTROLLER-2004

Comment by Sangwook Ha [ 02/Nov/21 ]

Controller version has been upgraded to 4.0.5:

CSIT tests are passing:

Generated at Wed Feb 07 20:34:14 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.