[OPNFLWPLUG-668] [Clustering] Switch state resync after cluster restart. Created: 05/Apr/16  Updated: 27/Sep/21  Resolved: 29/Jul/16

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Saibal Roy Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive 15switches.zip    
External issue ID: 5659
Priority: High

 Description   

Hi,

I was testing the cluster restart scenario with latest Be Code with He Plugin.
Tested in 3 node cluster with Stable Beryllium + Helium Plugin + JDK8 with G1GC.Tested with OVS 2.3.2.

Build used :
===================
Karaf distro from latest ODL stable Beryllium code

Objective of test :
===================
To validate the cluster restart and see if all the flows get configured in the switch.

Configuration and topology:
===========================

i.Controllers (c1, c2 and c3) VMs are running in Dell machine say h1, each VM has 8 vcpu and 16g RAM configuration

ii. Mininet (m1, m2 and m3) VMs with ovs version 2.3.2 are running in different Dell machine say h2, each VM has 8 vcpu and 16g RAM configuration.

m1 with 5 switches (1 to 6) connected to c1
m2 with 5 switches (6 - 11) connected to c2
m3 with 5 switches (11 - 16) connected to c3

Test Steps :
============
Pre-requisite: Pushed 10K flows in 15 switches(150k flows in the cluster) and config DS shows 150k flows.
Note all the 15 switches were connected and then the flows were pushed.
So Cluster has 10k flows per switch,5 switches per node.Total 15 switches,150K flows configured in the 3 node cluster.

Steps:
i. Stop all the Nodes c1,c2,c3
ii. Disconnected 15 switches connecting to Node c1,c2,c3.
iii. Connect the 15 switches to the Node c1,c2,c3.
iv. Start the Node c1,c2,c3.

Observations
============
After cluster restart we see that in some switches there are less than 10K flows.Out of 15, 6 switches has 0 flows and remaining 9 switches shows 10K flows.

Attaching the logs for more clarity.

Thanks & Regards,
Saibal Roy.



 Comments   
Comment by Saibal Roy [ 05/Apr/16 ]

Attachment 15switches.zip has been added with description: logs for Switch state resync after cluster restart

Comment by Muthukumaran Kothandaraman [ 05/Apr/16 ]

Hi Saibal,

Looking at logs and the symptoms you have observed, this could be a case where datastore may not be fully available (all persisted data restored consistently across the cluster) when switches reconnect.

Specifically in this case, switches are constantly hunting for port 6633 to be opened on all cluster nodes. So, soon after the ports get opened (rather prematurely) switches pounce upon the controller nodes like hungry tigers.

But at that juncture, perhaps datastore is still in mode of coming up (doing restoration etc.).

One quick way to verify is to add a linux firewall rule which blocks port 6633 for 3-5 minutes as part of karaf.sh.

This can prevent switches from connecting prematurely before datastore becomes fully "ready".

If we can confirm the clean behavior with this hack, we can discuss on a more cleaner solution for how to prevent 6633 from opening up when all backends are in clean "ready" state

Regards
Muthu

Comment by Muthukumaran Kothandaraman [ 29/Jul/16 ]

To be retested on latest boron master with lithium plugin combination to re-establish this. Mainly because Boron release is moving with Lithium

Generated at Wed Feb 07 20:33:04 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.