[OPNFLWPLUG-591] [Clustering]: Openflow connections unstable with Lithium plugin Created: 11/Jan/16  Updated: 27/Sep/21  Resolved: 01/Apr/16

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Saibal Roy Assignee: Saibal Roy
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive 20-1-2016.zip     File Bugs.rar    
Issue Links:
Blocks
is blocked by OPNFLWPLUG-593 Openflow Clustering stabilization Resolved
External issue ID: 4925
Priority: High

 Description   

Build used :
===================
Karaf distro from latest ODL Beryllium master code

Test Type :
===================
switch connection stability in cluster.

Objective of test :
===================
verify if the switch connection is stable in cluster deployment without using OF-HA.

Test Steps :
============
1. Bring up healthy 3 node cluster say c1, c2 and c3.
2. Bring UP the mininet with following command and connect to controller c1(10.183.181.41),c2(10.183.181.42) and c3(10.183.181.43).Here each controller is connected with 20 switches.so overall 60 switches.
Below commands is used to connect 20 switches per controller.

sudo mn --custom /home/mininet/mininet/custom/mytopo.py --topo mytopo --controller remote,ip=10.183.181.41,port=6633 --switch ovsk,protocols=Openflow13

Note:mytopo.py is attached for quick reference.

3.Check if the configured count of switches have establised connection with the controller instances using netstat

netstat -pan | grep 6633 | grep ESTABLISHED | grep ovs-vswitchd | wc -l

Test Results:
=============

1. Without any flow traffic or LLDP traffic itself, switch connection keeps flapping with respect to the controller.
So the connection is not stable.
In real scenerio, this can trigger unnecessarily re-conciliation cycles with the switches.

2. Operational Datastore reflects totally different count of switches in inventory shards(perhaps this is a side effect of BZ-4576 - https://bugs.opendaylight.org/show_bug.cgi?id=4576)

Attaching all the karaf logs of 3 controller nodes and the netstat snaphot of switch connectivity.

Thanks & Regards,
Saibal Roy.



 Comments   
Comment by Saibal Roy [ 11/Jan/16 ]

Attachment Bugs.rar has been added with description: karaf logs of 3 controller and netstat snapshot of switch connectivity.

Comment by Michal Rehak [ 18/Jan/16 ]

Hi, I looked briefly over attached logs and it seems to me that shard members 1 and 2 got stuck while connecting to DS-master. This resulted into many transaction submit failures and finally into "too many files open" exception.

By the member-3 there are conflicting transactions trying to write data which renders into optimistic lock exceptions. This looks like there are many reconnects on this node and after each one there is DS cleanup phase which probably went wrong once and from this point on it keeps blocking new connections.

Could you retest and check logs before connecting ovs if there are some DS-master connection related issues?

Also could you test with 3 devices just to inspect more possible reasons.

Thank you.
(and please do not use rar compression - zip or tgz are much wider supported)

Comment by Vratko Polak [ 19/Jan/16 ]

Off topic:

> please do not use rar compression

What about tar.xz?

Comment by Saibal Roy [ 20/Jan/16 ]

Attachment 20-1-2016.zip has been added with description: logs for switch stability connectivity

Comment by Saibal Roy [ 20/Jan/16 ]

Hi,
The below observation was observed.

I made 3 controller Up and i saw from the jconsole that member-1 has become the leader..Please find the details in the attached logs.

1. Now i connect 1 switch per controller (Total 3 switches), i could observe that the switch connectivity is not getting lost..Checked for 30 minutes and the switch connectivity was persistent.

2. Again i connected 5 switcher per controller(Total 15 switches) and i could not see the connectivity getting lost.Observed in the karaf logs and everything goes fine..

3. Now i connect 10 switches per controller(Total 30 switches) and i observed for 1 hour and i could see that the connectivity is getting lost.In the logs also i could see OptimisticLockFailedException.

Attaching the logs for more details.

Thanks & Regards,
Saibal Roy.

Comment by Vaclav Demcak [ 02/Mar/16 ]

Please could you confirm change behavior with actual stable/lithium or stable/beryllium code base?

Comment by Muthukumaran Kothandaraman [ 07/Mar/16 ]

Hi Vaclav,

We did this on beryllium master as on 11-Jan-2016. Now that we have stable/beryllium and lot of water flowed under bridge since 11-Jan, we will retry this with latest stable/beryllium again and update the status to see how we can take this forward

Comment by Muthukumaran Kothandaraman [ 01/Apr/16 ]

Latest Beryllium Stable build was run with JDK1.8 with rest of scenario remaining same as below.

Observation : For the same count of 20 switches (OVS 2.3.2), this issue was not observed.

Closing this bug.

In case we encounter scaling issues in terms of increased number of switches, then we can treat it as separate issue.

Hope we are in sync

Generated at Wed Feb 07 20:32:52 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.