[OPNFLWPLUG-588] [Clustering]: Switch state resync is not happening after controller restart [Routed RPC issue] Created: 04/Jan/16  Updated: 27/Sep/21  Resolved: 11/Feb/16

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Anil Gujele Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File resyncFailed.rar    
Issue Links:
Duplicate
is duplicated by CONTROLLER-1467 Clustering: Flows are not re-installe... Resolved
External issue ID: 4866
Priority: Highest

 Description   

Build used :
===================
Karaf distro from latest ODL Beryllium master code

Test Type :
===================
switch state resync after cluster node restart.

Objective of test :
===================
verify switch resync when node is restarted.

Test Steps :
============
1. Bring up healthy 3 node cluster say c1, c2 and c3. c1 is leader.
2. connect OVS switch to c3 node. c3 node is entity owner as shown in attached snapshot.
3. push 10 flows using flow-config-blaster script from follower c3
4. total flows in switch, config DS and operational DS is 10
5. restart c3 node
6. connect OVS switch to c3 node, c3 node is entity owner as shown in attached snapshot.
7. switch is showing 2 flows
8. config DS is showing 10 flows and Operational DS is showing 2 flows

Note: Output shown for Flows in switch and Operational DS in Step 7 and Step 8 are not consistent. In 4 times, i have seen output as 4, 0, 10 and 2 flows.

Controllers (to cross-check logs):
===================================
c1 - Controller 1 with IP 10.183.181.41 - config-inventory-shard leader
c2 - Controller 2 with IP 10.183.181.42 - config-inventory-shard follower
c3 - Controller 3 with IP 10.183.181.43 - config-inventory-shard follower

Enclosed Logs:
==============
c1.karaf.log for controller c1
c2.karaf.log for controller c2
c3.karaf.log for controller c3
snapshot for entity ower from c3
snapshot for flows in OVS switch



 Comments   
Comment by Anil Gujele [ 04/Jan/16 ]

Attachment resyncFailed.rar has been added with description: attached snapshot and logs from c1, c2,c3 node.

Comment by Muthukumaran Kothandaraman [ 04/Jan/16 ]

In above description, "leader" indicates shard-leader of inventory-config shard and "follower" indicates follower(s) of inventory-config

Comment by Tom Pantelis [ 07/Jan/16 ]

Looks like this should be filed against the openflow or ovsdb project.

Comment by Anil Gujele [ 28/Jan/16 ]

changed product from controller to openflowplugin

Comment by Anil Vishnoi [ 30/Jan/16 ]

Tom/Moiz,

This is another scenario of Routed RPC failure, which we are discussing through some other bugs.

Comment by Anil Vishnoi [ 05/Feb/16 ]

Hi Muthu,

I pushed following patch to openflowplugin that should solve/workaround this issue. My patch basically avoiding routed rpc by using clustering DCN + local rpc registration.

https://git.opendaylight.org/gerrit/#/c/34115/

Can you please test with this patch and see if this works for you.

Comment by Muthukumaran Kothandaraman [ 05/Feb/16 ]

Patch looks fine with me Anil. We will pick this and test. So, we have fully eliminated the need for forcing reconciliation through routed-rpc

Comment by Anil Vishnoi [ 05/Feb/16 ]

Yes and i think it's probably better in term of performance. Hopefully you will see some performance improvement in flow/second in clustered setup, given that we are avoiding remote rpc now and assuming that ClusteredData(Change/Tree)Listner don't create much problem. And once we get rid of DataChangeListner and use TreeListner, things might improve further.

Comment by Tom Pantelis [ 05/Feb/16 ]

Although Anil's patch removes the use of routed RPCs in OF, we should fix the timing issue with RPCs so I submitted https://git.opendaylight.org/gerrit/#/c/34175/ to add wait/retries in the RPC code.

Comment by Ryan Goulding [ 09/Feb/16 ]

Is this "Waiting for Review" now? It looks like Tom Pantelis pushed a patch to fix this.

Also, are we targeting stable/beryllium as well?

Comment by Ryan Goulding [ 09/Feb/16 ]

Spawning a separate bug for:

https://git.opendaylight.org/gerrit/#/c/34175/

This adds wait/retries in the RPC code.

Comment by Anil Gujele [ 11/Feb/16 ]

I have verified with latest code build,
I can see switch, config DS and Operational DS are showing same number of flows after follower node restart.

Generated at Wed Feb 07 20:32:52 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.