[CONTROLLER-1478] Switch reconciliation is not happening after leader node restart in 3 node cluster Created: 29/Jan/16  Updated: 19/Oct/17  Resolved: 25/Feb/16

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Beryllium
Fix Version/s: None

Type: Bug
Reporter: Anil Gujele Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File resyncFailed3SwitchLeaderRestart.rar    
External issue ID: 5135

 Description   

Build used :
===================
Karaf distro from ODL Beryllium master code

Test Type :
===================
Switch reconciliation is not happening after leader node restart in 3 node cluster.

Objective of test :
===================
verify Switch reconciliation when leader node is restarted.

Test Steps :
============
1. Bring up healthy 3 node cluster.
2. c1, c2 and c3 nodes are up. c1 is leader for config and operational inventory shard.
3. push 1000 flows from leader node c1 to 3 switchs (s1,s2 and s3).
4. connect 1 switch per node to c1, c2 and c3 node. (switch s1 connected to c1, switch s2 connected to c2 and switch s3 connected to c3)
5. total flows in 3 switchs is 3000
6. disconnect switch s1 from c1 node.
7. stop c1 node
8. push 200 more flows from leader node c1 to 3 switchs.
9. start c1 node
10. now c3 node is leader for config inventory shard and c2 is leader for operational inventory shard.
11. checked total flows in config data store is 3000 but expected is 3600
12. reconnect switchs s1 to c1 node.
13. total flows in reconnected switch s1 is 0 but expected number of flows is 1200 after resync.

Note:
1. there is AskTimeOutException in c3 node after step 7 (refer logs in folder jan29_till_step7).
2. after step 12, there is AskTimeOutException exception in node c1 (refer logs in folder jan29_till_step12).

Controllers (to cross-check logs):
===================================
c1 - Controller 1 with IP 10.183.181.41
c2 - Controller 2 with IP 10.183.181.42
c3 - Controller 3 with IP 10.183.181.43

Enclosed Logs:
==============
c1.log for controller c1
c2.log for controller c2
c3.log for controller c3



 Comments   
Comment by Anil Gujele [ 29/Jan/16 ]

Attachment resyncFailed3SwitchLeaderRestart.rar has been added with description: attached logs from c1, c2,c3 node.

Comment by Anil Gujele [ 29/Jan/16 ]

In step-8, flow is pushed from node c3.
Corrected step-8:
8. push 200 more flows from leader node c3 to 3 switchs.

Comment by Luis Gomez [ 31/Jan/16 ]

I can easily reproduce issue in 13 by:

1) Push a normal switch flow (do not connect a switch for now)

2) Kill and recover the inventory shard leader

3) Connect switch to old shard leader - Flow is not programmed

Comment by Luis Gomez [ 31/Jan/16 ]

Anil, is this the cluster RPC issue you commented to me?

Comment by Anil Vishnoi [ 31/Jan/16 ]

Yes luis.

Comment by Anil Gujele [ 02/Feb/16 ]

I have verified this defect with build from latest ODL Berrilium master code, reconciliation is working in this scenario.

I see below log messages in other two nodes in every 5 seconds once leader node is down.

2016-02-02 03:51:05,862 | WARN | ds-oper-thread-0 | OperationLimiter | 143 - org.opendaylight.controller.sal-distributed-datastore - 1.3.0.SNAPSHOT | Failed to acquire operation permit for transaction member-3-chn-11-txn-2021
2016-02-02 03:51:10,863 | WARN | ds-oper-thread-0 | OperationLimiter | 143 - org.opendaylight.controller.sal-distributed-datastore - 1.3.0.SNAPSHOT | Failed to acquire operation permit for transaction member-3-chn-11-txn-2021
2016-02-02 03:51:15,863 | WARN | ds-oper-thread-0 | OperationLimiter | 143 - org.opendaylight.controller.sal-distributed-datastore - 1.3.0.SNAPSHOT | Failed to acquire operation permit for transaction member-3-chn-11-txn-2021

Comment by Ryan Goulding [ 09/Feb/16 ]

So this works?

Comment by Anil Vishnoi [ 09/Feb/16 ]

Hi Anil,

Can you please test with the latest stable/beryllium, it should be fixed with

https://git.opendaylight.org/gerrit/#/c/34115/

Thanks

Comment by Anil Gujele [ 11/Feb/16 ]

Hi Anil,

I have tested it with your patch and it again happened in second attempt.

Thanks
ANil

Comment by Ryan Goulding [ 16/Feb/16 ]

Is this actually fixed? Is this still a blocker? If this is still causing issues, cna you please reopen.

Comment by Tom Pantelis [ 25/Feb/16 ]

I haven't heard anything wrt this still being an issue after Anil's patch. Closing...

Generated at Wed Feb 07 19:55:39 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.