[CONTROLLER-1570] Leader state changed events received in reversed order Created: 15/Dec/16  Updated: 25/Jul/23  Resolved: 22/Dec/16

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vinh Nguyen Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive ODL1_karaf.zip     Zip Archive ODL2_karaf.zip     Zip Archive ODL3_karaf.zip     Zip Archive ODL4_karaf.zip     Zip Archive ODL5_karaf.zip     Zip Archive ODL6_karaf.zip    
External issue ID: 7391

 Description   

When the ODL which is the leader of some shard is shutdown, one of the remaining ODL first receives two leader changed events, first event indicating new leader is selected and then the second event indicating the shard leader is null. The correct order of events received should be reversed.

The scenario is described below:

1) 6-nodes cluster (I am not sure the fact that cluster has more than 3 nodes is causing the problem because we always test with 3-nodes cluster)
2) ODL node 1 is leader of some shard, say ‘shard-default-config’
3) Now shutdown ODL1, the leader for ‘shard-default-config’ is transferred to ODL2
4) There are two events received by each node, first one indicates leader is null (current leader is no longer valid), the second for the new leader (ODL2) is selected.
5) Four of the remaining ODL nodes receive the ‘onLeaderStateChanged’ in that order
6) Problem: ODL3 node receives the ‘onLeaderStateChanged’ in reversed order. First it receives the event for new leader,
and then it receives the event for leader == null. This ODL node, based on the last event (leader==null), thinks that the ‘shard-default-config’ shard has no leader and throws exception when there is a read/write to the shard.

ReadFailedException

{message=Error executeRead ReadData for path /(urn:opendaylight:neutron?revision=2015-07-12)neutron/networks, errorList=[RpcError [message=Error executeRead ReadData for path /(urn:opendaylight:neutron?revision=2015-07-12)neutron/networks, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-3-shard-default-config currently has no leader. Try again later.]]}

Below is the excerpt from karaf log filtered on the ‘onLeaderStateChanged’ events when ODL1 is shut down. Please note the events for ‘shard-default-config’:

2016-11-14 04:32:35,263 | INFO | ult-dispatcher-6 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-operational: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-operational, leaderId=null, leaderPayloadVersion=5]
2016-11-14 04:32:35,263 | INFO | ult-dispatcher-6 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-operational: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-entity-ownership-operational, leaderId=null, leaderPayloadVersion=5]
2016-11-14 04:32:35,263 | INFO | ult-dispatcher-6 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-operational: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-toaster-operational, leaderId=null, leaderPayloadVersion=5]
2016-11-14 04:32:35,263 | INFO | ult-dispatcher-6 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-operational: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-inventory-operational, leaderId=null, leaderPayloadVersion=5]
2016-11-14 04:32:35,263 | INFO | ult-dispatcher-6 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-operational: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-topology-operational, leaderId=null, leaderPayloadVersion=5]
2016-11-14 04:32:35,318 | INFO | ult-dispatcher-6 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-operational: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-inventory-operational, leaderId=member-4-shard-inventory-operational, leaderPayloadVersion=5]
2016-11-14 04:32:35,318 | INFO | ult-dispatcher-6 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-operational: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-operational, leaderId=member-4-shard-default-operational, leaderPayloadVersion=5]
2016-11-14 04:32:35,322 | INFO | ult-dispatcher-2 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-operational: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-entity-ownership-operational, leaderId=member-4-shard-entity-ownership-operational, leaderPayloadVersion=5]
2016-11-14 04:32:35,330 | INFO | ult-dispatcher-5 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-operational: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-toaster-operational, leaderId=member-2-shard-toaster-operational, leaderPayloadVersion=5]
2016-11-14 04:32:35,331 | INFO | ult-dispatcher-5 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-operational: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-topology-operational, leaderId=member-2-shard-topology-operational, leaderPayloadVersion=5]
2016-11-14 04:32:35,441 | INFO | lt-dispatcher-16 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-inventory-config, leaderId=member-5-shard-inventory-config, leaderPayloadVersion=5]
2016-11-14 04:32:35,441 | INFO | lt-dispatcher-16 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-inventory-config, leaderId=null, leaderPayloadVersion=5]
2016-11-14 04:32:35,442 | INFO | lt-dispatcher-18 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-toaster-config, leaderId=member-5-shard-toaster-config, leaderPayloadVersion=5]
2016-11-14 04:32:35,442 | INFO | lt-dispatcher-18 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-config, leaderId=member-2-shard-default-config, leaderPayloadVersion=5]
2016-11-14 04:32:35,442 | INFO | lt-dispatcher-18 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-toaster-config, leaderId=null, leaderPayloadVersion=5]
2016-11-14 04:32:35,442 | INFO | lt-dispatcher-18 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-topology-config, leaderId=member-5-shard-topology-config, leaderPayloadVersion=5]
2016-11-14 04:32:35,442 | INFO | ult-dispatcher-4 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-topology-config, leaderId=null, leaderPayloadVersion=5]
2016-11-14 04:32:35,455 | INFO | lt-dispatcher-16 | ShardManager | 195 - org.opendaylight.controller.sal-distributed-datastore - 1.4.1.Boron-SR1 | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-config, leaderId=null, leaderPayloadVersion=5]



 Comments   
Comment by Vinh Nguyen [ 16/Dec/16 ]

Attachment ODL1_karaf.zip has been added with description: karaf.log ODL1

Comment by Vinh Nguyen [ 16/Dec/16 ]

Attachment ODL2_karaf.zip has been added with description: karaf.log ODL2

Comment by Vinh Nguyen [ 16/Dec/16 ]

Attachment ODL3_karaf.zip has been added with description: karaf.log ODL3 - This ODL received leader changed events in wrong order

Comment by Vinh Nguyen [ 16/Dec/16 ]

Attachment ODL4_karaf.zip has been added with description: karaf.log ODL4

Comment by Vinh Nguyen [ 16/Dec/16 ]

Attachment ODL5_karaf.zip has been added with description: karaf.log ODL5

Comment by Vinh Nguyen [ 16/Dec/16 ]

Attachment ODL6_karaf.zip has been added with description: karaf.log ODL6

Comment by Tom Pantelis [ 21/Dec/16 ]

Was ODL3/member-3 a non-voting member by any chance?

Comment by Tom Pantelis [ 21/Dec/16 ]

Strike that. I think I know the problem. On leader transition, the current leader first sends out LeaderTransitioning events to each follower to tell them the current leader is being deposed. The followers then send out a LeaderStateChange event with a null leaderId which is picked up by the ShardManager to delay subsequent transactions activity to the shard until a new leader is elected. However it's possible the LeaderStateChange message does not reach a follower until after the leader transition occurs (eg due to dispatching delay in the caller or the network). This results in a LeaderStateChange event with a null leaderId being delivered after the LeaderStateChange with the new leaderId. I wrote a unit test that reproduces it.

We need to handle LeaderTransitioning events in a CAS-like manner, ie include the leaderId with the LeaderTransitioning message and only issue the LeaderStateChange event with a null leaderId if the current leaderId matches the leaderId in the LeaderTransitioning message.

(In reply to Tom Pantelis from comment #7)
> Was ODL3/member-3 a non-voting member by any chance?

Comment by Tom Pantelis [ 21/Dec/16 ]

Submitted https://git.opendaylight.org/gerrit/#/c/49701/

Generated at Wed Feb 07 19:55:53 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.