[CONTROLLER-1900] Performance regression in cluster registration Created: 06/Jun/19  Updated: 12/Jun/19  Resolved: 12/Jun/19

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Fluorine SR3, Neon SR2
Fix Version/s: Sodium, Fluorine SR3, Neon SR2

Type: Bug Priority: Highest
Reporter: Luis Gomez Assignee: Luis Gomez
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to CONTROLLER-1897 Improve QNameFactory performance Resolved
relates to OPNFLWPLUG-1072 ContextChainHolderImpl encroaches on ... Confirmed

 Description   

The problem seems to be in all branches:

https://jenkins.opendaylight.org/releng/job/controller-csit-3node-clustering-ask-all-neon/247/robot/controller-clustering-ask.txt/Chasing%20The%20Leader/Unregister_Candidates_And_Validate_Criteria/

https://jenkins.opendaylight.org/releng/job/controller-csit-3node-clustering-tell-all-fluorine/220/robot/controller-clustering-tell.txt/Chasing%20The%20Leader/Unregister_Candidates_And_Validate_Criteria/

In Fluorine it started at around May 23rd so there multiple suspect patches:

https://git.opendaylight.org/gerrit/#/q/branch:stable/fluorine+project:controller

The test itself does:

1) Start a singleton registration flap on every controller instance with this RPC: /restconf/operations/odl-mdsal-lowlevel-control:register-flapping-singleton

2) Maintain the flap for 60 secs.

3) Stop the flap on every controller instance: /restconf/operations/odl-mdsal-lowlevel-control:unregister-flapping-singleton

4) Get flap count from above RPC response: <output xmlns="tag:opendaylight.org,2017:controller:yang:lowlevel:control"><flap-count>83</flap-count></output>

5) Add all the flaps for the 3 controller instances and divide the total by 60 secs.

Before the regression, the controller handled ~50 flaps/sec, after the regression is less than 5 flaps/sec.



 Comments   
Comment by Robert Varga [ 06/Jun/19 ]

Couple of observations here... I took the baseline of https://jenkins.opendaylight.org/releng/job/controller-csit-3node-clustering-ask-all-neon/239/

The weird thing is that for some reason OFP is reacting to the entity changes for others, which is weird and should not really happen:

2019-05-30T06:42:59,075 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-37 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device pcep-topology-service-group from operational DS

2019-05-30T07:22:50,531 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device flapping-singleton-service from operational DS

Those things are definitely none of OFP's business, so I wonder how we are getting these. They may end up interfering a bit with CSS, but I am not quite sure – https://git.opendaylight.org/gerrit/#/c/80707/ landed in the area.

In any case, yes, there is a notable drop in the number of these events (239 vs. 247):

nite@nitebug : ~/ctrl1900/x$ cat odl* | fgrep 'Try to remove device flapping' | sort | wc -l
16038
nite@nitebug : ~/ctrl1900$ cat odl* | fgrep 'Try to remove device flapping' | sort | wc -l
660

Now looking at the details of those logs, it seems the problem is that we are getting pauses:

2019-05-30T07:23:19,461 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device flapping-singleton-service from o
perational DS
2019-05-30T07:23:19,479 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device flapping-singleton-service from op
erational DS
2019-05-30T07:23:19,493 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device flapping-singleton-service from op
erational DS

>> 1 second

2019-05-30T07:23:20,502 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-40 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device flapping-singleton-service from o
perational DS
2019-05-30T07:23:20,519 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-40 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device flapping-singleton-service from operational DS
2019-05-30T07:23:20,527 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-40 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device flapping-singleton-service from operational DS

>> 1 second

2019-05-30T07:23:21,542 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device flapping-singleton-service from operational DS
2019-05-30T07:23:21,576 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-7 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device flapping-singleton-service from operational DS
2019-05-30T07:23:21,586 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device flapping-singleton-service from operational DS
2019-05-30T07:23:22,583 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-39 | ContextChainHolderImpl           | 415 - org.opendaylight.openflowplugin.impl - 0.8.2 | Try to remove device flapping-singleton-service from operational DS

ecelgp can you try running this test suite with the OFP patch reverted, please?

In the mean time I will look into why exactly is OFP even getting these events, as that definitely should not be happening. It is also possible we are leaking metadata in the Shard, I will look at that, too.

 

Comment by Robert Varga [ 06/Jun/19 ]

Ah, but Fluorine does not have that OFP patch.

Comment by Robert Varga [ 06/Jun/19 ]

ecelgp can we re-run that job with debug enabled on org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipShard, please?

Comment by Robert Varga [ 06/Jun/19 ]

So OFP is explicitly registering to CSS internal events, hence that is not a bug in CSS or EOS, but in OFP – filed OPNFLWPLUG-1072 for that.

Comment by Luis Gomez [ 08/Jun/19 ]

This run has the DEBUG activated:

https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-ask-all-fluorine/231/

This fixes the gclogs:

https://git.opendaylight.org/gerrit/#/c/82459/

Comment by Robert Varga [ 10/Jun/19 ]

Okay, I pulled the logs and started looking. ODL1 is the leader here, and it sees a significant pause:

2019-06-08T04:23:43,277 | DEBUG | opendaylight-cluster-data-shard-dispatcher-20 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-1-shard-entity-ownership-operational: Transaction member-1-entity-ownership-internal-fe-0-txn-80-0 committed as 0, sending response to Actor[akka://opendaylight-cluster-data/user/shardmanager-operational/member-1-shard-entity-ownership-operational#340224012]
2019-06-08T04:23:43,277 | DEBUG | opendaylight-cluster-data-shard-dispatcher-20 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-1-shard-entity-ownership-operational: fakeSnapshot purging log to 114 for term 1
2019-06-08T04:23:43,277 | DEBUG | opendaylight-cluster-data-shard-dispatcher-20 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-1-shard-entity-ownership-operational (Leader): Checking sendAppendEntries for follower member-2-shard-entity-ownership-operational: active: true, followerNextIndex: 116, leaderLastIndex: 115, leaderSnapShotIndex: 114
2019-06-08T04:23:43,277 | DEBUG | opendaylight-cluster-data-shard-dispatcher-20 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-1-shard-entity-ownership-operational (Leader): handleAppendEntriesReply - FollowerLogInformation for member-3-shard-entity-ownership-operational updated: matchIndex: 115, nextIndex: 116
2019-06-08T04:23:43,278 | DEBUG | opendaylight-cluster-data-shard-dispatcher-20 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-1-shard-entity-ownership-operational (Leader): Checking sendAppendEntries for follower member-3-shard-entity-ownership-operational: active: true, followerNextIndex: 116, leaderLastIndex: 115, leaderSnapShotIndex: 114
2019-06-08T04:23:43,279 | DEBUG | opendaylight-cluster-data-shard-dispatcher-20 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-1-shard-entity-ownership-operational: Persisting ApplyJournalEntries with index=115
2019-06-08T04:23:43,279 | DEBUG | opendaylight-cluster-data-shard-dispatcher-20 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | BatchedModifications commit member-1-entity-ownership-internal-fe-0-txn-80-0 succeeded
2019-06-08T04:23:44,083 | DEBUG | opendaylight-cluster-data-shard-dispatcher-30 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-1-shard-entity-ownership-operational: Applying 1 batched modifications for Tx member-3-entity-ownership-internal-fe-0-txn-17-0

So we have a delay in processing of 804ms. During this time ODL2 does:

2019-06-08T04:23:43,282 | DEBUG | opendaylight-cluster-data-shard-dispatcher-46 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational:
 Applying state for log index 114 data org.opendaylight.controller.cluster.datastore.persisted.CommitTransactionPayload@5478d243
2019-06-08T04:23:43,283 | DEBUG | opendaylight-cluster-data-shard-dispatcher-46 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational:
 fakeSnapshot purging log to 113 for term 1
2019-06-08T04:23:43,283 | DEBUG | opendaylight-cluster-data-shard-dispatcher-46 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational:
 Persisting ApplyJournalEntries with index=114
2019-06-08T04:23:44,084 | DEBUG | opendaylight-cluster-data-shard-dispatcher-31 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational
(Follower): Commit index set to 115

and ODL3 does:

2019-06-08T04:23:43,282 | DEBUG | opendaylight-cluster-data-shard-dispatcher-46 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational:
 Applying state for log index 114 data org.opendaylight.controller.cluster.datastore.persisted.CommitTransactionPayload@5478d243
2019-06-08T04:23:43,283 | DEBUG | opendaylight-cluster-data-shard-dispatcher-46 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational:
 fakeSnapshot purging log to 113 for term 1
2019-06-08T04:23:43,283 | DEBUG | opendaylight-cluster-data-shard-dispatcher-46 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational:
 Persisting ApplyJournalEntries with index=114
2019-06-08T04:23:44,084 | DEBUG | opendaylight-cluster-data-shard-dispatcher-31 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational
(Follower): Commit index set to 115

Overall it seems building consensus on followers takes 0.8 seconds every now and then – and thus we cannot maintain the expected rate. Note this is operational, so the persist call does not really touch disk.

Comment by Robert Varga [ 10/Jun/19 ]

This might be related to the test driver processing, as it would seem that for 0.8 seconds there really is nothing to do on the master.

Comment by Robert Varga [ 10/Jun/19 ]

These pauses occur at:

  • 2019-06-08T04:23:43,282 (114)
  • 2019-06-08T04:23:44,307 (137)
  • 2019-06-08T04:23:45,342 (160)
  • 2019-06-08T04:23:46,389 (183)
  • 2019-06-08T04:23:47,422 (206)
  • 2019-06-08T04:23:48,455 (229)
  • 2019-06-08T04:23:49,606 (252)
  • 2019-06-08T04:23:50,561 (275)
  • 2019-06-08T04:23:51,642 (298)
  • 2019-06-08T04:23:52,659 (321)
  • 2019-06-08T04:23:53,654 (344)
  • (to be filled in)

The numbers are commit indices, i.e. every 23 commits there seems to be a pause of ~0.8 seconds. A typical non-pause tiime from the above is about 200ms, during which the burst is processed at about 9ms (i.e. at about 111 commits per second), which translates to about 55 flaps a second (I think).

Comment by Luis Gomez [ 10/Jun/19 ]

So are we saying the issue is in the test RPC producing the flap?

Comment by Robert Varga [ 10/Jun/19 ]

Not sure yet, I'll have to investigate a bit more. Both '23' and 800ms are quite constant and I don't remember any CDS/EOS mechanics which would be related to this, so I'll need to dig through the code and also get debugs from all components in the mix...

Comment by Robert Varga [ 11/Jun/19 ]

https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-ask-all-fluorine/236/ is a run with full controller debugs

Comment by Robert Varga [ 11/Jun/19 ]

There seems to be a delay in propagating latest state from master (ODL2) to others, where ODL3 is holding owner, hence the delay propagates to ODL3 not relinquishing state.

ODL1:

 

2019-06-11T08:25:19,505 | DEBUG | opendaylight-cluster-data-shard-dispatcher-33 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-1-shard-entity-ownership-operational (Follower): handleAppendEntries: AppendEntries [leaderId=member-2-shard-entity-ownership-operational, prevLogIndex=99, prevLogTerm=1, leaderCommit=99, replicatedToAllIndex=98, payloadVersion=9, recipientRaftVersion=4, leaderRaftVersion=4, leaderAddress=null, entries=[SimpleReplicatedLogEntry [index=100, term=1, payload=org.opendaylight.controller.cluster.datastore.persisted.CommitTransactionPayload@2e722827]]]
2019-06-11T08:25:19,509 | DEBUG | opendaylight-cluster-data-notification-dispatcher-69 | EntityOwnerChangeListener        | 288 - org.opendaylight.controller.sal-distributed-datastore - 1.8.3.SNAPSHOT | member-1-shard-entity-ownership-operational: New owner: member-3, Original owner: member-3
2019-06-11T08:25:19,510 | DEBUG | opendaylight-cluster-data-notification-dispatcher-69 | ShardDataTreeChangePublisherActor | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-1-shard-entity-ownership-operational: Elapsed time for generation of change events for member-1-shard-entity-ownership-operational-DTCL-publisher: 1.312 ms
[...]
2019-06-11T08:25:20,247 | DEBUG | opendaylight-cluster-data-shard-dispatcher-47 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-1-shard-entity-ownership-operational (Follower): Commit index set to 100

ODL2:

2019-06-11T08:25:19,499 | DEBUG | opendaylight-cluster-data-shard-dispatcher-32 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational (Leader): handleAppendEntriesReply - FollowerLogInformation for member-1-shard-e
ntity-ownership-operational updated: matchIndex: 99, nextIndex: 100
2019-06-11T08:25:19,499 | DEBUG | opendaylight-cluster-data-shard-dispatcher-32 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational (Leader): Applying to log - commitIndex: 99, lastAppliedIndex: 98
2019-06-11T08:25:19,499 | DEBUG | opendaylight-cluster-data-shard-dispatcher-32 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational (Leader): Setting last applied to 99
[...]
2019-06-11T08:25:19,502 | DEBUG | opendaylight-cluster-data-shard-dispatcher-32 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational: Persisting ApplyJournalEntries with index=99
[...]
2019-06-11T08:25:19,512 | DEBUG | opendaylight-cluster-data-shard-dispatcher-32 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational (Leader): handleAppendEntriesReply - FollowerLogInformation for member-1-shard-entity-ownership-operational updated: matchIndex: 100, nextIndex: 101
[...]
2019-06-11T08:25:19,514 | DEBUG | opendaylight-cluster-data-shard-dispatcher-32 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational (Leader): handleAppendEntriesReply - FollowerLogInformation for member-3-shard-entity-ownership-operational updated: matchIndex: 100, nextIndex: 101
[...]
2019-06-11T08:25:19,515 | DEBUG | opendaylight-cluster-data-shard-dispatcher-32 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational: Persisting ApplyJournalEntries with index=100
[...]
2019-06-11T08:25:19,517 | DEBUG | opendaylight-cluster-data-notification-dispatcher-53 | ShardDataTreeChangePublisherActor | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational: Elapsed time for generation of change events for member-2-shard-entity-ownership-operational-DTCL-publisher: 2.777 ms
[...]
2019-06-11T08:25:20,261 | DEBUG | opendaylight-cluster-data-shard-dispatcher-26 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-2-shard-entity-ownership-operational: Applying 1 batched modifications for Tx member-3-entity-ownership-internal-fe-0-txn-14-0

ODL3:

2019-06-11T08:25:19,506 | DEBUG | opendaylight-cluster-data-shard-dispatcher-27 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-3-shard-entity-ownership-operational (Follower): handleAppendEntries: AppendEntries [leaderId=member-2-shard-entity-ownership-operational, prevLogIndex=99, prevLogTerm=1, leaderCommit=99, replicatedToAllIndex=98, payloadVersion=9, recipientRaftVersion=4, leaderRaftVersion=4, leaderAddress=null, entries=[SimpleReplicatedLogEntry [index=100, term=1, payload=org.opendaylight.controller.cluster.datastore.persisted.CommitTransactionPayload@1d897866]]]
2019-06-11T08:25:19,510 | DEBUG | opendaylight-cluster-data-notification-dispatcher-66 | EntityOwnerChangeListener        | 288 - org.opendaylight.controller.sal-distributed-datastore - 1.8.3.SNAPSHOT | member-3-shard-entity-ownership-operational: New owner: member-3, Original owner: member-3
2019-06-11T08:25:19,510 | DEBUG | opendaylight-cluster-data-notification-dispatcher-66 | ShardDataTreeChangePublisherActor | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-3-shard-entity-ownership-operational: Elapsed time for generation of change events for member-3-shard-entity-ownership-operational-DTCL-publisher: 916.3 μs
[...]
2019-06-11T08:25:20,248 | DEBUG | opendaylight-cluster-data-shard-dispatcher-31 | EntityOwnershipShard             | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | member-3-shard-entity-ownership-operational (Follower): Commit index set to 100
[...]
2019-06-11T08:25:20,250 | DEBUG | opendaylight-cluster-data-notification-dispatcher-66 | EntityOwnerChangeListener        | 288 - org.opendaylight.controller.sal-distributed-datastore - 1.8.3.SNAPSHOT | member-3-shard-entity-ownership-operational: New owner: member-3, Original owner:
[...]
2019-06-11T08:25:20,250 | DEBUG | opendaylight-cluster-data-shard-dispatcher-51 | SyncStatusTracker                | 273 - org.opendaylight.controller.sal-akka-raft - 1.8.3.SNAPSHOT | member-3-shard-car-operational: Lagging 0 entries behind leader and reached member-1-shard-car-operational (of expected -1)
2019-06-11T08:25:20,251 | DEBUG | opendaylight-cluster-data-akka.actor.default-dispatcher-36 | EntityOwnershipListenerActor     | 280 - org.opendaylight.controller.sal-clustering-commons - 1.8.3.SNAPSHOT | Notifying EntityOwnershipListener org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl@2b220cd9: DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.AsyncServiceCloseEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=flapping-singleton-service}]], state=LOCAL_OWNERSHIP_GRANTED [wasOwner=false, isOwner=true, hasOwner=true], inJeopardy=false]
2019-06-11T08:25:20,252 | DEBUG | opendaylight-cluster-data-akka.actor.default-dispatcher-36 | FlappingSingletonService         | 292 - org.opendaylight.controller.samples.clustering-it-provider - 1.8.3.SNAPSHOT | Instantiating flapping-singleton-service.

 

Comment by Robert Varga [ 11/Jun/19 ]

Alright, I have figured out what is going on and why things broke with the removal of PurgeTransaction entries.

State changes here are driven by DTCL, which is hooked onto the replica. That DTCL is driven by Leader's commitIndex, which is propagated to Followers via AppendEntries.

In the pause case, ODL2 is master and ODL3 is going to be the new owner. Consensus is established when ODL1 replies to AppendEntries, at which point the leader rolls forward and finishes the transaction. After that the code processing AppendEntriesReply checks if it should send another AppendEntries request. As the journal content has not changed, it will not. A similar thing happens when ODL3 responds.

The system is in steady state from ODL2's perspective, but there is still state hanging – as ODL3's DTCL has not fired, because ODL3 does not know the transaction has been applied. This continues until the next heartbeat occurs, at which point ODL2 sends an empty AppendEntries message to both ODL1 and ODL3, which contains the updated commitIndex. As the commitIndex is bumped, ODL3 fires the DTCL, which causes more stuff to happen, unblocking the system again.

Now the reason why this happened to work was that during transaction processing ShardDataTree persisted a PurgeTransactionPayload, hence the check in AppendEntriesReply code saw a new unreplicated entry being present and immediately sent it to ODL1 (and then ODL3 when it came back), reflecting the commitIndex update – thus the followers fired the DTCL immediately.

Comment by Robert Varga [ 11/Jun/19 ]

ecelgp can you give https://git.opendaylight.org/gerrit/82486 a try, please?

Comment by Robert Varga [ 11/Jun/19 ]

It seems to have been on the right track: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-clustering-ask-all-fluorine/239/robot-plugin/log.html.gz#s1-s14

Latest patch runs here: https://jenkins.opendaylight.org/releng/job/controller-patch-test-core-fluorine/2/

 

Comment by Robert Varga [ 11/Jun/19 ]

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-clustering-ask-all-fluorine/240/robot-plugin/log.html.gz#s1-s14-t3-k2-k4 reports the rate at 84.38 flaps/second, so I think this is fixed.

Comment by Luis Gomez [ 11/Jun/19 ]

Yes it is.

Generated at Wed Feb 07 19:56:44 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.