[CONTROLLER-1757] Singleton leader chasing exhausts heap space in few hours Created: 25/Aug/17  Updated: 06/Sep/17  Resolved: 06/Sep/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Carbon
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks MDSAL-275 VerifyException from flapping service... Resolved
External issue ID: 9054

 Description   

This bug is not (yet) present in Carbon code. This Bug is affecting changes proposed around SR2 branch lock. Reporting, as this will probably prevent some fixes to be merged into SR2 candidate build.

The exact build where this Bug happens is [0] which was intended to fix MDSAL-275. Which it does, but apparently there is a memory leak somewhere.

Logs for the Sandbox run are here [1], karaf.log files show UnreachableMember starts happening around three and half hours into the test duration (corresponding to GC pauses of 5 and more seconds), gclogs directories show that members 1 and 3 end with allocation failure not recoverable by GCaround 19 hours after the test starts. It is not clear whether heap dumps were created, they certainly have not been archived.

Patches that were included in the build are: [2], [3] (with its ancestors) and [4].

[0] https://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/org/opendaylight/integration/integration/distribution/distribution-karaf/0.6.2-SNAPSHOT/distribution-karaf-0.6.2-20170823.082806-47.zip
[1] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-cs-chasing-leader-longevity-only-carbon/14/
[2] https://git.opendaylight.org/gerrit/#/c/61420/18
[3] https://git.opendaylight.org/gerrit/#/c/62170/4
[4] https://git.opendaylight.org/gerrit/#/c/62140/1



 Comments   
Comment by Robert Varga [ 25/Aug/17 ]

The patch for MDSAL-275 allows us to run cluster singleton much hotter than before (60-110tps vs. 5tps). 19 hour run at 60tps translates roughly to 19*3600*60*4 ~= 16M transactions.

I suspect this is a EOS-specific rehash of CONTROLLER-1746 as EOS is a subclass of Shard and I suspect it interacts directly with ShardDataTree (and hence needs to play nice a purge its transactions).

Comment by Robert Varga [ 30/Aug/17 ]

I have re-created this in a unit test. It seems that this is coming from FrontendHistoryMetadataBuilder.purgedTransactions, which is not contiguous as expected, hence the RangeSet is not compressing properly.

This seems to be coming from EntityOwnershipShard and its CommitCoordinator, which manually allocate transaction IDs for BatchedModifications, but those IDs are not contiguous:

03:08:29,127 PM [cluster-test-shard-dispatcher-14] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59606-0, size 2
03:08:29,134 PM [cluster-test-shard-dispatcher-14] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59608-0, size 1
03:08:29,177 PM [cluster-test-shard-dispatcher-11] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59612-0, size 2
03:08:29,197 PM [cluster-test-shard-dispatcher-19] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59617-0, size 3
03:08:29,215 PM [cluster-test-shard-dispatcher-19] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59619-0, size 1
03:08:29,243 PM [cluster-test-shard-dispatcher-19] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59622-0, size 2
03:08:29,281 PM [cluster-test-shard-dispatcher-20] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59626-0, size 2
03:08:29,324 PM [cluster-test-shard-dispatcher-19] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59629-0, size 1
03:08:29,348 PM [cluster-test-shard-dispatcher-20] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59632-0, size 2
03:08:29,367 PM [cluster-test-shard-dispatcher-20] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59634-0, size 1
03:08:29,399 PM [cluster-test-shard-dispatcher-11] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59637-0, size 2
03:08:29,413 PM [cluster-test-shard-dispatcher-11] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59639-0, size 1
03:08:29,439 PM [cluster-test-shard-dispatcher-11] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59643-0, size 3
03:08:29,473 PM [cluster-test-shard-dispatcher-11] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59646-0, size 1
03:08:29,513 PM [cluster-test-shard-dispatcher-11] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59650-0, size 3

The fix for BUG-8858 is just flushing this out because it can tear through many more transitions, hence generates many more transactions.

Comment by Robert Varga [ 30/Aug/17 ]

https://git.opendaylight.org/gerrit/62449 is the unit test showing the problem.

Comment by Robert Varga [ 30/Aug/17 ]

I think the problem is coming from early allocation of transaction ID in EntityOwnershipShardCommitCoordinator.newBatchedModifications(), which is then state compressed and not committed.

Patch https://git.opendaylight.org/gerrit/62453 modifies EOS to allocate BatchedModifications (and transaction ID) only just before we send it to the backend.-

Comment by Robert Varga [ 30/Aug/17 ]

This is a Carbon -> Carbon SR1 regression, although the memory leak usually occurs very slowly.

Comment by Vratko Polak [ 06/Sep/17 ]

> Patch https://git.opendaylight.org/gerrit/62453

Looks like everything has been merged.
Verification will be this [7] CSIT job passing, but I believe FIXED is the correct status right now.

[7] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-cs-chasing-leader-longevity-only-carbon/21/

Generated at Wed Feb 07 19:56:23 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.