[CONTROLLER-1757] Singleton leader chasing exhausts heap space in few hours Created: 25/Aug/17 Updated: 06/Sep/17 Resolved: 06/Sep/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Carbon |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| External issue ID: | 9054 | ||||||||
| Description |
|
This bug is not (yet) present in Carbon code. This Bug is affecting changes proposed around SR2 branch lock. Reporting, as this will probably prevent some fixes to be merged into SR2 candidate build. The exact build where this Bug happens is [0] which was intended to fix Logs for the Sandbox run are here [1], karaf.log files show UnreachableMember starts happening around three and half hours into the test duration (corresponding to GC pauses of 5 and more seconds), gclogs directories show that members 1 and 3 end with allocation failure not recoverable by GCaround 19 hours after the test starts. It is not clear whether heap dumps were created, they certainly have not been archived. Patches that were included in the build are: [2], [3] (with its ancestors) and [4]. [0] https://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/org/opendaylight/integration/integration/distribution/distribution-karaf/0.6.2-SNAPSHOT/distribution-karaf-0.6.2-20170823.082806-47.zip |
| Comments |
| Comment by Robert Varga [ 25/Aug/17 ] |
|
The patch for I suspect this is a EOS-specific rehash of |
| Comment by Robert Varga [ 30/Aug/17 ] |
|
I have re-created this in a unit test. It seems that this is coming from FrontendHistoryMetadataBuilder.purgedTransactions, which is not contiguous as expected, hence the RangeSet is not compressing properly. This seems to be coming from EntityOwnershipShard and its CommitCoordinator, which manually allocate transaction IDs for BatchedModifications, but those IDs are not contiguous: 03:08:29,127 PM [cluster-test-shard-dispatcher-14] [DEBUG] EntityOwnershipShard - Committing next BatchedModifications member-1-entity-ownership-internal-fe-0-txn-59606-0, size 2 The fix for BUG-8858 is just flushing this out because it can tear through many more transitions, hence generates many more transactions. |
| Comment by Robert Varga [ 30/Aug/17 ] |
|
https://git.opendaylight.org/gerrit/62449 is the unit test showing the problem. |
| Comment by Robert Varga [ 30/Aug/17 ] |
|
I think the problem is coming from early allocation of transaction ID in EntityOwnershipShardCommitCoordinator.newBatchedModifications(), which is then state compressed and not committed. Patch https://git.opendaylight.org/gerrit/62453 modifies EOS to allocate BatchedModifications (and transaction ID) only just before we send it to the backend.- |
| Comment by Robert Varga [ 30/Aug/17 ] |
|
This is a Carbon -> Carbon SR1 regression, although the memory leak usually occurs very slowly. |
| Comment by Vratko Polak [ 06/Sep/17 ] |
|
> Patch https://git.opendaylight.org/gerrit/62453 Looks like everything has been merged. |