[CONTROLLER-1864] A 8s "non-GC world stop JVM pause" during snapshot writes Created: 20/Sep/18 Updated: 22/Sep/18 Resolved: 22/Sep/18 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | None |
| Affects Version/s: | Boron |
| Fix Version/s: | Carbon |
| Type: | Bug | Priority: | Medium |
| Reporter: | Michael Vorburger | Assignee: | Michael Vorburger |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
Attached GCLog.log.0.current has this: 2018-09-05T18:23:40.277+0000: 39892.345: Total time for which application threads were stopped: 8.3785112 seconds, Stopping threads took: 8.3765855 seconds Attached Run.jfr (7z) provided by Ericsson, from their internal downstream distribution based on Boron. (This is the same that they made available via the filehosting.org link on this SO QA. Best analyzed using the JDK Mission Control from OpenJDK v11 from http://jdk.java.net/jmc/. Produced via https://wiki.opendaylight.org/view/HowToProfilePerformance.) |
| Comments |
| Comment by Michael Vorburger [ 22/Sep/18 ] |
|
There was no "fix" here, but I'm filing this analysis here as an (immediately closed) JIRA issue for future ref: Thereforefore they were still using the default akka.persistence.snapshot.local.LocalSnapshotStore (seen in the stack trace yesterdday, below) instead of ODL's custom, and optimzed, org.opendaylight.controller.cluster.persistence.LocalSnapshotStore! Either you "just forgot" that, or, more likely, that was made after Boron... see here for the full history of this story in * https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/sal-clustering-commons/src/main/java/org/opendaylight/controller/cluster/persistence/LocalSnapshotStore.java Before this fix with a custom LocalSnapshotStore, the default one from Akka would create (x3 ?) a HUGE byte[] instead of "streaming directly to the file like the new custom one, so the JVM was not able to park it mid-stream, as Akka's version may hold up the safepoint while serializing to a byte[]" (summarized from private email exchange with Tom). PS: Earlier private email about this problem suspected mmp memory mapping files related issues from LevelDB (native or pure Java impl) blocking the world before GC, but that was a wrong initial conclusion; the snapshot (contrary to the journal) actually has absolutely nothing to do with LevelDB. |