[CONTROLLER-1864] A 8s "non-GC world stop JVM pause" during snapshot writes Created: 20/Sep/18  Updated: 22/Sep/18  Resolved: 22/Sep/18

Status: Resolved
Project: controller
Component/s: None
Affects Version/s: Boron
Fix Version/s: Carbon

Type: Bug Priority: Medium
Reporter: Michael Vorburger Assignee: Michael Vorburger
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File GCLog.log.0.current     File Run.7z     File configuration_factory_akka.conf     File configuration_initial_akka.conf    
Issue Links:
Relates
relates to CONTROLLER-1576 OOM when persisting a snapshot of the... Resolved
relates to OVSDB-469 Very high String allocation pressure ... Resolved

 Description   

Attached GCLog.log.0.current has this:

2018-09-05T18:23:40.277+0000: 39892.345: Total time for which application threads were stopped: 8.3785112 seconds, Stopping threads took: 8.3765855 seconds

Attached Run.jfr (7z) provided by Ericsson, from their internal downstream distribution based on Boron. (This is the same that they made available via the filehosting.org link on this SO QA. Best analyzed using the JDK Mission Control from OpenJDK v11 from http://jdk.java.net/jmc/. Produced via https://wiki.opendaylight.org/view/HowToProfilePerformance.)



 Comments   
Comment by Michael Vorburger [ 22/Sep/18 ]

There was no "fix" here, but I'm filing this analysis here as an (immediately closed) JIRA issue for future ref:
 If we very carefully comb over  E// attached initial/akka.conf, we can see that it is missing this line which upstream has:
 
snapshot-store.local.class = "org.opendaylight.controller.cluster.persistence.LocalSnapshotStore"

Thereforefore they were still using the default akka.persistence.snapshot.local.LocalSnapshotStore (seen in the stack trace yesterdday, below) instead of ODL's custom, and optimzed, org.opendaylight.controller.cluster.persistence.LocalSnapshotStore! Either you "just forgot" that, or, more likely, that was made after Boron... see here for the full history of this story in CONTROLLER-1576 and:

https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/sal-clustering-commons/src/main/java/org/opendaylight/controller/cluster/persistence/LocalSnapshotStore.java
https://github.com/opendaylight/controller/commits/master/opendaylight/md-sal/sal-clustering-commons/src/main/java/org/opendaylight/controller/cluster/persistence/LocalSnapshotStore.java
https://git.opendaylight.org/gerrit/#/q/topic:bug/7521

Before this fix with a custom LocalSnapshotStore, the default one from Akka would create (x3 ?) a HUGE byte[] instead of "streaming directly to the file like the new custom one, so the JVM was not able to park it mid-stream, as Akka's version may hold up the safepoint while serializing to a byte[]" (summarized from private email exchange with Tom).
 
They als ran ODL with a huge 24 GB heap, which does not help for GC, saying "[Dhiman] The reason for huge heap size is i) to tackle with huge memory allocation during PM counter’s serialization (jolokia)", and that problem was solved in Jolokia 2.5 years ago, see https://github.com/rhuss/jolokia/issues/391; "and ii) Humongous allocation for huge byte array creations before writing the snapshots" which was solved by CONTROLLER-1576 (above) 1.5 years ago, according to the JavaDoc on org.opendaylight.controller.cluster.persistence.LocalSnapshotStore.

PS: Earlier private email about this problem suspected mmp memory mapping files related issues from LevelDB (native or pure Java impl) blocking the world before GC, but that was a wrong initial conclusion; the snapshot (contrary to the journal) actually has absolutely nothing to do with LevelDB.

Generated at Wed Feb 07 19:56:38 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.