[CONTROLLER-1845] Karaf takes 7 minutes to start Created: 02/Jul/18 Updated: 05/Oct/18 Resolved: 05/Oct/18 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | karaf |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Lowest |
| Reporter: | Victor Pickard | Assignee: | Arun Sharma |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Σ Remaining Estimate: | Not Specified | Remaining Estimate: | Not Specified |
| Σ Time Spent: | Not Specified | Time Spent: | Not Specified |
| Σ Original Estimate: | Not Specified | Original Estimate: | Not Specified |
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Sub-Tasks: |
|
||||||||||||
| Description |
|
I've been running the ODL Clustering CSIT job from stable/oxygen and analyzing failures, as part of the clustering stabilization effort. During one of these runs, the cluster fails to go to sync status after ODL is restarted (Tell based False). I looked at the node that failed to sync, and see that karaf took just a little over 7 minutes to start up for this test case.
Here are the relevant karaf logs:
Jun 29, 2018 3:43:47 PM org.apache.karaf.main.Main launch
I'll also attach the (annotated) karaf log.
|
| Comments |
| Comment by Victor Pickard [ 02/Jul/18 ] |
|
Here are the steps in from the controller csit job for restarting ODL (Restart Odl With Tell Based False).
6. Copy logs back to new snapshot dir, as below: # mkdir -p '/tmp/karaf-0.8.3-SNAPSHOT/data' && rm -vrf '/tmp/karaf-0.8.3-SNAPSHOT/log' && mv -vf '/tmp/log' '/tmp/karaf-0.8.3-SNAPSHOT/data/ |
| Comment by Victor Pickard [ 02/Jul/18 ] |
|
Tom, I'll have to get this to fail again in CSIT, and will share the link to the run, which will have the karaf thread dumps you asked for to help debug this issue.
|
| Comment by Tom Pantelis [ 02/Jul/18 ] |
|
If it is due to karaf taking 7 minutes to re-install features, it does hit the disk a lot so I wonder if it's an intermittent issue with disk performance in the jenkins environment. |
| Comment by Victor Pickard [ 03/Jul/18 ] |
|
Would need a thread dump when this issue occurs.
Need to find a reliable way to reproduce this issue.
kill -3 for jvm pid will go to stdout, karaf.out
Did CSIT start karaf twice? |
| Comment by Jamo Luhrsen [ 03/Jul/18 ] |
|
vorburger, tpantelis, vpickard,
which part of the karaf log in the description is "weird" and maybe a problem in csit. I will investigate |
| Comment by Michael Vorburger [ 03/Jul/18 ] |
|
I don't have the full context of what is going on here or what this CSIT does, but the first thing that pops out at me is that what is shown in the description of this issue above looks like two Karaf instances started, one at 3:43:47 and then another at 3:50:48 - not one karaf took just a little over 7 minutes to start up. (Look closely, and compare it with the output of when start ODL locally - you'll find "half" of what is shown above.) Does this observation help you? |
| Comment by Tom Pantelis [ 03/Jul/18 ] |
|
yes there should only be one set. It's as if karaf was started then killed and then started again. |
| Comment by Jamo Luhrsen [ 03/Jul/18 ] |
|
We don't have the accompanying csit (e.g. robot) logs for the karaf log in the description, but I was hoping vpickard, hoping we can get another example of this karaf.log to go with the robot logs. I'll keep looking |
| Comment by Muthukumaran Kothandaraman [ 04/Jul/18 ] |
|
From the logs, it is observed that there are messages of JoinSeedNodeProcess which fails with retries before all instances of Karaf Restart - org.apache.karaf.main.Main launch The order of seed-nodes must be uniform across all ODL nodes and availability of first of seed-nodes is pre-requisite. We can check if at least first condition is satisfied in the target deployment.
One clarification We do not have Quarantine-based suicide and restart (QuarantinedMonitorActor) of ODL process in master - is my understanding correct ? Not seeing any such symptoms on the logs unless QuarantinedMonitorActor had been refactored into some other name retaining suicidal + restart behavior . |
| Comment by Tom Pantelis [ 04/Jul/18 ] |
|
The quarantined/restart mechanism is still there. But how is what you stated above relevant for this JIRA? This is about what looks to be a 7 min gap during karaf startup and not related to our clustering. |
| Comment by Victor Pickard [ 10/Jul/18 ] |
|
I saw something similiar in another run, but in this case, ODL/karaf appears to just hang, i.e., no logs after the first initial startup sequence.
Sandbox link:
Final logs in odl3 karaf log:
Jul 04, 2018 12:43:35 PM org.apache.karaf.main.Main launch
Normal startups look something like this: ul 04, 2018 12:24:53 PM org.apache.karaf.main.Main launch
|
| Comment by Tom Pantelis [ 10/Jul/18 ] |
|
Strange. Even in the normal run it took 6s before "Adding features..." was logged. What is it doing during that period.... maybe adding feature repos, installing boot features... we'd need a thread dump (or several) during that period to get an idea. Does CSIT configure the org.ops4j.pax.url.mvn.cfg to only use the local system dir or does it allow external resolution of artifacts? I'm wondering if it's trying to download artifact(s) over the network. |
| Comment by Victor Pickard [ 10/Jul/18 ] |
|
{no format} + cat /tmp/karaf-0.8.3-SNAPSHOT/etc/org.ops4j.pax.url.mvn.cfg ################################################################################ # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ################################################################################ # # If set to true, the following property will not allow any certificate to be used # when accessing Maven repositories through SSL # #org.ops4j.pax.url.mvn.certificateCheck= # # Path to the local Maven settings file. # The repositories defined in this file will be automatically added to the list # of default repositories if the 'org.ops4j.pax.url.mvn.repositories' property # below is not set. # The following locations are checked for the existence of the settings.xml file # * 1. looks for the specified url # * 2. if not found looks for ${user.home}/.m2/settings.xml # * 3. if not found looks for ${maven.home}/conf/settings.xml # * 4. if not found looks for ${M2_HOME}/conf/settings.xml # #org.ops4j.pax.url.mvn.settings= # # Path to the local Maven repository which is used to avoid downloading # artifacts when they already exist locally. # The value of this property will be extracted from the settings.xml file # above, or defaulted to: # System.getProperty( "user.home" ) + "/.m2/repository" # org.ops4j.pax.url.mvn.localRepository=${karaf.home}/${karaf.default.repository} # # Default this to false. It's just weird to use undocumented repos # org.ops4j.pax.url.mvn.useFallbackRepositories=false # # Uncomment if you don't wanna use the proxy settings # from the Maven conf/settings.xml file # # org.ops4j.pax.url.mvn.proxySupport=false # # Comma separated list of repositories scanned when resolving an artifact. # Those repositories will be checked before iterating through the # below list of repositories and even before the local repository # A repository url can be appended with zero or more of the following flags: # @snapshots : the repository contains snaphots # @noreleases : the repository does not contain any released artifacts # # The following property value will add the system folder as a repo. # org.ops4j.pax.url.mvn.defaultRepositories=\ file:${karaf.home}/${karaf.default.repository}@id=system.repository@snapshots,\ file:${karaf.data}/kar@id=kar.repository@multi@snapshots,\ file:${karaf.base}/${karaf.default.repository}@id=child.system.repository@snapshots # Use the default local repo (e.g.~/.m2/repository) as a "remote" repo #org.ops4j.pax.url.mvn.defaultLocalRepoAsRemote=false # # Comma separated list of repositories scanned when resolving an artifact. # The default list includes the following repositories: # http://repo1.maven.org/maven2@id=central # http://repository.springsource.com/maven/bundles/release@id=spring.ebr # http://repository.springsource.com/maven/bundles/external@id=spring.ebr.external # http://zodiac.springsource.com/maven/bundles/release@id=gemini # http://repository.apache.org/content/groups/snapshots-group@id=apache@snapshots@noreleases # https://oss.sonatype.org/content/repositories/snapshots@id=sonatype.snapshots.deploy@snapshots@noreleases # https://oss.sonatype.org/content/repositories/ops4j-snapshots@id=ops4j.sonatype.snapshots.deploy@snapshots@noreleases # To add repositories to the default ones, prepend '+' to the list of repositories # to add. # A repository url can be appended with zero or more of the following flags: # @snapshots : the repository contains snapshots # @noreleases : the repository does not contain any released artifacts # @id=repository.id : the id for the repository, just like in the settings.xml this is optional but recommended # org.ops4j.pax.url.mvn.repositories=https://nexus.opendaylight.org/content/repositories/opendaylight.snapshot@id=opendaylight-snapshot@snapshots, https://nexus.opendaylight.org/content/repositories/public@id=opendaylight-mirror, http://repo1.maven.org/maven2@id=central, http://repository.springsource.com/maven/bundles/release@id=spring.ebr.release, http://repository.springsource.com/maven/bundles/external@id=spring.ebr.external, http://zodiac.springsource.com/maven/bundles/release@id=gemini, http://repository.apache.org/content/groups/snapshots-group@id=apache@snapshots@noreleases, https://oss.sonatype.org/content/repositories/snapshots@id=sonatype.snapshots.deploy@snapshots@noreleases, https://oss.sonatype.org/content/repositories/ops4j-snapshots@id=ops4j.sonatype.snapshots.deploy@snapshots@noreleases {no format} |
| Comment by Victor Pickard [ 16/Jul/18 ] |
|
I'm lowering the priority on this issue. This is very sporadic and difficult to reproduce. I do have a patch that should detect when this issue occurs, and do a thread dump: |
| Comment by Jamo Luhrsen [ 24/Jul/18 ] |
|
my local docker container running ODL seems to be stuck in this state. I'm not sure what to do, but I don't want to get out of this state without at least trying to figure out some clues. anyone with ideas, please pass them my way. |
| Comment by Jamo Luhrsen [ 24/Jul/18 ] |
|
btw, I'm just now realizing that we are not the only place this kind of thing is seen: https://stackoverflow.com/questions/50734332/karaf-does-not-start |
| Comment by Luis Gomez [ 24/Jul/18 ] |
|
In this state can you use fuser to know the user of /tmp/karaf-0.8.3-SNAPSHOT/lock? fuser -v /tmp/karaf-0.8.3-SNAPSHOT/lock Just to see if this is a file lock problem. |
| Comment by Jamo Luhrsen [ 24/Jul/18 ] |
|
yeah, user seems fine: with karaf stopped: $ pwd /home/jluhrsen/jamo/repos/OpenDaylight/netvirt-ha-docker/mounts/1/assembly $ sudo fuser lock starting karaf: sudo fuser lock /home/jluhrsen/jamo/repos/OpenDaylight/netvirt-ha-docker/mounts/1/assembly/lock: 15756 ps -elf | rg 15756 0 S root 15756 15585 99 80 0 - 2038025 - 15:52 pts/5 00:04:17 /usr/bin/java -Djava.security.properties=/odlha/karaf/target/assembly/etc/odl.java.security -Xms128M -Xmx2048m -XX:+UnlockDiagnosticVMOptions -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote -Djava.security.egd=file:/dev/./urandom -Djava.endorsed.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.172-9.b11.fc28.x86_64/jre/jre/lib/endorsed:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.172-9.b11.fc28.x86_64/jre/lib/endorsed:/odlha/karaf/target/assembly/lib/endorsed -Djava.ext.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.172-9.b11.fc28.x86_64/jre/jre/lib/ext:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.172-9.b11.fc28.x86_64/jre/lib/ext:/odlha/karaf/target/assembly/lib/ext -Dkaraf.instances=/odlha/karaf/target/assembly/instances -Dkaraf.home=/odlha/karaf/target/assembly -Dkaraf.base=/odlha/karaf/target/assembly -Dkaraf.data=/odlha/karaf/target/assembly/data -Dkaraf.etc=/odlha/karaf/target/assembly/etc -Dkaraf.restart.jvm.supported=true -Djava.io.tmpdir=/odlha/karaf/target/assembly/data/tmp -Djava.util.logging.config.file=/odlha/karaf/target/assembly/etc/java.util.logging.properties -Dkaraf.startLocalConsole=true -Dkaraf.startRemoteShell=true -classpath /odlha/karaf/target/assembly/lib/boot/org.apache.karaf.diagnostic.boot-4.1.5.jar:/odlha/karaf/target/assembly/lib/boot/org.apache.karaf.jaas.boot-4.1.5.jar:/odlha/karaf/target/assembly/lib/boot/org.apache.karaf.main-4.1.5.jar:/odlha/karaf/target/assembly/lib/boot/org.osgi.core-6.0.0.jar org.apache.karaf.main.Main cat karaf.log Jul 24, 2018 10:21:47 PM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Trying to lock /odlha/karaf/target/assembly/lock Jul 24, 2018 10:21:47 PM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Lock acquired Jul 24, 2018 10:21:47 PM org.apache.karaf.main.Main$KarafLockCallback lockAquired INFO: Lock acquired. Setting startlevel to 100 Jul 24, 2018 10:34:45 PM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Trying to lock /odlha/karaf/target/assembly/lock Jul 24, 2018 10:34:45 PM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Lock acquired Jul 24, 2018 10:34:45 PM org.apache.karaf.main.Main$KarafLockCallback lockAquired INFO: Lock acquired. Setting startlevel to 100 Jul 24, 2018 10:37:19 PM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Trying to lock /odlha/karaf/target/assembly/lock Jul 24, 2018 10:37:19 PM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Lock acquired Jul 24, 2018 10:37:19 PM org.apache.karaf.main.Main$KarafLockCallback lockAquired INFO: Lock acquired. Setting startlevel to 100 Jul 24, 2018 10:52:11 PM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Trying to lock /odlha/karaf/target/assembly/lock Jul 24, 2018 10:52:11 PM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Lock acquired Jul 24, 2018 10:52:11 PM org.apache.karaf.main.Main$KarafLockCallback lockAquired INFO: Lock acquired. Setting startlevel to 100 |
| Comment by Jamo Luhrsen [ 24/Jul/18 ] |
|
since my reproduction is in a docker container, I'm able to start the container and then bounce karaf as needed. Every time I bounce it I get the same little lockAquired messages in karaf.log, but that's it. However, it seems it's still loading the karaf features and maybe working fine. I see that port 6653 (openflow) It's still a mystery why the karaf.log is not having any output. The karaf shell "log:display" is also empty. I did see one stdout error when starting the karaf shell, but no idea if it's related at all. here it is: opendaylight-user@root>23:42:39.944 [Blueprint Extender: 2] ERROR org.opendaylight.netvirt.vpnmanager.LearntVpnVipToPortEventProcessor - Failed to register the entity arpmonitoring |
| Comment by Victor Pickard [ 25/Jul/18 ] |
|
One thing to try that may give us some clues, is strace.
strace -p <karaf.pid> |
| Comment by Jamo Luhrsen [ 25/Jul/18 ] |
|
can't do this in my current setup. docker security stops me. I can maybe start a new container with flags to |
| Comment by Stephen Kitt [ 25/Jul/18 ] |
|
It turns out this is caused by the Pax Logging configuration file having the wrong contents, so the logs aren’t stored in the Karaf log file. Jamo reproduced this, and had a configuration file which only contained the lines from qosservice.cfg. So it seems that QosAlertGenerator in NetVirt is involved — it rewrites the configuration file... When this happens, the container itself is functional. That explains the seven minute thing: the container started, with no Pax logging (so only the initial lines made it), but the container worked so CSIT continued, ended up killing Karaf and starting it again. There is no double-start causing locking issues, it’s two Karaf instances being started in succession but with no logging. |
| Comment by Tom Pantelis [ 25/Jul/18 ] |
|
nice find. I assume it's getting overwritten with log4j settings instead of log4j2. |
| Comment by Jamo Luhrsen [ 25/Jul/18 ] |
|
yeah, this is starting to explain a lot of seeming weirdness. It explains why I couldn't figure out my logs and timestamps when We (skitt, thapar and I) saw a corrupted logging config file when I started karaf and then quickly exited with a CTRL-C. So we have something troublesome that can pop up if we stop (kill -9, CTRL-C, maybe other ways) a node at the wrong time. I'll see if I can find a more reliable way to reproduce this, if someone can think of a safer way to have netvirt rewrite this logging |
| Comment by Jamo Luhrsen [ 25/Jul/18 ] |
|
ok, I can reproduce this locally. The bottom line, I think, is if we kill the karaf process at some perfect time, the My most recent reproduction saw that the log file was completely empty, but I've seen it in various truncated this bash script can be used for reproduction. Just replace the ${LOCATION} env var with the root of where you want #!/bin/bash while true do # start karaf ${LOCATION}/karaf/target/assembly/bin/start # pick a random value between 1-15 seconds for when we kill karaf sleepy_time=$((RANDOM % (40 - 20 + 1) + 20)) echo "Waiting $sleepy_time seconds before killing karaf" sleep $sleepy_time # kill it ps aux | grep karaf.main | grep -v grep | awk '{print "kill -9",$2}' | sh # check line count is right in logging config file linecount=`cat ${LOCATION}/karaf/target/assembly/etc/org.ops4j.pax.logging.cfg | wc -l` if [ "$linecount" -ne "107" ]; then echo "might have caught the mysterious config file corruption monster"; exit 1 fi done |
| Comment by Tom Pantelis [ 25/Jul/18 ] |
|
It seems odd for QosAlertGenerator to add fixed logging configuration in code at runtime, usually such configuration is set up on deployment if desired. I don't know the history of this but perhaps we should consider removing it. |
| Comment by Jamo Luhrsen [ 25/Jul/18 ] |
|
yeah. or can there be a more atomic way to modify the file such that a badly timed kill won't cause this? |
| Comment by Tom Pantelis [ 25/Jul/18 ] |
|
That's the risk/danger of killing any process - could leave something in a corrupted state. The levelDB files could get corrupted if it was in the middle of updating a table... |
| Comment by Jamo Luhrsen [ 25/Jul/18 ] |
|
of course. and a power outage, OOM crash, kernel panic... |
| Comment by Tom Pantelis [ 25/Jul/18 ] |
|
You could try deleting the etc/org.opendaylight.netvirt.qosservice.fcg file for your testing if this issue is blocking it. However looking at QosAlertGenerator it looks it would blow up with an NPE if the file doesn't exist but I can push a quick patch for that. |
| Comment by Jamo Luhrsen [ 25/Jul/18 ] |
|
ok, what's the theory here? without that file, there will be no actual attempt to modify that logging config file, so we won't have the chance of getting a corrupted file at all? Not sure if an NPE will hurt me will it? not in the case where I'm trying to reproduce the |
| Comment by Tom Pantelis [ 26/Jul/18 ] |
|
The qos service bundle would bomb out and fail to start - you don't care about that for your testing but it could affect the diag status checks. For your testing I would suggest to not install integration-all - all you need is the odl-clustering-test-app (ie cars/people) |
| Comment by Jamo Luhrsen [ 26/Jul/18 ] |
|
yes. good point. I got ahead of myself here, but at least we found some answers to some mysteries. back to |
| Comment by Ariel Adam [ 26/Jul/18 ] |
|
Given that this isn't a clustering problems but Netvirt bug moved to Sam. |
| Comment by Michael Vorburger [ 26/Jul/18 ] |
|
https://lists.opendaylight.org/pipermail/netvirt-dev/2018-July/007293.html |
| Comment by Michael Vorburger [ 26/Jul/18 ] |
|
|