Details
-
Bug
-
Status: Verified
-
Highest
-
Resolution: Done
-
None
Description
This issue is seen in 3node (aka cluster) CSIT and happens when bringing up a node
that had previously been killed. With a trimmed down version of this ovsdb job
I was only able to hit the problem once in ~75 tries so it does seem to be very
infrequent.
I have noticed this in a 3node openflowplugin job, as well as in the sodium branch.
The problem reproduced in the sandbox (logs saved for 6 months):
robot log
karaf log of node having trouble
the full log is above, but sort of a walk through from what I can see:
- karaf is killed then started a short time later:
2020-03-17T09:53:47,235 | INFO | pipe-log:log "ROBOT MESSAGE: Starting test ovsdb-upstream-clustering.txt.Start OVS Multiple Connections" | core | 119 - org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Starting test ovsdb-upstream-clustering.txt.Kill Candidate Instance 2020-03-17T09:53:47,535 | INFO | pipe-log:log "ROBOT MESSAGE: Starting test ovsdb-upstream-clustering.txt.Start OVS Multiple Connections" | core | 119 - org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Killing ODL2 10.30.170.78 Mar 17, 2020 9:54:16 AM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Trying to lock /tmp/karaf-0.12.0-SNAPSHOT/lock Mar 17, 2020 9:54:17 AM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Lock acquired Mar 17, 2020 9:54:17 AM org.apache.karaf.main.Main$KarafLockCallback lockAcquired INFO: Lock acquired. Setting startlevel to 100 2020-03-17T09:54:18,567 | INFO | Start Level: Equinox Container: 19e92b2f-ab93-4cb9-a1bd-1a16ef11e074 | BlueprintContainerImpl | 82 - org.apache.aries.blueprint.core - 1.10.2 | Blueprint bundle org.apache.aries.blueprint.cm/1.3.1 has been started
Just glossing over the logs after startup, it appears that normal startup operations are
underway, but after some time the log starts to get flooded with these:
020-03-17T09:54:32,561 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP. 2020-03-17T09:54:32,590 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP. 2020-03-17T09:54:32,591 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.
This was noticed in the flood of the above logs:
2020-03-17T09:56:00,372 | ERROR | Blueprint Extender: 3 | AbstractDataStore | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Shard leaders failed to settle in 90 seconds, giving up
eventually some timeoutexceptions come aprox 2.5m after starting:
2020-03-17T09:56:21,549 | WARN | ForkJoinPool.commonPool-worker-3 | AbstractShardBackendResolver | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Failed to resolve shard java.util.concurrent.TimeoutException: Shard has no current leader
a few more of those timeoutexception warnings show up, then some blueprint
related ERROR:
2020-03-17T09:58:01,620 | ERROR | Blueprint Extender: 3 | PrefixedShardConfigWriter | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Unable to write initial shard config parent.
java.util.concurrent.ExecutionException: org.opendaylight.controller.cluster.access.client.RequestTimeoutException: ModifyTransactionRequest{target=member-2-datastore-Shard-prefix-configuration-shard-fe-2-chn-1-txn-0-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$c#222048557], modifications=0, protocol=SIMPLE} timed out after 120.005185571 seconds. The backend for prefix-configuration-shard is not available.
finally after aprox 5.5m, aaa bundles show up with ERRORs for failing to start, although
my guess is that they are just the first in line to fail as there is some mention of a timeout
waiting for dependencies. examples:
2020-03-17T09:59:21,848 | ERROR | Blueprint Extender: 1 | BlueprintContainerImpl | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT due to unresolved dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker)), (objectClass=org.opendaylight.aaa.encrypt.AAAEncryptionService)] java.util.concurrent.TimeoutException: null at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2] at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] 2020-03-17T09:59:21,851 | WARN | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker | 204 - org.opendaylight.controller.blueprint - 0.12.0.SNAPSHOT | Blueprint container for bundle org.opendaylight.aaa.cert_0.11.0.SNAPSHOT [192] timed out waiting for dependencies - restarting it 2020-03-17T09:59:21,854 | INFO | BlueprintContainerRestartService | BlueprintExtender | 82 - org.apache.aries.blueprint.core - 1.10.2 | Destroying container for blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT 2020-03-17T09:59:21,864 | ERROR | Blueprint Extender: 2 | BlueprintContainerImpl | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT due to unresolved dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))] java.util.concurrent.TimeoutException: null at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2] at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] 2020-03-17T09:59:21,865 | WARN | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker | 204 - org.opendaylight.controller.blueprint - 0.12.0.SNAPSHOT | Blueprint container for bundle org.opendaylight.aaa.encrypt-service-impl_0.11.0.SNAPSHOT [194] timed out waiting for dependencies - restarting it 2020-03-17T09:59:21,869 | INFO | BlueprintContainerRestartService | BlueprintContainerImpl | 82 - org.apache.aries.blueprint.core - 1.10.2 | Blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT is waiting for dependencies [(&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker)), (objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (objectClass=org.opendaylight.aaa.encrypt.AAAEncryptionService)] 2020-03-17T09:59:21,870 | INFO | BlueprintContainerRestartService | BlueprintExtender | 82 - org.apache.aries.blueprint.core - 1.10.2 | Destroying container for blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT 2020-03-17T09:59:21,875 | INFO | BlueprintContainerRestartService | BlueprintContainerImpl | 82 - org.apache.aries.blueprint.core - 1.10.2 | Blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT is waiting for dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))] 2020-03-17T09:59:21,900 | INFO | awaitility[checkBundleDiagInfos] | KarafSystemReady | 239 - org.opendaylight.infrautils.ready-impl - 1.7.0.SNAPSHOT | checkBundleDiagInfos: Elapsed time 298s, remaining time 1s, diag: Booting {Installed=0, Resolved=7, Unknown=0, GracePeriod=17, Waiting=0, Starting=0, Active=356, Stopping=0, Failure=0} 2020-03-17T09:59:21,959 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-5 | EndpointReader | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP. 2020-03-17T09:59:21,998 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-5 | EndpointReader | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP. 2020-03-17T09:59:22,037 | ERROR | Blueprint Extender: 2 | BlueprintContainerImpl | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.password-service-impl/0.11.0.SNAPSHOT due to unresolved dependencies [(&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))] java.util.concurrent.TimeoutException: null at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2] at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?]
The log continues to show these kinds of troubles until the job gives up which is aprox 8.5m
after the controller was started.
Attachments
# | Subject | Branch | Project | Status | CR | V |
---|---|---|---|---|---|---|
88301,1 | Revert "AAA-180: Fix Dynamic authorization" | master | aaa | Status: ABANDONED | 0 | -1 |
88302,1 | Revert "AAA-180: Fix Dynamic authorization" | stable/magnesium | aaa | Status: ABANDONED | 0 | +1 |
88304,1 | Do not use passive connections | stable/magnesium | controller | Status: MERGED | +2 | +1 |
88305,1 | Do not use passive connections | stable/sodium | controller | Status: MERGED | +2 | +1 |