Uploaded image for project: 'aaa'
  1. aaa
  2. AAA-195

aaa bundles failing to start

XMLWordPrintable

      This issue is seen in 3node (aka cluster) CSIT and happens when bringing up a node
      that had previously been killed. With a trimmed down version of this ovsdb job
      I was only able to hit the problem once in ~75 tries so it does seem to be very
      infrequent.

      I have noticed this in a 3node openflowplugin job, as well as in the sodium branch.

      The problem reproduced in the sandbox (logs saved for 6 months):

      robot log
      karaf log of node having trouble

      the full log is above, but sort of a walk through from what I can see:

      • karaf is killed then started a short time later:
        2020-03-17T09:53:47,235 | INFO  | pipe-log:log "ROBOT MESSAGE: Starting test ovsdb-upstream-clustering.txt.Start OVS Multiple Connections" | core                             | 119 - org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Starting test ovsdb-upstream-clustering.txt.Kill Candidate Instance
        2020-03-17T09:53:47,535 | INFO  | pipe-log:log "ROBOT MESSAGE: Starting test ovsdb-upstream-clustering.txt.Start OVS Multiple Connections" | core                             | 119 - org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Killing ODL2 10.30.170.78
        Mar 17, 2020 9:54:16 AM org.apache.karaf.main.lock.SimpleFileLock lock
        INFO: Trying to lock /tmp/karaf-0.12.0-SNAPSHOT/lock
        Mar 17, 2020 9:54:17 AM org.apache.karaf.main.lock.SimpleFileLock lock
        INFO: Lock acquired
        Mar 17, 2020 9:54:17 AM org.apache.karaf.main.Main$KarafLockCallback lockAcquired
        INFO: Lock acquired. Setting startlevel to 100
        2020-03-17T09:54:18,567 | INFO  | Start Level: Equinox Container: 19e92b2f-ab93-4cb9-a1bd-1a16ef11e074 | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Blueprint bundle org.apache.aries.blueprint.cm/1.3.1 has been started
        

      Just glossing over the logs after startup, it appears that normal startup operations are
      underway, but after some time the log starts to get flooded with these:

      020-03-17T09:54:32,561 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader                   | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.
      2020-03-17T09:54:32,590 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader                   | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.
      2020-03-17T09:54:32,591 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader                   | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.
      

      This was noticed in the flood of the above logs:

      2020-03-17T09:56:00,372 | ERROR | Blueprint Extender: 3 | AbstractDataStore                | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Shard leaders failed to settle in 90 seconds, giving up
      

      eventually some timeoutexceptions come aprox 2.5m after starting:

      2020-03-17T09:56:21,549 | WARN  | ForkJoinPool.commonPool-worker-3 | AbstractShardBackendResolver     | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Failed to resolve shard
      java.util.concurrent.TimeoutException: Shard has no current leader
      

      a few more of those timeoutexception warnings show up, then some blueprint
      related ERROR:

      2020-03-17T09:58:01,620 | ERROR | Blueprint Extender: 3 | PrefixedShardConfigWriter        | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Unable to write initial shard config parent.
      java.util.concurrent.ExecutionException: org.opendaylight.controller.cluster.access.client.RequestTimeoutException: ModifyTransactionRequest{target=member-2-datastore-Shard-prefix-configuration-shard-fe-2-chn-1-txn-0-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$c#222048557], modifications=0, protocol=SIMPLE} timed out after 120.005185571 seconds. The backend for prefix-configuration-shard is not available.
      

      finally after aprox 5.5m, aaa bundles show up with ERRORs for failing to start, although
      my guess is that they are just the first in line to fail as there is some mention of a timeout
      waiting for dependencies. examples:

      2020-03-17T09:59:21,848 | ERROR | Blueprint Extender: 1 | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT due to unresolved dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker)), (objectClass=org.opendaylight.aaa.encrypt.AAAEncryptionService)]
      java.util.concurrent.TimeoutException: null
      	at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2]
      	at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
      	at java.lang.Thread.run(Thread.java:834) [?:?]
      2020-03-17T09:59:21,851 | WARN  | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker           | 204 - org.opendaylight.controller.blueprint - 0.12.0.SNAPSHOT | Blueprint container for bundle org.opendaylight.aaa.cert_0.11.0.SNAPSHOT [192] timed out waiting for dependencies - restarting it
      2020-03-17T09:59:21,854 | INFO  | BlueprintContainerRestartService | BlueprintExtender                | 82 - org.apache.aries.blueprint.core - 1.10.2 | Destroying container for blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT
      2020-03-17T09:59:21,864 | ERROR | Blueprint Extender: 2 | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT due to unresolved dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))]
      java.util.concurrent.TimeoutException: null
      	at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2]
      	at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
      	at java.lang.Thread.run(Thread.java:834) [?:?]
      2020-03-17T09:59:21,865 | WARN  | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker           | 204 - org.opendaylight.controller.blueprint - 0.12.0.SNAPSHOT | Blueprint container for bundle org.opendaylight.aaa.encrypt-service-impl_0.11.0.SNAPSHOT [194] timed out waiting for dependencies - restarting it
      2020-03-17T09:59:21,869 | INFO  | BlueprintContainerRestartService | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT is waiting for dependencies [(&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker)), (objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (objectClass=org.opendaylight.aaa.encrypt.AAAEncryptionService)]
      2020-03-17T09:59:21,870 | INFO  | BlueprintContainerRestartService | BlueprintExtender                | 82 - org.apache.aries.blueprint.core - 1.10.2 | Destroying container for blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT
      2020-03-17T09:59:21,875 | INFO  | BlueprintContainerRestartService | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT is waiting for dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))]
      2020-03-17T09:59:21,900 | INFO  | awaitility[checkBundleDiagInfos] | KarafSystemReady                 | 239 - org.opendaylight.infrautils.ready-impl - 1.7.0.SNAPSHOT | checkBundleDiagInfos: Elapsed time 298s, remaining time 1s, diag: Booting {Installed=0, Resolved=7, Unknown=0, GracePeriod=17, Waiting=0, Starting=0, Active=356, Stopping=0, Failure=0}
      2020-03-17T09:59:21,959 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-5 | EndpointReader                   | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.
      2020-03-17T09:59:21,998 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-5 | EndpointReader                   | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.
      2020-03-17T09:59:22,037 | ERROR | Blueprint Extender: 2 | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.password-service-impl/0.11.0.SNAPSHOT due to unresolved dependencies [(&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))]
      java.util.concurrent.TimeoutException: null
      	at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2]
      	at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
      	at java.lang.Thread.run(Thread.java:834) [?:?]
      

      The log continues to show these kinds of troubles until the job gives up which is aprox 8.5m
      after the controller was started.

            rovarga Robert Varga
            jluhrsen Jamo Luhrsen
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: