[AAA-195] aaa bundles failing to start Created: 17/Mar/20  Updated: 23/Mar/20  Resolved: 23/Mar/20

Status: Verified
Project: aaa
Component/s: General
Affects Version/s: None
Fix Version/s: Magnesium SR1

Type: Bug Priority: Highest
Reporter: Jamo Luhrsen Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: CSIT, csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

This issue is seen in 3node (aka cluster) CSIT and happens when bringing up a node
that had previously been killed. With a trimmed down version of this ovsdb job
I was only able to hit the problem once in ~75 tries so it does seem to be very
infrequent.

I have noticed this in a 3node openflowplugin job, as well as in the sodium branch.

The problem reproduced in the sandbox (logs saved for 6 months):

robot log
karaf log of node having trouble

the full log is above, but sort of a walk through from what I can see:

  • karaf is killed then started a short time later:
    2020-03-17T09:53:47,235 | INFO  | pipe-log:log "ROBOT MESSAGE: Starting test ovsdb-upstream-clustering.txt.Start OVS Multiple Connections" | core                             | 119 - org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Starting test ovsdb-upstream-clustering.txt.Kill Candidate Instance
    2020-03-17T09:53:47,535 | INFO  | pipe-log:log "ROBOT MESSAGE: Starting test ovsdb-upstream-clustering.txt.Start OVS Multiple Connections" | core                             | 119 - org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Killing ODL2 10.30.170.78
    Mar 17, 2020 9:54:16 AM org.apache.karaf.main.lock.SimpleFileLock lock
    INFO: Trying to lock /tmp/karaf-0.12.0-SNAPSHOT/lock
    Mar 17, 2020 9:54:17 AM org.apache.karaf.main.lock.SimpleFileLock lock
    INFO: Lock acquired
    Mar 17, 2020 9:54:17 AM org.apache.karaf.main.Main$KarafLockCallback lockAcquired
    INFO: Lock acquired. Setting startlevel to 100
    2020-03-17T09:54:18,567 | INFO  | Start Level: Equinox Container: 19e92b2f-ab93-4cb9-a1bd-1a16ef11e074 | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Blueprint bundle org.apache.aries.blueprint.cm/1.3.1 has been started
    

Just glossing over the logs after startup, it appears that normal startup operations are
underway, but after some time the log starts to get flooded with these:

020-03-17T09:54:32,561 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader                   | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.
2020-03-17T09:54:32,590 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader                   | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.
2020-03-17T09:54:32,591 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader                   | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.

This was noticed in the flood of the above logs:

2020-03-17T09:56:00,372 | ERROR | Blueprint Extender: 3 | AbstractDataStore                | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Shard leaders failed to settle in 90 seconds, giving up

eventually some timeoutexceptions come aprox 2.5m after starting:

2020-03-17T09:56:21,549 | WARN  | ForkJoinPool.commonPool-worker-3 | AbstractShardBackendResolver     | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Failed to resolve shard
java.util.concurrent.TimeoutException: Shard has no current leader

a few more of those timeoutexception warnings show up, then some blueprint
related ERROR:

2020-03-17T09:58:01,620 | ERROR | Blueprint Extender: 3 | PrefixedShardConfigWriter        | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Unable to write initial shard config parent.
java.util.concurrent.ExecutionException: org.opendaylight.controller.cluster.access.client.RequestTimeoutException: ModifyTransactionRequest{target=member-2-datastore-Shard-prefix-configuration-shard-fe-2-chn-1-txn-0-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$c#222048557], modifications=0, protocol=SIMPLE} timed out after 120.005185571 seconds. The backend for prefix-configuration-shard is not available.

finally after aprox 5.5m, aaa bundles show up with ERRORs for failing to start, although
my guess is that they are just the first in line to fail as there is some mention of a timeout
waiting for dependencies. examples:

2020-03-17T09:59:21,848 | ERROR | Blueprint Extender: 1 | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT due to unresolved dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker)), (objectClass=org.opendaylight.aaa.encrypt.AAAEncryptionService)]
java.util.concurrent.TimeoutException: null
	at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2]
	at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
2020-03-17T09:59:21,851 | WARN  | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker           | 204 - org.opendaylight.controller.blueprint - 0.12.0.SNAPSHOT | Blueprint container for bundle org.opendaylight.aaa.cert_0.11.0.SNAPSHOT [192] timed out waiting for dependencies - restarting it
2020-03-17T09:59:21,854 | INFO  | BlueprintContainerRestartService | BlueprintExtender                | 82 - org.apache.aries.blueprint.core - 1.10.2 | Destroying container for blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT
2020-03-17T09:59:21,864 | ERROR | Blueprint Extender: 2 | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT due to unresolved dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))]
java.util.concurrent.TimeoutException: null
	at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2]
	at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
2020-03-17T09:59:21,865 | WARN  | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker           | 204 - org.opendaylight.controller.blueprint - 0.12.0.SNAPSHOT | Blueprint container for bundle org.opendaylight.aaa.encrypt-service-impl_0.11.0.SNAPSHOT [194] timed out waiting for dependencies - restarting it
2020-03-17T09:59:21,869 | INFO  | BlueprintContainerRestartService | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT is waiting for dependencies [(&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker)), (objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (objectClass=org.opendaylight.aaa.encrypt.AAAEncryptionService)]
2020-03-17T09:59:21,870 | INFO  | BlueprintContainerRestartService | BlueprintExtender                | 82 - org.apache.aries.blueprint.core - 1.10.2 | Destroying container for blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT
2020-03-17T09:59:21,875 | INFO  | BlueprintContainerRestartService | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT is waiting for dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))]
2020-03-17T09:59:21,900 | INFO  | awaitility[checkBundleDiagInfos] | KarafSystemReady                 | 239 - org.opendaylight.infrautils.ready-impl - 1.7.0.SNAPSHOT | checkBundleDiagInfos: Elapsed time 298s, remaining time 1s, diag: Booting {Installed=0, Resolved=7, Unknown=0, GracePeriod=17, Waiting=0, Starting=0, Active=356, Stopping=0, Failure=0}
2020-03-17T09:59:21,959 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-5 | EndpointReader                   | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.
2020-03-17T09:59:21,998 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-5 | EndpointReader                   | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.
2020-03-17T09:59:22,037 | ERROR | Blueprint Extender: 2 | BlueprintContainerImpl           | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.password-service-impl/0.11.0.SNAPSHOT due to unresolved dependencies [(&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))]
java.util.concurrent.TimeoutException: null
	at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2]
	at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

The log continues to show these kinds of troubles until the job gives up which is aprox 8.5m
after the controller was started.



 Comments   
Comment by Jamo Luhrsen [ 18/Mar/20 ]

I finally have a semi-reliable way to reproduce this. This patch essentially is stripping all ovsdb
work/test-cases from the suite and just repeating the kill/restart steps 10x. The frequency of occurrence is not higher, but after running only 30 jobs I saw the
problem 6 times, which would take aprox 8 hours to complete if they run back to back.

I will start a job with a revert of our first suspect from AAA-180 and see what that looks like after repeating
the job.

Comment by Jamo Luhrsen [ 19/Mar/20 ]

The problem is still there with both the distro from the revert patch (which is Magnesium based) as well as in
sodium SR2. I am trying the original Sodium release now to see if it is also there or not.

Comment by Venkatrangan Govindarajan [ 19/Mar/20 ]

Akka Ports are unreachable

,103 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-41 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.26 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.23:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.30.170.68:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.8:2550: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@10.30.170.8:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.68:2550: Unreachable [Unreachable] (3)], member status: [akka.tcp://opendaylight-cluster-data@10.30.170.23:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.68:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.8:2550 Up seen=true]

Comment by Jamo Luhrsen [ 19/Mar/20 ]

Wondering if it's because of the broken controller doing this:

2020-03-17T09:54:30,760 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader                   | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP.

I see what looks like all three controllers noticing the controller coming up and starting to sync with each
other. I see Leadership decision messages (Candidate to Follower or Leader, etc.) I also see messages
like this coming from the broken controller:

2020-03-17T09:54:28,670 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-44 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.26 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.59:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.78:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1614826351]] to [akka.tcp://opendaylight-cluster-data@10.30.170.59:2550]
2020-03-17T09:54:28,671 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-44 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.26 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.59:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.59:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.78:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1614826351]] (version [2.5.26])

rovarga has a patch to disable "passive" connections which is clearly suggested in that discard log above. I'll try
reproducing this with that patch next. It's merged in Aluminium/master now.

BTW, this issue was also seen in the original Sodium release so clearly it's been around for a while now and we are
just now noticing and following up. Originally I thought it might be some new issue, but I was wrong.

Comment by Jamo Luhrsen [ 20/Mar/20 ]

The patch to disable passive connections seems to resolve this. 400+ iterations without the failure now.
Recall before that in 300 tries I saw the problem 6 times. I'll let it keep running through the day just in
case.

here are the cherry-picks to sodium and magnesium that I think we want to get in as well:

https://git.opendaylight.org/gerrit/c/controller/+/88304
https://git.opendaylight.org/gerrit/c/controller/+/88305

Comment by Jamo Luhrsen [ 21/Mar/20 ]

totally convinced now 80+ without a failure. once the cherry-picks are merged I will close this as resolved.

Comment by Jamo Luhrsen [ 23/Mar/20 ]

thanks for the fix. I love when we fix those random sporadic failures that end up annoying us at the worst times (like when releasing)

Generated at Wed Feb 07 19:08:53 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.