[AAA-195] aaa bundles failing to start Created: 17/Mar/20 Updated: 23/Mar/20 Resolved: 23/Mar/20 |
|
| Status: | Verified |
| Project: | aaa |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | Magnesium SR1 |
| Type: | Bug | Priority: | Highest |
| Reporter: | Jamo Luhrsen | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | CSIT, csit:3node | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
This issue is seen in 3node (aka cluster) CSIT and happens when bringing up a node I have noticed this in a 3node openflowplugin job, as well as in the sodium branch. The problem reproduced in the sandbox (logs saved for 6 months): robot log the full log is above, but sort of a walk through from what I can see:
Just glossing over the logs after startup, it appears that normal startup operations are 020-03-17T09:54:32,561 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP. 2020-03-17T09:54:32,590 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP. 2020-03-17T09:54:32,591 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP. This was noticed in the flood of the above logs: 2020-03-17T09:56:00,372 | ERROR | Blueprint Extender: 3 | AbstractDataStore | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Shard leaders failed to settle in 90 seconds, giving up eventually some timeoutexceptions come aprox 2.5m after starting: 2020-03-17T09:56:21,549 | WARN | ForkJoinPool.commonPool-worker-3 | AbstractShardBackendResolver | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Failed to resolve shard java.util.concurrent.TimeoutException: Shard has no current leader a few more of those timeoutexception warnings show up, then some blueprint
2020-03-17T09:58:01,620 | ERROR | Blueprint Extender: 3 | PrefixedShardConfigWriter | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.11.0.SNAPSHOT | Unable to write initial shard config parent.
java.util.concurrent.ExecutionException: org.opendaylight.controller.cluster.access.client.RequestTimeoutException: ModifyTransactionRequest{target=member-2-datastore-Shard-prefix-configuration-shard-fe-2-chn-1-txn-0-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$c#222048557], modifications=0, protocol=SIMPLE} timed out after 120.005185571 seconds. The backend for prefix-configuration-shard is not available.
finally after aprox 5.5m, aaa bundles show up with ERRORs for failing to start, although 2020-03-17T09:59:21,848 | ERROR | Blueprint Extender: 1 | BlueprintContainerImpl | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT due to unresolved dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker)), (objectClass=org.opendaylight.aaa.encrypt.AAAEncryptionService)] java.util.concurrent.TimeoutException: null at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2] at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] 2020-03-17T09:59:21,851 | WARN | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker | 204 - org.opendaylight.controller.blueprint - 0.12.0.SNAPSHOT | Blueprint container for bundle org.opendaylight.aaa.cert_0.11.0.SNAPSHOT [192] timed out waiting for dependencies - restarting it 2020-03-17T09:59:21,854 | INFO | BlueprintContainerRestartService | BlueprintExtender | 82 - org.apache.aries.blueprint.core - 1.10.2 | Destroying container for blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT 2020-03-17T09:59:21,864 | ERROR | Blueprint Extender: 2 | BlueprintContainerImpl | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT due to unresolved dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))] java.util.concurrent.TimeoutException: null at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2] at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] 2020-03-17T09:59:21,865 | WARN | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker | 204 - org.opendaylight.controller.blueprint - 0.12.0.SNAPSHOT | Blueprint container for bundle org.opendaylight.aaa.encrypt-service-impl_0.11.0.SNAPSHOT [194] timed out waiting for dependencies - restarting it 2020-03-17T09:59:21,869 | INFO | BlueprintContainerRestartService | BlueprintContainerImpl | 82 - org.apache.aries.blueprint.core - 1.10.2 | Blueprint bundle org.opendaylight.aaa.cert/0.11.0.SNAPSHOT is waiting for dependencies [(&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker)), (objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (objectClass=org.opendaylight.aaa.encrypt.AAAEncryptionService)] 2020-03-17T09:59:21,870 | INFO | BlueprintContainerRestartService | BlueprintExtender | 82 - org.apache.aries.blueprint.core - 1.10.2 | Destroying container for blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT 2020-03-17T09:59:21,875 | INFO | BlueprintContainerRestartService | BlueprintContainerImpl | 82 - org.apache.aries.blueprint.core - 1.10.2 | Blueprint bundle org.opendaylight.aaa.encrypt-service-impl/0.11.0.SNAPSHOT is waiting for dependencies [(objectClass=org.opendaylight.mdsal.binding.api.DataBroker), (&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))] 2020-03-17T09:59:21,900 | INFO | awaitility[checkBundleDiagInfos] | KarafSystemReady | 239 - org.opendaylight.infrautils.ready-impl - 1.7.0.SNAPSHOT | checkBundleDiagInfos: Elapsed time 298s, remaining time 1s, diag: Booting {Installed=0, Resolved=7, Unknown=0, GracePeriod=17, Waiting=0, Starting=0, Active=356, Stopping=0, Failure=0} 2020-03-17T09:59:21,959 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-5 | EndpointReader | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP. 2020-03-17T09:59:21,998 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-5 | EndpointReader | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP. 2020-03-17T09:59:22,037 | ERROR | Blueprint Extender: 2 | BlueprintContainerImpl | 82 - org.apache.aries.blueprint.core - 1.10.2 | Unable to start container for blueprint bundle org.opendaylight.aaa.password-service-impl/0.11.0.SNAPSHOT due to unresolved dependencies [(&(|(type=default)(!(type=*)))(objectClass=org.opendaylight.mdsal.binding.api.DataBroker))] java.util.concurrent.TimeoutException: null at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:393) [82:org.apache.aries.blueprint.core:1.10.2] at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:45) [82:org.apache.aries.blueprint.core:1.10.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] The log continues to show these kinds of troubles until the job gives up which is aprox 8.5m |
| Comments |
| Comment by Jamo Luhrsen [ 18/Mar/20 ] |
|
I finally have a semi-reliable way to reproduce this. This patch essentially is stripping all ovsdb I will start a job with a revert of our first suspect from AAA-180 and see what that looks like after repeating |
| Comment by Jamo Luhrsen [ 19/Mar/20 ] |
|
The problem is still there with both the distro from the revert patch (which is Magnesium based) as well as in |
| Comment by Venkatrangan Govindarajan [ 19/Mar/20 ] |
|
Akka Ports are unreachable ,103 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-41 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.26 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.23:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.30.170.68:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.8:2550: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@10.30.170.8:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.68:2550: Unreachable [Unreachable] (3)], member status: [akka.tcp://opendaylight-cluster-data@10.30.170.23:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.68:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.8:2550 Up seen=true] |
| Comment by Jamo Luhrsen [ 19/Mar/20 ] |
|
Wondering if it's because of the broken controller doing this: 2020-03-17T09:54:30,760 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | EndpointReader | 47 - com.typesafe.akka.slf4j - 2.5.26 | Discarding inbound message to [Actor[akka://opendaylight-cluster-data/]] in read-only association to [akka.tcp://opendaylight-cluster-data@10.30.170.82:2550]. If this happens often you may consider using akka.remote.use-passive-connections=off or use Artery TCP. I see what looks like all three controllers noticing the controller coming up and starting to sync with each 2020-03-17T09:54:28,670 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-44 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.26 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.59:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.78:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1614826351]] to [akka.tcp://opendaylight-cluster-data@10.30.170.59:2550] 2020-03-17T09:54:28,671 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-44 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.26 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.59:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.59:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.78:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1614826351]] (version [2.5.26]) rovarga has a patch to disable "passive" connections which is clearly suggested in that discard log above. I'll try BTW, this issue was also seen in the original Sodium release so clearly it's been around for a while now and we are |
| Comment by Jamo Luhrsen [ 20/Mar/20 ] |
|
The patch to disable passive connections seems to resolve this. 400+ iterations without the failure now. here are the cherry-picks to sodium and magnesium that I think we want to get in as well: https://git.opendaylight.org/gerrit/c/controller/+/88304 |
| Comment by Jamo Luhrsen [ 21/Mar/20 ] |
|
totally convinced now |
| Comment by Jamo Luhrsen [ 23/Mar/20 ] |
|
thanks for the fix. I love when we fix those random sporadic failures that end up annoying us at the worst times (like when releasing) |