[CONTROLLER-1763] On restarting ODL on one node, ODL on another node dies in a clustered setup Created: 28/Aug/17  Updated: 15/Sep/17  Resolved: 15/Sep/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Carbon
Fix Version/s: None

Type: Bug
Reporter: Sai Sindhur Malleni Assignee: Michael Vorburger
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File controller0-karaf.tar.gz     File controller1-karaf.tar.gz     File controller2-karaf.tar.gz    
Issue Links:
Blocks
is blocked by CONTROLLER-1755 RaftActor lastApplied index moves bac... Resolved
is blocked by CONTROLLER-1756 OOM due to huge Map in ShardDataTree Resolved
External issue ID: 9064
Priority: Highest

 Description   

Description of problem:
On running low stress longevity tests using Browbeat+Rally (creating 40 neutron resources 2 at a time and deleting them, over and over again), in a clustered ODL setup, ODL on controller-1 hits OOM after about 42 hours into the test. ODL on controller-2 is functional at that point but ODL on controller-0 seems to be running and ports are up but is non-functional (see BZ https://bugzilla.redhat.com/show_bug.cgi?id=1486060). When ODL on controller-0 is restarted to make it functional again at around 16:01 UTC 08/28/2017, ODL on controller-2 dies at around 16:04 UTC 08/28/2017. ODL on controller-1 which hit OOM is left alone.

Here we can see the karaf process count going to 0 on controller-2 around 16:04 UTC 08/28/2017: https://snapshot.raintank.io/dashboard/snapshot/chxdQkhAw3X8l9LS2HNNzCZGQHQvWubO?orgId=2

The heap is dumped before the process dies, however it can be clearly seen that the 2G heap is not reached here: https://snapshot.raintank.io/dashboard/snapshot/RMuDksXZ61ql2kMA47wqBHUXQeYWG05g?orgId=2

Max heap used is around 1.4G

Setup:
3 ODLs
3 OpenStack Controllers
3 Compute nodes

ODL RPM from upstream: python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch

Test:
Create 40 neutron resources (rotuers, networks etc) 2 at a time using Rally and delete them over and over again. This is a long running low stress test.

Additional info:
ODL Controller-0 Logs:
http://8.43.86.1:8088/smalleni/karaf-controller-0.log.tar.gz
ODL Controller-1 Logs:
http://8.43.86.1:8088/smalleni/karaf-controller-1.log.tar.gz
http://8.43.86.1:8088/smalleni/karaf-controller-1-rollover.log.tar.gz
ODL Controller-2 Logs:
http://8.43.86.1:8088/smalleni/karaf-controller-2.log.tar.gz



 Comments   
Comment by Michael Vorburger [ 29/Aug/17 ]

> ODL on controller-2 dies
> ODL Controller-2 Logs: karaf-controller-2.log.tar.gz

Interestingly, this log does NOT even have the "famous last words before suicide" re. "... shutting down JVM...", as in CONTROLLER-1761 ... so it would be interesting to first better understand what it actually died of. For example, if it were a "OutOfMemoryError: unable to create new native thread" then that would not be in regular logs, but probably would have gone to STDOUT by the JVM, which means on a RPM installed ODL started by a systemd service, it should appear e.g. in something like "systemctl status opendaylight" (or a better command to consult the full systemd journal of that service?) - can you provide us with what that has?

Also, that log does however have a ton of these:

akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-2-shard-default-operational#236873956)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction

and then this:

186 - com.typesafe.akka.slf4j - 2.4.18 | Remote connection to null failed with java.net.ConnectException: Connection refused: /172.16.0.18:2550

meaning the cluster is in real bad shape here anyway...

Comment by Sai Sindhur Malleni [ 29/Aug/17 ]

Michael,

We had to redeploy the environment so I no longer have access to the environment. Yeh and it doesn;t seem to be OOM sine we have the Grafana chart showing that heap didn't go anywhere close to 2G.

Comment by Sai Sindhur Malleni [ 29/Aug/17 ]

ODL RPM used for the sake of completeness: opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch

Comment by Michael Vorburger [ 06/Sep/17 ]

Wondering if CONTROLLER-1755 may helped fix this - let's re-test and confirm is still seen.

Comment by Michael Vorburger [ 13/Sep/17 ]

> Jamo Luhrsen 2017-09-12 17:45:28 UTC
> *** This bug has been marked as a duplicate of CONTROLLER-1756 ***

Jamo, I'm not 100% convinced that Sridhar and Stephen agree ...

Comment by Michael Vorburger [ 13/Sep/17 ]

Based on internal discussions, we would like to re-open this, remove the dupe to OOM which is possibly causing more confusion (we're NOT sure it is linked to OOM), and let Sridhar re-confirm if this is indeed solved now (by OOM) or still an issue..

Comment by Michael Vorburger [ 13/Sep/17 ]

CONTROLLER-1751 seems related, or this may even be a dupe of 9006, except Sridhar can now apparently easily reproduce it, it's not so "sporadic" as that one is reported to be (full details coming up from him)...

Comment by Sridhar Gaddam [ 13/Sep/17 ]

In a fresh ODL Cluster setup with 3 controllers and 3 compute nodes, we observed that after bringing down two Controllers, the third controller was not responding to any curl requests[#] thought "systemctl status opendaylight" shows that its running.

Steps used to reproduce the scenario:
1. In the stable cluster the leader was on controller-1
2. Stopped (i.e., systemctl stop opendaylight) ODL on controller-1 (i.e., cluster leader)
3. Observed that cluster leader moved to controller-2
4. Started (i.e., systemctl start opendaylight) ODL on controller-1. Observed that cluster leader remained on controller-2
5. Stopped/started ODL on controller-1 couple of times. Did not see any issue, things [#] looked normal.
6. Now stopped ODL on controller-1 and controller-2 (which was the leader).
7. I could see that ODL running on Controller-0 is no longer responding to any curl requests[#], eventhough systemctl shows that its running.

PFA [*] the logs from controller-0, 1 and 2.
Step 6 (i.e., ODL on controller-2 was stopped) at "2017-09-13 12:13:03" time.

[#] curl -s -u "admin:admin" -X GET http://$ODL_IP:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 | python -mjson.tool
[*] controller0-karaf.log, controller1-karaf.log and controller2-karaf.log

PS: We could see the following exceptions in Karaf.log
307:2017-09-13 10:35:34,770 | WARN | saction-28-30'}} | DeadlockMonitor | 134 - org.opendaylight.controller.config-manager - 0.6.2.SNAPSHOT | ModuleIdentifier

{factoryName='binding-broker-impl', instanceName='binding-broker-impl'}

did not finish after 9964 ms

akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-0-shard-default-operational#1759700434)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[185:com.typesafe.akka.actor:2.4.18]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[185:com.typesafe.akka.actor:2.4.18]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)[185:com.typesafe.akka.actor:2.4.18]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)[185:com.typesafe.akka.actor:2.4.18]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)[185:com.typesafe.akka.actor:2.4.18]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)[185:com.typesafe.akka.actor:2.4.18]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_141]

2017-09-13 12:13:29,166 | WARN | lt-dispatcher-67 | NettyTransport | 186 - com.typesafe.akka.slf4j - 2.4.18 | Remote connection to null failed with java.net.ConnectException: Connection refused: /172.16.0.21:2550

Comment by Tom Pantelis [ 13/Sep/17 ]

With 2 nodes down, there's no longer a majority so datastore requests will fail.

Comment by Sridhar Gaddam [ 13/Sep/17 ]

Attachment controller0-karaf.tar.gz has been added with description: Karaf.log on Controller0 nodee

Comment by Sridhar Gaddam [ 13/Sep/17 ]

Attachment controller1-karaf.tar.gz has been added with description: Karaf.log on Controller1 node

Comment by Sridhar Gaddam [ 13/Sep/17 ]

Attachment controller2-karaf.tar.gz has been added with description: Karaf.log on Controller2 node

Comment by Tom Pantelis [ 13/Sep/17 ]

2017-09-13 12:13:48,723 | ERROR | lt-dispatcher-57 | LocalThreePhaseCommitCohort | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Failed to prepare transaction member-0-datastore-operational-fe-0-chn-8-txn-0-0 on backend
ReadFailedException{message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/8f8c62e7-b501-46ef-b462-c430bedaa2a2}

], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/8f8c62e7-b501-46ef-b462-c430bedaa2a2}

], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Could not commit transaction member-0-datastore-operational-fe-0-chn-8-txn-0-0. Shard member-0-shard-default-operational currently has no leader. Try again later.]]}

That's expected with 2 nodes down. Datastore requests will time out after a period of time after some retries.

Comment by Tom Pantelis [ 14/Sep/17 ]

Based on my last comment, is there still an issue here that needs to be addressed or can we close it?

Comment by Michael Vorburger [ 14/Sep/17 ]

This issue now seems to have confused 2 different things, as far as I understand:

Sridhar's #c9 is a non-issue according to Tom's explanation in #c10 & #c14.

Sai's original #c0 is about a mysterious death of a cluster node. That seems curious and would be interesting to understand better, but is not actionable without further information requested by me in #c1.

To unblock Nitrogen release, and reduce general confusion, I'm therefore temporarily closing this as RESOLVED INVALID. I'm intending to re-open this issue if and when new information (req. in #c1) re. the mysterious death of a node becomes available.

Comment by Sai Sindhur Malleni [ 15/Sep/17 ]

Agree with Michael in Comment #16, these are two separate issues. I haven't been able to reproduce the scenario of mysterious death of ODL on one node when ODL on another node is restarted (original issue addressed by this bug). I am fine with resolving the bug as invalid for now and will hopefully get more actionable data in the next round of scale testing.

Comment by Sridhar Gaddam [ 15/Sep/17 ]

(In reply to Sai Sindhur Malleni from comment #17)
> Agree with Michael in Comment #16, these are two separate issues. I haven't
> been able to reproduce the scenario of mysterious death of ODL on one node
> when ODL on another node is restarted (original issue addressed by this
> bug). I am fine with resolving the bug as invalid for now and will hopefully
> get more actionable data in the next round of scale testing.

I'm fine to close this bug as Invalid and we can revisit this later (if reproduced again)

Generated at Wed Feb 07 19:56:24 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.