[CONTROLLER-1763] On restarting ODL on one node, ODL on another node dies in a clustered setup Created: 28/Aug/17 Updated: 15/Sep/17 Resolved: 15/Sep/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Carbon |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Sai Sindhur Malleni | Assignee: | Michael Vorburger |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| External issue ID: | 9064 | ||||||||||||
| Priority: | Highest | ||||||||||||
| Description |
|
Description of problem: Here we can see the karaf process count going to 0 on controller-2 around 16:04 UTC 08/28/2017: https://snapshot.raintank.io/dashboard/snapshot/chxdQkhAw3X8l9LS2HNNzCZGQHQvWubO?orgId=2 The heap is dumped before the process dies, however it can be clearly seen that the 2G heap is not reached here: https://snapshot.raintank.io/dashboard/snapshot/RMuDksXZ61ql2kMA47wqBHUXQeYWG05g?orgId=2 Max heap used is around 1.4G Setup: ODL RPM from upstream: python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch Test: Additional info: |
| Comments |
| Comment by Michael Vorburger [ 29/Aug/17 ] |
|
> ODL on controller-2 dies Interestingly, this log does NOT even have the "famous last words before suicide" re. "... shutting down JVM...", as in Also, that log does however have a ton of these: akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-2-shard-default-operational#236873956)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction and then this: 186 - com.typesafe.akka.slf4j - 2.4.18 | Remote connection to null failed with java.net.ConnectException: Connection refused: /172.16.0.18:2550 meaning the cluster is in real bad shape here anyway... |
| Comment by Sai Sindhur Malleni [ 29/Aug/17 ] |
|
Michael, We had to redeploy the environment so I no longer have access to the environment. Yeh and it doesn;t seem to be OOM sine we have the Grafana chart showing that heap didn't go anywhere close to 2G. |
| Comment by Sai Sindhur Malleni [ 29/Aug/17 ] |
|
ODL RPM used for the sake of completeness: opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch |
| Comment by Michael Vorburger [ 06/Sep/17 ] |
|
Wondering if |
| Comment by Michael Vorburger [ 13/Sep/17 ] |
|
> Jamo Luhrsen 2017-09-12 17:45:28 UTC Jamo, I'm not 100% convinced that Sridhar and Stephen agree ... |
| Comment by Michael Vorburger [ 13/Sep/17 ] |
|
Based on internal discussions, we would like to re-open this, remove the dupe to OOM which is possibly causing more confusion (we're NOT sure it is linked to OOM), and let Sridhar re-confirm if this is indeed solved now (by OOM) or still an issue.. |
| Comment by Michael Vorburger [ 13/Sep/17 ] |
|
|
| Comment by Sridhar Gaddam [ 13/Sep/17 ] |
|
In a fresh ODL Cluster setup with 3 controllers and 3 compute nodes, we observed that after bringing down two Controllers, the third controller was not responding to any curl requests[#] thought "systemctl status opendaylight" shows that its running. Steps used to reproduce the scenario: PFA [*] the logs from controller-0, 1 and 2. [#] curl -s -u "admin:admin" -X GET http://$ODL_IP:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 | python -mjson.tool PS: We could see the following exceptions in Karaf.log did not finish after 9964 ms akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-0-shard-default-operational#1759700434)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction". 2017-09-13 12:13:29,166 | WARN | lt-dispatcher-67 | NettyTransport | 186 - com.typesafe.akka.slf4j - 2.4.18 | Remote connection to null failed with java.net.ConnectException: Connection refused: /172.16.0.21:2550 |
| Comment by Tom Pantelis [ 13/Sep/17 ] |
|
With 2 nodes down, there's no longer a majority so datastore requests will fail. |
| Comment by Sridhar Gaddam [ 13/Sep/17 ] |
|
Attachment controller0-karaf.tar.gz has been added with description: Karaf.log on Controller0 nodee |
| Comment by Sridhar Gaddam [ 13/Sep/17 ] |
|
Attachment controller1-karaf.tar.gz has been added with description: Karaf.log on Controller1 node |
| Comment by Sridhar Gaddam [ 13/Sep/17 ] |
|
Attachment controller2-karaf.tar.gz has been added with description: Karaf.log on Controller2 node |
| Comment by Tom Pantelis [ 13/Sep/17 ] |
|
2017-09-13 12:13:48,723 | ERROR | lt-dispatcher-57 | LocalThreePhaseCommitCohort | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Failed to prepare transaction member-0-datastore-operational-fe-0-chn-8-txn-0-0 on backend ]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/8f8c62e7-b501-46ef-b462-c430bedaa2a2}], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/8f8c62e7-b501-46ef-b462-c430bedaa2a2}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Could not commit transaction member-0-datastore-operational-fe-0-chn-8-txn-0-0. Shard member-0-shard-default-operational currently has no leader. Try again later.]]} That's expected with 2 nodes down. Datastore requests will time out after a period of time after some retries. |
| Comment by Tom Pantelis [ 14/Sep/17 ] |
|
Based on my last comment, is there still an issue here that needs to be addressed or can we close it? |
| Comment by Michael Vorburger [ 14/Sep/17 ] |
|
This issue now seems to have confused 2 different things, as far as I understand: Sridhar's #c9 is a non-issue according to Tom's explanation in #c10 & #c14. Sai's original #c0 is about a mysterious death of a cluster node. That seems curious and would be interesting to understand better, but is not actionable without further information requested by me in #c1. To unblock Nitrogen release, and reduce general confusion, I'm therefore temporarily closing this as RESOLVED INVALID. I'm intending to re-open this issue if and when new information (req. in #c1) re. the mysterious death of a node becomes available. |
| Comment by Sai Sindhur Malleni [ 15/Sep/17 ] |
|
Agree with Michael in Comment #16, these are two separate issues. I haven't been able to reproduce the scenario of mysterious death of ODL on one node when ODL on another node is restarted (original issue addressed by this bug). I am fine with resolving the bug as invalid for now and will hopefully get more actionable data in the next round of scale testing. |
| Comment by Sridhar Gaddam [ 15/Sep/17 ] |
|
(In reply to Sai Sindhur Malleni from comment #17) I'm fine to close this bug as Invalid and we can revisit this later (if reproduced again) |