[CONTROLLER-1848] Sporadic CSIT failure in RPC tests (501 when hitting /restconf/operations/basic-rpc-test:basic-global) Created: 03/Jul/18  Updated: 01/Aug/18  Resolved: 30/Jul/18

Status: Resolved
Project: controller
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Victor Pickard Assignee: Tom Pantelis
Resolution: Duplicate Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to NETVIRT-1315 Troubleshooting Controller CSIT In Progress

 Description   

The suite seems to kill and restart the "brt" (not sure what brt means yet) owner and at that point there is
some trouble (501 response) when trying to hit the "/restconf/operations/basic-rpc-test:basic-global"
endpoint.

the response text is:

 	{"errors":{"error":[{"error-type":"application","error-tag":"operation-not-supported","error-message":"No implementation of RPC AbsoluteSchemaPath{path=[(urn:opendaylight:controller:basic-rpc-test?revision=2016-01-20)basic-global]} available"}]}}


 Comments   
Comment by Tom Pantelis [ 03/Jul/18 ]

That message is normal from akka - it's reporting a node is unreachable which I assume is part of the test. 

Comment by Jamo Luhrsen [ 05/Jul/18 ]

as I commented in the kernel call, this jira was opened because of a failing test. I think vpickard was just trying to find some
more specific reason for the failure (to be helpful) and not neccessarily calling out that error message.

maybe it will be better to use more generic jira summaries like "CSIT test case XYZ intermittently failing" with some comments
showing certain suspect logs like "Leader can not perform its duties" so we can keep better track of our issues? just an idea,
but I'm not sure if this one getting closed resulted in a new one being open, because I doubt the intermittent test failure was
addressed yet.

Comment by Tom Pantelis [ 05/Jul/18 ]

Sure. You can reopen this and change summary etc or open a new one - doesn't really matter. I thought this was created b/c you saw that ominous message in the log and thought it was an issue.

Comment by Jamo Luhrsen [ 05/Jul/18 ]

here is the original bug description for posterity

Controller clustering CSIT is failing intermittently, I see this in the karaf logs 7 times for this particular run [1]

2018-07-03T06:49:24,867 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-33 | Cluster(akka://opendaylight-cluster-data) | 51 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.209:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.30.170.209:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.221:2550: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@10.30.170.220:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.221:2550: Unreachable [Unreachable] (1)], member status: [akka.tcp://opendaylight-cluster-data@10.30.170.209:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.220:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.221:2550 Up seen=false]

[1] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-vpickard-all-oxygen/13/

Comment by Jamo Luhrsen [ 05/Jul/18 ]

thanks tpantelis, I just repopened this jira. the description had a link to the failing job, so I use that to
add a few details.

we need to keep digging from here.

Comment by Victor Pickard [ 06/Jul/18 ]

Archived logs since this is a sandbox job

 

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/163/controller-csit-3node-clustering-vpickard-all-oxygen/13/

Comment by Josh Hershberg [ 15/Jul/18 ]

BRT stands for "Basic RPC Test"

Comment by Tom Pantelis [ 16/Jul/18 ]

We need to understand what the test does and need the log files in order to do post-mortem analysis.

Comment by Victor Pickard [ 16/Jul/18 ]

The archived logs are here:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/163/controller-csit-3node-clustering-vpickard-all-oxygen/13/

Comment by Tom Pantelis [ 16/Jul/18 ]

So the "old owner" was odl3. There are no errors in the odl3 karaf log after restart around 2018-07-03T06:50:20. In fact there's no log messages at all after 6:52:28 - the robot log.html indicates the basic-global RPC was sent around 6:54. We need to enable debug for org.opendaylight.controller.remote.rpc and try to reproduce.

If possible, in the Restart_Killed_Member step, it would be helpful to save off the karaf.log and delete it so the node starts with a clean log.

Comment by Tom Pantelis [ 17/Jul/18 ]

Looking at the Run_Rpc keyword, it issues the RPC once. When a node restarts, it takes a bit of time for it to learn about RPC registrations from other nodes via the gossip functionality. So I suspect the test issued the RPC before odl3 knew about the remote RPC implementation, hence the 501 "No implementation" error. So the test should retry via Wait Until Keyword Succeeds.

Comment by Jamo Luhrsen [ 18/Jul/18 ]

what's a reasonable time for this though. looking at the log vpickard posted in the comment above, the
killed memeber is restarted, and it polls for ~2 minutes until syncstatus is true, then it's another 1.5m before
the RPC is issued. So, 3.5m from the time of restart, but probably more importantly, 1.5m since the cluster
sync was said to be ok.

we can certainly add a WUKS to that step, but it seems like it's already happening quite a long time later
for a cluster member to not know about the RPC. right?

Comment by Tom Pantelis [ 18/Jul/18 ]

I didn't notice 1.5m passed. That should be plenty of time. Does it clean the data dir before restarting? If so, then it's re-installing the world so that can take time. It was just an idea...

We need a failure run with org.opendaylight.controller.remote.rpc debug enabled. Maybe pull out that one suite and run over and over in the sandbox or whatever?

Comment by Josh Hershberg [ 19/Jul/18 ]

The few times I've seen this reproduced it seems that it is caused by 2/3 of the nodes being down or otherwise unreachable. Manual explicit tests of this have produced the same outcome - when 2/3 nodes are down (or in "deadlock") rpc calls return 501.

Comment by Tom Pantelis [ 19/Jul/18 ]

So the BasicRpcTestProvider implements the basic-global RPC. It is also a ClusterSingletonService - the RPC is registered when it gets the singleton lock and is spun up. Therefore with 2/3 nodes down and the prior lock owner was one of those 2 then consensus is lost and the remaining node isn't able to become the lock owner. Therefore there is no BasicRpcTestProvider instance running and no basic-global RPC implementation.

Comment by Josh Hershberg [ 19/Jul/18 ]

Thanks!

Comment by Jamo Luhrsen [ 30/Jul/18 ]

jhershbelooked back through the history of Oxygen jobs and found that the failures
seen are coming because of two existing bugs:

CONTROLLER-1849 and CONTROLLER-1742

marking this as duplicate because of that.

Comment by Josh Hershberg [ 01/Aug/18 ]

From what I can tell this works via EOS. Not a question, just recording this for posterity.

Comment by Tom Pantelis [ 01/Aug/18 ]

yes - EOS via ClusterSingletonService

Generated at Wed Feb 07 19:56:36 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.