[CONTROLLER-1848] Sporadic CSIT failure in RPC tests (501 when hitting /restconf/operations/basic-rpc-test:basic-global) Created: 03/Jul/18 Updated: 01/Aug/18 Resolved: 30/Jul/18 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Medium |
| Reporter: | Victor Pickard | Assignee: | Tom Pantelis |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | csit:3node | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Description |
|
The suite seems to kill and restart the "brt" (not sure what brt means yet) owner and at that point there is the response text is:
{"errors":{"error":[{"error-type":"application","error-tag":"operation-not-supported","error-message":"No implementation of RPC AbsoluteSchemaPath{path=[(urn:opendaylight:controller:basic-rpc-test?revision=2016-01-20)basic-global]} available"}]}}
|
| Comments |
| Comment by Tom Pantelis [ 03/Jul/18 ] |
|
That message is normal from akka - it's reporting a node is unreachable which I assume is part of the test. |
| Comment by Jamo Luhrsen [ 05/Jul/18 ] |
|
as I commented in the kernel call, this jira was opened because of a failing test. I think vpickard was just trying to find some maybe it will be better to use more generic jira summaries like "CSIT test case XYZ intermittently failing" with some comments |
| Comment by Tom Pantelis [ 05/Jul/18 ] |
|
Sure. You can reopen this and change summary etc or open a new one - doesn't really matter. I thought this was created b/c you saw that ominous message in the log and thought it was an issue. |
| Comment by Jamo Luhrsen [ 05/Jul/18 ] |
|
here is the original bug description for posterity
|
| Comment by Jamo Luhrsen [ 05/Jul/18 ] |
|
thanks tpantelis, I just repopened this jira. the description had a link to the failing job, so I use that to we need to keep digging from here. |
| Comment by Victor Pickard [ 06/Jul/18 ] |
|
Archived logs since this is a sandbox job
|
| Comment by Josh Hershberg [ 15/Jul/18 ] |
|
BRT stands for "Basic RPC Test" |
| Comment by Tom Pantelis [ 16/Jul/18 ] |
|
We need to understand what the test does and need the log files in order to do post-mortem analysis. |
| Comment by Victor Pickard [ 16/Jul/18 ] |
|
The archived logs are here: |
| Comment by Tom Pantelis [ 16/Jul/18 ] |
|
So the "old owner" was odl3. There are no errors in the odl3 karaf log after restart around 2018-07-03T06:50:20. In fact there's no log messages at all after 6:52:28 - the robot log.html indicates the basic-global RPC was sent around 6:54. We need to enable debug for org.opendaylight.controller.remote.rpc and try to reproduce. If possible, in the Restart_Killed_Member step, it would be helpful to save off the karaf.log and delete it so the node starts with a clean log. |
| Comment by Tom Pantelis [ 17/Jul/18 ] |
|
Looking at the Run_Rpc keyword, it issues the RPC once. When a node restarts, it takes a bit of time for it to learn about RPC registrations from other nodes via the gossip functionality. So I suspect the test issued the RPC before odl3 knew about the remote RPC implementation, hence the 501 "No implementation" error. So the test should retry via Wait Until Keyword Succeeds. |
| Comment by Jamo Luhrsen [ 18/Jul/18 ] |
|
what's a reasonable time for this though. looking at the log vpickard posted in the comment above, the we can certainly add a WUKS to that step, but it seems like it's already happening quite a long time later |
| Comment by Tom Pantelis [ 18/Jul/18 ] |
|
I didn't notice 1.5m passed. That should be plenty of time. Does it clean the data dir before restarting? If so, then it's re-installing the world so that can take time. It was just an idea... We need a failure run with org.opendaylight.controller.remote.rpc debug enabled. Maybe pull out that one suite and run over and over in the sandbox or whatever? |
| Comment by Josh Hershberg [ 19/Jul/18 ] |
|
The few times I've seen this reproduced it seems that it is caused by 2/3 of the nodes being down or otherwise unreachable. Manual explicit tests of this have produced the same outcome - when 2/3 nodes are down (or in "deadlock") rpc calls return 501. |
| Comment by Tom Pantelis [ 19/Jul/18 ] |
|
So the BasicRpcTestProvider implements the basic-global RPC. It is also a ClusterSingletonService - the RPC is registered when it gets the singleton lock and is spun up. Therefore with 2/3 nodes down and the prior lock owner was one of those 2 then consensus is lost and the remaining node isn't able to become the lock owner. Therefore there is no BasicRpcTestProvider instance running and no basic-global RPC implementation. |
| Comment by Josh Hershberg [ 19/Jul/18 ] |
|
Thanks! |
| Comment by Jamo Luhrsen [ 30/Jul/18 ] |
|
jhershbelooked back through the history of Oxygen jobs and found that the failures
marking this as duplicate because of that. |
| Comment by Josh Hershberg [ 01/Aug/18 ] |
|
From what I can tell this works via EOS. Not a question, just recording this for posterity. |
| Comment by Tom Pantelis [ 01/Aug/18 ] |
|
yes - EOS via ClusterSingletonService |