[CONTROLLER-1906] OFP RPC does not work from all instances in the cluster Created: 18/Jul/19 Updated: 14/Aug/19 Resolved: 14/Aug/19 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Sodium |
| Fix Version/s: | Sodium |
| Type: | Bug | Priority: | Highest |
| Reporter: | Luis Gomez | Assignee: | Emmett Cox |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
Regression was detected here: To reproduce just connect OVS switch to 3 controllers and file an RPC like this from all instances: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <input xmlns="urn:opendaylight:flow:service"> <node xmlns:inv="urn:opendaylight:inventory">/inv:nodes/inv:node[inv:id="openflow:1"]</node> <table_id>0</table_id> <priority>2</priority> <match> <ethernet-match> <ethernet-type> <type>2048</type> </ethernet-type> </ethernet-match> <ipv4-destination>10.0.1.0/24</ipv4-destination> </match> <instructions> <instruction> <order>0</order> <apply-actions> <action> <output-action> <output-node-connector>1</output-node-connector> </output-action> <order>0</order> </action> </apply-actions> </instruction> </instructions> </input> At least 1 instance will complain with this message: <errors xmlns="urn:ietf:params:xml:ns:yang:ietf-restconf"> <error> <error-type>application</error-type> <error-tag>operation-failed</error-tag> <error-message>The operation encountered an unexpected error while executing.</error-message> <error-info>Ask timed out on [Actor[akka.tcp://opendaylight-cluster-data@10.18.130.162:2550/user/rpc/broker#-516941188]] after [15000 ms]. Message of type [org.opendaylight.controller.remote.rpc.messages.ExecuteRpc]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.</error-info> </error> </errors> |
| Comments |
| Comment by Robert Varga [ 18/Jul/19 ] |
|
Candidate in this area is https://git.opendaylight.org/gerrit/#/c/82283/ – EmmettCox can you take a look? |
| Comment by Emmett Cox [ 24/Jul/19 ] |
|
Taking a look now - trying to recreate bug first, setting up environment, might be missing a few things - if anyone could point out some mistakes in my flow/logic, would be useful: -spun up 3 vms, each running an opendaylight controller and clustered together -installed feature odl-openflowplugin-flow-services-rest on each odl
sudo ovs-vsctl dd-br br-test sudo ovs-vsctl set-controller br-test < controller ips ports etc > -ran above command through postman explorer, got an error back saying No implementation of RPC AbsolouteSchemaPath <SchemaPath>add-flow available
Think either I might be missing a feature install or might be configuring the bridge wrong... any ideas? |
| Comment by Luis Gomez [ 30/Jul/19 ] |
|
Just to confirm the regression was introduced by the patch reverted here: https://git.opendaylight.org/gerrit/c/controller/+/83212 Any ETA for the fix? I believe OFP patch verification is impacted by this problem. |
| Comment by Robert Varga [ 31/Jul/19 ] |
|
Got a pointer to the verify job failing? Aside from that, I do not currently have the cycles over the next two weeks to test this out locally, but if I get some data at logs.opendaylight.org, I may be able to look into the cause. Can we get an CSIT run with sal-remoterpc-connector debugs store there? |
| Comment by Luis Gomez [ 31/Jul/19 ] |
|
The failing job in the ticket description is a good example if a job used to verify patches: I can run the test with the debug on. |
| Comment by Luis Gomez [ 31/Jul/19 ] |
|
This run has the DEBUG but I do not see anything when the test fails: |
| Comment by Emmett Cox [ 31/Jul/19 ] |
|
got some logs from running csit tests in my local environment: [^karaf_debug.log]
WIll point out that i had to run the csit test tests twice before it failed - it succeeded perfectly the first time. Also from looking through the previous runs on jenkins the number of fail tests can range from 2-6, with only two tests failing consistently every time - makes it annoying to debug, and implies the problem is an inconsistent one... |
| Comment by Emmett Cox [ 31/Jul/19 ] |
|
The specific error is listed below: 2019-07-31T09:42:10,746 | TRACE | qtp533277569-75 | BrokerFacade | 298 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.10.0.SNAPSHOT | Read OPERATIONAL via Restconf: /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology
|
| Comment by Emmett Cox [ 01/Aug/19 ] |
|
I wonder about the warning regarding the Connection refused for one of the nodes.... I've noticed the same warning appear for some of the other tests that succeed, but those tests take a few seconds longer to execute.... could it simply be akka timing out when it' would work if given a few extra seconds? not that it should take so long, but... |
| Comment by Emmett Cox [ 06/Aug/19 ] |
|
discovered that I was missing the debug option for remote rpc logs, so there's a little bit more being logged now.... going to include the logs from all 3 nodes, give me a min to add them....
|
| Comment by Emmett Cox [ 08/Aug/19 ] |
|
Discovered the root of the issue is to do with the OpsRegistrar changes made as part of my commit. Part of the changes removed functionality that removed and closed old rpc registrations, which caused the rpc's to not update correctly and fail when a node was shutdown. I'm in the midst of making some code changes to fix the bug. |
| Comment by Luis Gomez [ 12/Aug/19 ] |
|
Hi Emmett, FYI, I started a verification on this patch, considering this is the candidate fix: |
| Comment by Luis Gomez [ 13/Aug/19 ] |
|
I think the issue is fixed now: https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-sodium/ |