[CONTROLLER-1906] OFP RPC does not work from all instances in the cluster Created: 18/Jul/19  Updated: 14/Aug/19  Resolved: 14/Aug/19

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Sodium
Fix Version/s: Sodium

Type: Bug Priority: Highest
Reporter: Luis Gomez Assignee: Emmett Cox
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File karaf.log     File karaf_original_owner.log     File karaf_successor.log    

 Description   

Regression was detected here:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-clustering-only-sodium/

To reproduce just connect OVS switch to 3 controllers and file an RPC like this from all instances:
POST http://controller:8181/restconf/operations/sal-flow:add-flow

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<input xmlns="urn:opendaylight:flow:service">
    <node xmlns:inv="urn:opendaylight:inventory">/inv:nodes/inv:node[inv:id="openflow:1"]</node>
    <table_id>0</table_id>
    <priority>2</priority>
    <match>
        <ethernet-match>
            <ethernet-type>
                <type>2048</type>
            </ethernet-type>
        </ethernet-match>
        <ipv4-destination>10.0.1.0/24</ipv4-destination>
    </match>
    <instructions>
        <instruction>
            <order>0</order>
            <apply-actions>
                <action>
                    <output-action>
                        <output-node-connector>1</output-node-connector>
                    </output-action>
                    <order>0</order>
                </action>
            </apply-actions>
        </instruction>
    </instructions>
</input>

At least 1 instance will complain with this message:

<errors xmlns="urn:ietf:params:xml:ns:yang:ietf-restconf">
    <error>
        <error-type>application</error-type>
        <error-tag>operation-failed</error-tag>
        <error-message>The operation encountered an unexpected error while executing.</error-message>
        <error-info>Ask timed out on [Actor[akka.tcp://opendaylight-cluster-data@10.18.130.162:2550/user/rpc/broker#-516941188]] after [15000 ms]. Message of type [org.opendaylight.controller.remote.rpc.messages.ExecuteRpc]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.</error-info>
    </error>
</errors>


 Comments   
Comment by Robert Varga [ 18/Jul/19 ]

Candidate in this area is https://git.opendaylight.org/gerrit/#/c/82283/EmmettCox can you take a look?

Comment by Emmett Cox [ 24/Jul/19 ]

Taking a look now - trying to recreate bug first, setting up environment, might be missing a few things - if anyone could point out some mistakes in my flow/logic, would be useful:

-spun up 3 vms, each running an opendaylight controller and clustered together

-installed feature odl-openflowplugin-flow-services-rest on each odl

  • created a test-bridge on one of the vm's and added the 3 controllers to bridge:

    sudo ovs-vsctl dd-br br-test

    sudo ovs-vsctl set-controller br-test < controller ips ports etc >

-ran above command through postman explorer, got an error back saying  No implementation of RPC AbsolouteSchemaPath <SchemaPath>add-flow available

 

Think either I might be missing a feature install or might be configuring the bridge wrong... any ideas?

Comment by Luis Gomez [ 30/Jul/19 ]

Just to confirm the regression was introduced by the patch reverted here:

https://git.opendaylight.org/gerrit/c/controller/+/83212

Any ETA for the fix? I believe OFP patch verification is impacted by this problem.

Comment by Robert Varga [ 31/Jul/19 ]

Got a pointer to the verify job failing?

Aside from that, I do not currently have the cycles over the next two weeks to test this out locally, but if I get some data at logs.opendaylight.org, I may be able to look into the cause. Can we get an CSIT run with sal-remoterpc-connector debugs store there?

Comment by Luis Gomez [ 31/Jul/19 ]

The failing job in the ticket description is a good example if a job used to verify patches:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-clustering-only-sodium/

I can run the test with the debug on.

Comment by Luis Gomez [ 31/Jul/19 ]

This run has the DEBUG but I do not see anything when the test fails:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-gate-clustering-only-sodium/104/

Comment by Emmett Cox [ 31/Jul/19 ]

got some logs from running csit tests in my local environment:

[^karaf_debug.log]

 

WIll point out that i had to run the csit test tests twice before it failed - it succeeded perfectly the first time. Also from looking through the previous runs on jenkins the number of fail tests can range from 2-6, with only two tests failing consistently every time - makes it annoying to debug, and implies the problem is an inconsistent one...

Comment by Emmett Cox [ 31/Jul/19 ]

The specific error is listed below:

2019-07-31T09:42:10,746 | TRACE | qtp533277569-75  | BrokerFacade                     | 298 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.10.0.SNAPSHOT | Read OPERATIONAL via Restconf: /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology
2019-07-31T09:42:11,423 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-30 | NettyTransport                   | 88 - com.typesafe.akka.slf4j - 2.5.21 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /192.168.67.48:2550
2019-07-31T09:42:11,431 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-30 | ReliableDeliverySupervisor       | 88 - com.typesafe.akka.slf4j - 2.5.21 | Association with remote system [akka.tcp://opendaylight-cluster-data@192.168.67.48:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@192.168.67.48:2550]] Caused by: [java.net.ConnectException: Connection refused: /192.168.67.48:2550]
2019-07-31T09:42:12,458 | DEBUG | qtp533277569-68  | RestconfImpl                     | 298 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.10.0.SNAPSHOT | Execution RpcError:
java.util.concurrent.ExecutionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://opendaylight-cluster-data@192.168.67.48:2550/user/rpc/broker#1272405160]] after [15000 ms]. Message of type [org.opendaylight.controller.remote.rpc.messages.ExecuteRpc]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.

 

 

Comment by Emmett Cox [ 01/Aug/19 ]

I wonder about the warning regarding the Connection refused for one of the nodes.... I've noticed the same warning appear for some of the other tests that succeed, but those tests take a few seconds longer to execute.... could it simply be akka timing out when it' would work if given a few extra seconds? not that it should take so long, but...

Comment by Emmett Cox [ 06/Aug/19 ]

discovered that I was missing the debug option for remote rpc logs, so there's a little bit more being logged now....

going to include the logs from all 3 nodes, give me a min to add them.... 

 

Comment by Emmett Cox [ 08/Aug/19 ]

Discovered the root of the issue is to do with the OpsRegistrar changes made as part of my commit.

Part of the changes removed functionality that removed and closed old rpc registrations, which caused the rpc's to not update correctly and fail when a node was shutdown.

I'm in the midst of making some code changes to fix the bug.

Comment by Luis Gomez [ 12/Aug/19 ]

Hi Emmett,

FYI, I started a verification on this patch, considering this is the candidate fix:

https://git.opendaylight.org/gerrit/c/controller/+/83530

Comment by Luis Gomez [ 13/Aug/19 ]

I think the issue is fixed now: https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-sodium/

Generated at Wed Feb 07 19:56:45 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.