[CONTROLLER-1558] Routed RPCs in cluster breaks after isolation/heal Created: 13/Oct/16 Updated: 25/Jul/23 Resolved: 08/Mar/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Michal Rehak | Assignee: | Tomas Cere |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 6937 |
| Description |
|
If routed RPC is registered on one node in cluster then it is routed to this node from any other cluster node (using restconf-rcp). Restconf output: or <errors xmlns="urn:ietf:params:xml:ns:yang:ietf-restconf"><error><error-type>application</error-type><error-tag>operation-not-supported</error-tag><error-message>No local or remote implementation available for rpc AbsoluteSchemaPath {path=[(urn:opendaylight:groupbasedpolicy:base_endpoint?revision=2016-04-27)register-endpoint]}</error-message></error></errors> Tested on 3node cluster, branch:master (mvn -U @ 2016-10-13). |
| Comments |
| Comment by Michal Rehak [ 13/Oct/16 ] |
|
Attachment cluster-isolationX2-20161013.zip has been added with description: logs, scenario overview, restconf outputs |
| Comment by Michal Rehak [ 18/Oct/16 ] |
|
What happened in nutshell:
If trying more then got Rpc implementation for {} was removed during processing. response. |
| Comment by Michal Rehak [ 18/Oct/16 ] |
|
Attachment cluster-isolationX2-20161018.zip has been added with description: logs, scenario overview, restconf outputs + DEBUG remoterpc |
| Comment by Robert Varga [ 13/Jan/17 ] |
|
Is this still reproducible on current Carbon? The exception reported seems to be impossible with current codebase (and if it is, it points to classpath badness). |
| Comment by Michal Rehak [ 17/Jan/17 ] |
|
This is still broken on current master (mvn -U @ 20170117 08:00 UTC). |
| Comment by Robert Varga [ 17/Jan/17 ] |
|
I think this will be addressed with the patch in BUG-3128, at least partially. The first RESTCONF output points to a sal-remoterpc-connector routing loop, i.e. the the RPC request is being invoked on a remote node, but that node tracks that RPC as remote – like if the surving node is pointing to a previous (but not current) owner. Patch to correct the format string: https://git.opendaylight.org/gerrit/50574. The second output points to a failure to find a router for something we are registered for, but is no longer present in RpcRegistry's RoutingTable. This codepath is eliminated in BUG-3128. The logs are pointing towards akka cluster not reforming (failing to associate). We need to reproduce this in CSIT and make sure akka cluster is working as expected. A thread dump could be useful (maybe there is something stuck). Is the data store working okay? |
| Comment by Michal Rehak [ 19/Jan/17 ] |
|
Yes - dataStore worked. Although I expected not to get any response from the isolated node. |
| Comment by Robert Varga [ 26/Jan/17 ] |
| Comment by Tomas Cere [ 27/Jan/17 ] |
|
The above patch should fix the isolated node not having any rpc's registered after cluster heal. There is still an issue where one of the nodes in the cluster looses remotely registered rpcs once the cluster is healed this will need further analysis and most likely some additional logging in Gossiper. The EndpointAssociationException in the first post is harmless and all it actually sais is that the node cannot communicate to a cluster member which is expected on node isolation. |
| Comment by Tomas Cere [ 10/Feb/17 ] |
|
Seems to be fixed now according to csit now. https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-carbon/ |