[NETCONF-454] Three node cluster does not recover from isolation+ rejoin of a member. Created: 16/Aug/17 Updated: 22/Jan/24 |
|
| Status: | Confirmed |
| Project: | netconf |
| Component/s: | netconf |
| Affects Version/s: | 2.0.2 |
| Fix Version/s: | 7.0.0, 5.0.10, 6.0.7 |
| Type: | Bug | Priority: | High |
| Reporter: | RichardHill | Assignee: | Ivan Hrasko |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | pt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| External issue ID: | 8999 | ||||||||||||||||||||
| Description |
LOgs are attached. Member-1 was rejoined at approx 04:44 Results when trying to view the netconf-topology after rejoin Results when looking beyond the mount point after rejoin.
Member-1: 10.10.199.105
Set the log levels for clustering.
odl-jolokia odl-netconf-clustered-topology odl-netconf-connector-all odl-restconf odl-netconf-mdsal
curl --request GET \ Result:
curl --request GET \ Result:
curl --request GET \ Result:
java -Xmx1G -XX:MaxPermSize=256M -Dorg.apache.sshd.registerBouncyCastle=false -jar netconf-testtool-1.2-Carbon-executable.jar --device-count 1 --debug true --schemas-dir ./schemas --md-sal true
curl --request PUT \ Result:
curl --request GET \ Result:
curl --request GET \ Result:
Member-XXX |
| Comments |
| Comment by RichardHill [ 16/Aug/17 ] |
|
The postman collection was run to investigate the state of cluster
The colloection was run before and after isolation, |
| Comment by RichardHill [ 16/Aug/17 ] |
|
Attachment postman-collection-and-results-isolation-and-rejoin.zip has been added with description: postman results and collection used to examine cluster |
| Comment by RichardHill [ 16/Aug/17 ] |
|
| Comment by RichardHill [ 16/Aug/17 ] |
|
Attachment logs-carbon-isolation-rejoin.zip has been added with description: logs from each memeber of cluster |
| Comment by RichardHill [ 16/Aug/17 ] |
|
Error seen on member-1 after rejoin 2017-08-16 04:44:18,623 | ERROR | Bundle Shutdown | NetconfTopologyManager | 303 - org.opendaylight.netconf.topology-singleton - 1.2.0.Carbon | Error at closing topology context. InstanceIdentifier: KeyedInstanceIdentifier {targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=topology-netconf]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=netconf-test-device]]]]}java.lang.NullPointerException: RemoteDevice {netconf-test-device}: Device communicator was not created. |
| Comment by RichardHill [ 16/Aug/17 ] |
2017-08-16 04:45:09,241 | ERROR | qtp666241474-424 | ContainerResponse | 179 - com.sun.jersey.jersey-server - 1.17.0 | The RuntimeException could not be mapped to a response, re-throwing to the HTTP container |
| Comment by RichardHill [ 17/Aug/17 ] |
|
java -version |
| Comment by Robert Varga [ 17/Aug/17 ] |
|
This is a problem in netconf, not controller. |
| Comment by Tomas Olvecky [ 17/Aug/17 ] |
|
I think there is also a problem in clustering as well, because node-2 did not respond to restconf requests that were querying md-sal topology: Reading logs of node-2 I see TimeoutException in blueprint. This happened after the node got quarantined: Fix for |
| Comment by Vratko Polak [ 21/Aug/17 ] |
|
> Try 1) Member 1: Null Member-2: 404 Member-3: 500 What is the Jolokia response in this case? > Fix for We will test the fix. [0] https://logs.opendaylight.org/releng/jenkins092/netconf-csit-3node-clustering-only-carbon/630/log.html.gz#s1-s5-t13-k2-k2-k8-k1-k2-k1-k1-k3-k1 |
| Comment by RichardHill [ 22/Aug/17 ] |
|
Regarding >> What is the Jolokia response in this case? >> Member 1: Null Member-2: 404 Member-3: 500 The response is recorded in the postman results attached. Its possibly not clear from the bug report which result files map to which Try. Netconf-testtool-reconnecton.postman_test_run_after_rejoin.json = Try 1 Netconf-testtool-reconnecton.postman_test_run_after_rejoin.json: Line 693 shows the results for Try 1 { , , }, ], |
| Comment by Vratko Polak [ 22/Aug/17 ] |
|
I am still not sure if there are two different bugs, or just two descriptions of the same bug, so I have opened |
| Comment by RichardHill [ 22/Aug/17 ] |
|
Regarding >> I am still not sure if there are two different bugs, What information do you need to help investigate? |
| Comment by Vratko Polak [ 25/Aug/17 ] |
|
> odl-jolokia odl-netconf-clustered-topology odl-netconf-connector-all odl-restconf odl-netconf-mdsal I think odl-netconf-connector-all is not really compatible with odl-netconf-clustered-topology. Does this happen without any odl-netconf-connector- feature? > What information do you need to help investigate? Ideally, we would want a Robot suite contributed to Integration/Test so that we can run it with various ODL builds on Sandbox. I think this suite: Sadly, I am on PTO next week, otherwise I would attempt to create such a suite myself. |
| Comment by RichardHill [ 05/Sep/17 ] |
|
I'll push something |
| Comment by RichardHill [ 20/Sep/17 ] |
|
Ive tried to push this but I struggled working out why coala was rejecting the submission. https://git.opendaylight.org/gerrit/#/c/62765/ TOX validation failed and was not sure why. Can you point me to some documents about configuring "coala" please. After trial and error I eventually removed some commented out code and added a linebreak in the git command which got it to pass. |
| Comment by Vratko Polak [ 21/Sep/17 ] |
|
> https://git.opendaylight.org/gerrit/#/c/62765/ I have adapted that to run on Sandbox (without other suites). The test failed as described. See run [4] and archived logs [5] (both will get deleted over weekend). That was run against current Carbon snapshot build, and errors in logs look somewhat different. I will try to set DEBUG verbosity (on appropriate packages I have yet to narrow down) to see more details. [4] https://jenkins.opendaylight.org/sandbox/job/netconf-csit-3node-clustering-all-carbon/1/ > TOX validation failed and was not sure why. If you have Bash with activated virtualenv with Python 2.7 and tox, you can type "tox" in your test repository to run the same checks. Error messages are easier to locate that way. |
| Comment by Vratko Polak [ 21/Sep/17 ] |
|
> see more details From INFO logs I got an impression that device owner has moved after rejoin, so this may be hitting the same Singleton deficiency as |
| Comment by Vratko Polak [ 21/Sep/17 ] |
|
> I will try to set DEBUG verbosity Run #3: [6]. > may be hitting the same Singleton deficiency as Nope. According to member-1 log [7], RELEASING_OWNERSHIP state did not get stuck: switching from RELEASING_OWNERSHIP to STANDBY Currently, I am suspicious about the way the new master removes [8] a slave mount point (not sure on which member): 2017-09-21 15:22:57,627 | DEBUG | lt-dispatcher-50 | NetconfNodeManager | 282 - org.opendaylight.netconf.topology-singleton - 1.2.2.SNAPSHOT | RemoteDevice{netconf-test-device} : Sending poison pill to Actorakka://opendaylight-cluster-data/user/$v#-411647032 [6] https://jenkins.opendaylight.org/sandbox/job/netconf-csit-3node-clustering-all-carbon/3/ |
| Comment by Vratko Polak [ 25/Sep/17 ] |
|
> Run #3: [6]. For this week, [9] seems to be the same, although I have not looked into karaf.log yet. [9] https://jenkins.opendaylight.org/sandbox/job/netconf-csit-3node-clustering-only-carbon/1/ |
| Comment by Vratko Polak [ 26/Sep/17 ] |
|
> https://git.opendaylight.org/gerrit/#/c/62765/ The suite is merged. I am still investigating whether the failure is in Singleton implementation. |
| Comment by Vratko Polak [ 18/Oct/17 ] |
|
Finally I think I understand what is going on well enough to write a comment. First, Netconf clustered topology is based on Singleton service, which is based on Entity ownership service, which uses a specific Shard to store its data, and when the previous Leader of that shard is isolated, it enters IsolatedLeader state and it cannot make progress. Only after rejoining the affected member learns its lost its ownership and another member is already an owner. The isolation suite frequently passes because Entity ownership service takes some time to notify the member, and until that happens it still considers itself a master, keeps opened SSH connection to the device, and everything looks good from the Robot point of view. I am less sure on details of what happens when this bug is present: That would work well if it was true that there is always at most one owner across the cluster. But in the isolation scenario, the new owner updates the operational topology first, and then the old owner deletes that without the new owner realizing. Also, other slaves interpret that as device disconnecting and they remove their mount points as well, resulting in this Bug. I am not sure whether Entity ownership service could be modified to guarantee at most one owner across cluster even in isolation scenarios. P.S.: This week Sandbox job with perhaps too much traces: [10]. [10] https://jenkins.opendaylight.org/sandbox/job/netconf-csit-3node-clustering-only-oxygen/ |
| Comment by Vratko Polak [ 18/Oct/17 ] |
|
> This week Sandbox job Selected lines from karaf log [11]: 2017-10-18 12:12:49,195 | INFO | nsole user karaf | core | 167 - org.apache.karaf.log.core - 4.0.10 | ROBOT MESSAGE: Starting test Find_And_Isolate_Device_Entity_Owner 2017-10-18 12:12:51,052 | INFO | nsole user karaf | core | 167 - org.apache.karaf.log.core - 4.0.10 | ROBOT MESSAGE: Starting test Wait_For_New_Owner_To_Appear 2017-10-18 12:13:01,731 | INFO | nsole user karaf | core | 167 - org.apache.karaf.log.core - 4.0.10 | ROBOT MESSAGE: Starting test Rejoin_Original_Entity_Owner 2017-10-18 12:13:02,385 | DEBUG | lt-dispatcher-21 | sterSingletonServiceProviderImpl | 298 - org.opendaylight.mdsal.singleton-dom-impl - 2.4.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=topology-netconf]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=netconf-test-device]]]]}}]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] |
| Comment by Vratko Polak [ 18/Oct/17 ] |
|
> I am not sure whether Entity ownership service could be modified to If I am reading this [12] comment correctly, two owners during cluster partition is an expected outcome. |
| Comment by Tomas Cere [ 19/Oct/17 ] |
|
I am pretty sure that singleton service should be trying to shut down the service on the isolated node. Although if we are doing any datastore operations in the close of the Netconf service based on singleton we wont shutdown until the datastore is available so if im right we will have two owners on rejoin. No idea atm what the appropriate solution would be I would have to think about this. |
| Comment by Vratko Polak [ 26/Oct/17 ] |
|
Workaround attempt (master branch): [13]. |
| Comment by Vratko Polak [ 26/Oct/17 ] |
|
> [13] Testing shows two issues. One is specific to the current patch not cleaning up operational state [14] (3 occurrences), other looks like a controller bug |
| Comment by Ivan Hrasko [ 31/Mar/21 ] |
|
The problem is partially caused by mdsal-clustering behavior and partially by netconf-clustering behavior. The problem in mdsal-clustering is that after the isolation the previous leader gets leader again and in general we can say that there is too much leader transitions than is necessary. I believe the sequence should be as this: But after applying this patch https://git.opendaylight.org/gerrit/c/netconf/+/95511 we get logs like this: 21
2021-03-24T15:52:32,600 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-36 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was selected: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
21
2021-03-24T15:52:36,819 | INFO | remote-connector-processing-executor-10 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} connected - registering master mount point
ISOLATION 15:52:43
23
2021-03-24T15:52:52,229 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-36 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was selected: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
23
2021-03-24T15:52:53,496 | INFO | remote-connector-processing-executor-12 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} connected - registering master mount point
NEW OWNER 15:52:53.660
21
2021-03-24T15:52:55,988 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-20 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was removed: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
21
2021-03-24T15:52:55,991 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-20 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} disconnected - unregistering master mount point
REJOIN 15:53:57
21
2021-03-24T15:54:00,702 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-20 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was selected: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
21
2021-03-24T15:54:00,909 | INFO | remote-connector-processing-executor-1 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} connected - registering master mount point
21
2021-03-24T15:54:08,477 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-23 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was removed: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
RESYNC 15:54:08.844
22
2021-03-24T15:54:09,253 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was selected: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
23
2021-03-24T15:54:09,261 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-37 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was removed: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
23
2021-03-24T15:54:09,268 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-37 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} disconnected - unregistering master mount point
22
2021-03-24T15:54:09,478 | INFO | remote-connector-processing-executor-1 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} connected - registering master mount point
22
2021-03-24T15:54:11,938 | INFO | opendaylight-cluster-data-notification-dispatcher-49 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was removed: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
22
2021-03-24T15:54:11,961 | INFO | opendaylight-cluster-data-notification-dispatcher-49 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} disconnected - unregistering master mount point
The problem in netconf clustering is that NetconfTopologyContext class is using to control leader-follower transition the RemoteDeviceConnector class. Its methods work well for connecting and disconnecting the device but they are misused for leadership change. When a node become to be leader RemoteDeviceConnector#startRemoteDeviceConnection is called. This method starts "master" mount point and write to operational datastore status "connected" for the device. When a node become to be follower RemoteDeviceConnector#stopRemoteDeviceConnection is called. This method stops "master" mountpoint.. and writes to operational datastore "connecting" for the device ... and finally deletes node data from operational datastore. And that's the problem. After previous leader isolation when new leader is elected it writes status "connected" and followers establish their "slave" mountpoints. But after rejoin we can get disconnect with receiving "connecting" and after all node data are deleted from operational datastore. I have made a test run without removing the config data for the device in the teardown but I end up in the state when all "slave" mountpoints were down and operational data for node were gone. The controller was not able to recover connection. The only solution was to restart test device! I suggest we need becomeLeader() and becomeFollower() methods which will do: becomeLeader() will write "connecting" to operational datastore to cause reconnect on followers (they need to drop "slave" mount points redirected to previous leader), then it will establish "master" mount point and if that succeeds it can write "connected" to operational datastore - thus notifying followers to create "slave" mountpoints. becomeFollower() will just stop "master" mount point and writes nothing to operational datastore. I think that the current implementation by writing to operational datastore after node is no more leader breaks the netconf clustering idea that only leader can write device status. That looks simple but the implementation of RemoteDeviceConnector using NetconfConnectorDTO class is making this hard problem. We will need to rework some of the classes and unit tests. |
| Comment by Ivan Hrasko [ 07/Apr/21 ] |
|
After fixing the previous problems we occasionally get to the following problem: Node 22 was the leader. We have broken the isolation and noticed the following logs: node 21: node 22: 2021-04-07T14:35:42.142869+02:00[Europe/Bratislava]: I am master 2021-04-07T14:35:42.289366+02:00[Europe/Bratislava]: Updating data: true from master: akka.tcp://opendaylight-cluster-data@192.168.56.22:2550 2021-04-07T14:35:47.022939+02:00[Europe/Bratislava]: I am follower 2021-04-07T14:35:47.039052+02:00[Europe/Bratislava]: modification... 2021-04-07T14:35:47.049861+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.22:2550 node 23: 2021-04-07T14:35:46.612573+02:00[Europe/Bratislava]: modification... 2021-04-07T14:35:46.626221+02:00[Europe/Bratislava]: Disconnecting slave mountpoint 2021-04-07T14:35:47.040825+02:00[Europe/Bratislava]: modification... 2021-04-07T14:35:47.048678+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.22:2550 We see that node 21 is somehow out of game and nodes 22 and 23 were trying to form leader - follower relationship. Note:
instantiateServiceInstance and closeServiceInstance methods are implemented in: .. and here is the output of /rests/data/entity-owners:entity-owners:
{
"entity-owners:entity-owners": {
"entity-type": [
{
"type": "org.opendaylight.mdsal.ServiceEntityType",
"entity": [
{
"id": "/odl-general-entity:entity[name='KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey{_topologyId=Uri{_value=topology-netconf}}], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey{_nodeId=Uri{_value=netconf-test-device}}]]}']",
"candidate": [
{
"name": "member-1"
},
{
"name": "member-3"
},
{
"name": "member-2"
}
],
"owner": "member-1"
}
]
},
{
"type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
"entity": [
{
"id": "/odl-general-entity:entity[name='KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey{_topologyId=Uri{_value=topology-netconf}}], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey{_nodeId=Uri{_value=netconf-test-device}}]]}']",
"candidate": [
{
"name": "member-1"
}
],
"owner": "member-1"
}
]
}
]
}
}
|
| Comment by Ivan Hrasko [ 13/Apr/21 ] |
|
We have reproduced this problem again. The problem is as follows: But the isolated leader node 21 after isolation got notification that The thing that makes Netconf clustering to crash is that node 21 This results in incorrect leader address stored in operational DS as you can 21 2021-04-12T14:16:16.658098+02:00[Europe/Bratislava]: Registering node manager: null 2021-04-12T14:16:16.672608+02:00[Europe/Bratislava]: registering.. 2021-04-12T14:16:16.692363+02:00[Europe/Bratislava]: I am leader 2021-04-12T14:16:16.699679+02:00[Europe/Bratislava]: Unregistering node manager: org.opendaylight.netconf.topology.singleton.impl.NetconfNodeManager@58d4a02 2021-04-12T14:16:18.089251+02:00[Europe/Bratislava]: connecting device.. true 2021-04-12T14:16:18.106551+02:00[Europe/Bratislava]: Updating data: true from master: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550 2021-04-12T14:16:34.811554+02:00[Europe/Bratislava]: I am follower 2021-04-12T14:16:34.826534+02:00[Europe/Bratislava]: Disconnecting device.. false 2021-04-12T14:16:34.841825+02:00[Europe/Bratislava]: Registering node manager: org.opendaylight.netconf.topology.singleton.impl.NetconfNodeManager@58d4a02 2021-04-12T14:16:34.856144+02:00[Europe/Bratislava]: modification... 2021-04-12T14:16:34.863838+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550 2021-04-12T14:17:36.064744+02:00[Europe/Bratislava]: I am leader 2021-04-12T14:17:36.076016+02:00[Europe/Bratislava]: Unregistering node manager: org.opendaylight.netconf.topology.singleton.impl.NetconfNodeManager@9603f02 2021-04-12T14:17:36.263286+02:00[Europe/Bratislava]: connecting device.. true 2021-04-12T14:17:36.292559+02:00[Europe/Bratislava]: Updating data: true from master: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550 2021-04-12T14:17:40.090901+02:00[Europe/Bratislava]: I am follower 2021-04-12T14:17:40.118272+02:00[Europe/Bratislava]: Disconnecting device.. false 2021-04-12T14:17:40.130646+02:00[Europe/Bratislava]: Registering node manager: org.opendaylight.netconf.topology.singleton.impl.NetconfNodeManager@9603f02 2021-04-12T14:17:40.144148+02:00[Europe/Bratislava]: modification... 2021-04-12T14:17:40.151471+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550 Disconnected from the target VM, address: '192.168.56.21:5005', transport: 'socket' 22 2021-04-12T14:16:17.677425+02:00[Europe/Bratislava]: Registering node manager: null 2021-04-12T14:16:17.700051+02:00[Europe/Bratislava]: registering.. 2021-04-12T14:16:17.703418+02:00[Europe/Bratislava]: modification... 2021-04-12T14:16:17.725131+02:00[Europe/Bratislava]: Disconnecting slave mountpoint 2021-04-12T14:16:18.140807+02:00[Europe/Bratislava]: modification... 2021-04-12T14:16:18.147951+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550 2021-04-12T14:16:20.613060+02:00[Europe/Bratislava]: master: Actor[akka.tcp://opendaylight-cluster-data@192.168.56.21:2550/user/akka.tcp:opendaylight-cluster-data@192.168.56.21:2550_netconf-test-device#770934166] 2021-04-12T14:16:33.702581+02:00[Europe/Bratislava]: modification... 2021-04-12T14:16:33.712800+02:00[Europe/Bratislava]: Disconnecting slave mountpoint 2021-04-12T14:16:33.885460+02:00[Europe/Bratislava]: modification... 2021-04-12T14:16:33.891728+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.23:2550 2021-04-12T14:17:31.334643+02:00[Europe/Bratislava]: master: Actor[akka.tcp://opendaylight-cluster-data@192.168.56.23:2550/user/akka.tcp:opendaylight-cluster-data@192.168.56.23:2550_netconf-test-device#-189182088] 2021-04-12T14:17:40.008555+02:00[Europe/Bratislava]: modification... 2021-04-12T14:17:40.018033+02:00[Europe/Bratislava]: Disconnecting slave mountpoint 2021-04-12T14:17:40.121272+02:00[Europe/Bratislava]: modification... 2021-04-12T14:17:40.127474+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550 Disconnected from the target VM, address: '192.168.56.22:5005', transport: 'socket' 23 2021-04-12T14:16:17.676719+02:00[Europe/Bratislava]: Registering node manager: null 2021-04-12T14:16:17.693843+02:00[Europe/Bratislava]: registering.. 2021-04-12T14:16:17.696230+02:00[Europe/Bratislava]: modification... 2021-04-12T14:16:17.714960+02:00[Europe/Bratislava]: Disconnecting slave mountpoint 2021-04-12T14:16:18.142020+02:00[Europe/Bratislava]: modification... 2021-04-12T14:16:18.149924+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550 2021-04-12T14:16:33.668206+02:00[Europe/Bratislava]: I am leader 2021-04-12T14:16:33.675129+02:00[Europe/Bratislava]: Unregistering node manager: org.opendaylight.netconf.topology.singleton.impl.NetconfNodeManager@3254b5b4 2021-04-12T14:16:33.847092+02:00[Europe/Bratislava]: connecting device.. true 2021-04-12T14:16:33.861299+02:00[Europe/Bratislava]: Updating data: true from master: akka.tcp://opendaylight-cluster-data@192.168.56.23:2550 Disconnected from the target VM, address: '192.168.56.23:5005', transport: 'socket' We have created workaround for this in https://git.opendaylight.org/gerrit/c/netconf/+/95755 We conclude that the root of this problem is not in Netconf clustering but in mdsal clustering. |
| Comment by Ivan Hrasko [ 03/Aug/21 ] |
|
We have to stop the work on this task because of currently not working ClusterSingletonService from the following reasons: The base logic of the Netconf clustering is located in: class NetconfTopologyContext implements ClusterSingletonService, AutoCloseable The principle is based on: ClusterSingletonService#instantiateServiceInstance and ClusterSingletonService#closeServiceInstance methods. NetconfTopologyContext class expects that #instantiateServiceInstance method is invoked when The same is expected with #closeServiceInstance. Since changes in clustering has been made for task: https://jira.opendaylight.org/browse/CONTROLLER-1982 Similarly, #closeServiceInstance is called only on the deleting the topology node on the original leader. Note: /rests/data/entity-owners:entity-owners?content=nonconfig is no more available, thus to get info which node is leader we have switched to use: /jolokia/read/org.opendaylight.controller:Category=Shards,name=\{member}-shard-topology-operational,type=DistributedOperationalDatastore
|
| Comment by Robert Varga [ 10/Aug/21 ] |
|
https://jira.opendaylight.org/browse/CONTROLLER-1992 tracks missing entity ownership state knobs. Note that the jolokia endpoint does not really have much correlation to EOS decisions. |
| Comment by Robert Varga [ 10/Aug/21 ] |
|
ivanhrasko can recreate the lack of callback invocation in https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/eos-dom-akka/src/test/java/org/opendaylight/controller/eos/akka/ThreeNodeBaseTest.java ? |
| Comment by Robert Varga [ 06/Nov/22 ] |
|
The patches are marked as work-in-progress. Please update them and this issue once they are ready. |