[NETCONF-454] Three node cluster does not recover from isolation+ rejoin of a member. Created: 16/Aug/17  Updated: 22/Jan/24

Status: Confirmed
Project: netconf
Component/s: netconf
Affects Version/s: 2.0.2
Fix Version/s: 7.0.0, 5.0.10, 6.0.7

Type: Bug Priority: High
Reporter: RichardHill Assignee: Ivan Hrasko
Resolution: Unresolved Votes: 1
Labels: pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive logs-carbon-isolation-rejoin.zip     Zip Archive postman-collection-and-results-isolation-and-rejoin.zip    
Issue Links:
Duplicate
is duplicated by CONTROLLER-1784 DataTreeChangeListener when registere... Resolved
Relates
relates to CONTROLLER-1992 Expose entity ownership information v... Resolved
relates to NETCONF-1039 Remove RemoteDeviceConnector class Resolved
External issue ID: 8999

 Description   
    1. Aim:
      Verify that a three node cluster can recover from isolation+rejoin of a member.
    1. Description:
      A 3 node cluster was deployed. The netconf test tool was connected and we looked behind the mountpoint from each member of the cluster.
      We then found the leader of the config-topology shard and the config-inventory shard.
      A different memeber was isolated. So in this the leader was member 2. Member-1 was isolated using iptables.
      We looked beyond the mount point of the netconf device from each member of the cluster.
      The isolated member returned HTTP 401 when trying to view the netconf-topology using RESTCONF
      Member-2 ,Member-3 returned HTTP 200 and data when trying to view the netconf-topology and look beyond the mount point.

LOgs are attached. Member-1 was rejoined at approx 04:44

Results when trying to view the netconf-topology after rejoin
Try 1) Member 1: Null Member-2: 404 Member-3: 500
Try 2) Member 1: 200 Member-2: 404 Member-3: 200
Try 3) Member 1: 200 Member-2: 401 Member-3: 200

Results when looking beyond the mount point after rejoin.
Try 1) Member 1: 404 Member-2: 404 Member-3: 500
Try 2) Member 1: 404 Member-2: 404 Member-3: 500
Try 3) Member 1: 404 Member-2: 401 Member-3: 500

    1. Test artifact
      Carbon
    1. Test setup

Member-1: 10.10.199.105
Member-2: 10.10.199.88
Member-3: 10.10.199.183
Netconf device: 10.10.199.185

    1. Steps
      Configure a 3 node cluster
      Start odl on members

Set the log levels for clustering.
log:set DEBUG org.opendaylight.controller.cluster
log:set INFO org.opendaylight.controller.cluster.datastore.node.utils

  1. Features loaded:

odl-jolokia odl-netconf-clustered-topology odl-netconf-connector-all odl-restconf odl-netconf-mdsal

    1. REST API CALLS
  1. Find the leader of the shards on DistributedConfigDatastore.
      1. Leader of shard-topology-config on DistributedConfigDatastore on each member

curl --request GET \
--url 'http://10.10.199.XXX:8181/jolokia/read/org.opendaylight.controller:Category=Shards,name=member-1-shard-topology-config,type=DistributedConfigDatastore' \
--header 'accept: application/json' \
--header 'authorization: Basic YWRtaW46YWRtaW4=' \
--header 'cache-control: no-cache' \
--header 'content-type: application/json' \
--header 'postman-token: 19ee9a88-1c09-6762-6a54-0b44c8cba583'

Result:

      1. Leader of shard-inventory-config on DistributedConfigDatastore on each member

curl --request GET \
--url 'http://10.10.199.XXX:8181/jolokia/read/org.opendaylight.controller:Category=Shards,name=member-1-shard-inventory-config,type=DistributedConfigDatastore' \
--header 'accept: application/json' \
--header 'authorization: Basic YWRtaW46YWRtaW4=' \
--header 'cache-control: no-cache' \
--header 'content-type: application/json' \
--header 'postman-token: 1dc38630-8b41-10e4-cdfa-fb53639c543a'

Result:

      1. Confirm no device added to network topology

curl --request GET \
--url http://10.10.199.105:8181/restconf/operational/network-topology:network-topology \
--header 'accept: application/xml' \
--header 'authorization: Basic YWRtaW46YWRtaW4=' \
--header 'cache-control: no-cache' \
--header 'content-type: application/xml' \
--header 'postman-token: 1633d048-98ee-ae9a-00dd-b9f33749cb70'

Result:

      1. Start device on testtools-vm

java -Xmx1G -XX:MaxPermSize=256M -Dorg.apache.sshd.registerBouncyCastle=false -jar netconf-testtool-1.2-Carbon-executable.jar --device-count 1 --debug true --schemas-dir ./schemas --md-sal true

      1. Add device

curl --request PUT \
--url http://10.10.199.105:8181/restconf/config/network-topology:network-topology/topology/topology-netconf/node/netconf-test-device \
--header 'authorization: Basic YWRtaW46YWRtaW4=' \
--header 'cache-control: no-cache' \
--header 'content-type: application/xml' \
--header 'postman-token: a3868282-fbca-58e4-08ee-fc522a730e30' \
--data ' <node xmlns="urn:TBD:params:xml:ns:yang:network-topology">\r\n <node-id>netconf-test-device</node-id>\r\n <host xmlns="urn:opendaylight:netconf-node-topology">10.10.199.185</host>\r\n <port xmlns="urn:opendaylight:netconf-node-topology">17830</port>\r\n <username xmlns="urn:opendaylight:netconf-node-topology">root</username>\r\n <password xmlns="urn:opendaylight:netconf-node-topology">root</password>\r\n <tcp-only xmlns="urn:opendaylight:netconf-node-topology">false</tcp-only>\r\n <keepalive-delay xmlns="urn:opendaylight:netconf-node-topology">0</keepalive-delay>\r\n </node>\r\n \r\n '

Result:
HTTP 200

      1. Confirm device is connected in the operational topology on each of the nodes.

curl --request GET \
--url http://10.10.199.XXX:8181/restconf/operational/network-topology:network-topology \
--header 'accept: application/xml' \
--header 'authorization: Basic YWRtaW46YWRtaW4=' \
--header 'cache-control: no-cache' \
--header 'content-type: application/xml' \
--header 'postman-token: 1633d048-98ee-ae9a-00dd-b9f33749cb70'

Result:
HTTP 200

      1. look beyond the mount point from each of the nodes

curl --request GET \
--url http://10.10.199.XXX:8181/restconf/operational/network-topology:network-topology/topology/topology-netconf/node/netconf-test-device/yang-ext:mount \
--header 'accept: application/xml' \
--header 'authorization: Basic YWRtaW46YWRtaW4=' \
--header 'cache-control: no-cache' \
--header 'content-type: application/xml' \
--header 'postman-token: e17517ae-f008-b96e-43ef-960804d7df87'

Result:
HTTP 200

      1. Find leader of shard-topology-config on DistributedConfigDatastore on each member after isolation

Member-XXX
curl --request GET \
--url 'http://10.10.199.XXX:8181/jolokia/read/org.opendaylight.controller:Category=Shards,name=member-1-shard-inventory-config,type=DistributedConfigDatastore' \
--header 'accept: application/json' \
--header 'authorization: Basic YWRtaW46YWRtaW4=' \
--header 'cache-control: no-cache' \
--header 'content-type: application/json' \
--header 'postman-token: 607b0a83-e559-f1c6-dfb6-c18fde1c9364'



 Comments   
Comment by RichardHill [ 16/Aug/17 ]

The postman collection was run to investigate the state of cluster
Collection examines each member of cluster and verifies if:

  • netconf-topology is available
  • netconf device is connected
  • data available behind the mount point of netconf device
  • leader of shard-topology-config
  • leader of shard-inventory-config

The colloection was run before and after isolation,
after isolation we waited for some minutes then ran again
After rejoin the collection was run almost immediately then after a wait for a few minutes.

Comment by RichardHill [ 16/Aug/17 ]

Attachment postman-collection-and-results-isolation-and-rejoin.zip has been added with description: postman results and collection used to examine cluster

Comment by RichardHill [ 16/Aug/17 ]
  1. Isolation of member-1 time:
    approx 2017-08-16 04:35
  1. Rejoin time:
    approx 2017-08-16 04:44
Comment by RichardHill [ 16/Aug/17 ]

Attachment logs-carbon-isolation-rejoin.zip has been added with description: logs from each memeber of cluster

Comment by RichardHill [ 16/Aug/17 ]

Error seen on member-1 after rejoin

2017-08-16 04:44:18,623 | ERROR | Bundle Shutdown | NetconfTopologyManager | 303 - org.opendaylight.netconf.topology-singleton - 1.2.0.Carbon | Error at closing topology context. InstanceIdentifier: KeyedInstanceIdentifier

{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=topology-netconf]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=netconf-test-device]]]]}

java.lang.NullPointerException: RemoteDevice

{netconf-test-device}

: Device communicator was not created.
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:226)
at org.opendaylight.netconf.topology.singleton.impl.RemoteDeviceConnectorImpl.stopRemote

Comment by RichardHill [ 16/Aug/17 ]
    1. Error seen on member-3 after rejoin

2017-08-16 04:45:09,241 | ERROR | qtp666241474-424 | ContainerResponse | 179 - com.sun.jersey.jersey-server - 1.17.0 | The RuntimeException could not be mapped to a response, re-throwing to the HTTP container
java.lang.IllegalStateException: Can't create ProxyReadTransaction
at org.opendaylight.netconf.topology.singleton.impl.ProxyDOMDataBroker.newReadOnlyTransaction(ProxyDOMDataBroker.java:76)
at org.opendaylight.netconf.sal.restconf.impl.BrokerFacade.readOperationalData(BrokerFacade.java:207)[307:org.opendaylight.netconf.sal-rest-connector:1.5.0.Carbon]
at org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.readOperationalData(RestconfImpl.java:736)[307:org.opendaylight.netconf.sal-rest-connector:1.5.0.Carbon]
at org.opendaylight.netconf.sal.restconf.impl.StatisticsRestconfServiceWrapper.readOperationalData(StatisticsRestconfServiceWrapper.java:116)[307:org.opendaylight.netconf.sal-rest

Comment by RichardHill [ 17/Aug/17 ]

java -version
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

Comment by Robert Varga [ 17/Aug/17 ]

This is a problem in netconf, not controller.

Comment by Tomas Olvecky [ 17/Aug/17 ]

I think there is also a problem in clustering as well, because node-2 did not respond to restconf requests that were querying md-sal topology:
> Results when trying to view the netconf-topology after rejoin
> Try 1) Member 1: Null Member-2: 404 Member-3: 500
> Try 2) Member 1: 200 Member-2: 404 Member-3: 200
> Try 3) Member 1: 200 Member-2: 401 Member-3: 200

Reading logs of node-2 I see TimeoutException in blueprint. This happened after the node got quarantined:
2017-08-16 04:43:58,092 | WARN | ult-dispatcher-3 | QuarantinedMonitorActor | 239 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | Got quarantined

Fix for CONTROLLER-1627 should resolve that problem.

Comment by Vratko Polak [ 21/Aug/17 ]

> Try 1) Member 1: Null Member-2: 404 Member-3: 500

What is the Jolokia response in this case?
I see similar failures in Netconf cluster tests even without prior isolation, but the Jolokia response [0] is different from a corresponding Nitrogen Bug [1].

> Fix for CONTROLLER-1627 should resolve that problem.

We will test the fix.

[0] https://logs.opendaylight.org/releng/jenkins092/netconf-csit-3node-clustering-only-carbon/630/log.html.gz#s1-s5-t13-k2-k2-k8-k1-k2-k1-k1-k3-k1
[1] https://bugs.opendaylight.org/show_bug.cgi?id=9006#c4

Comment by RichardHill [ 22/Aug/17 ]

Regarding

>> What is the Jolokia response in this case?

>> Member 1: Null Member-2: 404 Member-3: 500

The response is recorded in the postman results attached. Its possibly not clear from the bug report which result files map to which Try.

Netconf-testtool-reconnecton.postman_test_run_after_rejoin.json = Try 1
Netconf-testtool-reconnecton.postman_test_run_after_rejoin1.json = Try 2
Netconf-testtool-reconnecton.postman_test_run_after_rejoin2.json = Try 3

Netconf-testtool-reconnecton.postman_test_run_after_rejoin.json: Line 693 shows the results for Try 1

{
"id": "6b6505be-db1e-ca25-28a3-9b4fb52f8473",
"name": "View operational network-topology member-1",
"url": "http://controller-1-ip-addr:port/restconf/operational/network-topology:network-topology",
"totalTime": 0,
"responseCode":

{ "code": 0, "name": "", "detail": "" }

,
"tests":

{ "Response Body ": true }

,
"testPassFailCounts": {
"Response Body ":

{ "pass": 1, "fail": 0 }

},
"times": [
1008
],
"allTests": [

{ "Response Body ": true }

],
"time": 1008,
"totalRequestTime": 1008,
"iterationResults": {}
}

Comment by Vratko Polak [ 22/Aug/17 ]

I am still not sure if there are two different bugs, or just two descriptions of the same bug, so I have opened CONTROLLER-1754.

Comment by RichardHill [ 22/Aug/17 ]

Regarding

>> I am still not sure if there are two different bugs,

What information do you need to help investigate?

Comment by Vratko Polak [ 25/Aug/17 ]

> odl-jolokia odl-netconf-clustered-topology odl-netconf-connector-all odl-restconf odl-netconf-mdsal

I think odl-netconf-connector-all is not really compatible with odl-netconf-clustered-topology. Does this happen without any odl-netconf-connector- feature?

> What information do you need to help investigate?

Ideally, we would want a Robot suite contributed to Integration/Test so that we can run it with various ODL builds on Sandbox. I think this suite:
https://github.com/opendaylight/integration-test/blob/master/csit/suites/netconf/clustering/entity.robot#L111
with ClusterManagement.Isolate_Member_From_List_Or_All (and ClusteManagement.Rejoin_Member_From_List_Or_All at line 138) would work, but I know you have to wait carefully after rejoin, it takes a while before the rejoining member is fully up-to-date.

Sadly, I am on PTO next week, otherwise I would attempt to create such a suite myself.

Comment by RichardHill [ 05/Sep/17 ]

I'll push something

Comment by RichardHill [ 20/Sep/17 ]

Ive tried to push this but I struggled working out why coala was rejecting the submission.

https://git.opendaylight.org/gerrit/#/c/62765/

TOX validation failed and was not sure why. Can you point me to some documents about configuring "coala" please. After trial and error I eventually removed some commented out code and added a linebreak in the git command which got it to pass.

Comment by Vratko Polak [ 21/Sep/17 ]

> https://git.opendaylight.org/gerrit/#/c/62765/

I have adapted that to run on Sandbox (without other suites). The test failed as described. See run [4] and archived logs [5] (both will get deleted over weekend).

That was run against current Carbon snapshot build, and errors in logs look somewhat different. I will try to set DEBUG verbosity (on appropriate packages I have yet to narrow down) to see more details.

[4] https://jenkins.opendaylight.org/sandbox/job/netconf-csit-3node-clustering-all-carbon/1/
[5] https://logs.opendaylight.org/sandbox/jenkins091/netconf-csit-3node-clustering-all-carbon/1/

> TOX validation failed and was not sure why.

If you have Bash with activated virtualenv with Python 2.7 and tox, you can type "tox" in your test repository to run the same checks. Error messages are easier to locate that way.

Comment by Vratko Polak [ 21/Sep/17 ]

> see more details

From INFO logs I got an impression that device owner has moved after rejoin, so this may be hitting the same Singleton deficiency as MDSAL-291 is hitting.

Comment by Vratko Polak [ 21/Sep/17 ]

> I will try to set DEBUG verbosity

Run #3: [6].
Member-1 was the old device owner to get isolated and it was the one not responding correctly after the rejoin. Member-2 is the new device owner.

> may be hitting the same Singleton deficiency as MDSAL-291

Nope. According to member-1 log [7], RELEASING_OWNERSHIP state did not get stuck:
2017-09-21 15:23:00,694 | DEBUG | ult-dispatcher-2 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Service group KeyedInstanceIdentifier

{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=topology-netconf]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=netconf-test-device]]]]}

switching from RELEASING_OWNERSHIP to STANDBY

Currently, I am suspicious about the way the new master removes [8] a slave mount point (not sure on which member):
2017-09-21 15:22:57,627 | INFO | lt-dispatcher-50 | NetconfTopologyContext | 282 - org.opendaylight.netconf.topology-singleton - 1.2.2.SNAPSHOT | Master was selected: IpAddress [_ipv4Address=Ipv4Address [_value=10.29.15.99]]
2017-09-21 15:22:57,627 | DEBUG | lt-dispatcher-50 | NetconfNodeManager | 282 - org.opendaylight.netconf.topology-singleton - 1.2.2.SNAPSHOT | RemoteDevice

{netconf-test-device}: Sending message to unregister slave mountpoint on Actorakka://opendaylight-cluster-data/user/$v#-411647032
2017-09-21 15:22:57,627 | DEBUG | lt-dispatcher-50 | NetconfNodeManager | 282 - org.opendaylight.netconf.topology-singleton - 1.2.2.SNAPSHOT | RemoteDevice{netconf-test-device}

: Sending poison pill to Actorakka://opendaylight-cluster-data/user/$v#-411647032

[6] https://jenkins.opendaylight.org/sandbox/job/netconf-csit-3node-clustering-all-carbon/3/
[7] https://logs.opendaylight.org/sandbox/jenkins091/netconf-csit-3node-clustering-all-carbon/3/odl1_karaf.log.gz
[8] https://logs.opendaylight.org/sandbox/jenkins091/netconf-csit-3node-clustering-all-carbon/3/odl2_karaf.log.gz

Comment by Vratko Polak [ 25/Sep/17 ]

> Run #3: [6].

For this week, [9] seems to be the same, although I have not looked into karaf.log yet.

[9] https://jenkins.opendaylight.org/sandbox/job/netconf-csit-3node-clustering-only-carbon/1/

Comment by Vratko Polak [ 26/Sep/17 ]

> https://git.opendaylight.org/gerrit/#/c/62765/

The suite is merged. I am still investigating whether the failure is in Singleton implementation.

Comment by Vratko Polak [ 18/Oct/17 ]

Finally I think I understand what is going on well enough to write a comment.

First, Netconf clustered topology is based on Singleton service, which is based on Entity ownership service, which uses a specific Shard to store its data, and when the previous Leader of that shard is isolated, it enters IsolatedLeader state and it cannot make progress. Only after rejoining the affected member learns its lost its ownership and another member is already an owner. The isolation suite frequently passes because Entity ownership service takes some time to notify the member, and until that happens it still considers itself a master, keeps opened SSH connection to the device, and everything looks good from the Robot point of view.

I am less sure on details of what happens when this bug is present:
There is a listener on config topology changes. If a device is configured, the listener (on each member) registers a singleton application, and starts another (slave) listener, this time on operational topology. Singleton service instantiates the application on one member (called master). Master stops the local slave listener, and creates a netconf connector, updating the operational topology status. When connector connects, master creates a local mdsal "master" mountpoint others can use to access the device. Slaves get notified (by "connected" status in operational topology) and create local mdsal "slave" mountpoints (to forward requests to the master).
So far so good, but the catch is that the writes to operational topology are only done by a component which handles the connection, which assumes it is the only connection to the specified device. But in the isolation scenario we have both the old master and the new master.
If the singleton service decides the current owner should not be an owner, netconf code does the following (perhaps in a different order) removes the master mount point, tears down the netconf connection, updates the operational topology and creates the slave listener again. Other members get notified (by seeing the device has no longer status connected in operational topology) and they remove their slave mountpoints. Only after the previous owner has done everything, singleton service should instantiate the application on a new owner.

That would work well if it was true that there is always at most one owner across the cluster. But in the isolation scenario, the new owner updates the operational topology first, and then the old owner deletes that without the new owner realizing. Also, other slaves interpret that as device disconnecting and they remove their mount points as well, resulting in this Bug.

I am not sure whether Entity ownership service could be modified to guarantee at most one owner across cluster even in isolation scenarios.
If not, Netconf code has to accommodate for late old owner writes.

P.S.: This week Sandbox job with perhaps too much traces: [10].

[10] https://jenkins.opendaylight.org/sandbox/job/netconf-csit-3node-clustering-only-oxygen/

Comment by Vratko Polak [ 18/Oct/17 ]

> This week Sandbox job

Selected lines from karaf log [11]:

2017-10-18 12:12:49,195 | INFO | nsole user karaf | core | 167 - org.apache.karaf.log.core - 4.0.10 | ROBOT MESSAGE: Starting test Find_And_Isolate_Device_Entity_Owner

2017-10-18 12:12:51,052 | INFO | nsole user karaf | core | 167 - org.apache.karaf.log.core - 4.0.10 | ROBOT MESSAGE: Starting test Wait_For_New_Owner_To_Appear

2017-10-18 12:13:01,731 | INFO | nsole user karaf | core | 167 - org.apache.karaf.log.core - 4.0.10 | ROBOT MESSAGE: Starting test Rejoin_Original_Entity_Owner

2017-10-18 12:13:02,385 | DEBUG | lt-dispatcher-21 | sterSingletonServiceProviderImpl | 298 - org.opendaylight.mdsal.singleton-dom-impl - 2.4.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=topology-netconf]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=netconf-test-device]]]]}}]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]

[11] https://logs.opendaylight.org/sandbox/jenkins091/netconf-csit-3node-clustering-only-oxygen/48/odl1_karaf.log.gz

Comment by Vratko Polak [ 18/Oct/17 ]

> I am not sure whether Entity ownership service could be modified to
> guarantee at most one owner across cluster even in isolation scenarios.

If I am reading this [12] comment correctly, two owners during cluster partition is an expected outcome.

[12] https://github.com/opendaylight/controller/blob/release/nitrogen/opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/datastore/entityownership/EntityOwnershipShard.java#L462-L467

Comment by Tomas Cere [ 19/Oct/17 ]

I am pretty sure that singleton service should be trying to shut down the service on the isolated node. Although if we are doing any datastore operations in the close of the Netconf service based on singleton we wont shutdown until the datastore is available so if im right we will have two owners on rejoin. No idea atm what the appropriate solution would be I would have to think about this.

Comment by Vratko Polak [ 26/Oct/17 ]

Workaround attempt (master branch): [13].
Not fully tested yet, but ready for code review.

[13] https://git.opendaylight.org/gerrit/#/c/64440/8

Comment by Vratko Polak [ 26/Oct/17 ]

> [13]

Testing shows two issues. One is specific to the current patch not cleaning up operational state [14] (3 occurrences), other looks like a controller bug CONTROLLER-1784 (1 occurrence).

[14] https://git.opendaylight.org/gerrit/#/c/64706/2/netconf/sal-netconf-connector/src/main/java/org/opendaylight/netconf/sal/connect/netconf/sal/NetconfDeviceTopologyAdapter.java@144

Comment by Ivan Hrasko [ 31/Mar/21 ]

The problem is partially caused by mdsal-clustering behavior and partially by netconf-clustering behavior.

The problem in mdsal-clustering is that after the isolation the previous leader gets leader again and in general we can say that there is too much leader transitions than is necessary.

I believe the sequence should be as this:
we have a leader, the leader gets isolated, in the meantime new leader is elected, after rejoin the previous leader accepts its follower role to the new leader.

But after applying this patch https://git.opendaylight.org/gerrit/c/netconf/+/95511 we get logs like this:

21
 2021-03-24T15:52:32,600 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-36 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was selected: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
21
 2021-03-24T15:52:36,819 | INFO | remote-connector-processing-executor-10 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} connected - registering master mount point
ISOLATION 15:52:43
23
 2021-03-24T15:52:52,229 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-36 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was selected: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
23
 2021-03-24T15:52:53,496 | INFO | remote-connector-processing-executor-12 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} connected - registering master mount point
NEW OWNER 15:52:53.660
21
 2021-03-24T15:52:55,988 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-20 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was removed: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
21
 2021-03-24T15:52:55,991 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-20 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} disconnected - unregistering master mount point
REJOIN 15:53:57
21
 2021-03-24T15:54:00,702 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-20 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was selected: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
21
 2021-03-24T15:54:00,909 | INFO | remote-connector-processing-executor-1 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} connected - registering master mount point
21
 2021-03-24T15:54:08,477 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-23 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was removed: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
RESYNC 15:54:08.844
22
 2021-03-24T15:54:09,253 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was selected: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
23
 2021-03-24T15:54:09,261 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-37 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was removed: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
23
 2021-03-24T15:54:09,268 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-37 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} disconnected - unregistering master mount point
22
 2021-03-24T15:54:09,478 | INFO | remote-connector-processing-executor-1 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} connected - registering master mount point
22
 2021-03-24T15:54:11,938 | INFO | opendaylight-cluster-data-notification-dispatcher-49 | NetconfTopologyContext | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Master was removed: IpAddress{_ipv4Address=Ipv4Address{_value=192.168.56.25}}
22
 2021-03-24T15:54:11,961 | INFO | opendaylight-cluster-data-notification-dispatcher-49 | MasterSalFacade | 289 - org.opendaylight.netconf.topology-singleton - 1.13.2.SNAPSHOT | Device RemoteDevice{netconf-test-device} disconnected - unregistering master mount point

 

The problem in netconf clustering is that NetconfTopologyContext class is using to control leader-follower transition the RemoteDeviceConnector class. Its methods work well for connecting and disconnecting the device but they are misused for leadership change.

When a node become to be leader RemoteDeviceConnector#startRemoteDeviceConnection is called. This method starts "master" mount point and write to operational datastore status "connected" for the device.
The followers are listening for changes in operational datastore and when they receive "connected" they create their "slave" mountpoints.

When a node become to be follower RemoteDeviceConnector#stopRemoteDeviceConnection is called. This method stops "master" mountpoint.. and writes to operational datastore "connecting" for the device ... and finally deletes node data from operational datastore. And that's the problem. After previous leader isolation when new leader is elected it writes status "connected" and followers establish their "slave" mountpoints. But after rejoin we can get disconnect with receiving "connecting" and after all node data are deleted from operational datastore.

I have made a test run without removing the config data for the device in the teardown but I end up in the state when all "slave" mountpoints were down and operational data for node were gone. The controller was not able to recover connection. The only solution was to restart test device!

I suggest we need becomeLeader() and becomeFollower() methods which will do:

becomeLeader() will write "connecting" to operational datastore to cause reconnect on followers (they need to drop "slave" mount points redirected to previous leader), then it will establish "master" mount point and if that succeeds it can write "connected" to operational datastore - thus notifying followers to create "slave" mountpoints.

becomeFollower() will just stop "master" mount point and writes nothing to operational datastore. I think that the current implementation by writing to operational datastore after node is no more leader breaks the netconf clustering idea that only leader can write device status.

That looks simple but the implementation of RemoteDeviceConnector using NetconfConnectorDTO class is making this hard problem. We will need to rework some of the classes and unit tests.

Comment by Ivan Hrasko [ 07/Apr/21 ]

After fixing the previous problems we occasionally get to the following problem:

Node 22 was the leader.
We have isolated node 22 from nodes 21 and 23 using iptables.
During isolation nodes 21 and 23 were able to provide mountpoints.

We have broken the isolation and noticed the following logs:

node 21:
no logs at all (it means that this node did not even get operational datastore notifications)

node 22:

2021-04-07T14:35:42.142869+02:00[Europe/Bratislava]: I am master
2021-04-07T14:35:42.289366+02:00[Europe/Bratislava]: Updating data: true from master: akka.tcp://opendaylight-cluster-data@192.168.56.22:2550
2021-04-07T14:35:47.022939+02:00[Europe/Bratislava]: I am follower
2021-04-07T14:35:47.039052+02:00[Europe/Bratislava]: modification...
2021-04-07T14:35:47.049861+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.22:2550

node 23:

 2021-04-07T14:35:46.612573+02:00[Europe/Bratislava]: modification...
 2021-04-07T14:35:46.626221+02:00[Europe/Bratislava]: Disconnecting slave mountpoint
 2021-04-07T14:35:47.040825+02:00[Europe/Bratislava]: modification...
 2021-04-07T14:35:47.048678+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.22:2550

We see that node 21 is somehow out of game and nodes 22 and 23 were trying to form leader - follower relationship.
But node 22 failed to hold its leadership, become follower creating not working "slave" mountpoint to itself.
We can only speculate that probably the node 21 was supposed to be new leader but failed to receive notification about that.

Note:
the logs you can see are not from the application code but from IDE debugger:

  • I am master - means that nodes instantiateServiceInstance method was called
  • I am follower - means that node closeServiceInstance method was called
  • Updating data: ... - means tat node is writing data to operational DS (by writing true leader informs its followers to create "slave" mount points)
  • modification - means that node received notification about change in operational DS
  • Disconnecting slave mount point - means the node is disconnecting its "slave" mountpoint
  • Connecting slave mount point to: ... - means that node is connecting its "slave" mountpoint to leader with IP address as you can see in debugger message

instantiateServiceInstance and closeServiceInstance methods are implemented in:
org.opendaylight.netconf.topology.singleton.impl.NetconfTopologyContext implements ClusterSingletonService, AutoCloseable

.. and here is the output of /rests/data/entity-owners:entity-owners:

{
  "entity-owners:entity-owners": {
    "entity-type": [
      {
        "type": "org.opendaylight.mdsal.ServiceEntityType",
        "entity": [
          {
            "id": "/odl-general-entity:entity[name='KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey{_topologyId=Uri{_value=topology-netconf}}], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey{_nodeId=Uri{_value=netconf-test-device}}]]}']",
            "candidate": [
              {
                "name": "member-1"
              },
              {
                "name": "member-3"
              },
              {
                "name": "member-2"
              }
            ],
            "owner": "member-1"
          }
        ]
      },
      {
        "type": "org.opendaylight.mdsal.AsyncServiceCloseEntityType",
        "entity": [
          {
            "id": "/odl-general-entity:entity[name='KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey{_topologyId=Uri{_value=topology-netconf}}], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey{_nodeId=Uri{_value=netconf-test-device}}]]}']",
            "candidate": [
              {
                "name": "member-1"
              }
            ],
            "owner": "member-1"
          }
        ]
      }
    ]
  }
}

 

Comment by Ivan Hrasko [ 13/Apr/21 ]

We have reproduced this problem again.

The problem is as follows:
you can see that at time 2021-04-12T14:16:33.861299+02:00
node 23 as the leader is writing data to operational datastore.

But the isolated leader node 21 after isolation got notification that
it is leader again at 2021-04-12T14:17:36.064744+02:00
and is writing to operational DS at 2021-04-12T14:17:36.292559+02:00.
It holds the leadership only for 4 seconds and then become follower.
That is not correct because the real leader is still node 23.

The thing that makes Netconf clustering to crash is that node 21
got notification that it is leader and then that it is follower and node 23
has no information about this "chaos". Node 23 never got notification
that it losts its leadership (and probably that never even happen) thus
node 23 has no change to write the correct data to opeartional
datastore again.

This results in incorrect leader address stored in operational DS as you can
see at time 2021-04-12T14:17:40.151471+02:00 when node 21 is connecting
"slave" mountpoint to itself.

21

2021-04-12T14:16:16.658098+02:00[Europe/Bratislava]: Registering node manager: null
2021-04-12T14:16:16.672608+02:00[Europe/Bratislava]: registering..
2021-04-12T14:16:16.692363+02:00[Europe/Bratislava]: I am leader
2021-04-12T14:16:16.699679+02:00[Europe/Bratislava]: Unregistering node manager: org.opendaylight.netconf.topology.singleton.impl.NetconfNodeManager@58d4a02
2021-04-12T14:16:18.089251+02:00[Europe/Bratislava]: connecting device.. true
2021-04-12T14:16:18.106551+02:00[Europe/Bratislava]: Updating data: true from master: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550
2021-04-12T14:16:34.811554+02:00[Europe/Bratislava]: I am follower
2021-04-12T14:16:34.826534+02:00[Europe/Bratislava]: Disconnecting device.. false
2021-04-12T14:16:34.841825+02:00[Europe/Bratislava]: Registering node manager: org.opendaylight.netconf.topology.singleton.impl.NetconfNodeManager@58d4a02
2021-04-12T14:16:34.856144+02:00[Europe/Bratislava]: modification...
2021-04-12T14:16:34.863838+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550
2021-04-12T14:17:36.064744+02:00[Europe/Bratislava]: I am leader
2021-04-12T14:17:36.076016+02:00[Europe/Bratislava]: Unregistering node manager: org.opendaylight.netconf.topology.singleton.impl.NetconfNodeManager@9603f02
2021-04-12T14:17:36.263286+02:00[Europe/Bratislava]: connecting device.. true
2021-04-12T14:17:36.292559+02:00[Europe/Bratislava]: Updating data: true from master: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550
2021-04-12T14:17:40.090901+02:00[Europe/Bratislava]: I am follower
2021-04-12T14:17:40.118272+02:00[Europe/Bratislava]: Disconnecting device.. false
2021-04-12T14:17:40.130646+02:00[Europe/Bratislava]: Registering node manager: org.opendaylight.netconf.topology.singleton.impl.NetconfNodeManager@9603f02
2021-04-12T14:17:40.144148+02:00[Europe/Bratislava]: modification...
2021-04-12T14:17:40.151471+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550
Disconnected from the target VM, address: '192.168.56.21:5005', transport: 'socket'

22

2021-04-12T14:16:17.677425+02:00[Europe/Bratislava]: Registering node manager: null
2021-04-12T14:16:17.700051+02:00[Europe/Bratislava]: registering..
2021-04-12T14:16:17.703418+02:00[Europe/Bratislava]: modification...
2021-04-12T14:16:17.725131+02:00[Europe/Bratislava]: Disconnecting slave mountpoint
2021-04-12T14:16:18.140807+02:00[Europe/Bratislava]: modification...
2021-04-12T14:16:18.147951+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550
2021-04-12T14:16:20.613060+02:00[Europe/Bratislava]: master: Actor[akka.tcp://opendaylight-cluster-data@192.168.56.21:2550/user/akka.tcp:opendaylight-cluster-data@192.168.56.21:2550_netconf-test-device#770934166]
2021-04-12T14:16:33.702581+02:00[Europe/Bratislava]: modification...
2021-04-12T14:16:33.712800+02:00[Europe/Bratislava]: Disconnecting slave mountpoint
2021-04-12T14:16:33.885460+02:00[Europe/Bratislava]: modification...
2021-04-12T14:16:33.891728+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.23:2550
2021-04-12T14:17:31.334643+02:00[Europe/Bratislava]: master: Actor[akka.tcp://opendaylight-cluster-data@192.168.56.23:2550/user/akka.tcp:opendaylight-cluster-data@192.168.56.23:2550_netconf-test-device#-189182088]
2021-04-12T14:17:40.008555+02:00[Europe/Bratislava]: modification...
2021-04-12T14:17:40.018033+02:00[Europe/Bratislava]: Disconnecting slave mountpoint
2021-04-12T14:17:40.121272+02:00[Europe/Bratislava]: modification...
2021-04-12T14:17:40.127474+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550
Disconnected from the target VM, address: '192.168.56.22:5005', transport: 'socket'

23

2021-04-12T14:16:17.676719+02:00[Europe/Bratislava]: Registering node manager: null
2021-04-12T14:16:17.693843+02:00[Europe/Bratislava]: registering..
2021-04-12T14:16:17.696230+02:00[Europe/Bratislava]: modification...
2021-04-12T14:16:17.714960+02:00[Europe/Bratislava]: Disconnecting slave mountpoint
2021-04-12T14:16:18.142020+02:00[Europe/Bratislava]: modification...
2021-04-12T14:16:18.149924+02:00[Europe/Bratislava]: Connecting slave mount point to: akka.tcp://opendaylight-cluster-data@192.168.56.21:2550
2021-04-12T14:16:33.668206+02:00[Europe/Bratislava]: I am leader
2021-04-12T14:16:33.675129+02:00[Europe/Bratislava]: Unregistering node manager: org.opendaylight.netconf.topology.singleton.impl.NetconfNodeManager@3254b5b4
2021-04-12T14:16:33.847092+02:00[Europe/Bratislava]: connecting device.. true
2021-04-12T14:16:33.861299+02:00[Europe/Bratislava]: Updating data: true from master: akka.tcp://opendaylight-cluster-data@192.168.56.23:2550
Disconnected from the target VM, address: '192.168.56.23:5005', transport: 'socket'

We have created workaround for this in https://git.opendaylight.org/gerrit/c/netconf/+/95755
where we added 15s delay for all nodes that is possibly "isolated leader after rejoin" preventing
it writing data to operational datastore if it fails to hold its leadership for at least 15s time period.

We conclude that the root of this problem is not in Netconf clustering but in mdsal clustering.

Comment by Ivan Hrasko [ 03/Aug/21 ]

We have to stop the work on this task because of currently not working ClusterSingletonService from the following reasons:

The base logic of the Netconf clustering is located in:

 class NetconfTopologyContext implements ClusterSingletonService, AutoCloseable

The principle is based on:

ClusterSingletonService#instantiateServiceInstance

and

ClusterSingletonService#closeServiceInstance

methods.

NetconfTopologyContext class expects that #instantiateServiceInstance method is invoked when
a node on which is running becomes a leader.
When leader is isolated and a new leader is elected it is expected that #instantiateServiceInstance is invoked on a new leader's node.
Invoking of #instantiateServiceInstance is used to run process of establishing a "master" mountpoint to the device.

The same is expected with #closeServiceInstance.
When a node is no more leader this method should be invoked.
Invoking of #closeServiceInstance is used to run process of establishing "slave" mountpoints (throught "master" mountpoint) to the device.

Since changes in clustering has been made for task: https://jira.opendaylight.org/browse/CONTROLLER-1982
#instantiateServiceInstance method is invoked on the current leader node only when topology node is created. When the leader changes the method is not invoked on the new leader.
Thus new leader does not establish a "master" mountpoint connection to the device and followers still holds (no more working) "slave" mountpoints to the previous leader.

Similarly, #closeServiceInstance is called only on the deleting the topology node on the original leader.

Note:

/rests/data/entity-owners:entity-owners?content=nonconfig

is no more available, thus to get info which node is leader we have switched to use:

/jolokia/read/org.opendaylight.controller:Category=Shards,name=\{member}-shard-topology-operational,type=DistributedOperationalDatastore
Comment by Robert Varga [ 10/Aug/21 ]

https://jira.opendaylight.org/browse/CONTROLLER-1992 tracks missing entity ownership state knobs. Note that the jolokia endpoint does not really have much correlation to EOS decisions.

Comment by Robert Varga [ 10/Aug/21 ]

ivanhrasko  can recreate the lack of callback invocation in https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/eos-dom-akka/src/test/java/org/opendaylight/controller/eos/akka/ThreeNodeBaseTest.java ?

Comment by Robert Varga [ 06/Nov/22 ]

The patches are marked as work-in-progress. Please update them and this issue once they are ready.

Generated at Wed Feb 07 20:15:04 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.