[CONTROLLER-2047] ODL Clustering issues Created: 05/Aug/22  Updated: 25/Aug/23

Status: Open
Project: controller
Component/s: clustering
Affects Version/s: 4.0.10
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Rohini Ambika Assignee: Venkatrangan Govindarajan
Resolution: Unresolved Votes: 0
Labels: pick-next, pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Microsoft Word ODL_Cluster_logs_HA.docx     Text File PhosSR3_odl-0.log     Text File PhosSR3_odl-1.log     Text File PhosSR3_odl-2.log     File akka-odl-instance1.conf     File akka-odl-instance2.conf     File akka-odl-instance3.conf     File module-shards.conf    
Issue Links:
Relates
relates to CONTROLLER-2035 ODL Cluster - Akka Cluster Singleton ... Resolved
Priority: High

 Description   

Requirement :  ODL clustering for high availability (HA) on data distribution 

Failover/High Availability failing in ODL cluster:

  • Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts. **
  • Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.  

Env Configuration:

  • 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
  • CPU :  8 Cores
  • RAM : 20GB
  • Java Heap size : Min – 512MB Max – 16GB
  • JDK version : 11
  • Kubernetes version : 1.19.1
  • Docker version : 20.10.7

ODL features installed to enable clustering:

  • odl-netconf-clustered-topology
  • odl-restconf-all

Device configured : Netconf devices , all devices having same schema(tested with 250 devices)



 Comments   
Comment by Robert Varga [ 09/Aug/22 ]

So this is cloning CONTROLLER-2035, which was resolved in 4.0.11. but the provided logs are from 4.0.10.
If this is still an issue, I would expect logs from 4.0.12 at the very least (i.e. Phosphorus SR3), with an explicit note that that release is no longer community-supported.

What is the purpose of this issue?

Comment by Rohini Ambika [ 10/Aug/22 ]

We have tested using 4.0.10 with CONTROLLER-2035 patch applied , hence the logs represents 4.0.10 version.

If needed , we can test with 4.0.12 version as mentioned and share the observations.

Comment by Rohini Ambika [ 24/Aug/22 ]

rovarga Tested with Phosphorous SR3 version and logs attached. Still we are facing this issue. Please check .

Comment by Rohini Ambika [ 01/Sep/22 ]

Hi rovarga Did you get a chance to look in to the logs shared?

Comment by Robert Varga [ 28/Nov/22 ]

Sorry, too busy elsewhere. Putting it on backlog.

Comment by Vlad [ 25/Aug/23 ]

Hi,

 

There is any updates about the issue?

 

We are facing the same issue with a docker image of odl 0.18.1

 

We couldn't reach any of the actors on the cluster

Generated at Wed Feb 07 19:57:04 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.