[CONTROLLER-1901] cluster node quarantined, but the node did not auto restart when restore the network connection Created: 13/Jun/19  Updated: 02/Jun/21  Due: 11/Jun/19  Resolved: 02/Jun/21

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Oxygen SR4
Fix Version/s: Sodium SR4

Type: Bug Priority: Medium
Reporter: Bo Song Assignee: Bo Song
Resolution: Done Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

3 cluster nodes
member-01:172.20.14.162
member-02:172.20.14.163
member-03:172.20.14.164

odl-version:Oxygen-sr4(0.8.4)

1,config cluster
2,cluster nodes start
3,install feature:odl-mdsal-all
4,add reject route on node01 as:
route add -host 172.20.14.163 reject
route add -host 172.20.14.164 reject

5,few minutes later delete reject route as :
route del -host 172.20.14.163 reject
route del -host 172.20.14.164 reject

6, log always print "is still unreachable or has not been restarted. Keeping it quarantined."
7, node did not restart

cluster config use default settings,such as node01:
such as node01:
odl-cluster-data {
akka {
remote {
artery

{ enabled = off canonical.hostname = "172.20.14.162" canonical.port = 2550 }

netty.tcp

{ hostname = "172.20.14.162" port = 2550 }
  1. when under load we might trip a false positive on the failure detector
  2. transport-failure-detector { # heartbeat-interval = 4 s # acceptable-heartbeat-pause = 16s # }

    }

cluster

{ # Remove ".tcp" when using artery. seed-nodes = ["akka.tcp://opendaylight-cluster-data@172.20.14.162:2550", "akka.tcp://opendaylight-cluster-data@172.20.14.163:2550", "akka.tcp://opendaylight-cluster-data@172.20.14.164:2550"] roles = ["member-1"] }

persistence {

  1. By default the snapshots/journal directories live in KARAF_HOME. You can choose to put it somewhere else by
  2. modifying the following two properties. The directory location specified may be a relative or absolute path.
  3. The relative path is always relative to KARAF_HOME.
  1. snapshot-store.local.dir = "target/snapshots"
  2. journal.leveldb.dir = "target/journal"

journal {
leveldb

{ # Set native = off to use a Java-only implementation of leveldb. # Note that the Java-only version is not currently considered by Akka to be production quality. # native = off }

}
}
}
}


Attachments: File log.rar    
Issue Links:
Relates
relates to CONTROLLER-1941 Controller does not quarantine on iso... Resolved
Priority: Normal

 Description   

In a three-node cluster environment, a cluster member is isolated by manual network isolation, and then the network is restored. It is found that the cluster members are not restarted.

Environment:

3 cluster nodes

member-01:172.20.14.162
member-02:172.20.14.163
member-03:172.20.14.164

 

odl-version: Oxygen-sr4(0.8.4)

 

Steps to reproduce:

  1. config cluster
  2. cluster nodes start
  3. install feature:odl-mdsal-all
  4. add reject route on node01 as:
    route add -host 172.20.14.163 reject
    route add -host 172.20.14.164 reject
  5. few minutes later delete reject route as :
    route del -host 172.20.14.163 reject
    route del -host 172.20.14.164 reject
  6. log always print "is still unreachable or has not been restarted. Keeping it quarantined."
  7. node did not restart

cluster config use default settings, such as node01:

 

odl-cluster-data {
  akka {
    remote {
      artery {
        enabled = off
        canonical.hostname = "172.20.14.162"
        canonical.port = 2550
      }

      netty.tcp {
        hostname = "172.20.14.162"
        port = 2550
      }

      # when under load we might trip a false positive on the failure detector
      transport-failure-detector {
        # heartbeat-interval = 4 s
        # acceptable-heartbeat-pause = 16s #
      }
    }

    cluster {
      # Remove ".tcp" when using artery.
      seed-nodes = ["akka.tcp://opendaylight-cluster-data@172.20.14.162:2550", "akka.tcp://opendaylight-cluster-data@172.20.14.163:2550", "akka.tcp://opendaylight-cluster-data@172.20.14.164:2550"] roles = ["member-1"]
    }

    persistence {
      # By default the snapshots/journal directories live in KARAF_HOME. You can choose to put it somewhere else by
      # modifying the following two properties. The directory location specified may be a relative or absolute path.
      # The relative path is always relative to KARAF_HOME.
      snapshot-store.local.dir = "target/snapshots"
      journal.leveldb.dir = "target/journal"
      journal {
        leveldb {
          # Set native = off to use a Java-only implementation of leveldb.
          # Note that the Java-only version is not currently considered by Akka to be production quality.
          # native = off
        }
      }
    }
  }
}

 

 



 Comments   
Comment by Bo Song [ 13/Jun/19 ]

when test on n-sr2, node quarantine and automatic restart is possible.
I read the log notice that on n-sr2 log will print "Got quarantined by akka.tcp://opendaylight-cluster-data@xx.xx.xx.xx:2550" before auto-restart.
so i check the o-sr4 code and add log found that node did not received any message and the restart method will not be called.

method "onReceive":
QuarantinedMonitorActor

Now I'm trying to solve this problem, hope I can get some help about this problem, thanks.

Comment by Bo Song [ 25/Jun/19 ]

I have tested Fluorine-SR2, still has this problem.
Here are the problems I found:
odl-controller subscribe the “ThisActorSystemQuarantinedEvent” and call karaf.restart method.
Oxygen-sr4 dependency new akka-version and akka do not publish “ThisActorSystemQuarantinedEvent” during the operation process. so I change subscribe event from “ThisActorSystemQuarantinedEvent” to RemotingLifecycleEvent(super class) and found akka publish "AssociationErrorEvent" when restore network with the detail log:

_onReceive AssociationErrorEvent AssociationError [akka.tcp://opendaylight-cluster-data@172.20.14.162:2550] -> [akka.tcp://opendaylight-cluster-data@172.20.14.163:2550]: Error [Invalid address: akka.tcp://opendaylight-cluster-data@172.20.14.163:2550] [
akka.remote.InvalidAssociation: Invalid address: akka.tcp://opendaylight-cluster-data@172.20.14.163:2550
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has a UID that has been quarantined. Association aborted._

I read akka code about publish event, publish business has changed, not much has been discovered.
I have already made a repair on my odl-project based on the "AssociationErrorEvent", and I will write out the details of the modification recently. But I do not intend to submit it to the community, because I feel this solution is not good, just for your reference.

Comment by Bo Song [ 26/Jun/19 ]

Here is my solution(based on stable/oxygen):
when receive "AssociationErrorEvent" and message contains "The remote system has a UID that has been quarantined", ready to call restart method.
Before restart , it will count the number of remote addresses to restart the Isolated single node. Previous designs may restart two other nodes, and my approach guarantees business uninterrupted.
This solution works only for three-nodes clusters, more nodes may have problems. I've tested it several times in a three-nodes environment and it's stable.

I have submitted my codes to my github:
change-diff

just for your reference. Looking forward to better solutions.

Comment by Bo Song [ 02/Jun/21 ]

CONTROLLER-1904 have already been solved

Comment by Bo Song [ 02/Jun/21 ]

https://jira.opendaylight.org/browse/CONTROLLER-1941 

Generated at Wed Feb 07 19:56:44 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.