[OPNFLWPLUG-1065] openflowplugin failed to initialize because of dependency issue Created: 04/Mar/19  Updated: 06/May/19  Resolved: 05/May/19

Status: Resolved
Project: OpenFlowPlugin
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Highest
Reporter: Yi Yang Assignee: Anil Vishnoi
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Opendaylight Fluorine SR1 + Openstack Rocky


Attachments: File karaf.log.error-for-1065    
Priority: Highest

 Description   

curl -u admin:admin http://192.168.0.5:8181/diagstatus shows openflowplugin wasn't up successfully, here are some error information.

"statusSummary": [

{ "serviceName": "OPENFLOW", "effectiveStatus": "ERROR", "reportedStatusDescription": "OF::PORTS:: 6653 and 6633 are not up yet", "statusTimestamp": "2019-03-04T01:54:44.411Z", "errorCause": "" }

,



 Comments   
Comment by Yi Yang [ 04/Mar/19 ]

the attachment karaf.log.error-for-1065 is karaf log, I have encountered this many times, but restarting it can fix issue, I doubt it is initialization order issue of depended features.

Comment by Anil Vishnoi [ 05/Mar/19 ]

Looks like your cluster setup is not coming up properly. Initially when 192.168.0.5 node comes, it tries to reach .6 and .7 machine and the connection is refused, so node started as a singleton cluster (pretty much useless)

 

2019-03-04T01:28:19,714 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 83 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@192.168.0.5:2550] - Started up successfully
2019-03-04T01:28:19,921 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | NettyTransport                   | 83 - com.typesafe.akka.slf4j - 2.5.11 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /192.168.0.7:2550
2019-03-04T01:28:19,927 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | NettyTransport                   | 83 - com.typesafe.akka.slf4j - 2.5.11 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /192.168.0.6:2550
2019-03-04T01:28:19,939 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | ReliableDeliverySupervisor       | 83 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@192.168.0.6:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@192.168.0.6:2550]] Caused by: [Connection refused: /192.168.0.6:2550]
2019-03-04T01:28:19,941 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | ReliableDeliverySupervisor       | 83 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@192.168.0.7:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@192.168.0.7:2550]] Caused by: [Connection refused: /192.168.0.7:2550]

 

 

and after around 5:30 minutes, node 0.6 become reachable

 

2019-03-04T01:33:50,275 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-44 | Cluster(akka://opendaylight-cluster-data) | 83 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@192.168.0.5:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@192.168.0.6:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1720025309]] to [akka.tcp://opendaylight-cluster-data@192.168.0.5:2550]
2019-03-04T01:33:50,276 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-44 | Cluster(akka://opendaylight-cluster-data) | 83 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@192.168.0.5:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@192.168.0.5:2550] to [Actor[akka.tcp://opendaylight-cluster-data@192.168.0.6:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1720025309]]
2019-03-04T01:33:50,452 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-44 | Cluster(akka://opendaylight-cluster-data) | 83 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@192.168.0.5:2550] - Node [akka.tcp://opendaylight-cluster-data@192.168.0.6:2550] is JOINING, roles [member-2, dc-default]

and than initialization of mdsal, aaa and eventually openflowplugin triggers, but seems like some of the components are still not initialized properly yet.

2019-03-04T01:34:15,330 | INFO  | Blueprint Extender: 3 | OpenFlowPluginProviderFactoryImpl | 370 - org.opendaylight.openflowplugin.impl - 0.7.1 | Initializing new OFP southbound.

Looks like once diagstats give up after some timeout, it does not retry to check (or don't get notifications to retry in case bundle initialization happens with delay) that status again. But irrespective of that, controller goes in the bad state because controller didn't converge initially because of the non availability of other nodes in the cluster. So it's not really openflowplugin issue, but related to the proper cluster initialization.

 

 

Comment by Yi Yang [ 05/Mar/19 ]

Anil, thank you so much for your comments, yes, it is cluster-related issue, so whom do you think it should be reassigned to?

Comment by Anil Vishnoi [ 05/Mar/19 ]

Hi Yi,

 

I would suggest to first check your environment to see why controllers 192.168.0.6/7 are not reachable to 192.168.0.5, that seems to be the root cause of the issue. I have seen that you are creating br-int on the same node, i am wondering if your OVSDB configuration on the 192.168.0.6/7 is interfering with the networking between your cluster VMs. In most of the cases i have seen that if CDS don't initialize properly, the only option you have is restart your cluster.

Comment by Yi Yang [ 05/Mar/19 ]

Anil, this is an error ha_l2.robot resulted in, I can make the cluster work normally by restarting every cluster node in order, but this isn't what we expect.

In reality, such situation maybe often happens, so we think akka-based clustering isn't reliable, why can't it fix this automatically? 

Actually ha_l2.robot is very friendly, Robert said MDSAL cluster has more rigorous than this, but it can work normally, do you know which robot script can torture MDSAL cluster ?

Comment by Anil Vishnoi [ 02/May/19 ]

yangyi01 I believe out of order restart issue is something akka's limitation and to fix this, i would suggest you open a ticket to the controller project.

I don't really have any information regarding the script that can stress mdsal, probably md-sal project or integration/test project folks can help you better here.

Comment by Yi Yang [ 02/May/19 ]

Thanks Anil, I'll open a new bug for controller and close this one, do you know how I can change this to controller project? or I have to re-create new one for it?

Comment by Jamo Luhrsen [ 02/May/19 ]

yangyi01 are you going to close this? what is the bug id in the controller project?

Comment by Yi Yang [ 05/May/19 ]

https://jira.opendaylight.org/browse/CONTROLLER-1892 has been there, so close this one, this should be a controller-specific bug.

Comment by Yi Yang [ 05/May/19 ]

Jamo, I created https://jira.opendaylight.org/browse/CONTROLLER-1892 for it, so close this one.

Comment by Jamo Luhrsen [ 06/May/19 ]

thanks!

Generated at Wed Feb 07 20:34:06 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.