[OPNFLWPLUG-1111] controller (openflowplugin) failed to initialize because of dependency issue Created: 05/May/19  Updated: 19/Jul/21

Status: Open
Project: OpenFlowPlugin
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: High
Reporter: Yi Yang Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Opendaylight Fluorine SR1 + Openstack Rocky


Attachments: File karaf.log.error-for-1892    
Epic Link: Clustering Stability
Priority: Highest

 Description   

curl -u admin:admin http://192.168.0.5:8181/diagstatus shows openflowplugin wasn't up successfully, here are some error information.

"statusSummary": [

{ "serviceName": "OPENFLOW", "effectiveStatus": "ERROR", "reportedStatusDescription": "OF::PORTS:: 6653 and 6633 are not up yet", "statusTimestamp": "2019-03-04T01:54:44.411Z", "errorCause": "" }

,



 Comments   
Comment by Yi Yang [ 05/May/19 ]

Please refer to https://jira.opendaylight.org/browse/OPNFLWPLUG-1065 for more info.

Comment by Robert Varga [ 06/May/19 ]

I am not sure what I should be looking for: the cluster seems not to have been bootstrapped correctly, i.e. the first node joined on self. How is the cluster being bootstrapped?

Comment by Yi Yang [ 07/May/19 ]

Robert, sorry, I re-assigned it to you, it is indeed controller clustering issue, let me recap it here.

This is a cluster setup, it is ok at the beginning, l2.robot in https://git.opendaylight.org/gerrit/p/integration/test.git test can reproduce this issue very easily because it is torturing it (stop member 1, write data store, start member 1, stop member 2, write data store, start member 2, the issue is there), it is data sync issue I think, so yes, member 2 can't be restarted because of this.

Can you explain why it isn't clustering issue? Or you can reassign it to Abhijit, let TSC chairman assign it to somebody.

Sam, Jamo and Tom have explained this issue to me, please draft a detailed guide about how one application should use MDSAL APIs to strictly follow clustering requirements if you think it is application-related issue, otherwise I don't think one application can result in such issue, can we have a tool/script to check if all MDSAL applications in ODL have strictly followed clustering requirements? I believe this is an critical issue, please don't ignore it, nobody knows about clustering more clearly than you. I know applications need to use some strict APIs to make sure clustering can work correctly per Tom's explanation, but please do let us know which APIs applications can use and which APIs applications can't use for cluster.

Comment by Robert Varga [ 20/Aug/19 ]

yangyi01 I do not see answer to question raised here: https://jira.opendaylight.org/browse/OPNFLWPLUG-1065?focusedCommentId=66521&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-66521

I am sorry, but I just cannot analyze the entire application stack, but certainly timeouts waiting for a service and giving up on them in some bounded time, without ever going back are a problem.

 

Comment by Robert Varga [ 20/Aug/19 ]

At the end of the day, this boils down to reproducing the issue, with all three nodes, with exact knowledge of what actions were made at which precise moments in the environment. I know next to nothing about RF, nor netvirt, hence a pointer to robot is just not enough to send me after a wild goose chase across the entire stack to pin down who is doing what.

Generated at Wed Feb 07 20:34:13 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.