[MDSAL-291] Member isolation test fails in Carbon and Nitrogen Created: 12/Sep/17  Updated: 09/Mar/18  Resolved: 18/Oct/17

Status: Resolved
Project: mdsal
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Luis Gomez Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 9145

 Description   

It works in Boron but not in Carbon or Nitrogen. New test will be added to catch this.



 Comments   
Comment by Tomas Slusny [ 12/Sep/17 ]

Isnt this related to new cluster failures introduced possibly by this change? https://git.opendaylight.org/gerrit/#/c/62708/

Comment by Luis Gomez [ 12/Sep/17 ]

Yes, patch date match the regression.

Comment by Luis Gomez [ 12/Sep/17 ]

Test added to track this: https://git.opendaylight.org/gerrit/#/c/63014

Comment by Luis Gomez [ 13/Sep/17 ]

OK, current nitrogen branch shows this bug:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-gate-clustering-only-nitrogen/280/

However after applying this revert:

https://git.opendaylight.org/gerrit/#/c/63111/

The bug is gone:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-gate-clustering-only-nitrogen/279

So I think it makes sense to mark this regression as BLOCKER and move the issue to mdsal.

Comment by Luis Gomez [ 13/Sep/17 ]

To reproduce:

Run OF cluster member isolation test:

  • Connect switch to 3 instances
  • isolate (e.g. using iptables) switch master instance from the cluster, wait for cluster to elect new master
  • rejoin master instance, wait for cluster to be in sync
  • isolate switch candidate instance from the cluster, wait for cluster to be in sync
  • rejoin candidate instance, wait for cluster to be in sync
  • Disconnect switch

After this test, if we reconnect the same switch, the entity owner does not show correct number of connected instances + no instance is given mastership for the device, so everything fails for the device. To recover we have to restart controller.

Comment by Vratko Polak [ 14/Sep/17 ]

From these two lines:
2017-09-13 19:06:17,923 | INFO | lt-dispatcher-22 | ContextChainHolderImpl | 281 - org.opendaylight.openflowplugin.impl - 0.5.0.SNAPSHOT | Removing device from operational DS Uri [_value=openflow:1] was successful
2017-09-13 19:06:23,103 | INFO | entLoopGroup-7-1 | ConnectionAdapterImpl | 287 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.5.0.SNAPSHOT | Hello received

only the first one appears (with different time) in karaf.log [5] of the failed run.
I started running tests with debug logs for org.opendaylight.mdsal.singleton, but Sandbox had problems launching VMs for a while. So only one relevant run [6] finished yet, but the test case passed.

[0] > https://git.opendaylight.org/gerrit/#/c/62708/

One behavioral change is that service un-registration is now asynchronous, which (from user point of view) looks like subsequent re-registration is delayed, so perhaps the Robot check now hits this window of original master not being a candidate yet? Will need more time to investigate logs.

[5] https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-3node-gate-clustering-only-nitrogen/280/odl1_karaf.log.gz
[6] https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-nitrogen/4/

Comment by Luis Gomez [ 14/Sep/17 ]

If you see the test trend:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-clustering-only-nitrogen/

It is not failing 100% of times but definitely there are more times it fails than it does not, and this started to happen just after the patch I reverted.

Also I can easily reproduce the issue in my env.

Comment by Luis Gomez [ 14/Sep/17 ]

Also to be fair the member isolation test never worked 100% OK, so the real regression is that now it fails more often than before. From this perspective if the debug/fix is complicated at this time I think we can release nitrogen and write something in the release notes saying member isolation is not supported and therefore requires cluster restart.

Comment by Robert Varga [ 14/Sep/17 ]

Theory is that the rework of MDSAL-275 is flushing synchronization issues in OFP. I will review the logs tomorrow, but in the mean time it would be good to check ofp if it's solid.

Previously singleton service would block the calling thread until after the unregistration has completely finished. This is no longer the case, as it does not make sense and it actually was unstable in face of GCs and other timing issues.

Hence it is possible that the absence of the lock is flushing lack of synchronization downstream...

Comment by Vratko Polak [ 14/Sep/17 ]

Debug logs with the revert are here [8] (until Friday ends).

I have not reproduced a clear-cut replication of this symptom (without the revert) yet. Can someone with Sandbox rights (Luis?) keep hitting Rebuild on run #13 until the result looks right?
(And also archive karaf logs somewhere.)

[8] https://logs.opendaylight.org/sandbox/jenkins091/openflowplugin-csit-3node-clustering-only-nitrogen/11/odl1_karaf.log.gz

Comment by Vratko Polak [ 14/Sep/17 ]

> run #13

Of job [9].

[9] https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-nitrogen/

Comment by Luis Gomez [ 15/Sep/17 ]

I triggered a few, this one for instance is an example of the failure:

https://logs.opendaylight.org/sandbox/jenkins091/openflowplugin-csit-3node-clustering-only-nitrogen/17/odl1_karaf.log.gz

You can see in the robot log, the first failure is already incorrect number or instances returned by entity owner.

Comment by Luis Gomez [ 17/Sep/17 ]

OK, I see things are cleared now in sandbox. Vratko, fyi I have updated your patch to disable switch ECHO requests as otherwise you can get undesired switch reconnects due to controller not responding to these messages.

Comment by Vratko Polak [ 18/Sep/17 ]

> Of job [9].

This week, I got the desired outcome (only 1 failed test case) twice (runs #2 and #6), but both times there are exceptions which were not reported on releng.

Cause:
java.lang.IllegalStateException: Transaction ReadWriteShardDataTreeTransaction{id=member-1-datastore-operational-fe-0-chn-4-txn-13-0, closed=false, snapshot=MutableDataTree [modification=NodeModification [identifier=(urn:ietf:params:xml:ns:netconf:base:1.0)data, modificationType=NONE, childModification={}]]} is open

Consequence:
2017-09-18 14:59:32,592 | WARN | ofppool-0 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 service stopping reported error
java.util.concurrent.CancellationException: Task was cancelled.

Other than that, the simplified job passed once (#3), and twice (#4, #5) it had additional failures in Cluster Sync Problems suite. Still waiting for a releng-compatible failure.

Comment by Kit Lou [ 18/Sep/17 ]

Can we downgrade severity based on the information we have collected so far? Thanks!

Comment by Luis Gomez [ 18/Sep/17 ]

I am OK downgrading priority, what about you Vratko?

Comment by Robert Varga [ 18/Sep/17 ]

Based on the logs provided by Luis, there seems to be an issue with the state machine when placeholders are in play:

2017-09-15 00:20:41,415 | DEBUG | lt-dispatcher-21 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 services started
2017-09-15 00:20:41,415 | DEBUG | lt-dispatcher-21 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from TAKING_OWNERSHIP to OWNER
2017-09-15 00:21:04,286 | DEBUG | entLoopGroup-5-5 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Closing service group openflow:1
2017-09-15 00:21:04,286 | DEBUG | entLoopGroup-5-5 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 unregistering
2017-09-15 00:21:04,286 | DEBUG | entLoopGroup-5-5 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 closing
2017-09-15 00:21:04,286 | DEBUG | entLoopGroup-5-5 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Replaced group openflow:1 with PlaceholderGroup

{id=openflow:1}

2017-09-15 00:21:04,308 | DEBUG | entLoopGroup-5-5 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Call registrationService org.opendaylight.openflowplugin.impl.lifecycle.ContextCha
inImpl@73da35a6 method for ClusterSingletonService Provider org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl@215ba1af
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [en
tity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision= 2015-09-30)name=openflow:1}

]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight
.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}

]], state=
LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup openflow:1
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 lost ownership in state OWNER
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from OWNER to STOPPING_SERVICES
2017-09-15 00:21:14,896 | DEBUG | lt-dispatcher-24 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [en
tity=DOMEntity [type=org.opendaylight.mdsal.AsyncServiceCloseEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity ?revision=2015-09-30)name=openflow:1}

]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]
2017-09-15 00:21:14,896 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight
.mdsal.AsyncServiceCloseEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}

]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup openflow:1
2017-09-15 00:21:14,896 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Group openflow:1 in state STOPPING_SERVICES ignoring cleanup OwnershipChange DOMEn
tityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.AsyncServiceCloseEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:md sal:core:general-entity?revision=2015-09-30)name=openflow:1}

]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]
2017-09-15 00:21:14,897 | DEBUG | ofppool-0 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 finished stopping services
2017-09-15 00:21:14,897 | DEBUG | ofppool-0 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from STOPPING_SERVICES to RELEASING_OWNERSHIP
2017-09-15 00:23:09,057 | DEBUG | lt-dispatcher-21 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}

]], state=REMOTE_OWNERSHIP_LOST_NO_OWNER [wasOwner=false, isOwner=false, hasOwner=false], inJeopardy=false]
2017-09-15 00:23:09,057 | DEBUG | lt-dispatcher-21 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}

]], state=REMOTE_OWNERSHIP_LOST_NO_OWNER [wasOwner=false, isOwner=false, hasOwner=false], inJeopardy=false] for ClusterSingletonServiceGroup openflow:1
2017-09-15 00:23:09,057 | DEBUG | lt-dispatcher-21 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Group openflow:1 in state RELEASING_OWNERSHIP not processed entity OwnershipChange DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}

]], state=REMOTE_OWNERSHIP_LOST_NO_OWNER [wasOwner=false, isOwner=false, hasOwner=false], inJeopardy=false]

I will review that in detail later this week.

Comment by Vratko Polak [ 19/Sep/17 ]

> downgrading priority

Looking at [11] section 5.q it seems we want to keep this as a blocker (so we can merge a fix) until this is the last blocking bug, then we downgrade this and build a (hopefully) final RC build.

[11] https://meetings.opendaylight.org/opendaylight-meeting/2017/tsc/opendaylight-meeting-tsc.2017-09-14-17.02.html

Comment by Robert Varga [ 19/Sep/17 ]

So the issue occurs after EOS leadership election, when SingletonService forcibly loses ownership:

ber-2-shard-entity-ownership-operational, leaderPayloadVersion=5]
2017-09-15 00:21:14,892 | INFO | rd-dispatcher-27 | ShardManager | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.6.0.SNAPSHOT | shard-manager-operational Received follower initial sync status for member-1-shard-entity-ownership-operational status sync done false
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendayl
ight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}

]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:ge
neral-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}

]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup openflow:1
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 lost ownership in state OWNER
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from OWNER to STOPPING_SERVICES

This is a slightly different codepath, which is then not recovered and gets stuck in RELEASING_OWNERSHIP – the state machine should be skipping that step, as we already lost ownership, hence we should transition to STANDBY instead.

Comment by Vratko Polak [ 20/Sep/17 ]

> This week, I got

Run #11 [12] finally looks like the original symptom. Specifically, member-1 was owner, git hit by owner isolation. After that, member-1 was no longer an owner, so it got hit again by successor isolation. As there was no progress during the seconf isolation, and member-1 was initiating elections with higher term, member-1 become device owner after second rejoin.

> gets stuck in RELEASING_OWNERSHIP

While member-1 got stuck after the first rejoin, it unstuck after becoming owner again. It was member-2, stuck after losing ownership at the second rejoin, who got stuck indefinitely.

Looking at the karaf.log [13] of member-2, that (getting stuck after the successor rejoin) is the cause of this bug indeed.

2017-09-20 08:36:02,167 | DEBUG | ult-dispatcher-4 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}

]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup openflow:1
2017-09-20 08:36:02,167 | DEBUG | ult-dispatcher-4 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 lost ownership in state OWNER
2017-09-20 08:36:02,167 | DEBUG | ult-dispatcher-4 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from OWNER to STOPPING_SERVICES
...
2017-09-20 08:36:02,175 | DEBUG | ofppool-0 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from STOPPING_SERVICES to RELEASING_OWNERSHIP
...
2017-09-20 08:36:15,920 | INFO | entLoopGroup-7-2 | SystemNotificationsListenerImpl | 281 - org.opendaylight.openflowplugin.impl - 0.5.0.SNAPSHOT | ConnectionEvent: Connection closed by device, Device:/10.29.12.136:35320, NodeId:openflow:1
...
2017-09-20 08:36:15,921 | DEBUG | entLoopGroup-7-2 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Replaced group openflow:1 with PlaceholderGroup

{id=openflow:1}

...
2017-09-20 08:36:15,943 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 is closed, not taking ownership
...
2017-09-20 08:36:15,943 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 lost ownership in state RELEASING_OWNERSHIP
2017-09-20 08:36:15,943 | INFO | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 ignoring lost ownership in state RELEASING_OWNERSHIP,
...
2017-09-20 08:36:15,950 | INFO | ult-dispatcher-5 | ContextChainHolderImpl | 281 - org.opendaylight.openflowplugin.impl - 0.5.0.SNAPSHOT | Removing device from operational DS Uri [_value=openflow:1] was successful
2017-09-20 08:36:21,485 | INFO | entLoopGroup-9-1 | ConnectionAdapterImpl | 287 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.5.0.SNAPSHOT | Hello received
...
2017-09-20 08:36:21,489 | INFO | entLoopGroup-9-1 | RoleContextImpl | 281 - org.opendaylight.openflowplugin.impl - 0.5.0.SNAPSHOT | Started timer for setting SLAVE role on device openflow:1 if no role will be set in 20s.
2017-09-20 08:36:21,489 | DEBUG | entLoopGroup-9-1 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Call registrationService org.opendaylight.openflowplugin.impl.lifecycle.ContextChainImpl@14379d5c method for ClusterSingletonService Provider org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl@67124196
2017-09-20 08:36:21,489 | DEBUG | entLoopGroup-9-1 | PlaceholderGroup | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | PlaceholderGroup

{id=openflow:1}

: added service org.opendaylight.openflowplugin.impl.lifecycle.ContextChainImpl@14379d5c
...
2017-09-20 08:36:21,496 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Group openflow:1 in state RELEASING_OWNERSHIP not processed entity OwnershipChange DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}

]], state=REMOTE_OWNERSHIP_CHANGED [wasOwner=false, isOwner=false, hasOwner=true], inJeopardy=false]
...
2017-09-20 08:36:41,499 | INFO | entLoopGroup-9-1 | ContextChainHolderImpl | 281 - org.opendaylight.openflowplugin.impl - 0.5.0.SNAPSHOT | Role SLAVE was granted to device openflow:1

> we should transition to STANDBY instead

Yes, that would fix this. More specifically, we need to split STOPPING_SERVICES into (something like) STOPPING_SERVICES_BEFORE_UNREGISTERING and STOPPING_SERVICES_AFTER_LOSING_OWNERSHIP.

[12] https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-nitrogen/11/
[13] https://logs.opendaylight.org/sandbox/jenkins091/openflowplugin-csit-3node-clustering-only-nitrogen/11/odl2_karaf.log.gz

Comment by Luis Gomez [ 20/Sep/17 ]

Given the root cause is identified, what is the ETA for a patch? I guess depending on this ETA we can downgrade this bug or hold for a while.

Comment by Robert Varga [ 21/Sep/17 ]

I am still working through the state transitions, ETA today late evening or tomorrow – depending on how it goes.

Comment by Luis Gomez [ 21/Sep/17 ]

So I think the agreement is to fix this for Carbon SR3 and therefore Nitrogen SR1 too.

Comment by Robert Varga [ 21/Sep/17 ]

Yes, slight correction: Carbon SR2, not SR3 (that would be way out )

Comment by Robert Varga [ 21/Sep/17 ]

preliminary patch: https://git.opendaylight.org/gerrit/63426

needs some more work to resolve the implementation/UT mismatch (7 failing tests).

Comment by Luis Gomez [ 22/Sep/17 ]

Right SR2, I thought next was SR3...

Comment by Vratko Polak [ 26/Sep/17 ]

I tried to reproduce this withing controller clustering job, by extending [15] existing "singleton partition and heal" suite, but so far no luck replicating this Bug there (the one failure on this week's Sandbox [16] is CONTROLLER-1670).

> preliminary patch: https://git.opendaylight.org/gerrit/63426

That Oxygen change already makes the Openflow scenario more stable.
From this week's Sandbox job [17], so far only two failures happened (#2 and #18, both times member-3 was not completely removed from owners upon device disconnect).
Odd numbered runs are without this change, mostly failing (although I have not checked the exact cause each time).

[15] https://git.opendaylight.org/gerrit/63492
[16] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-nitrogen/
[17] https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-oxygen/

Comment by Thanh Ha (zxiiro) [ 05/Oct/17 ]

What's the status on this one?

I believe it's the only bug remaining holding up Carbon SR2.

Comment by Vratko Polak [ 05/Oct/17 ]

I believe the cherry-pick to Carbon [18] (with its parent) is good enough for SR2.
But Robert may have other plans.

[18] https://git.opendaylight.org/gerrit/#/c/63866/2

Comment by Kit Lou [ 12/Oct/17 ]

Please verify this issue is fixed in the latest Carbon SR2 build. Thanks!

Comment by Luis Gomez [ 12/Oct/17 ]

Yes this is fixed now:

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/openflowplugin-csit-3node-clustering-only-carbon/783/

Comment by Kit Lou [ 18/Oct/17 ]

Hi Luis,

In the latest successful nitrogen autorelease build 242, this same CSIT job is successful:

https://jenkins.opendaylight.org/releng/view/autorelease/job/openflowplugin-csit-3node-clustering-only-nitrogen/

Does this mean we are good with this bug for nitrogen SR1? Thanks!

Comment by Luis Gomez [ 18/Oct/17 ]

AFAIK fix was merged in all branches so yes we are good.

Generated at Wed Feb 07 20:09:25 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.