|
Isnt this related to new cluster failures introduced possibly by this change? https://git.opendaylight.org/gerrit/#/c/62708/
|
|
Yes, patch date match the regression.
|
|
Test added to track this: https://git.opendaylight.org/gerrit/#/c/63014
|
|
OK, current nitrogen branch shows this bug:
https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-gate-clustering-only-nitrogen/280/
However after applying this revert:
https://git.opendaylight.org/gerrit/#/c/63111/
The bug is gone:
https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-gate-clustering-only-nitrogen/279
So I think it makes sense to mark this regression as BLOCKER and move the issue to mdsal.
|
|
To reproduce:
Run OF cluster member isolation test:
- Connect switch to 3 instances
- isolate (e.g. using iptables) switch master instance from the cluster, wait for cluster to elect new master
- rejoin master instance, wait for cluster to be in sync
- isolate switch candidate instance from the cluster, wait for cluster to be in sync
- rejoin candidate instance, wait for cluster to be in sync
- Disconnect switch
After this test, if we reconnect the same switch, the entity owner does not show correct number of connected instances + no instance is given mastership for the device, so everything fails for the device. To recover we have to restart controller.
|
|
From these two lines:
2017-09-13 19:06:17,923 | INFO | lt-dispatcher-22 | ContextChainHolderImpl | 281 - org.opendaylight.openflowplugin.impl - 0.5.0.SNAPSHOT | Removing device from operational DS Uri [_value=openflow:1] was successful
2017-09-13 19:06:23,103 | INFO | entLoopGroup-7-1 | ConnectionAdapterImpl | 287 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.5.0.SNAPSHOT | Hello received
only the first one appears (with different time) in karaf.log [5] of the failed run.
I started running tests with debug logs for org.opendaylight.mdsal.singleton, but Sandbox had problems launching VMs for a while. So only one relevant run [6] finished yet, but the test case passed.
[0] > https://git.opendaylight.org/gerrit/#/c/62708/
One behavioral change is that service un-registration is now asynchronous, which (from user point of view) looks like subsequent re-registration is delayed, so perhaps the Robot check now hits this window of original master not being a candidate yet? Will need more time to investigate logs.
[5] https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-3node-gate-clustering-only-nitrogen/280/odl1_karaf.log.gz
[6] https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-nitrogen/4/
|
|
If you see the test trend:
https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-clustering-only-nitrogen/
It is not failing 100% of times but definitely there are more times it fails than it does not, and this started to happen just after the patch I reverted.
Also I can easily reproduce the issue in my env.
|
|
Also to be fair the member isolation test never worked 100% OK, so the real regression is that now it fails more often than before. From this perspective if the debug/fix is complicated at this time I think we can release nitrogen and write something in the release notes saying member isolation is not supported and therefore requires cluster restart.
|
|
Theory is that the rework of MDSAL-275 is flushing synchronization issues in OFP. I will review the logs tomorrow, but in the mean time it would be good to check ofp if it's solid.
Previously singleton service would block the calling thread until after the unregistration has completely finished. This is no longer the case, as it does not make sense and it actually was unstable in face of GCs and other timing issues.
Hence it is possible that the absence of the lock is flushing lack of synchronization downstream...
|
|
Debug logs with the revert are here [8] (until Friday ends).
I have not reproduced a clear-cut replication of this symptom (without the revert) yet. Can someone with Sandbox rights (Luis?) keep hitting Rebuild on run #13 until the result looks right?
(And also archive karaf logs somewhere.)
[8] https://logs.opendaylight.org/sandbox/jenkins091/openflowplugin-csit-3node-clustering-only-nitrogen/11/odl1_karaf.log.gz
|
|
> run #13
Of job [9].
[9] https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-nitrogen/
|
|
I triggered a few, this one for instance is an example of the failure:
https://logs.opendaylight.org/sandbox/jenkins091/openflowplugin-csit-3node-clustering-only-nitrogen/17/odl1_karaf.log.gz
You can see in the robot log, the first failure is already incorrect number or instances returned by entity owner.
|
|
OK, I see things are cleared now in sandbox. Vratko, fyi I have updated your patch to disable switch ECHO requests as otherwise you can get undesired switch reconnects due to controller not responding to these messages.
|
|
> Of job [9].
This week, I got the desired outcome (only 1 failed test case) twice (runs #2 and #6), but both times there are exceptions which were not reported on releng.
Cause:
java.lang.IllegalStateException: Transaction ReadWriteShardDataTreeTransaction{id=member-1-datastore-operational-fe-0-chn-4-txn-13-0, closed=false, snapshot=MutableDataTree [modification=NodeModification [identifier=(urn:ietf:params:xml:ns:netconf:base:1.0)data, modificationType=NONE, childModification={}]]} is open
Consequence:
2017-09-18 14:59:32,592 | WARN | ofppool-0 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 service stopping reported error
java.util.concurrent.CancellationException: Task was cancelled.
Other than that, the simplified job passed once (#3), and twice (#4, #5) it had additional failures in Cluster Sync Problems suite. Still waiting for a releng-compatible failure.
|
|
Can we downgrade severity based on the information we have collected so far? Thanks!
|
|
I am OK downgrading priority, what about you Vratko?
|
|
Based on the logs provided by Luis, there seems to be an issue with the state machine when placeholders are in play:
2017-09-15 00:20:41,415 | DEBUG | lt-dispatcher-21 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 services started
2017-09-15 00:20:41,415 | DEBUG | lt-dispatcher-21 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from TAKING_OWNERSHIP to OWNER
2017-09-15 00:21:04,286 | DEBUG | entLoopGroup-5-5 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Closing service group openflow:1
2017-09-15 00:21:04,286 | DEBUG | entLoopGroup-5-5 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 unregistering
2017-09-15 00:21:04,286 | DEBUG | entLoopGroup-5-5 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 closing
2017-09-15 00:21:04,286 | DEBUG | entLoopGroup-5-5 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Replaced group openflow:1 with PlaceholderGroup
{id=openflow:1}
2017-09-15 00:21:04,308 | DEBUG | entLoopGroup-5-5 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Call registrationService org.opendaylight.openflowplugin.impl.lifecycle.ContextCha
inImpl@73da35a6 method for ClusterSingletonService Provider org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl@215ba1af
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [en
tity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=
2015-09-30)name=openflow:1}
]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight
.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}
]], state=
LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup openflow:1
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 lost ownership in state OWNER
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from OWNER to STOPPING_SERVICES
2017-09-15 00:21:14,896 | DEBUG | lt-dispatcher-24 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [en
tity=DOMEntity [type=org.opendaylight.mdsal.AsyncServiceCloseEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity
?revision=2015-09-30)name=openflow:1}
]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]
2017-09-15 00:21:14,896 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight
.mdsal.AsyncServiceCloseEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}
]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup openflow:1
2017-09-15 00:21:14,896 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Group openflow:1 in state STOPPING_SERVICES ignoring cleanup OwnershipChange DOMEn
tityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.AsyncServiceCloseEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:md
sal:core:general-entity?revision=2015-09-30)name=openflow:1}
]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]
2017-09-15 00:21:14,897 | DEBUG | ofppool-0 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 finished stopping services
2017-09-15 00:21:14,897 | DEBUG | ofppool-0 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from STOPPING_SERVICES to RELEASING_OWNERSHIP
2017-09-15 00:23:09,057 | DEBUG | lt-dispatcher-21 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}
]], state=REMOTE_OWNERSHIP_LOST_NO_OWNER [wasOwner=false, isOwner=false, hasOwner=false], inJeopardy=false]
2017-09-15 00:23:09,057 | DEBUG | lt-dispatcher-21 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}
]], state=REMOTE_OWNERSHIP_LOST_NO_OWNER [wasOwner=false, isOwner=false, hasOwner=false], inJeopardy=false] for ClusterSingletonServiceGroup openflow:1
2017-09-15 00:23:09,057 | DEBUG | lt-dispatcher-21 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Group openflow:1 in state RELEASING_OWNERSHIP not processed entity OwnershipChange DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}
]], state=REMOTE_OWNERSHIP_LOST_NO_OWNER [wasOwner=false, isOwner=false, hasOwner=false], inJeopardy=false]
I will review that in detail later this week.
|
|
> downgrading priority
Looking at [11] section 5.q it seems we want to keep this as a blocker (so we can merge a fix) until this is the last blocking bug, then we downgrade this and build a (hopefully) final RC build.
[11] https://meetings.opendaylight.org/opendaylight-meeting/2017/tsc/opendaylight-meeting-tsc.2017-09-14-17.02.html
|
|
So the issue occurs after EOS leadership election, when SingletonService forcibly loses ownership:
ber-2-shard-entity-ownership-operational, leaderPayloadVersion=5]
2017-09-15 00:21:14,892 | INFO | rd-dispatcher-27 | ShardManager | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.6.0.SNAPSHOT | shard-manager-operational Received follower initial sync status for member-1-shard-entity-ownership-operational status sync done false
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendayl
ight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}
]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:ge
neral-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}
]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup openflow:1
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 lost ownership in state OWNER
2017-09-15 00:21:14,895 | DEBUG | lt-dispatcher-24 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from OWNER to STOPPING_SERVICES
This is a slightly different codepath, which is then not recovered and gets stuck in RELEASING_OWNERSHIP – the state machine should be skipping that step, as we already lost ownership, hence we should transition to STANDBY instead.
|
|
> This week, I got
Run #11 [12] finally looks like the original symptom. Specifically, member-1 was owner, git hit by owner isolation. After that, member-1 was no longer an owner, so it got hit again by successor isolation. As there was no progress during the seconf isolation, and member-1 was initiating elections with higher term, member-1 become device owner after second rejoin.
> gets stuck in RELEASING_OWNERSHIP
While member-1 got stuck after the first rejoin, it unstuck after becoming owner again. It was member-2, stuck after losing ownership at the second rejoin, who got stuck indefinitely.
Looking at the karaf.log [13] of member-2, that (getting stuck after the successor rejoin) is the cause of this bug indeed.
2017-09-20 08:36:02,167 | DEBUG | ult-dispatcher-4 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}
]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup openflow:1
2017-09-20 08:36:02,167 | DEBUG | ult-dispatcher-4 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 lost ownership in state OWNER
2017-09-20 08:36:02,167 | DEBUG | ult-dispatcher-4 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from OWNER to STOPPING_SERVICES
...
2017-09-20 08:36:02,175 | DEBUG | ofppool-0 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 switching from STOPPING_SERVICES to RELEASING_OWNERSHIP
...
2017-09-20 08:36:15,920 | INFO | entLoopGroup-7-2 | SystemNotificationsListenerImpl | 281 - org.opendaylight.openflowplugin.impl - 0.5.0.SNAPSHOT | ConnectionEvent: Connection closed by device, Device:/10.29.12.136:35320, NodeId:openflow:1
...
2017-09-20 08:36:15,921 | DEBUG | entLoopGroup-7-2 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Replaced group openflow:1 with PlaceholderGroup
{id=openflow:1}
...
2017-09-20 08:36:15,943 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 is closed, not taking ownership
...
2017-09-20 08:36:15,943 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 lost ownership in state RELEASING_OWNERSHIP
2017-09-20 08:36:15,943 | INFO | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Service group openflow:1 ignoring lost ownership in state RELEASING_OWNERSHIP,
...
2017-09-20 08:36:15,950 | INFO | ult-dispatcher-5 | ContextChainHolderImpl | 281 - org.opendaylight.openflowplugin.impl - 0.5.0.SNAPSHOT | Removing device from operational DS Uri [_value=openflow:1] was successful
2017-09-20 08:36:21,485 | INFO | entLoopGroup-9-1 | ConnectionAdapterImpl | 287 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.5.0.SNAPSHOT | Hello received
...
2017-09-20 08:36:21,489 | INFO | entLoopGroup-9-1 | RoleContextImpl | 281 - org.opendaylight.openflowplugin.impl - 0.5.0.SNAPSHOT | Started timer for setting SLAVE role on device openflow:1 if no role will be set in 20s.
2017-09-20 08:36:21,489 | DEBUG | entLoopGroup-9-1 | sterSingletonServiceProviderImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Call registrationService org.opendaylight.openflowplugin.impl.lifecycle.ContextChainImpl@14379d5c method for ClusterSingletonService Provider org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl@67124196
2017-09-20 08:36:21,489 | DEBUG | entLoopGroup-9-1 | PlaceholderGroup | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | PlaceholderGroup
{id=openflow:1}
: added service org.opendaylight.openflowplugin.impl.lifecycle.ContextChainImpl@14379d5c
...
2017-09-20 08:36:21,496 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 262 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.0.SNAPSHOT | Group openflow:1 in state RELEASING_OWNERSHIP not processed entity OwnershipChange DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[
{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}
]], state=REMOTE_OWNERSHIP_CHANGED [wasOwner=false, isOwner=false, hasOwner=true], inJeopardy=false]
...
2017-09-20 08:36:41,499 | INFO | entLoopGroup-9-1 | ContextChainHolderImpl | 281 - org.opendaylight.openflowplugin.impl - 0.5.0.SNAPSHOT | Role SLAVE was granted to device openflow:1
> we should transition to STANDBY instead
Yes, that would fix this. More specifically, we need to split STOPPING_SERVICES into (something like) STOPPING_SERVICES_BEFORE_UNREGISTERING and STOPPING_SERVICES_AFTER_LOSING_OWNERSHIP.
[12] https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-nitrogen/11/
[13] https://logs.opendaylight.org/sandbox/jenkins091/openflowplugin-csit-3node-clustering-only-nitrogen/11/odl2_karaf.log.gz
|
|
Given the root cause is identified, what is the ETA for a patch? I guess depending on this ETA we can downgrade this bug or hold for a while.
|
|
I am still working through the state transitions, ETA today late evening or tomorrow – depending on how it goes.
|
|
So I think the agreement is to fix this for Carbon SR3 and therefore Nitrogen SR1 too.
|
|
Yes, slight correction: Carbon SR2, not SR3 (that would be way out )
|
|
preliminary patch: https://git.opendaylight.org/gerrit/63426
needs some more work to resolve the implementation/UT mismatch (7 failing tests).
|
|
Right SR2, I thought next was SR3...
|
|
I tried to reproduce this withing controller clustering job, by extending [15] existing "singleton partition and heal" suite, but so far no luck replicating this Bug there (the one failure on this week's Sandbox [16] is CONTROLLER-1670).
> preliminary patch: https://git.opendaylight.org/gerrit/63426
That Oxygen change already makes the Openflow scenario more stable.
From this week's Sandbox job [17], so far only two failures happened (#2 and #18, both times member-3 was not completely removed from owners upon device disconnect).
Odd numbered runs are without this change, mostly failing (although I have not checked the exact cause each time).
[15] https://git.opendaylight.org/gerrit/63492
[16] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-nitrogen/
[17] https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-oxygen/
|
|
What's the status on this one?
I believe it's the only bug remaining holding up Carbon SR2.
|
|
I believe the cherry-pick to Carbon [18] (with its parent) is good enough for SR2.
But Robert may have other plans.
[18] https://git.opendaylight.org/gerrit/#/c/63866/2
|
|
Please verify this issue is fixed in the latest Carbon SR2 build. Thanks!
|
|
Yes this is fixed now:
https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/openflowplugin-csit-3node-clustering-only-carbon/783/
|
|
Hi Luis,
In the latest successful nitrogen autorelease build 242, this same CSIT job is successful:
https://jenkins.opendaylight.org/releng/view/autorelease/job/openflowplugin-csit-3node-clustering-only-nitrogen/
Does this mean we are good with this bug for nitrogen SR1? Thanks!
|
|
AFAIK fix was merged in all branches so yes we are good.
|
Generated at Wed Feb 07 20:09:25 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.