[MDSAL-197] Switch connection bounce results in wrong ownership change in Singleton Created: 13/Sep/16  Updated: 09/Mar/18  Resolved: 14/Nov/16

Status: Resolved
Project: mdsal
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Tomas Slusny Assignee: Vaclav Demcak
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: All


Attachments: Text File 0841_10.13.11.207.log     Text File 1722_10.13.11.207.log    
Issue Links:
Blocks
blocks BGPCEP-540 ClusterSingletonService/EOS race cond... Resolved
blocks OPNFLWPLUG-616 Longevity test rarely goes beyond 6 h... Resolved
blocks OPNFLWPLUG-759 l2switch addresss tracker does not be... Resolved
blocks OPNFLWPLUG-767 Switch connection bounce generates wr... Resolved
External issue ID: 6710

 Description   

In Carbon, sometimes, when switch is disconnected and reconnected with very small delay between these actions, it's role is incorrectly set (should be MASTER, but is SLAVE).

Reproduction steps (from Luis from bug https://bugs.opendaylight.org/show_bug.cgi?id=6672):
1) start mininet 1 switch (s1) pointing to any of the cluster instances (e.g. 10.13.11.207)
2) Run the following script in the mininet system to bounce connection quickly:

#!/bin/bash
sudo ovs-vsctl del-controller s1
sleep 0.2
sudo ovs-vsctl set-controller s1 "tcp:192.168.0.101"

In local cluster test, I did step 2 twice. First time, role MASTER was successfully set, but second time, role was SLAVE, what is incorrect.

I attached logs from my local testing from node 10.13.11.207.



 Comments   
Comment by Tomas Slusny [ 13/Sep/16 ]

Attachment 1722_10.13.11.207.log has been added with description: Karaf log from 10.13.11.207 cluster node

Comment by Tom Pantelis [ 15/Sep/16 ]

I see this ex:

2016-09-13 17:21:40,055 | ERROR | ult-dispatcher-4 | ClusterSingletonServiceGroupImpl | 169 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.0.SNAPSHOT | Unexpected exception state for service Provider openflow:1 in LostLeadership
com.google.common.base.VerifyException
at com.google.common.base.Verify.verify(Verify.java:98)[70:com.google.guava:18.0.0]
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lostOwnership(ClusterSingletonServiceGroupImpl.java:308)[169:org.opendaylight.mdsal.singleton-dom-impl:2.2.0.SNAPSHOT]
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.ownershipChanged(ClusterSingletonServiceGroupImpl.java:213)[169:org.opendaylight.mdsal.singleton-dom-impl:2.2.0.SNAPSHOT]
at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.ownershipChanged(AbstractClusterSingletonServiceProviderImpl.java:145)[169:org.opendaylight.mdsal.singleton-dom-impl:2.2.0.SNAPSHOT]
at org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl.ownershipChanged(DOMClusterSingletonServiceProviderImpl.java:23)[169:org.opendaylight.mdsal.singleton-dom-impl:2.2.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.onEntityOwnershipChanged(EntityOwnershipListenerActor.java:46)[205:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]

It had local ownership of the ServiceEntityType and AsycnCloseEntityType. Then it got an unregister from the app:

2016-09-13 17:21:40,031 | DEBUG | on-dispatcher-52 | ClusterSingletonServiceGroupImpl | 169 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.0.SNAPSHOT | UnregisterService method call for ClusterSingletonServiceGroup openflow:1

2016-09-13 17:21:40,036 | DEBUG | on-dispatcher-52 | ClusterSingletonServiceGroupImpl | 169 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.0.SNAPSHOT | Close method for service Provider openflow:1

This closes the serviceEntityCandidateReg and sets it to null.

Then it got the LOCAL_OWNERSHIP_LOST_NEW_OWNER notification for the ServiceEntityType:

2016-09-13 17:21:40,046 | DEBUG | ult-dispatcher-4 | sterSingletonServiceProviderImpl | 169 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}

]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]

This leads to the call to lostOwnership which fails this verify on line 308 b/c the serviceEntityCandidateReg was already nulled out:

Verify.verify(serviceEntityCandidateReg != null);

There's either logic flaw somewhere else or the code is being unnecessarily paranoid in verifying serviceEntityCandidateReg != null.

Comment by Vaclav Demcak [ 28/Sep/16 ]

https://git.opendaylight.org/gerrit/#/c/46175/

Comment by Tomas Slusny [ 30/Sep/16 ]

So I was testing connection bounce again on patch https://git.opendaylight.org/gerrit/#/c/46175/7 above (also, with controller on this patch https://git.opendaylight.org/gerrit/#/c/45638/3 by Tom Pantelis) and eveything is working almost great, except that sometimes I see VerifyException when device is closing in tryToTakeOwnership method. I attached full log from main node. So I think that sometimes when in closing state, also TryToTakeOwnership is called, so serviceEntityCandidateReg can be null there too.

Comment by Tomas Slusny [ 30/Sep/16 ]

Attachment 0841_10.13.11.207.log has been added with description: TryToTakeOwnership VerifyException in closing

Comment by Tomas Slusny [ 30/Sep/16 ]

After further testing, this VerifyException is also appearing in TakeOwnership on Verify.verify(serviceEntityCandidateReg != null); when in closing.

Comment by Vaclav Demcak [ 30/Sep/16 ]

It looks like you don't wait correctly for a closing process. Or I'm not guarding a closing process correctly. Logs show, you're closing your service in same time as registration. So EOS response first registration role but your service is in a closing state. Propably a verification exception is not here the right approach. I have to think about it. Could you describe here exactly your steps for closing / re-registration process from your code please?

Comment by Vaclav Demcak [ 11/Oct/16 ]

master
https://git.opendaylight.org/gerrit/#/c/46175/14
stable boron cherry-pick
https://git.opendaylight.org/gerrit/#/c/46761/

Comment by Tomas Slusny [ 13/Oct/16 ]

So I was testing these problems again with patch by Vaclav, what is posted above, and everything is now working fine, so closing this issue.

Comment by Bertrand Low [ 18/Oct/16 ]

Hi Vaclav and Tomas,

the cherry-pick to Boron (https://git.opendaylight.org/gerrit/#/c/46761/) is not passing the distribution-check and it appears to be due to this error:

2016-10-18 22:22:53,853 | ERROR | config-pusher | ConfigTransactionControllerImpl | 128 - org.opendaylight.controller.config-manager - 0.5.1.SNAPSHOT | Commit failed on ModuleIdentifier

{factoryName='bgp-reachability-ipv4', instanceName='example-ipv4-topology'}

in transaction TransactionIdentifier

{name='ConfigTransaction-282-284'}

java.lang.RuntimeException: com.google.common.base.VerifyException
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.registerService(ClusterSingletonServiceGroupImpl.java:170)
at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.registerClusterSingletonService(AbstractClusterSingletonServiceProviderImpl.java:106)
at Proxyd030c118_68d9_4469_aef9_a99fb46b9175.registerClusterSingletonService(Unknown Source)
at Proxy9705fcab_39eb_4b61_ac74_0c4dab9a7c4c.registerClusterSingletonService(Unknown Source)

This seems different from the causes tracked below (lostleadership and takeownership) - any ideas why the patch is passing in master but not in Boron?

If I understand correctly, the fix for MDSAL-197 will have to be merged into boron before the fix for OPNFLWPLUG-767 / 5271 (which is awaiting Luis verification) can be merged.

Thanks

Comment by Vaclav Demcak [ 19/Oct/16 ]

(In reply to Bertrand Low from comment #8)
> Hi Vaclav and Tomas,
>
> the cherry-pick to Boron (https://git.opendaylight.org/gerrit/#/c/46761/) is
> not passing the distribution-check and it appears to be due to this error:
>
> 2016-10-18 22:22:53,853 | ERROR | config-pusher |
> ConfigTransactionControllerImpl | 128 -
> org.opendaylight.controller.config-manager - 0.5.1.SNAPSHOT | Commit failed
> on ModuleIdentifier

{factoryName='bgp-reachability-ipv4', > instanceName='example-ipv4-topology'}

in transaction
> TransactionIdentifier

{name='ConfigTransaction-282-284'}

> java.lang.RuntimeException: com.google.common.base.VerifyException
> at
> org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.
> registerService(ClusterSingletonServiceGroupImpl.java:170)
> at
> org.opendaylight.mdsal.singleton.dom.impl.
> AbstractClusterSingletonServiceProviderImpl.
> registerClusterSingletonService(AbstractClusterSingletonServiceProviderImpl.
> java:106)
> at
> Proxyd030c118_68d9_4469_aef9_a99fb46b9175.
> registerClusterSingletonService(Unknown Source)
> at
> Proxy9705fcab_39eb_4b61_ac74_0c4dab9a7c4c.
> registerClusterSingletonService(Unknown Source)
>
> This seems different from the causes tracked below (lostleadership and
> takeownership) - any ideas why the patch is passing in master but not in
> Boron?
>
> If I understand correctly, the fix for MDSAL-197 will have to be merged into
> boron before the fix for OPNFLWPLUG-767 / 5271 (which is awaiting Luis
> verification) can be merged.
>
> Thanks

yes, because https://git.opendaylight.org/gerrit/#/c/46761/ pacht indirectly depends on https://git.opendaylight.org/gerrit/#/c/47018/ , but it is a part of diferent issue.

Comment by Vinh Nguyen [ 19/Oct/16 ]

Hi Vaclat, the https://git.opendaylight.org/gerrit/#/c/46761/ patch indirectly depends on https://git.opendaylight.org/gerrit/#/c/47018/. However there seems to be a circular dependency between the two patches since the 47018 mentions that it indirectly depends on the 46761. Is that correct?

Do you have any idea when the above patches will be merged?
Thanks

Comment by Robert Varga [ 26/Oct/16 ]

Reopen as the issue has not been merged yet.

Comment by Robert Varga [ 26/Oct/16 ]

This is still waiting for merge. Boron is throttled for SR1 release, so unless this issue becomes a blocker for SR1, it will need to be postponed to SR2.

Raising to blocker for SR2. If you feel it needs to be delivered in SR1, feel free to retarget and notify release manager of a blocker.

Comment by Martin Ciglan [ 07/Nov/16 ]

Hi all

Having Boron-SR1 released, can we move on with this issue? Thank you.

Martin

Generated at Wed Feb 07 20:08:59 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.