[MDSAL-197] Switch connection bounce results in wrong ownership change in Singleton Created: 13/Sep/16 Updated: 09/Mar/18 Resolved: 14/Nov/16 |
|
| Status: | Resolved |
| Project: | mdsal |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Tomas Slusny | Assignee: | Vaclav Demcak |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: Linux |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| External issue ID: | 6710 | ||||||||||||||||||||
| Description |
|
In Carbon, sometimes, when switch is disconnected and reconnected with very small delay between these actions, it's role is incorrectly set (should be MASTER, but is SLAVE). Reproduction steps (from Luis from bug https://bugs.opendaylight.org/show_bug.cgi?id=6672): #!/bin/bash In local cluster test, I did step 2 twice. First time, role MASTER was successfully set, but second time, role was SLAVE, what is incorrect. I attached logs from my local testing from node 10.13.11.207. |
| Comments |
| Comment by Tomas Slusny [ 13/Sep/16 ] |
|
Attachment 1722_10.13.11.207.log has been added with description: Karaf log from 10.13.11.207 cluster node |
| Comment by Tom Pantelis [ 15/Sep/16 ] |
|
I see this ex: 2016-09-13 17:21:40,055 | ERROR | ult-dispatcher-4 | ClusterSingletonServiceGroupImpl | 169 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.0.SNAPSHOT | Unexpected exception state for service Provider openflow:1 in LostLeadership It had local ownership of the ServiceEntityType and AsycnCloseEntityType. Then it got an unregister from the app: 2016-09-13 17:21:40,031 | DEBUG | on-dispatcher-52 | ClusterSingletonServiceGroupImpl | 169 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.0.SNAPSHOT | UnregisterService method call for ClusterSingletonServiceGroup openflow:1 2016-09-13 17:21:40,036 | DEBUG | on-dispatcher-52 | ClusterSingletonServiceGroupImpl | 169 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.0.SNAPSHOT | Close method for service Provider openflow:1 This closes the serviceEntityCandidateReg and sets it to null. Then it got the LOCAL_OWNERSHIP_LOST_NEW_OWNER notification for the ServiceEntityType: 2016-09-13 17:21:40,046 | DEBUG | ult-dispatcher-4 | sterSingletonServiceProviderImpl | 169 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.0.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[ {(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=openflow:1}]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] This leads to the call to lostOwnership which fails this verify on line 308 b/c the serviceEntityCandidateReg was already nulled out: Verify.verify(serviceEntityCandidateReg != null); There's either logic flaw somewhere else or the code is being unnecessarily paranoid in verifying serviceEntityCandidateReg != null. |
| Comment by Vaclav Demcak [ 28/Sep/16 ] |
| Comment by Tomas Slusny [ 30/Sep/16 ] |
|
So I was testing connection bounce again on patch https://git.opendaylight.org/gerrit/#/c/46175/7 above (also, with controller on this patch https://git.opendaylight.org/gerrit/#/c/45638/3 by Tom Pantelis) and eveything is working almost great, except that sometimes I see VerifyException when device is closing in tryToTakeOwnership method. I attached full log from main node. So I think that sometimes when in closing state, also TryToTakeOwnership is called, so serviceEntityCandidateReg can be null there too. |
| Comment by Tomas Slusny [ 30/Sep/16 ] |
|
Attachment 0841_10.13.11.207.log has been added with description: TryToTakeOwnership VerifyException in closing |
| Comment by Tomas Slusny [ 30/Sep/16 ] |
|
After further testing, this VerifyException is also appearing in TakeOwnership on Verify.verify(serviceEntityCandidateReg != null); when in closing. |
| Comment by Vaclav Demcak [ 30/Sep/16 ] |
|
It looks like you don't wait correctly for a closing process. Or I'm not guarding a closing process correctly. Logs show, you're closing your service in same time as registration. So EOS response first registration role but your service is in a closing state. Propably a verification exception is not here the right approach. I have to think about it. Could you describe here exactly your steps for closing / re-registration process from your code please? |
| Comment by Vaclav Demcak [ 11/Oct/16 ] |
|
master |
| Comment by Tomas Slusny [ 13/Oct/16 ] |
|
So I was testing these problems again with patch by Vaclav, what is posted above, and everything is now working fine, so closing this issue. |
| Comment by Bertrand Low [ 18/Oct/16 ] |
|
Hi Vaclav and Tomas, the cherry-pick to Boron (https://git.opendaylight.org/gerrit/#/c/46761/) is not passing the distribution-check and it appears to be due to this error: 2016-10-18 22:22:53,853 | ERROR | config-pusher | ConfigTransactionControllerImpl | 128 - org.opendaylight.controller.config-manager - 0.5.1.SNAPSHOT | Commit failed on ModuleIdentifier {factoryName='bgp-reachability-ipv4', instanceName='example-ipv4-topology'}in transaction TransactionIdentifier {name='ConfigTransaction-282-284'}java.lang.RuntimeException: com.google.common.base.VerifyException This seems different from the causes tracked below (lostleadership and takeownership) - any ideas why the patch is passing in master but not in Boron? If I understand correctly, the fix for Thanks |
| Comment by Vaclav Demcak [ 19/Oct/16 ] |
|
(In reply to Bertrand Low from comment #8) in transaction > java.lang.RuntimeException: com.google.common.base.VerifyException yes, because https://git.opendaylight.org/gerrit/#/c/46761/ pacht indirectly depends on https://git.opendaylight.org/gerrit/#/c/47018/ , but it is a part of diferent issue. |
| Comment by Vinh Nguyen [ 19/Oct/16 ] |
|
Hi Vaclat, the https://git.opendaylight.org/gerrit/#/c/46761/ patch indirectly depends on https://git.opendaylight.org/gerrit/#/c/47018/. However there seems to be a circular dependency between the two patches since the 47018 mentions that it indirectly depends on the 46761. Is that correct? Do you have any idea when the above patches will be merged? |
| Comment by Robert Varga [ 26/Oct/16 ] |
|
Reopen as the issue has not been merged yet. |
| Comment by Robert Varga [ 26/Oct/16 ] |
|
This is still waiting for merge. Boron is throttled for SR1 release, so unless this issue becomes a blocker for SR1, it will need to be postponed to SR2. Raising to blocker for SR2. If you feel it needs to be delivered in SR1, feel free to retarget and notify release manager of a blocker. |
| Comment by Martin Ciglan [ 07/Nov/16 ] |
|
Hi all Having Boron-SR1 released, can we move on with this issue? Thank you. Martin |