[MDSAL-275] VerifyException from flapping service singleton registration Created: 17/Jul/17  Updated: 09/Mar/18  Resolved: 14/Sep/17

Status: Resolved
Project: mdsal
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by CONTROLLER-1755 RaftActor lastApplied index moves bac... Resolved
is blocked by CONTROLLER-1757 Singleton leader chasing exhausts hea... Resolved
Duplicate
is duplicated by MDSAL-234 Cluster singleton service unable to h... Resolved
External issue ID: 8858

 Description   

The "flapping service" is a simple test service implemented using singleton service. Upon creation, flapping service instance tries to de-register flapping service from the current member, and upon close, the instance tries to re-register on the current member again.

The robot test registers the flapping service on all members, waits a while, and then un-registers all members. This is also running as a longevity job.

This week, the longevity job failed [0], after 22 hours (after 175814 successful de-registrations) in the re-registration phase. Karaf log [1] shows VerifyException:

2017-07-16 21:10:57,018 | WARN | pool-30-thread-1 | FlappingSingletonService | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | There was a problem re-registering flapping singleton service.java.lang.RuntimeException: com.google.common.base.VerifyException
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.registerService(ClusterSingletonServiceGroupImpl.java:180)
at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.registerClusterSingletonService(AbstractClusterSingletonServiceProviderImpl.java:107)
at Proxy057965cf_73f3_4f46_a035_5d06b8a2497b.registerClusterSingletonService(Unknown Source)
at Proxy1a1b6ecc_a451_4f57_8a21_81dd42385c83.registerClusterSingletonService(Unknown Source)
at org.opendaylight.controller.clustering.it.provider.impl.FlappingSingletonService.lambda$closeServiceInstance$1(FlappingSingletonService.java:78)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)[:1.8.0_131]
...

Garbage collection log [2] shows there was a small GC pause just before, not sure how that could be related.

2017-07-16T21:10:56.605+0000: 78867.300: [GC (Allocation Failure) [PSYoungGen: 598048K->64416K(604160K)] 1060366K->526734K(1238528K), 0.4121114 secs] [Times: user=2.06 sys=0.00, real=0.41 secs]

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-cs-chasing-leader-longevity-only-carbon/13/log.html.gz#s1-s2-t3-k3-k2-k1-k1-k2-k1-k4-k7-k1
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-cs-chasing-leader-longevity-only-carbon/13/odl1_karaf.log.gz
[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-cs-chasing-leader-longevity-only-carbon/13/gclogs-1/gc_1500160588.82.log.gz



 Comments   
Comment by Robert Varga [ 24/Jul/17 ]

We need the complete exception stack. In any case, that throw/catch logic needs to be cleaned up.

Comment by Vratko Polak [ 07/Aug/17 ]

This happened again, in a longevity job after almost 20 hours.

Karaf.log [3] shows this exception:

2017-08-07 01:08:46,962 | WARN | pool-30-thread-1 | FlappingSingletonService | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | There was a problem re-registering flapping singleton service.
java.lang.RuntimeException: com.google.common.base.VerifyException
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.registerService(ClusterSingletonServiceGroupImpl.java:180)
at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.registerClusterSingletonService(AbstractClusterSingletonServiceProviderImpl.java:107)
at Proxy7f8ecfdd_c3ac_4a1d_bbfd_341267680a82.registerClusterSingletonService(Unknown Source)
at Proxy27779ad8_0887_4aef_a928_433e9f48426a.registerClusterSingletonService(Unknown Source)
at org.opendaylight.controller.clustering.it.provider.impl.FlappingSingletonService.lambda$closeServiceInstance$1(FlappingSingletonService.java:75)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)[:1.8.0_131]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)[:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)[:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)[:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)[:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)[:1.8.0_131]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_131]
Caused by: com.google.common.base.VerifyException
at com.google.common.base.Verify.verify(Verify.java:98)[57:com.google.guava:18.0.0]
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.registerService(ClusterSingletonServiceGroupImpl.java:171)
... 11 more

Once again, GC log [4] shows a minor collection just before the exception:

2017-08-07T01:08:46.522+0000: 67044.318: [GC (Allocation Failure) [PSYoungGen: 610304K->50592K(626688K)] 1065438K->506662K(1191424K), 0.4388956 secs] [Times: user=2.25 sys=0.00, real=0.44 secs]

[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-cs-chasing-leader-longevity-only-carbon/16/odl1_karaf.log.gz
[4] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-cs-chasing-leader-longevity-only-carbon/16/gclogs-1/gc_1502001081.71.log.gz

Comment by Robert Varga [ 08/Aug/17 ]

The locking is funky and the GC pause may be what pushed it over the brink. I am investigating what to do about it.

Comment by Vratko Polak [ 08/Aug/17 ]

This seems to be a consequence of the current MDSAL-234 workaround. The implementation waits 200 milliseconds, which is not enough if GC happens.

I created a short run with the wait greatly reduced and debug logs [5].
Here is copy of the interesting part:

2017-08-08 13:34:31,301 | INFO | h for user karaf | command | 266 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Register_Candidates
2017-08-08 13:34:31,543 | DEBUG | qtp2030202857-77 | FlappingSingletonService | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Registering flapping-singleton-service.
2017-08-08 13:34:31,543 | DEBUG | qtp2030202857-77 | sterSingletonServiceProviderImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Call registrationService org.opendaylight.controller.clustering.it.provider.impl.FlappingSingletonService@cd0a4c9 method for ClusterSingletonService Provider org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl
2017-08-08 13:34:31,544 | DEBUG | qtp2030202857-77 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | New Instance of ClusterSingletonServiceGroup flapping-singleton-service
2017-08-08 13:34:31,544 | DEBUG | qtp2030202857-77 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Initialization ClusterSingletonGroup flapping-singleton-service
2017-08-08 13:34:31,544 | DEBUG | qtp2030202857-77 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | RegisterService method call for ClusterSingletonServiceGroup flapping-singleton-service
2017-08-08 13:34:31,635 | DEBUG | lt-dispatcher-34 | sterSingletonServiceProviderImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=flapping-singleton-service}

]], state=LOCAL_OWNERSHIP_GRANTED [wasOwner=false, isOwner=true, hasOwner=true], inJeopardy=false]
2017-08-08 13:34:31,635 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=flapping-singleton-service}

]], state=LOCAL_OWNERSHIP_GRANTED [wasOwner=false, isOwner=true, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup flapping-singleton-service
2017-08-08 13:34:31,635 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | TryToTakeLeadership method for service Provider flapping-singleton-service
2017-08-08 13:34:31,729 | DEBUG | lt-dispatcher-34 | sterSingletonServiceProviderImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.AsyncServiceCloseEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=flapping-singleton-service}

]], state=LOCAL_OWNERSHIP_GRANTED [wasOwner=false, isOwner=true, hasOwner=true], inJeopardy=false]
2017-08-08 13:34:31,730 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.AsyncServiceCloseEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=flapping-singleton-service}

]], state=LOCAL_OWNERSHIP_GRANTED [wasOwner=false, isOwner=true, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup flapping-singleton-service
2017-08-08 13:34:31,730 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | TakeLeadership method for service Provider flapping-singleton-service
2017-08-08 13:34:31,730 | DEBUG | lt-dispatcher-34 | FlappingSingletonService | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Instantiating flapping-singleton-service.
2017-08-08 13:34:31,732 | DEBUG | pool-30-thread-1 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | UnregisterService method call for ClusterSingletonServiceGroup flapping-singleton-service
2017-08-08 13:34:31,749 | DEBUG | lt-dispatcher-34 | sterSingletonServiceProviderImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=flapping-singleton-service}

]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]
2017-08-08 13:34:31,749 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Ownership change DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.ServiceEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=flapping-singleton-service}

]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup flapping-singleton-service
2017-08-08 13:34:31,749 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | LostLeadership method for service Provider flapping-singleton-service
2017-08-08 13:34:31,749 | DEBUG | lt-dispatcher-34 | FlappingSingletonService | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Closing flapping-singleton-service, flapCount: 0
2017-08-08 13:34:31,751 | DEBUG | pool-30-thread-1 | FlappingSingletonService | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Running re-registration
2017-08-08 13:34:31,752 | DEBUG | pool-30-thread-1 | sterSingletonServiceProviderImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Call registrationService org.opendaylight.controller.clustering.it.provider.impl.FlappingSingletonService@cd0a4c9 method for ClusterSingletonService Provider org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl
2017-08-08 13:34:31,752 | DEBUG | pool-30-thread-1 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | RegisterService method call for ClusterSingletonServiceGroup flapping-singleton-service
2017-08-08 13:34:31,752 | DEBUG | lt-dispatcher-34 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Destroy service Instance flapping-singleton-service is success
2017-08-08 13:34:31,778 | DEBUG | lt-dispatcher-34 | sterSingletonServiceProviderImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Ownership change for ClusterSingletonService Provider DOMEntityOwnershipChange [entity=DOMEntity [type=org.opendaylight.mdsal.AsyncServiceCloseEntityType, id=/(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)entity/entity[

{(urn:opendaylight:params:xml:ns:yang:mdsal:core:general-entity?revision=2015-09-30)name=flapping-singleton-service}

]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false]
2017-08-08 13:34:31,779 | DEBUG | lt-dispatcher-34 | sterSingletonServiceProviderImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | ClusterSingletonServiceGroup was not found for serviceIdentifier flapping-singleton-service
2017-08-08 13:34:31,754 | DEBUG | pool-30-thread-1 | ClusterSingletonServiceGroupImpl | 155 - org.opendaylight.mdsal.singleton-dom-impl - 2.2.2.SNAPSHOT | Unexpected error by registration service Provider flapping-singleton-service
com.google.common.base.VerifyException

[5] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-cs-chasing-leader-longevity-only-carbon/2/odl1_karaf.log.gz

Comment by Robert Varga [ 09/Aug/17 ]

WIP patch: https://git.opendaylight.org/gerrit/61420

Comment by Kit Lou [ 29/Aug/17 ]

Hi Robert, Do you have an ETA on when the patch will be ready to merge? Thanks!

Comment by Vratko Polak [ 06/Sep/17 ]

> WIP patch: https://git.opendaylight.org/gerrit/61420

Merged to all active branches.

Generated at Wed Feb 07 20:09:20 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.