[MDSAL-275] VerifyException from flapping service singleton registration Created: 17/Jul/17 Updated: 09/Mar/18 Resolved: 14/Sep/17 |
|
| Status: | Resolved |
| Project: | mdsal |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||||||||||
| External issue ID: | 8858 | ||||||||||||||||||||
| Description |
|
The "flapping service" is a simple test service implemented using singleton service. Upon creation, flapping service instance tries to de-register flapping service from the current member, and upon close, the instance tries to re-register on the current member again. The robot test registers the flapping service on all members, waits a while, and then un-registers all members. This is also running as a longevity job. This week, the longevity job failed [0], after 22 hours (after 175814 successful de-registrations) in the re-registration phase. Karaf log [1] shows VerifyException: 2017-07-16 21:10:57,018 | WARN | pool-30-thread-1 | FlappingSingletonService | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | There was a problem re-registering flapping singleton service.java.lang.RuntimeException: com.google.common.base.VerifyException Garbage collection log [2] shows there was a small GC pause just before, not sure how that could be related. 2017-07-16T21:10:56.605+0000: 78867.300: [GC (Allocation Failure) [PSYoungGen: 598048K->64416K(604160K)] 1060366K->526734K(1238528K), 0.4121114 secs] [Times: user=2.06 sys=0.00, real=0.41 secs] [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-cs-chasing-leader-longevity-only-carbon/13/log.html.gz#s1-s2-t3-k3-k2-k1-k1-k2-k1-k4-k7-k1 |
| Comments |
| Comment by Robert Varga [ 24/Jul/17 ] |
|
We need the complete exception stack. In any case, that throw/catch logic needs to be cleaned up. |
| Comment by Vratko Polak [ 07/Aug/17 ] |
|
This happened again, in a longevity job after almost 20 hours. Karaf.log [3] shows this exception: 2017-08-07 01:08:46,962 | WARN | pool-30-thread-1 | FlappingSingletonService | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | There was a problem re-registering flapping singleton service. Once again, GC log [4] shows a minor collection just before the exception: 2017-08-07T01:08:46.522+0000: 67044.318: [GC (Allocation Failure) [PSYoungGen: 610304K->50592K(626688K)] 1065438K->506662K(1191424K), 0.4388956 secs] [Times: user=2.25 sys=0.00, real=0.44 secs] [3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-cs-chasing-leader-longevity-only-carbon/16/odl1_karaf.log.gz |
| Comment by Robert Varga [ 08/Aug/17 ] |
|
The locking is funky and the GC pause may be what pushed it over the brink. I am investigating what to do about it. |
| Comment by Vratko Polak [ 08/Aug/17 ] |
|
This seems to be a consequence of the current I created a short run with the wait greatly reduced and debug logs [5]. 2017-08-08 13:34:31,301 | INFO | h for user karaf | command | 266 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Register_Candidates ]], state=LOCAL_OWNERSHIP_GRANTED [wasOwner=false, isOwner=true, hasOwner=true], inJeopardy=false] ]], state=LOCAL_OWNERSHIP_GRANTED [wasOwner=false, isOwner=true, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup flapping-singleton-service ]], state=LOCAL_OWNERSHIP_GRANTED [wasOwner=false, isOwner=true, hasOwner=true], inJeopardy=false] ]], state=LOCAL_OWNERSHIP_GRANTED [wasOwner=false, isOwner=true, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup flapping-singleton-service ]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] ]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] for ClusterSingletonServiceGroup flapping-singleton-service ]], state=LOCAL_OWNERSHIP_LOST_NEW_OWNER [wasOwner=true, isOwner=false, hasOwner=true], inJeopardy=false] |
| Comment by Robert Varga [ 09/Aug/17 ] |
|
WIP patch: https://git.opendaylight.org/gerrit/61420 |
| Comment by Kit Lou [ 29/Aug/17 ] |
|
Hi Robert, Do you have an ETA on when the patch will be ready to merge? Thanks! |
| Comment by Vratko Polak [ 06/Sep/17 ] |
|
> WIP patch: https://git.opendaylight.org/gerrit/61420 Merged to all active branches. |