[INFRAUTILS-18] JobCoordinator getIncompleteTaskCount() does not reach 0 in component test after 30s Created: 27/Oct/17 Updated: 31/May/18 Resolved: 31/May/18 |
|
| Status: | Verified |
| Project: | infrautils |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | Oxygen |
| Type: | Bug | Priority: | Highest |
| Reporter: | Michael Vorburger | Assignee: | Michael Vorburger |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
https://logs.opendaylight.org/releng/jenkins092/netvirt-verify-oxygen-mvn33-openjdk8/2245/vpnservice/aclservice/impl/target/surefire-reports/org.opendaylight.netvirt.aclservice.tests.AclServiceStatefulTest-output.txt.gz for https://jenkins.opendaylight.org/releng/job/netvirt-verify-oxygen-mvn33-openjdk8/2245/console for https://git.opendaylight.org/gerrit/#/c/64381/ illustrates a case of AclServiceStatefulTest failing on Jenkins (only), but passing locally, due to some what I suspect to be a timing issue / concurrency bug in JobCoordinator... ... basically, this kind of component test failure: newInterfaceWithMultipleAcl(org.opendaylight.netvirt.aclservice.tests.AclServiceStatefulTest) Time elapsed: 30.396 sec <<< ERROR! org.awaitility.core.ConditionTimeoutException: Condition with alias 'TestableJobCoordinatorEventsWaiter' didn't complete within 30 seconds because lambda expression in org.opendaylight.genius.datastoreutils.testutils.TestableJobCoordinatorEventsWaiter: expected <0L> but was <3>. is "just" telling us that the JobCoordinator still has 3 pending jobs that still did not execute, even after the test was waiting for it for 30 seconds. It's possible that Jenkins is slower than local builds, but 30s seems an awfully sufficiently long time for some thread to do do work.. more likely, somehow it got stuck? |
| Comments |
| Comment by Michael Vorburger [ 27/Oct/17 ] |
|
Doubt https://git.opendaylight.org/gerrit/#/c/64685/ will fix this? Merged it anyway as ready now, but probably unrelated. Raised https://git.opendaylight.org/gerrit/#/c/64806/ and https://git.opendaylight.org/gerrit/#/c/64805/ to see if we can shed some more light on what exact state it's in when it gets stuck like that... |
| Comment by Michael Vorburger [ 27/Oct/17 ] |
|
https://git.opendaylight.org/gerrit/64807 temporarily adds an @Ignore to AclServiceStatefulTest until we figure out what's wrong with JobCoordinator... the idea is that once c/64806/ and c/64805/ are in to give us more debug details when it gets stuck, we'll revert c/64807 and hope to see the failure again on Jenkins, with more details, and then see. |
| Comment by Tom Pantelis [ 27/Oct/17 ] |
|
I've seen these failures locally as well. |
| Comment by Tom Pantelis [ 27/Oct/17 ] |
|
I got a failure locally: FAILURE! - in org.opendaylight.netvirt.aclservice.tests.AclServiceStatefulTest org.opendaylight.infrautils.testutils.LogCaptureRuleException: Expected no error log, but: Details about stuck JobCoordinator: JobCoordinatorImpl{incompleteTasks=2, pendingTasks=2, failedJobs=0, clearedTasks=228, createdTasks=230, executeAttempts=124, retriesCount=0, fjPool=java.util.concurrent.ForkJoinPool@66cd5f16[Running, parallelism = 4, size = 0, active = 0, running = 0, steals = 58, tasks = 0, submissions = 0], jobQueueMap={ACL=JobQueue {executing=null, queue=[]}, port1=JobQueue{executing=null, queue=[]}, port2=JobQueue{executing=null, queue=[]}}, jobQueueMapLock=java.util.concurrent.locks.ReentrantLock@315dd1ef[Unlocked], scheduledExecutorService=java.util.concurrent.ScheduledThreadPoolExecutor@2843960a[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], isJobAvailable=false, jobQueueMapCondition=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@55b87372} org.awaitility.core.ConditionTimeoutException: Condition with alias 'TestableJobCoordinatorEventsWaiter' didn't complete within 120 seconds because lambda expression in org.opendaylight.genius.datastoreutils.testutils.TestableJobCoordinatorEventsWaiter: expected <0L> but was <2>. The queues are empty and FJ pool has nothing - I think the incompleteTasks and pendingTasks counters are wrong. I think the problem is with the static nature of the JobCoordinatorCounters. Since they're enums, there's of course only one instance in the JVM but they're reused per JobCoordinator so a subsequent test may see the residual counts from the prior test's JobCoordinator. This wasn't an issue with the DataStoreJobCoordinator b/c there was a static instance but now each test creates a new JobCoordinator. The JobCoordinatorCounters can't be static - I'll change it and (hopefully) verify it fixes the problem. It failed after 14 runs - I say if it can get to 50 clean then I'll declare victory. |
| Comment by Stephen Kitt [ 27/Oct/17 ] |
|
Having enums for this seems quite strange — enums aren’t ever supposed to store varying state... |
| Comment by Tom Pantelis [ 27/Oct/17 ] |
|
Agree - JobCoordinatorCounters should be local to the JobCoordinator instance. I'll push a patch for that - should fix this issue. |
| Comment by Tom Pantelis [ 27/Oct/17 ] |
|
I made the changes and the AclServiceStatefulTest ran 41 times successfully. One test failed for another reason: FAILURE! - in org.opendaylight.netvirt.aclservice.tests.AclServiceStatefulTest That looks like a failure in the app code and not the JC. The original failure caused by the JC appears to be fixed. |
| Comment by Tom Pantelis [ 27/Oct/17 ] |
| Comment by Michael Vorburger [ 30/Oct/17 ] |
|
Thanks Tom! Merged c/64824/ (above), and raised new https://git.opendaylight.org/gerrit/#/c/64827/ to un-revert ignored netvirt aclservice test. |
| Comment by Michael Vorburger [ 30/Oct/17 ] |
|
PS, clean up in https://git.opendaylight.org/gerrit/#/c/64869/ |