[INFRAUTILS-18] JobCoordinator getIncompleteTaskCount() does not reach 0 in component test after 30s Created: 27/Oct/17  Updated: 31/May/18  Resolved: 31/May/18

Status: Verified
Project: infrautils
Component/s: General
Affects Version/s: None
Fix Version/s: Oxygen

Type: Bug Priority: Highest
Reporter: Michael Vorburger Assignee: Michael Vorburger
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

https://logs.opendaylight.org/releng/jenkins092/netvirt-verify-oxygen-mvn33-openjdk8/2245/vpnservice/aclservice/impl/target/surefire-reports/org.opendaylight.netvirt.aclservice.tests.AclServiceStatefulTest-output.txt.gz for https://jenkins.opendaylight.org/releng/job/netvirt-verify-oxygen-mvn33-openjdk8/2245/console for https://git.opendaylight.org/gerrit/#/c/64381/ illustrates a case of AclServiceStatefulTest failing on Jenkins (only), but passing locally, due to some what I suspect to be a timing issue / concurrency bug in JobCoordinator...

... basically, this kind of component test failure:

newInterfaceWithMultipleAcl(org.opendaylight.netvirt.aclservice.tests.AclServiceStatefulTest)  Time elapsed: 30.396 sec  <<< ERROR!
 org.awaitility.core.ConditionTimeoutException: Condition with alias 'TestableJobCoordinatorEventsWaiter' didn't complete within 30 seconds
 because lambda expression in org.opendaylight.genius.datastoreutils.testutils.TestableJobCoordinatorEventsWaiter: expected <0L> but was <3>.

is "just" telling us that the JobCoordinator still has 3 pending jobs that still did not execute, even after the test was waiting for it for 30 seconds. It's possible that Jenkins is slower than local builds, but 30s seems an awfully sufficiently long time for some thread to do do work.. more likely, somehow it got stuck?



 Comments   
Comment by Michael Vorburger [ 27/Oct/17 ]

Doubt https://git.opendaylight.org/gerrit/#/c/64685/ will fix this? Merged it anyway as ready now, but probably unrelated.

Raised https://git.opendaylight.org/gerrit/#/c/64806/ and https://git.opendaylight.org/gerrit/#/c/64805/ to see if we can shed some more light on what exact state it's in when it gets stuck like that...

Comment by Michael Vorburger [ 27/Oct/17 ]

https://git.opendaylight.org/gerrit/64807 temporarily adds an @Ignore to AclServiceStatefulTest until we figure out what's wrong with JobCoordinator... the idea is that once c/64806/ and c/64805/ are in to give us more debug details when it gets stuck, we'll revert c/64807 and hope to see the failure again on Jenkins, with more details, and then see.

Comment by Tom Pantelis [ 27/Oct/17 ]

I've seen these failures locally as well.

Comment by Tom Pantelis [ 27/Oct/17 ]

I got a failure locally:

FAILURE! - in org.opendaylight.netvirt.aclservice.tests.AclServiceStatefulTest
newInterfaceWithMultipleAcl(org.opendaylight.netvirt.aclservice.tests.AclServiceStatefulTest) Time elapsed: 120.401 sec <<< ERROR!

org.opendaylight.infrautils.testutils.LogCaptureRuleException: Expected no error log, but: Details about stuck JobCoordinator: JobCoordinatorImpl{incompleteTasks=2, pendingTasks=2, failedJobs=0, clearedTasks=228, createdTasks=230, executeAttempts=124, retriesCount=0, fjPool=java.util.concurrent.ForkJoinPool@66cd5f16[Running, parallelism = 4, size = 0, active = 0, running = 0, steals = 58, tasks = 0, submissions = 0], jobQueueMap={ACL=JobQueue

{executing=null, queue=[]}, port1=JobQueue{executing=null, queue=[]}

, port2=JobQueue{executing=null, queue=[]}}, jobQueueMapLock=java.util.concurrent.locks.ReentrantLock@315dd1ef[Unlocked], scheduledExecutorService=java.util.concurrent.ScheduledThreadPoolExecutor@2843960a[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], isJobAvailable=false, jobQueueMapCondition=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@55b87372}

org.awaitility.core.ConditionTimeoutException: Condition with alias 'TestableJobCoordinatorEventsWaiter' didn't complete within 120 seconds because lambda expression in org.opendaylight.genius.datastoreutils.testutils.TestableJobCoordinatorEventsWaiter: expected <0L> but was <2>.

The queues are empty and FJ pool has nothing - I think the incompleteTasks and pendingTasks counters are wrong. I think the problem is with the static nature of the JobCoordinatorCounters. Since they're enums, there's of course only one instance in the JVM but they're reused per JobCoordinator so a subsequent test may see the residual counts from the prior test's JobCoordinator. This wasn't an issue with the DataStoreJobCoordinator b/c there was a static instance but now each test creates a new JobCoordinator.

The JobCoordinatorCounters can't be static - I'll change it and (hopefully) verify it fixes the problem. It failed after 14 runs - I say if it can get to 50 clean then I'll declare victory.

Comment by Stephen Kitt [ 27/Oct/17 ]

Having enums for this seems quite strange — enums aren’t ever supposed to store varying state...

Comment by Tom Pantelis [ 27/Oct/17 ]

Agree - JobCoordinatorCounters should be local to the JobCoordinator instance. I'll push a patch for that - should fix this issue.

Comment by Tom Pantelis [ 27/Oct/17 ]

I made the changes and the AclServiceStatefulTest ran 41 times successfully. One test failed for another reason:

FAILURE! - in org.opendaylight.netvirt.aclservice.tests.AclServiceStatefulTest
newInterfaceWithIcmpAcl(org.opendaylight.netvirt.aclservice.tests.AclServiceStatefulTest) Time elapsed: 1.533 sec <<< FAILURE!
org.junit.ComparisonFailure: Expected and actual beans do not match expected:<...ilter_Egress_10.0.0.[2/32_5000"
flowName = "ACL"
hardTimeOut = 0
idleTimeOut = 0
instructionInfoList = #[
new InstructionWriteMetadata(4bi, 16777214bi),
new InstructionGotoTable(213 as short)
]
matchInfoList = #[
new MatchMetadata(83886080000bi, 1099494850560bi),
new MatchEthernetType(2048L),
new MatchIpv4Destination(new Ipv4Prefix("10.0.0.2/32"))
]
....

That looks like a failure in the app code and not the JC. The original failure caused by the JC appears to be fixed.

Comment by Tom Pantelis [ 27/Oct/17 ]

Submitted https://git.opendaylight.org/gerrit/#/c/64824/

Comment by Michael Vorburger [ 30/Oct/17 ]

Thanks Tom! Merged c/64824/ (above), and raised new https://git.opendaylight.org/gerrit/#/c/64827/ to un-revert ignored netvirt aclservice test.

Comment by Michael Vorburger [ 30/Oct/17 ]

PS, clean up in https://git.opendaylight.org/gerrit/#/c/64869/

Generated at Wed Feb 07 20:02:03 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.