[INFRAUTILS-57] JobCoordinatorTest is flaky and (very rarely) fails on the build Created: 11/Oct/18 Updated: 11/Jan/19 Resolved: 11/Jan/19 |
|
| Status: | Resolved |
| Project: | infrautils |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Neon |
| Type: | Bug | Priority: | Medium |
| Reporter: | Michael Vorburger | Assignee: | Michael Vorburger |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
The merge job of https://git.opendaylight.org/gerrit/#/c/76810/ in https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/infrautils-merge-neon/84 just failed with: [jobcoordinator-main-task-1] ERROR org.opendaylight.infrautils.jobcoordinator.internal.JobCoordinatorImpl - Job still failed on final retry: JobEntry{key='org.opendaylight.infrautils.jobcoordinator.tests.JobCoordinatorTest', mainWorker=org.opendaylight.infrautils.jobcoordinator.tests.JobCoordinatorTest$TestCallable@21affc49, rollbackWorker=null, retryCount=3/3, futures=[com.google.common.util.concurrent.ImmediateFuture$ImmediateFailedFuture@671256e1[status=FAILURE, cause=[org.opendaylight.infrautils.jobcoordinator.tests.JobCoordinatorTest$JobException: Job is failed intentionally]]]}
org.opendaylight.infrautils.jobcoordinator.tests.JobCoordinatorTest$JobException: Job is failed intentionally
at org.opendaylight.infrautils.jobcoordinator.tests.JobCoordinatorTest.<clinit>(JobCoordinatorTest.java:49)
This is the first time I've seen this failure. It passes locally, and I expect it will also pass on the "remerge". Still, someone should look into if this can be made more reliable. Must be concurrency issue. |
| Comments |
| Comment by Tom Pantelis [ 11/Oct/18 ] |
|
Looks like the failure is actually: Not true that executeAttempts (<0>) is equal to <1> at org.opendaylight.infrautils.jobcoordinator.tests.JobCoordinatorTest.assertExecuteAttempts(JobCoordinatorTest.java:362) at org.opendaylight.infrautils.jobcoordinator.tests.JobCoordinatorTest.testJobNoExceptionReturnNull(JobCoordinatorTest.java:173) |
| Comment by Tom Pantelis [ 11/Oct/18 ] |
|
This is a timing issue where jobExecuteAttempts may not get marked/incremented. The JobQueueHandler does this:
JobQueue jobQueue = entry.getValue();
if (jobQueue.getExecutingEntry() != null) {
jobExecuteAttempts.mark();
continue;
}
So jobExecuteAttempts is only incremented if the JobQueue already has an executing entry. I'm unclear as to the reason for this logic and what jobExecuteAttempts is intended to signify. The test will fail if the JobQueueHandler happens to start and initially runs the above code (when getExecutingEntry() is null) and also the job task is executed and completes before the isJobAvailable flag is set. This can be reproduced every time by inserting a sleep before setting isJobAvailable in signalForNextJob. As mentioned I don't really see the purpose of jobExecuteAttempts wrt how it's implemented nor the purpose of asserting that it's 1 in the test(s). I think we should remove the call to assertExecuteAttempts(1) since it's non-deterministic whether or not it will be 1. I'll push a patch. |
| Comment by Tom Pantelis [ 11/Oct/18 ] |
| Comment by Michael Vorburger [ 12/Oct/18 ] |
|
I've merged c/76915 to prevent this from failing & blocking builds, but have a follow-up question: If you find that jobExecuteAttempts is basically pointless (becaues unreliable), how about just removing that counter all together? I mean in the Impl, not just in the test. PS: Added k.faseela as Watcher to this issue, Faseela jump in when you are back, or point this out to your internal friends with an intersted in the JobCoordinator. |
| Comment by Tom Pantelis [ 12/Oct/18 ] |
|
agreed but first I just wanted to prevent the sporadic failures. I don't know the history or intent of that metric or if it's used - let's see what Faseela et al say. |
| Comment by Michael Vorburger [ 31/Oct/18 ] |
|
k.faseela given above do you (or any of your friends at E//) have any objections to the complete removal of the ExecuteAttempts metrics as just proposed via https://git.opendaylight.org/gerrit/#/c/77378/ ? tpantelis FYI - I'm just trying to wrap up this loose end. |
| Comment by Faseela K [ 31/Oct/18 ] |
|
vorburger : I have just circulated an internal email to check this, but won't it be good to have this metrics, in case some job is stuck? |
| Comment by Michael Vorburger [ 31/Oct/18 ] |
|
My understanding, based just on reading the comments from tpantelis above, is that this particular metric is unreliable / broken, so best removed to avoid confusion. |