[INFRAUTILS-8] JobCoordinator (ex-DataStoreJobCoordinator) job failures should indicate stack trace of original caller who submitted job Created: 09/Mar/17  Updated: 24/Sep/21  Resolved: 24/Sep/21

Status: Resolved
Project: infrautils
Component/s: General
Affects Version/s: (unspecified)
Fix Version/s: None

Type: Improvement
Reporter: Michael Vorburger Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by INFRAUTILS-16 JobCoordinator enqueueJob must return... Resolved

 Description   

In order to better understand root causes of issues such as e.g. GENIUS-58 in the future, it would be useful if the JobCoordinator (ex-DataStoreJobCoordinator; about to be moved from genius to infrautils in https://git.opendaylight.org/gerrit/#/c/51431/) for job failures should indicate stack trace of original caller who submitted job.

NB: https://bugs.opendaylight.org/show_bug.cgi?id=7917#c3

"Adding appropriate info in the caller's mainWorker toString would help to identify the originator. However I think capturing the caller's stack trace would be too expensive in production although it could be done in a debug mode."

I'm wondering how other Java frameworks deal with not loosing the stack of the original caller when working with async lambdas in Java .. there must be some.. "prior art" in this domain? Perhaps worth trying to learn a bit more about this through online research, before jumping into an implementation.



 Comments   
Comment by Michael Vorburger [ 09/Mar/17 ]

https://bugs.opendaylight.org/show_bug.cgi?id=7917#c5 :

> The only way to capture caller identity is to capture it via a Throwable.
> That is going to hurt performance a lot.

> Clean way of achieving this is to route the failure back to the requestor –
> which can then identify itself and provide any useful context.

> I mean, at the end of the day, the requestor needs to know about the failure, right?

Right... so the REAL problem here is that all the enqueueJob methods in the JobCoordinator (ex-DataStoreJobCoordinator) really instead of void should be returning a ListenableFuture that you can attach some sort of LoggingFutureCallback to (via Futures.addCallback), right?

But if with this, you still wouldn't get a nice stack trace in a log, would you? You would just get an ERROR log from the Lambda you passed as the onFailure to the FutureCallback... so this, alone, still wouldn't actually solve the real problem I was after above, I believe. Is there any solution to that?

But returning Future from JobCoordinator enqueueJob would actually be very interesting for testability as well (it's something I've been battling with for the component tests).

Once https://git.opendaylight.org/gerrit/#/c/51431/ is in (I don't want to hold it back further), we probably should be changing & adding that then...

Comment by Michael Vorburger [ 22/Sep/17 ]

New INFRAUTILS-16 opened for the return Future part; this issue for stack trace.

Comment by Michael Vorburger [ 04/Oct/17 ]

On further thought, I think (optionally, due to perf impact) capturing is a low priority, because e.g. in GENIUS-58 one DOES actually get a clue of the origin - via the JobEntry mainWorker (it was InterfaceStateRemoveWorker, in that case).

One thing we ARE missing badly is context in case of thread death due to "Thread terminated due to uncaught exception" - such as in NETVIRT-937... and I've added that in https://git.opendaylight.org/gerrit/#/c/63965/.

Comment by Robert Varga [ 24/Sep/21 ]

INFRAUTILS-75 removed JobCoordinator, hence this is a non-issue.

Generated at Wed Feb 07 20:02:01 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.