NetVirt CSIT Improvements (NETVIRT-1086)

[NETVIRT-1089] Add trace:transactions to suite teardowns Created: 13/Feb/18  Updated: 21/Dec/18

Status: In Progress
Project: netvirt
Component/s: None
Affects Version/s: None
Fix Version/s: Fluorine-SR2, Neon

Type: Sub-task Priority: Medium
Reporter: Jamo Luhrsen Assignee: Jamo Luhrsen
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
blocks NETVIRT-878 CSIT should help to detect possible m... In Progress
blocks MDSAL-213 Serializing DataObject to JSON causes... Resolved
is blocked by CONTROLLER-1765 trace:transactions should suppress fa... Resolved
is blocked by NETVIRT-1296 Transaction leak in VpnOpStatusListener Resolved
is blocked by CONTROLLER-1832 Transaction Trace tool wiring creates... Resolved
is blocked by CONTROLLER-1834 Transaction Trace tool wiring for pin... Resolved
is blocked by NETVIRT-1295 Transaction leak in ExternalRoutersLi... Resolved
is blocked by OPNFLWPLUG-1043 Transaction leak in ForwardingRulesMa... Resolved

 Description   

initially, just add the call. we can save the results to a file and log them like we do with other
things like packet traces, exceptions, etc.

but hopefully, at some point, we can fail if the output is non-empty, which probably
indicates some leftover unfinished (aka leaking) transactions



 Comments   
Comment by Jamo Luhrsen [ 18/May/18 ]

I was remembering a jira ticket about this task that should have comments or updates. this one is
blank. Oh, I see it. It's NETVIRT-878 which is mentioned on this ticket to be blocked by this.

Anyway, smalleni mentioned he might want to look at this as a task to get familiar with CSIT,
jenkins, robot, sandbox, etc etc. So, I added him as a watcher.

However, we were able to quickly run this patch in the sandbox again and I can confirm again that it
introduces some troubles/failures that we don't expect. we ran the same job without the trace:transactions
happening on every test case teardown and it's 100% passing . I have not dug in to the what/why/how, but
clearly doing this trace:transactions is having some kind of negative effect on our jobs. Maybe there is
some simple explanation and fix, but we have to figure this out before we can get this in to our tests.

smalleni, feel like digging around?

Comment by Michael Vorburger [ 22/May/18 ]

I'm interested in getting this done, so motivated to help. Had a look at the "troubles/failures" linked above - 13 CSIT failures? What could perhaps be useful as a next step is if one of you could look through the karaf.log (is it here?), and somehow compare a failing one with trace:transaction with one without where CSIT passes 100% - if you spot any errors / exceptions there which only happen with trace:transactions, I may be able to understand the cause. (In theory, it should not cause any functional difference, as the trace feature just installs a kind of wrapper.)

Comment by Jamo Luhrsen [ 23/May/18 ]

ok, so first of all MY BAD in that the job I linked to with failures did not actually even install the odl-mdsal-trace feature.
it was running "trace:transactions" on each test case though resulting in a command not found. And just doing that was
enough to cause failures.

I did look at the karaf.log (yes, vorburger, you have the right place for those logs) but nothing stood out.

The first few failures are end-user kind of failures where openstack instances have connectivity problems. We could
try to dig in on the debugs there to see why that is. But, the later failures, like this one looks like a test system
issue where we lost the ssh connection to one of our test VMs. That seems like something to chase. I wonder if our
step to do trace:transactions, which is done via ssh, is building up more connection cruft than we can handle eventually.
Or we are hitting some ssh bug in robot. Btw, we do close the ssh connection each time we run trace:transactions.

Anyway, just thinking out loud for now.

I am running this again here so lets wait on that before we dig any further. yes, I am running it with odl-mdsal-trace.

Comment by Jamo Luhrsen [ 24/May/18 ]

the new job finished, had failures and really did do the mdsal-trace feature install.

All the logs are here

I have not dug very deep, but I did notice a few things:

  • lots of exceptions and to compare this is the same job without this trace stuff to look at.
    • note that one is a "frozen class" which I recall seeing some email thread about. I think we have a red hat bugzilla around that (if it's the same)
    • a pretty serious netvirt functional failure where all instances in one network did not receive their IPs but another network was ok.
    • looks like the same problem with ssh connections that I mentioned in the previous comment is also happening here as well
    • in the final call to trace:transactions we can see a LOT (I think it's a lot) of unclosed transactions

let me know what you think vorburger and we can think of ideas to keep digging.

Comment by Michael Vorburger [ 04/Jun/18 ]

jluhrsen as per internal / private email: What I'm most interested in first here (we'll get back to transaction leaks after) is that you've managed to hit "frozen class" MDSAL-213 here! Could your re-run this and turn but up the logging for class org.opendaylight.mdsal.binding.dom.codec.gen.impl.AbstractStreamWriterGenerator all the way up to TRACE (hoping that doesn't lead to too much, we'll see; otherwise just do DEBUG), because I recently added additional new logging re. Frozen (not the movie but the bug), and provide that log on MDSAL-213 ?

Just for the record here and as previously stated elsewhere: It COULD be that frozen ONLY happens with features-mdsal-trace... If you are able to try it once completely without and once with odl-mdsal-trace, we (MDSAL-213) would be most curious whether that makes any difference. See also CONTROLLER-1832 and CONTROLLER-1834 which are vaguely related.

Comment by Michael Vorburger [ 04/Jun/18 ]

> in the final call to trace:transactions we can see a LOT (I think it's a lot) of unclosed transactions

I've opened NETVIRT-1295 and NETVIRT-1296 shague will you drive someone to "plug" those in netvirt weekly meetings?

Comment by Jamo Luhrsen [ 04/Jun/18 ]

working on it with this sandbox job

Comment by Jamo Luhrsen [ 04/Jun/18 ]

I ran one job with the mdsal-trace feature installed and one job without that feature. It looks like the failures are the same, so I no longer suspect this feature is causing any more problems. I'm guessing the failures (which aren't expected) are coming because of all the extra trace:transactions commands via ssh to karaf console. We'll have to dig in to that idea next.

Comment by Jamo Luhrsen [ 05/Jun/18 ]

vorburger pinged me on IRC if this jira was about fluorine and/or oxygen. this is an int/test patch and by
default will apply to both. If we don't want it in one or the other, that's possible. what's the thinking, vorburger?

Comment by Michael Vorburger [ 06/Jun/18 ]

Running this on top Fluorine AND Oxygen is great IMHO!

Comment by Michael Vorburger [ 30/Jul/18 ]

CONTROLLER-1765 is needed to finish up being able to fully automate this.

Comment by Jamo Luhrsen [ 31/Oct/18 ]

I finally had some time to look at this one again.

vorburger, over in CONTROLLER-1765 you guessed at using some number 7 as a threshold for which we might want to flag a
problem with a failure in csit. in our standard netvirt csit job, by the 2nd suite we are already seeing 14 open transactions. By the final suite, that number has grown to 114. I'm guessing that indicates a minor leak. what do you think?

Looks like there are other open transactions from the get go, but they do not increase. So possibly those are leak-able, but just not with the normal netvirt csit flow.

Unfortunately, something about this work (installing odl-mdsal-trace and issuing that karaf cli on each test case teardown) is causing a problem
with something unrelated. I have to figure that out before we can get this really working.

but, in the meantime we probably want to figure out if we have a leak to fix, and if not, how to deal with it (growing from
7-114) in our csit job.

Comment by Michael Vorburger [ 31/Oct/18 ]

I've opened OPNFLWPLUG-1043 but FYI it's newReadOnlyTransaction - less critical, AFAIK.

Comment by Jamo Luhrsen [ 31/Oct/18 ]

great thanks.

if it's less critical (as in, wont get fixed) then we'll need to ignore it when checking because we don't want to have
these things failing our csit which is our gates. maybe that makes it slightly more critical

Comment by Jamo Luhrsen [ 21/Dec/18 ]

~2m later, taking another look. Trying to put this in the apex job first.

step one complete
step two looks good so far
step three needs some iterations

Generated at Wed Feb 07 20:23:12 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.