[NETCONF-716] regression in netconf-scale CSIT betweeen Mg Sr1 and Sr2 Created: 30/Jul/20  Updated: 04/Sep/20  Resolved: 04/Sep/20

Status: Verified
Project: netconf
Component/s: None
Affects Version/s: Magnesium SR2, Sodium SR4
Fix Version/s: Aluminium, 1.13.0, Magnesium SR3

Type: Bug Priority: Highest
Reporter: Jamo Luhrsen Assignee: Jamo Luhrsen
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Issue split
split to INTTEST-123 netconf-csit-1node-scale-only-magnesi... Resolved

 Description   

This job appears to have a regression between SR1 and SR2. The job will pass from
time to time, but for the most part it will fail in the Getmulti and Getsingle suites which
mount 500 instances of the netconf-testtool and then attempt to "issue requests" on
each device in order (essentially a GET to yang-ext:mount for each device). When the
python tool doing those requests has trouble the test case fails.

In the sandbox, I ran a job with the Sr1 release which saw two failures in 35 iterations,
whereas using a distro built recently (so Sr2 bits) the results were 8 passes in 36 iterations.

Jenkins job history in the web ui only goes back 30 builds or so, but I pulled all the
console logs since job 225 (May 1st – just before SR1 was released) until the most
recent build #327. Some of those jobs were infra aborts, but it's aprox 90 data points.
here is a hacky spreadsheet to illustrate results.

There seems to be two points of interest. The job was mostly passing for quite some
time, then had 10 straight failures before becoming stable again. That happened at
job #265 (June 6th). Finally job #301 failed again (July 11th) and it seems to be unstable
since that point on.

Nothing in the netconf project stands out to me in the June 6th timeframe, but there
was an MRI bump on July 10th

Here is a karaf.log from the most recent failed job. There is some ugliness that
may be a place to start looking. Search for "Getmulti.Issue_Requests_On_Devices" and
scroll down. Seems that one mount session went down, then a lot of NETCONF
operation failed Exceptions.



 Comments   
Comment by Robert Varga [ 31/Jul/20 ]

So I took a look at

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netconf-csit-1node-scale-only-magnesium/326/ (failed) versus https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netconf-csit-1node-scale-only-magnesium/299/ (successful)

and one thing that is immediately noticeable is that the logs from the testtool are different: testtool--netconf-scale-txt-GetSingle in the failed case is a lot smaller and lists inability to bind port 17830 and up – which seems to indicate a failed cleanup before.

Also the Getmulti log is roughly 39% of the successful run – where success covers ~38 minutes and failure covers ~19 minutes. I strongly suspect there is something fishy in the test suite.

 

 

Comment by Robert Varga [ 31/Jul/20 ]

The other thing is we have a bunch of InterruptedByTimeoutExceptions in SessionTimeoutListener. I think these require client/server heartbeats to be enabled ... which we tracked somewhere else.

Comment by Jamo Luhrsen [ 04/Aug/20 ]

rovarga noticed there are a lot of device disconnects in the karaf.log, even in the passing cases. For reference, comparing two logs 1   2 where
the first is something recent with this regression and the 2nd is from early May (the SR1 timeframe). The first has 193 instances of "Session went down"
in the karaf.log whereas the 2nd has none.

In mid may, we got the work to shade sshd in order to disassociate with the karaf provided sshd (to fix a bug with karaf shell ssh, iirc). That timeframe
doesn't match up with the description where the failures became prevalent around July 11th. However, there was a mina-sshd upgrade on July 7th
that might be a clue.

I'm experimenting in that area with something like this to see if it gets rid of the regression.

Comment by Robert Varga [ 26/Aug/20 ]

Alright, so downgrading mina-sshd seems to have stabilized the test suite, but we still seeing timeouts. I'll try longevity locally and hopefully will be able to make some sense of this.

Comment by Robert Varga [ 26/Aug/20 ]

Okay, the two additional patches seem to stabilize sessions on the use cases tested by the CSIT suite. Let's bring them in, see if they are restoring sanity.

Comment by Jamo Luhrsen [ 26/Aug/20 ]

I ran two jobs in the sandbox since yesterday on aluminium. one is running with the distribution created from
the distro check job in the master (silicon) patch that reverted the sshd upgrade
while the other is running latest aluminium distro which doesn't have that revert. The first job failed twice in 37 jobs and the
second failed 7 in 37. I was hoping it would be more concrete with 100% passing in the first job, but it's still a big difference.

just fyi.

Comment by Jamo Luhrsen [ 26/Aug/20 ]

can you point me to the two patches?

Comment by Robert Varga [ 26/Aug/20 ]

https://git.opendaylight.org/gerrit/q/I98ef51b4346c2aee0c9c87d58035bc08ef2c77d1
https://git.opendaylight.org/gerrit/q/I2c6ed07b5d3b29f6524b0ed99067895c4dfdc020

both will be part of next autoreleases.

Comment by Robert Varga [ 26/Aug/20 ]

To throw some variance across branches and get some results overnight, https://git.opendaylight.org/gerrit/c/netconf/+/92299 will re-instate sshd-2.5.1 on Silicon and we can compare the results.

Comment by Robert Varga [ 27/Aug/20 ]

https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-1node-scale-only-silicon/23/ succeeded with sshd-2.5.1, the only timeouts are on Karaf SSH. I have gave it another try to see where it is.

https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-1node-scale-only-aluminium/181/ succeeded with sshd-2.3.0, same as above.

https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-1node-scale-only-magnesium/355/ seems to contain some failure – but we'll need to have it finish before we know more.

I have triggered a Sodium AR build to get a read on that branch, too.

 

Comment by Robert Varga [ 04/Sep/20 ]

Everything except Mg is stabilized. I will open a separate issue to track the Mg failure.

Generated at Wed Feb 07 20:15:43 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.