[NETCONF-716] regression in netconf-scale CSIT betweeen Mg Sr1 and Sr2 Created: 30/Jul/20 Updated: 04/Sep/20 Resolved: 04/Sep/20 |
|
| Status: | Verified |
| Project: | netconf |
| Component/s: | None |
| Affects Version/s: | Magnesium SR2, Sodium SR4 |
| Fix Version/s: | Aluminium, 1.13.0, Magnesium SR3 |
| Type: | Bug | Priority: | Highest |
| Reporter: | Jamo Luhrsen | Assignee: | Jamo Luhrsen |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Description |
|
This job appears to have a regression between SR1 and SR2. The job will pass from In the sandbox, I ran a job with the Sr1 release which saw two failures in 35 iterations, Jenkins job history in the web ui only goes back 30 builds or so, but I pulled all the There seems to be two points of interest. The job was mostly passing for quite some Nothing in the netconf project stands out to me in the June 6th timeframe, but there Here is a karaf.log from the most recent failed job. There is some ugliness that |
| Comments |
| Comment by Robert Varga [ 31/Jul/20 ] |
|
So I took a look at https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netconf-csit-1node-scale-only-magnesium/326/ (failed) versus https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netconf-csit-1node-scale-only-magnesium/299/ (successful) and one thing that is immediately noticeable is that the logs from the testtool are different: testtool--netconf-scale-txt-GetSingle in the failed case is a lot smaller and lists inability to bind port 17830 and up – which seems to indicate a failed cleanup before. Also the Getmulti log is roughly 39% of the successful run – where success covers ~38 minutes and failure covers ~19 minutes. I strongly suspect there is something fishy in the test suite.
|
| Comment by Robert Varga [ 31/Jul/20 ] |
|
The other thing is we have a bunch of InterruptedByTimeoutExceptions in SessionTimeoutListener. I think these require client/server heartbeats to be enabled ... which we tracked somewhere else. |
| Comment by Jamo Luhrsen [ 04/Aug/20 ] |
|
rovarga noticed there are a lot of device disconnects in the karaf.log, even in the passing cases. For reference, comparing two logs 1 2 where In mid may, we got the work to shade sshd in order to disassociate with the karaf provided sshd (to fix a bug with karaf shell ssh, iirc). That timeframe I'm experimenting in that area with something like this to see if it gets rid of the regression. |
| Comment by Robert Varga [ 26/Aug/20 ] |
|
Alright, so downgrading mina-sshd seems to have stabilized the test suite, but we still seeing timeouts. I'll try longevity locally and hopefully will be able to make some sense of this. |
| Comment by Robert Varga [ 26/Aug/20 ] |
|
Okay, the two additional patches seem to stabilize sessions on the use cases tested by the CSIT suite. Let's bring them in, see if they are restoring sanity. |
| Comment by Jamo Luhrsen [ 26/Aug/20 ] |
|
I ran two jobs in the sandbox since yesterday on aluminium. one is running with the distribution created from just fyi. |
| Comment by Jamo Luhrsen [ 26/Aug/20 ] |
|
can you point me to the two patches? |
| Comment by Robert Varga [ 26/Aug/20 ] |
|
https://git.opendaylight.org/gerrit/q/I98ef51b4346c2aee0c9c87d58035bc08ef2c77d1 both will be part of next autoreleases. |
| Comment by Robert Varga [ 26/Aug/20 ] |
|
To throw some variance across branches and get some results overnight, https://git.opendaylight.org/gerrit/c/netconf/+/92299 will re-instate sshd-2.5.1 on Silicon and we can compare the results. |
| Comment by Robert Varga [ 27/Aug/20 ] |
|
https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-1node-scale-only-silicon/23/ succeeded with sshd-2.5.1, the only timeouts are on Karaf SSH. I have gave it another try to see where it is. https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-1node-scale-only-aluminium/181/ succeeded with sshd-2.3.0, same as above. https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-1node-scale-only-magnesium/355/ seems to contain some failure – but we'll need to have it finish before we know more. I have triggered a Sodium AR build to get a read on that branch, too.
|
| Comment by Robert Varga [ 04/Sep/20 ] |
|
Everything except Mg is stabilized. I will open a separate issue to track the Mg failure. |