[CONTROLLER-1928] Regression detected in CSIT Created: 30/Jan/20  Updated: 05/Feb/20  Resolved: 05/Feb/20

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Magnesium, Sodium SR2
Fix Version/s: Magnesium, Sodium SR2

Type: Bug Priority: Highest
Reporter: Luis Gomez Assignee: Tomas Cere
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to CONTROLLER-1927 Transaction can become stuck in COMMI... Resolved

 Description   

This recent fix patch:

https://git.opendaylight.org/gerrit/#/c/controller/+/87043/

Produced regression in these 2 cluster tests:

1) https://jenkins.opendaylight.org/releng/job/controller-csit-3node-rest-clust-cars-perf-ask-only-sodium/
Cars performance test: 10K cars are configured in a cluster, the leader is isolated from the cluster, when new leader is elected, cars are verified in the new leader. This test fails because not all cars are seen in the new leader.

2) https://jenkins.opendaylight.org/releng/job/controller-csit-1node-akka1-all-sodium/
Upgrade test: A single controller is loaded with old SW (Carbon in the test) and some configuration, after that the node is shutdown and the data (snapshots folder) is transfered to a new SW (Sodium SR2) controller. The new controller fails to boot up and there are GC and OOM errors in the karaf.log.



 Comments   
Comment by Luis Gomez [ 02/Feb/20 ]

Candidate fix seems to work fine and addresses the first regression, however I still see problems in the second (upgrade) test. Even when I try to upgrade from Sodium SR1, and even when I run the "only" test, controller fails to boot up:
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-1node-akka1-all-sodium/478/odl_1/odl1_karaf.log.gz

Any idea why it is failing?

Comment by Tomas Cere [ 04/Feb/20 ]

Is it possible to grab the Carbon snapshot somewhere so I can try it out locally?

Comment by Jamo Luhrsen [ 04/Feb/20 ]

as reported here I saw an OOM/GC crash when doing a simple
netconf test to mount 10 devices. Each device has a relatively large yang schema. The test was done using the revert patch
(87277) but now re-reading the comments here I realize it probably was not intended to fix the 2nd issue described (OOM).

Comment by Luis Gomez [ 05/Feb/20 ]

tcere I tried the CSIT test with several distributions, in theory the test should be using the previous ODL release (Sodium SR1) but I figured it is broken and uses old Carbon instead. Anyway, this is the ink if you want to download ODL distros: https://nexus.opendaylight.org/content/repositories/opendaylight.release/org/opendaylight/integration/karaf/

Comment by Luis Gomez [ 05/Feb/20 ]

That seems probably unrelated to this problem so lets:

1) Wait for next RC CSIT results to confirm there is no issues with the revert.
2) Dig more into your issue, like is it present in Sodium SR1 or master?

Comment by Jamo Luhrsen [ 05/Feb/20 ]

for the OOM/GC crash I am seeing with my one-off netconf mount test, I also see it in SR1. so, although likely a very
serious bug, it's not a regression. I have filed a new jira for this issue and we can discuss there if we would
want to block SR2 on it or let it go since it's not exactly a regression. Not exactly, because it was not there in the
original/first Sodium release.

Generated at Wed Feb 07 19:56:48 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.