[CONTROLLER-1747] Leader movement longevity job is unstable Created: 09/Aug/17  Updated: 25/Jul/23  Resolved: 11/Sep/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8959

 Description   

This weekend, job controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon failed to finish correctly on RelEng. Looking at console output [0], the robot execution passed, but during post-processing the Robot data, connection to the Robot VM was lost.

An attempt at reproducing this behavior on Sandbox has failed, in a sense that ReadTimeout happened after 11 hours and examining console output [1] it is clear that connections to ODL VMs have been lost.

The job has never shown this types of behavior before, so this is a regression. Both types can be explained if there is an exception generating long restconf outputs and karaf log. ODL VM failures happens when disk is full, Robot VM when output.xml is too large.

Unfortunately, attempts to reproduce with shorter runtimes on Sandbox are failing so far (meaning the test passes without issues), so the real cause might be something different, including just bad luck with infra in two occasions.

[0] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/18/console
[1] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/1/console



 Comments   
Comment by Vratko Polak [ 22/Aug/17 ]

> bad luck with infra in two occasions.

This seems to be increasingly likely, marking as WORKSFORME.

Comment by Vratko Polak [ 04/Sep/17 ]

Another failure of this type happened [2], specifically Robot VM crashed when processing output.xml probably due to running out of free RAM.

Still not enough information to draw a conclusion.
Controller might be guilty of generation large log files, but that has to be confirmed yet.

[2] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/22/console

Comment by Vratko Polak [ 04/Sep/17 ]

This can also affect other longevity jobs, such as [3].

[3] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-drb-precedence-longevity-only-carbon/22/console

Comment by Vratko Polak [ 11/Sep/17 ]

Another failure happened [4], this time Robot VM crash was perhaps preceded by ODL VM crash.

[4] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-notifications-longevity-only-carbon/28/console

Generated at Wed Feb 07 19:56:21 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.