Troubleshooting Controller CSIT (NETVIRT-1315)

[NETVIRT-1349] Verify ODL up/down method Created: 26/Jun/18  Updated: 26/Jun/18

Status: Open
Project: netvirt
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Sub-task Priority: Medium
Reporter: Sam Hague Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

This subtask will verify if the kill -9 is fine or of something more graceful should be used.

Most of the test suites use kill -9 to stop ODL. A comment was made that akka and clustering behaves differently in that scenario than it does for a graceful shutdown.



 Comments   
Comment by Sam Hague [ 26/Jun/18 ]

Seems like kill -9 or any method to stop ODL should be handled the same as far as the cluster is concerned - meaning a missing ODL is a missing ODL regardless of how it went missing. But if there are differences, and if using a different method than kill -9 uncovers bugs then I think we should change the application tests to use the better method so we can focus on the applications. We should create other tests using kill -9 so we can focus on those tests.

Comment by Tom Pantelis [ 26/Jun/18 ]

HA should kick in as an end result no matter how a node is stopped. However the path to that result differs a bit.  If a node is killed then one of the other nodes has to time out and start a new election. If a node is shut down gracefully, then the current leader will (try to) transfer leadership  to another node immediately. The former may result in hiccups/disruption during the time out period but the latter should not. 

In the end both need to be tested along with isolating a node (which is similar to killing a node but has different dynamics on rejoin). But I would agree that apps really don't need to test the differences - leave that up to controller suites. So I'd suggest graceful shutdown for apps to hopefully avoid intermittent test failures.

Comment by Jamo Luhrsen [ 26/Jun/18 ]

and unfortunately, all of the scenarios are not unrealistic (isolation, sudden failure, graceful restart).

Generated at Wed Feb 07 20:23:51 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.