Troubleshooting Controller CSIT
(NETVIRT-1315)
|
|
| Status: | Open |
| Project: | netvirt |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Sub-task | Priority: | Medium |
| Reporter: | Sam Hague | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
This subtask will verify if the kill -9 is fine or of something more graceful should be used. Most of the test suites use kill -9 to stop ODL. A comment was made that akka and clustering behaves differently in that scenario than it does for a graceful shutdown. |
| Comments |
| Comment by Sam Hague [ 26/Jun/18 ] |
|
Seems like kill -9 or any method to stop ODL should be handled the same as far as the cluster is concerned - meaning a missing ODL is a missing ODL regardless of how it went missing. But if there are differences, and if using a different method than kill -9 uncovers bugs then I think we should change the application tests to use the better method so we can focus on the applications. We should create other tests using kill -9 so we can focus on those tests. |
| Comment by Tom Pantelis [ 26/Jun/18 ] |
|
HA should kick in as an end result no matter how a node is stopped. However the path to that result differs a bit. If a node is killed then one of the other nodes has to time out and start a new election. If a node is shut down gracefully, then the current leader will (try to) transfer leadership to another node immediately. The former may result in hiccups/disruption during the time out period but the latter should not. In the end both need to be tested along with isolating a node (which is similar to killing a node but has different dynamics on rejoin). But I would agree that apps really don't need to test the differences - leave that up to controller suites. So I'd suggest graceful shutdown for apps to hopefully avoid intermittent test failures. |
| Comment by Jamo Luhrsen [ 26/Jun/18 ] |
|
and unfortunately, all of the scenarios are not unrealistic (isolation, sudden failure, graceful restart). |