[AAA-130] AAA problems during partitioning and healing cluster Created: 12/May/17 Updated: 21/Mar/19 Resolved: 07/Feb/18 |
|
| Status: | Resolved |
| Project: | aaa |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Jakub Morvay | Assignee: | Ryan Goulding |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8432 |
| Description |
|
Currently we are thoroughly testing ODL clustering. The common scenario is that we isolate one node, verify some state we are expecting, verify some behavior we are expecting, etc. Then we join isolated node to back to cluster and again verify some state, etc. The problem is, that after partitioning or healing the cluster, AAA seems not to work correctly. Sometimes we get 401 response to our HTTP requests, sometimes we don't get any answer at all. This happens on isolated node but also on non-isolated nodes. We don't have debug logs for AAA or for RESTCONF on during our tests. I will try to replicate this locally and update the bug with relevant logs. |
| Comments |
| Comment by Vratko Polak [ 12/May/17 ] |
|
> we don't get any answer at all Not only then. I see one case [0] when timeout happens when isolating one member and querying one of the other two, and one case [1] when timeout happens after graceful leader movement. The only suspicious thing is that this happens when a shard leader moves, and the failure is always on the third member (not the old or the new leader). [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/673/archives/log.html.gz#s1-s44-t3-k2-k5-k1-k2-k1-k2-k1-k6-k2-k1-k2-k1-k1-k3-k3-k1 |
| Comment by Vratko Polak [ 12/May/17 ] |
|
Note that the URI which fails is /restconf/modules. I have prepared a test change to skip this check, we will see if this repeats on jolokia URI (which we need to access in order to detect the new leader). |
| Comment by Vratko Polak [ 18/Sep/17 ] |
|
No failures are seen in CSIT anymore. Some were fixed, other are avoided by the suite not performing specific checks. Lowering severity to Minor, we would need a specialized suite to determine which requests are not working right during isolation scenario. |
| Comment by Ryan Goulding [ 07/Feb/18 ] |
|
This is expected; Authorization requires access to MD-SAL. During isolation, that is not possible. Closing since it functions as designed. |
| Comment by Ryan Goulding [ 07/Feb/18 ] |
|
Functions as designed. |