[AAA-130] AAA problems during partitioning and healing cluster Created: 12/May/17  Updated: 21/Mar/19  Resolved: 07/Feb/18

Status: Resolved
Project: aaa
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Jakub Morvay Assignee: Ryan Goulding
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8432

 Description   

Currently we are thoroughly testing ODL clustering. The common scenario is that we isolate one node, verify some state we are expecting, verify some behavior we are expecting, etc. Then we join isolated node to back to cluster and again verify some state, etc.

The problem is, that after partitioning or healing the cluster, AAA seems not to work correctly. Sometimes we get 401 response to our HTTP requests, sometimes we don't get any answer at all. This happens on isolated node but also on non-isolated nodes.

We don't have debug logs for AAA or for RESTCONF on during our tests. I will try to replicate this locally and update the bug with relevant logs.



 Comments   
Comment by Vratko Polak [ 12/May/17 ]

> we don't get any answer at all
> join isolated node to back to cluster

Not only then. I see one case [0] when timeout happens when isolating one member and querying one of the other two, and one case [1] when timeout happens after graceful leader movement.
Both cases use tell-based protocol (so the failure is unlikely to be caused by AskTimeoutException). One affects a module-based shard, the other one affects a prefix-based shard, so this is unlikely to matter.

The only suspicious thing is that this happens when a shard leader moves, and the failure is always on the third member (not the old or the new leader).

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/673/archives/log.html.gz#s1-s44-t3-k2-k5-k1-k2-k1-k2-k1-k6-k2-k1-k2-k1-k1-k3-k3-k1
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/676/archives/log.html.gz#s1-s60-t1-k2-k12-k1-k1-k1-k6-k2-k1-k2-k1-k1-k3-k3-k1

Comment by Vratko Polak [ 12/May/17 ]

Note that the URI which fails is /restconf/modules. I have prepared a test change to skip this check, we will see if this repeats on jolokia URI (which we need to access in order to detect the new leader).

Comment by Vratko Polak [ 18/Sep/17 ]

No failures are seen in CSIT anymore. Some were fixed, other are avoided by the suite not performing specific checks. Lowering severity to Minor, we would need a specialized suite to determine which requests are not working right during isolation scenario.

Comment by Ryan Goulding [ 07/Feb/18 ]

This is expected;  Authorization requires access to MD-SAL.  During isolation, that is not possible.  Closing since it functions as designed.

Comment by Ryan Goulding [ 07/Feb/18 ]

Functions as designed.

Generated at Wed Feb 07 19:08:43 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.