[AAA-28] Race condition between AAA and netconf connector Created: 10/Mar/15 Updated: 21/Mar/19 Resolved: 11/Sep/15 |
|
| Status: | Resolved |
| Project: | aaa |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Wojciech Dec | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 2808 |
| Description |
|
From Devin Avery Hi Folks, Following up on this. So the problem appears to be an internal race condition inside of MD-SAL / AAA / config system combined with logic flaws about retrying connections. Basically, the way I get this to reproduce is Iist all of the features that I want to install in the featuresBoot list and then start a clean controller (I.e. delete data directory, delete /etc/opendaylight). When the controller starts up there is a race condition from the config subsystem injecting the self-referencing netconf connection and the AAA database being locked. When the problem occurs the AAA database file is locked and thus when netconf tries to connect and specifically authenticate the AAA fails because of an exception. Then because the config subsystem, or more specifically the netconf mounter is a try once and done (no retries) it fails to connect the self-referencing controller. Potential fixes from what I can tell: enhance AAA logic to handle issues when the database might be locked, and consider a wait retry mechanism (we would need to answer the question of “when would we expect this error” to implement something like this). I think ultimately #1 is the longer term fix. If you are trying to reproduce this note that I see this every once in a while - on 10 machines, I see it consistently on 1 of them every time I run a suite of tests. For now I have worked around this by installing the features more slowly / independently, but this is clearly not an ideal solution. I’ll see if I can’t come up with a more consistent reproduction scenario but figured I would share what I know in the mean time. |
| Comments |
| Comment by Maros Marsalek [ 08/Apr/15 ] |
|
Hi, Could not reproduce this directly, so tried this manually by removing the odl-aaa-netconf-plugin feature (to make authentication for loopback always fail) and installing odl-netconf-connector-all. Configs were pushed successfully, but the connection could not be established due to failed authentication. The attempts to establish the connection continued with failures until I installed the aaa feature manually. After that the loopback was successfully mounted. This is the behavior on current master (Lithium release). In Helium and the first SR release the connection was not reattempted when the authentication failed, but this is fixed now. Netconf SSH endpoint relies on the aaa service to perform authentication and it retrieves the service from the OSGi registry. It looks like the service is exposed to the OSGi registry but its not fully functional (the database lock exception). This looks like an AAA issue and AAA should check the database. However this will not be an issue after AAA is migrated to use MD-SAL instead of SQLite database. |
| Comment by Flavio Fernandes [ 10/Apr/15 ] |
|
ref: https://lists.opendaylight.org/pipermail/controller-dev/2015-March/008458.html From: Devin Avery <davery@brocade.com> Hi Folks, Following up on this. So the problem appears to be an internal race condition inside of MD-SAL / AAA / config system combined with logic flaws about retrying connections. Basically, the way I get this to reproduce is Iist all of the features that I want to install in the featuresBoot list and then start a clean controller (I.e. delete data directory, delete /etc/opendaylight). When the controller starts up there is a race condition from the config subsystem injecting the self-referencing netconf connection and the AAA database being locked. When the problem occurs the AAA database file is locked and thus when netconf tries to connect and specifically authenticate the AAA fails because of an exception. Then because the config subsystem, or more specifically the netconf mounter is a try once and done (no retries) it fails to connect the self-referencing controller. Potential fixes from what I can tell: enhance AAA logic to handle issues when the database might be locked, and consider a wait retry mechanism (we would need to answer the question of “when would we expect this error” to implement something like this). I think ultimately #1 is the longer term fix. If you are trying to reproduce this note that I see this every once in a while - on 10 machines, I see it consistently on 1 of them every time I run a suite of tests. For now I have worked around this by installing the features more slowly / independently, but this is clearly not an ideal solution. I’ll see if I can’t come up with a more consistent reproduction scenario but figured I would share what I know in the mean time. Pone other comment - in the AAA code it would be really helpful it the entire exception was printed to the log, not just the message. In general the AAA code catches exceptions and logs the e.getMessage() or they use an overloaded toString on the exception to only print a simple message. It is better practice to print the stack trace too though to make it easier to debug Thank you, Devin Devin Avery From: Wojciech Dec <wdec.ietf@gmail.com> Hi, I must admit I didn't see this particular problem so far. It obviously looks like something thrown up by SQLite on reading/writing. We're going to replace SQLite with another db backend so hopefully that may redress that, in the meantime if you have more info on how to reproduce this please let us know. Cheers, On 5 March 2015 at 16:11, Devin Avery <davery@brocade.com> wrote: _______________________________________________ |
| Comment by Wojciech Dec [ 11/Sep/15 ] |
|
This appears to have been resolved, or at least not seen since the fix for To be reopened if it is seen again. |