[NETCONF-150] Netconf connector dies after another feature with schemas is installed Created: 18/Feb/16 Updated: 15/Mar/19 Resolved: 13/Oct/17 |
|
| Status: | Resolved |
| Project: | netconf |
| Component/s: | netconf |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Jozef Behran | Assignee: | Unassigned |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 5383 |
| Description |
|
feature:install odl-netconf-connector-all Tested on latest Beryllium. In the karaf.log (see attachment) the following suspicious exception happens after the schema change events (there were 2 found). 2016-02-18 15:15:58,470 | WARN | r - ConfigPusher | FeatureConfigPusher | 122 - config-persister-feature-adapter - 0.4.0.SNAPSHOT | Karaf featuresService.listInstalledFeatures() has thrown an exception, retry 0 The |
| Comments |
| Comment by Jozef Behran [ 18/Feb/16 ] |
|
Attachment karaf.log.xz has been added with description: INFO logs from the session described in the report |
| Comment by Jozef Behran [ 26/Feb/16 ] |
|
I created a test suite that tries to catch this bug [1] but it will PASS on Sandbox but FAIL on my local system. As I have no way to compare the Sandbox test environment to my local system test environment, I cannot continue working on the suite as I can't figure out what in the suite is broken and how to fix it. |
| Comment by Jozef Behran [ 26/Feb/16 ] |
|
Also the test case in my last comment needs to be run in its own job. Here is a definition of a job that runs the suite and then runs a Netconf CRUD and TCPMD5 suites to make sure the two features work correctly: |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
Found suspicious "Optimistic Lock Failed" in the logs at the time of the BGPCEP installation failure, followed by "will retry after timeout". From that I suspect that Netconf tries to reconnect in time when the operation is not yet done and silently crashes (no exception logged) because of that. Attached the offending log so a deeper analysis can be done. This is an INFO level log. I will try to generate a DEBUG log that fits here and upload it as well. |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
Attachment karaf.log.xz has been added with description: INFO logs with a suspicious OptimisticLockFailed |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
This time the Optimistic Lock Failed happened during the ODL boot but the suite was started only a couple of minutes after the boot. These logs are "partial DEBUG" which means the boot happened at the INFO level and at the suite initialization the logging was switched to DEBUG. Note to myself: Need to run a test where ODL boot is started before lunch and the suite is started after the lunch. This should give ODL 30-45 minutes to do the retry of the failed operation. |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
Attachment karaf.log.xz has been added with description: Partial DEBUG logs with suspicious OptimisticLockFailed |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
Here is yet another test run with the suspicious exception but everything running under DEBUG logging (it fits but only barely). As always the suite was started only a couple of minutes after the ODL finished booting. |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
Attachment karaf.log.xz has been added with description: Full DEBUG logs with suspicious OptimisticLockFailed |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
It turns out the OptimisticLockFailed most likely has nothing to do with the problem. Here is a partial DEBUG (boot at INFO, test at DEBUG) log with no OptimisticLockFailed but the bug was still there. |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
Attachment karaf.log.xz has been added with description: Partial DEBUG log with no OptimisticLockFailed but bug still present |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
This is the last of the log files. In this one there is a 1-hour (actually slightly more than 1 hour) pause between the ODL boot end and the beginning of the suite execution. The logging was configured to DEBUG during the whole time. |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
Attachment karaf.log.xz has been added with description: Full DEBUG logs with 1 hour between boot and test run |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
In this test run the suite failed to catch the bug (the bug did not happen). Analysis and comparison with the other log files is needed to figure out, why. |
| Comment by Jozef Behran [ 03/Mar/16 ] |
|
Attachment karaf.log.xz has been added with description: Full DEBUG log from a Sandbox test run |
| Comment by Jozef Behran [ 07/Mar/16 ] |
|
Reporting results from the analysis. First, according to the logs the problem is happening when the connection maintained by the Netconf connector is disconnected while the Netconf connector is still initializing. The Netconf connector then fails to reconnect. Based on discussion about "who is responsible for detection of the schema changes", it appears that if a device disconnects from ODL with Netconf connector, it can crash the Netconf connector by just disconnecting too fast. Unfortunately creating a test suite that would catch the problem on Sandbox is a lot of work (definitely larger to fit into the time space between today and the Beryllium SR1) and I have higher priority stuff on my agenda at this time so most likely this needs to be postponed. |
| Comment by Tomas Cere [ 13/Oct/17 ] |
|
cfg subsystem related, wont fix since its deprecated and everyone should move towards the datastore |