[NETCONF-150] Netconf connector dies after another feature with schemas is installed Created: 18/Feb/16  Updated: 15/Mar/19  Resolved: 13/Oct/17

Status: Resolved
Project: netconf
Component/s: netconf
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Jozef Behran Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File karaf.log.xz     File karaf.log.xz     File karaf.log.xz     File karaf.log.xz     File karaf.log.xz     File karaf.log.xz     File karaf.log.xz    
External issue ID: 5383

 Description   

feature:install odl-netconf-connector-all
then feature:install odl-bgpcep-bgp-all
and after the second feature:install finishes, odl-netconf-connector-all stops working.

Tested on latest Beryllium.
This problem might occur for any feature that deploys additional schemas upon installation in place of "odl-bgpcep-bgp-all" but that was not tested.

In the karaf.log (see attachment) the following suspicious exception happens after the schema change events (there were 2 found).

2016-02-18 15:15:58,470 | WARN | r - ConfigPusher | FeatureConfigPusher | 122 - config-persister-feature-adapter - 0.4.0.SNAPSHOT | Karaf featuresService.listInstalledFeatures() has thrown an exception, retry 0
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextEntry(Unknown Source)[:1.7.0_67]
at java.util.HashMap$KeyIterator.next(Unknown Source)[:1.7.0_67]
at java.util.AbstractCollection.toArray(Unknown Source)[:1.7.0_67]
at org.apache.karaf.features.internal.FeaturesServiceImpl.listInstalledFeatures(FeaturesServiceImpl.java:788)[20:org.apache.karaf.features.core:3.0.3]
at Proxyadff952b_fc56_472f_92f2_a3993414b469.listInstalledFeatures(Unknown Source)[:]
at org.opendaylight.controller.configpusherfeature.internal.FeatureConfigPusher.isInstalled(FeatureConfigPusher.java:96)[122:config-persister-feature-adapter:0.4.0.SNAPSHOT]
at org.opendaylight.controller.configpusherfeature.internal.FeatureConfigPusher.pushConfig(FeatureConfigPusher.java:77)[122:config-persister-feature-adapter:0.4.0.SNAPSHOT]
at org.opendaylight.controller.configpusherfeature.internal.FeatureConfigPusher.pushConfigs(FeatureConfigPusher.java:67)[122:config-persister-feature-adapter:0.4.0.SNAPSHOT]
at org.opendaylight.controller.configpusherfeature.internal.ConfigPushingRunnable.processFeatureEvent(ConfigPushingRunnable.java:65)[122:config-persister-feature-adapter:0.4.0.SNAPSHOT]
at org.opendaylight.controller.configpusherfeature.internal.ConfigPushingRunnable.run(ConfigPushingRunnable.java:47)[122:config-persister-feature-adapter:0.4.0.SNAPSHOT]
at java.lang.Thread.run(Unknown Source)[:1.7.0_67]

The NETCONF-144 is related but this is not the case. I added "sleep 60" between the two feature:install commands and also between the ODL startup and the first feature:install command and the behavior persists. The log confirms there are significant time gaps between the feature:install commands and the rest of ODL activity so this is not caused by the second feature:install being executed concurrently with the odl-netconf-connector reconnect. But for odl-bgpcep-bgp-all the odl-netconf-connector reconnect happens twice so I am not entirely sure about this concurrency related argument.



 Comments   
Comment by Jozef Behran [ 18/Feb/16 ]

Attachment karaf.log.xz has been added with description: INFO logs from the session described in the report

Comment by Jozef Behran [ 26/Feb/16 ]

I created a test suite that tries to catch this bug [1] but it will PASS on Sandbox but FAIL on my local system. As I have no way to compare the Sandbox test environment to my local system test environment, I cannot continue working on the suite as I can't figure out what in the suite is broken and how to fix it.

[1] https://git.opendaylight.org/gerrit/35416

Comment by Jozef Behran [ 26/Feb/16 ]

Also the test case in my last comment needs to be run in its own job. Here is a definition of a job that runs the suite and then runs a Netconf CRUD and TCPMD5 suites to make sure the two features work correctly:

https://git.opendaylight.org/gerrit/35448

Comment by Jozef Behran [ 03/Mar/16 ]

Found suspicious "Optimistic Lock Failed" in the logs at the time of the BGPCEP installation failure, followed by "will retry after timeout". From that I suspect that Netconf tries to reconnect in time when the operation is not yet done and silently crashes (no exception logged) because of that.

Attached the offending log so a deeper analysis can be done. This is an INFO level log. I will try to generate a DEBUG log that fits here and upload it as well.

Comment by Jozef Behran [ 03/Mar/16 ]

Attachment karaf.log.xz has been added with description: INFO logs with a suspicious OptimisticLockFailed

Comment by Jozef Behran [ 03/Mar/16 ]

This time the Optimistic Lock Failed happened during the ODL boot but the suite was started only a couple of minutes after the boot. These logs are "partial DEBUG" which means the boot happened at the INFO level and at the suite initialization the logging was switched to DEBUG.

Note to myself: Need to run a test where ODL boot is started before lunch and the suite is started after the lunch. This should give ODL 30-45 minutes to do the retry of the failed operation.

Comment by Jozef Behran [ 03/Mar/16 ]

Attachment karaf.log.xz has been added with description: Partial DEBUG logs with suspicious OptimisticLockFailed

Comment by Jozef Behran [ 03/Mar/16 ]

Here is yet another test run with the suspicious exception but everything running under DEBUG logging (it fits but only barely). As always the suite was started only a couple of minutes after the ODL finished booting.

Comment by Jozef Behran [ 03/Mar/16 ]

Attachment karaf.log.xz has been added with description: Full DEBUG logs with suspicious OptimisticLockFailed

Comment by Jozef Behran [ 03/Mar/16 ]

It turns out the OptimisticLockFailed most likely has nothing to do with the problem. Here is a partial DEBUG (boot at INFO, test at DEBUG) log with no OptimisticLockFailed but the bug was still there.

Comment by Jozef Behran [ 03/Mar/16 ]

Attachment karaf.log.xz has been added with description: Partial DEBUG log with no OptimisticLockFailed but bug still present

Comment by Jozef Behran [ 03/Mar/16 ]

This is the last of the log files. In this one there is a 1-hour (actually slightly more than 1 hour) pause between the ODL boot end and the beginning of the suite execution. The logging was configured to DEBUG during the whole time.

Comment by Jozef Behran [ 03/Mar/16 ]

Attachment karaf.log.xz has been added with description: Full DEBUG logs with 1 hour between boot and test run

Comment by Jozef Behran [ 03/Mar/16 ]

In this test run the suite failed to catch the bug (the bug did not happen). Analysis and comparison with the other log files is needed to figure out, why.

Comment by Jozef Behran [ 03/Mar/16 ]

Attachment karaf.log.xz has been added with description: Full DEBUG log from a Sandbox test run

Comment by Jozef Behran [ 07/Mar/16 ]

Reporting results from the analysis. First, according to the logs the problem is happening when the connection maintained by the Netconf connector is disconnected while the Netconf connector is still initializing. The Netconf connector then fails to reconnect.

Based on discussion about "who is responsible for detection of the schema changes", it appears that if a device disconnects from ODL with Netconf connector, it can crash the Netconf connector by just disconnecting too fast. Unfortunately creating a test suite that would catch the problem on Sandbox is a lot of work (definitely larger to fit into the time space between today and the Beryllium SR1) and I have higher priority stuff on my agenda at this time so most likely this needs to be postponed.

Comment by Tomas Cere [ 13/Oct/17 ]

cfg subsystem related, wont fix since its deprecated and everyone should move towards the datastore

Generated at Wed Feb 07 20:14:18 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.