[CONTROLLER-893] clustering.services-implementation failed to register listner. Created: 25/Sep/14 Updated: 19/Oct/17 Resolved: 29/Sep/14 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Helium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Hideyuki Tai | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 2074 |
| Description |
|
System tests for VTN frequently fails. https://jenkins.opendaylight.org/integration/job/integration-master-csit-karaf-vtn-only/ We've found out that the root cause of this failure is in clustering. We sometimes observe that clustering.services-implementation fails to register listener. We observed that when the clustering.services-implementation fails to register topologymanager.edgesDB, system tests for VTN failed. 2014-09-24 14:29:00,198 | TRACE | Local user karaf | \ 2014-09-24 14:29:00,199 | TRACE | Local user karaf | \ 2014-09-24 14:29:00,199 | TRACE | Local user karaf | \ 2014-09-24 14:29:00,199 | DEBUG | Local user karaf | \ Because if the clustering.services-implementation fails to register topologymanager.edgesDB, dijkstra_impelmentation fails to create route information, and IRouting.getRoute(Node, Node) returns null, VTN Manager thinks there are no route between hosts, and eventually VTN Manager does not try to forward packets and install flow entries. That is why System tests for VTN frequently fails. |
| Comments |
| Comment by Hideyuki Tai [ 25/Sep/14 ] |
|
The easiest way to reproduce the issue 0. Clean caches. $ rm -rf data/ 1. Run Karaf distribution. $ bin/karaf 2. Change the log level of clustering.services_implementation to TRACE log:set TRACE org.opendaylight.controller.clustering.services_implementation 3. Install odl-nsf-managers feature:install odl-nsf-managers NOTE: Which listener the clustering.services-implementation fails to register changes each time. My observation The following is my observation during the above procedure. {nodeconnections] It was failed to register listener for nodeconnections. 2014-09-24 20:41:51,777 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache connectionmanager.CONTAINER_BASED.nodeconnections didn't exist when org.opendaylight.controller.connectionmanager.internal.ConnectionManager@34455f88 tried to register to its updates 2014-09-24 20:41:51,777 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache connectionmanager.LOAD_BALANCED.nodeconnections didn't exist when org.opendaylight.controller.connectionmanager.internal.ConnectionManager@34455f88 tried to register to its updates 2014-09-24 20:41:51,778 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache connectionmanager.ROUND_ROBIN.nodeconnections didn't exist when org.opendaylight.controller.connectionmanager.internal.ConnectionManager@34455f88 tried to register to its u pdates [config.container.event.save] 2014-09-24 20:41:51,926 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache config.container.event.save didn't exist when org.opendaylight.controller.configuration.internal.ContainerConfigurationService@408c3e7b tried to register to its updates [statisticsmanager] 2014-09-24 20:41:51,945 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache statisticsmanager.triggers didn't exist when org.opendaylight.controller.statisticsmanager.internal.StatisticsManager@53f21293 tried to register to its updates 2014-09-24 20:41:51,945 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache statisticsmanager.flowStatistics didn't exist when org.opendaylight.controller.statisticsmanager.internal.StatisticsManager@53f21293 tried to register to its updates [frm.workStatus] 2014-09-24 20:41:52,029 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache frm.workOrder didn't exist when org.opendaylight.controller.forwardingrulesmanager.internal.ForwardingRulesManager@3ce3ed23 tried to register to its updates 2014-09-24 20:41:52,029 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache frm.workStatus didn't exist when org.opendaylight.controller.forwardingrulesmanager.internal.ForwardingRulesManager@3ce3ed23 tried to register to its updates 2014-09-24 20:41:52,030 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache frm.installedSwView didn't exist when org.opendaylight.controller.forwardingrulesmanager.internal.ForwardingRulesManager@3ce3ed23 tried to register to its updates {topologymanager.edgesDB}2014-09-24 20:41:52,041 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache topologymanager.edgesDB didn't exist when org.opendaylight.controller.topologymanager.internal.TopologyManagerImpl@7f6d2f18 tried to register to its updates |
| Comment by Colin Dixon [ 25/Sep/14 ] |
|
Adding Moiz to the cc list in the hope that he can take this or find somebody who can. |
| Comment by Ed Warnicke [ 26/Sep/14 ] |
|
Hideyuki... Please try to collect the logs corresponding to this: private void retrieveCaches() { this.edgesDB = (ConcurrentMap<Edge, Set<Property>>) this.clusterContainerService.getCache(TOPOEDGESDB); this.hostsDB = this.nodeConnectorsDB = this.userLinksDB = } From TopologyManagerImpl. The caches being reported missing are created and retrieved in TopologyManager.init(). I think we should see: So I am hoping we can see from the logs whether we are seeing errors registering the caches. |
| Comment by Colin Dixon [ 26/Sep/14 ] |
|
Somebody might try a retry with a sleep in setClusterAware in ClusterServicesCommon. Then make sure it doesn't slow down boot unacceptably. Other dirty solutions would be: 1.) Telling users how to tell if VTN came up good or not, so they can reboot if not. |
| Comment by Moiz Raja [ 26/Sep/14 ] |
|
I tried reproducing this issue. Couple of things, 1. I do not see this message logged at all logger.trace("cachename:{} on container:{} has " + 2. Where the exception related message is logged I see the comment // Do nothing, the important is that So it seems to me that (a) caches never exist and (b) For some reason it is ok to ignore the adding of the listener |
| Comment by Moiz Raja [ 26/Sep/14 ] |
|
Ed, None of the errors that you asked Hideyuki to look for are reported. There was one though 2014-09-26 09:21:19,868 | ERROR | Local user karaf | ConfigurationService | 195 - org.opendaylight.controller.configuration.implementation - 0.4.3.SNAPSHOT | Failed to create config directory: configuration/startup/default I don't know if this has any bearing on the issue. |
| Comment by Moiz Raja [ 26/Sep/14 ] |
|
I tried adding 100 retires when adding a listener with a sleep of upto 100 millisecond between retries and it still failed when adding many listeners. |
| Comment by Moiz Raja [ 26/Sep/14 ] |
|
I added some additional logging in createCache to see if the cache was ever created. It looks like when the add listener fails the cache is never created. |
| Comment by Giovanni Meo [ 26/Sep/14 ] |
|
(In reply to Moiz Raja from comment #8) The cache is created afterward, so the idea is actually to store those caches that has been got listener before the creation and then to add the listener later on. |
| Comment by Moiz Raja [ 26/Sep/14 ] |
|
Giovanni, I added logging like so, @Override logger.info("Creating cache {} success!!!!", cacheName); and I see neither of the log statements in the log. |
| Comment by Moiz Raja [ 26/Sep/14 ] |
|
Attachment karaf.log has been added with description: karaf.log showing that create cache was not called for caches where adding listener failed |
| Comment by Giovanni Meo [ 26/Sep/14 ] |
|
(In reply to Moiz Raja from comment #11) Thanks Moiz, but this logs show no caches got the message "didn't exist when", so not sure if the log is telling it all. |
| Comment by Moiz Raja [ 26/Sep/14 ] |
|
Here are the other changes I did. I did comment out the "didn't exist" statement. void setCacheUpdateAware(Map props, ICacheUpdateAware s) { if (this.cacheUpdateAware != null) { } private void addListener(ICacheUpdateAware s, String cache, GetUpdatesContainer<?, ?> up) { int tries; int MAX_RETRIES = 5; for(tries=0;tries<MAX_RETRIES;tries++) { break; try { Thread.sleep(100); } catch (InterruptedException e) { if(tries == MAX_RETRIES){ |
| Comment by Hideyuki Tai [ 27/Sep/14 ] |
|
Moiz Raja has written a patch for this bug. |
| Comment by Colin Dixon [ 29/Sep/14 ] |
|
Hiedyuki, if you have confirmed that this is now fixed, can you set the state to fixed, resolved? Thanks! |