[CONTROLLER-893] clustering.services-implementation failed to register listner. Created: 25/Sep/14  Updated: 19/Oct/17  Resolved: 29/Sep/14

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Helium
Fix Version/s: None

Type: Bug
Reporter: Hideyuki Tai Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File karaf.log    
External issue ID: 2074

 Description   

System tests for VTN frequently fails.

https://jenkins.opendaylight.org/integration/job/integration-master-csit-karaf-vtn-only/
https://jenkins.opendaylight.org/integration/job/integration-master-csit-karaf-vtn-all/

We've found out that the root cause of this failure is in clustering.

We sometimes observe that clustering.services-implementation fails to register listener.
Which listener the clustering.services-implementation fails to register changes each time.

We observed that when the clustering.services-implementation fails to register topologymanager.edgesDB, system tests for VTN failed.

2014-09-24 14:29:00,198 | TRACE | Local user karaf | \
ClusterGlobalManager | \
98 - org.opendaylight.controller.clustering.services-implementation - \
0.4.3.Helium | setCacheUpdateAware: \
org.opendaylight.controller.topologymanager.internal.\
TopologyManagerImpl@ceb60d

2014-09-24 14:29:00,199 | TRACE | Local user karaf | \
ClusterManagerCommon | \
98 - org.opendaylight.controller.clustering.services-implementation - \
0.4.3.Helium | CacheUpdateAware being set on container:default

2014-09-24 14:29:00,199 | TRACE | Local user karaf | \
ClusterManagerCommon | \
98 - org.opendaylight.controller.clustering.services-implementation - \
0.4.3.Helium | cachenames provided below:

2014-09-24 14:29:00,199 | DEBUG | Local user karaf | \
ClusterManagerCommon | \
98 - org.opendaylight.controller.clustering.services-implementation - \
0.4.3.Helium | Cache topologymanager.edgesDB didn't exist when \
org.opendaylight.controller.topologymanager.internal.\
TopologyManagerImpl@ceb60d tried to register to its updates

Because if the clustering.services-implementation fails to register topologymanager.edgesDB, dijkstra_impelmentation fails to create route information, and IRouting.getRoute(Node, Node) returns null, VTN Manager thinks there are no route between hosts, and eventually VTN Manager does not try to forward packets and install flow entries.

That is why System tests for VTN frequently fails.



 Comments   
Comment by Hideyuki Tai [ 25/Sep/14 ]

The easiest way to reproduce the issue
======================================

0. Clean caches.

$ rm -rf data/

1. Run Karaf distribution.

$ bin/karaf

2. Change the log level of clustering.services_implementation to TRACE

log:set TRACE org.opendaylight.controller.clustering.services_implementation

3. Install odl-nsf-managers

feature:install odl-nsf-managers

NOTE: Which listener the clustering.services-implementation fails to register changes each time.

My observation
==============

The following is my observation during the above procedure.

{nodeconnections] It was failed to register listener for nodeconnections. 2014-09-24 20:41:51,777 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache connectionmanager.CONTAINER_BASED.nodeconnections didn't exist when org.opendaylight.controller.connectionmanager.internal.ConnectionManager@34455f88 tried to register to its updates 2014-09-24 20:41:51,777 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache connectionmanager.LOAD_BALANCED.nodeconnections didn't exist when org.opendaylight.controller.connectionmanager.internal.ConnectionManager@34455f88 tried to register to its updates 2014-09-24 20:41:51,778 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache connectionmanager.ROUND_ROBIN.nodeconnections didn't exist when org.opendaylight.controller.connectionmanager.internal.ConnectionManager@34455f88 tried to register to its u pdates [config.container.event.save] 2014-09-24 20:41:51,926 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache config.container.event.save didn't exist when org.opendaylight.controller.configuration.internal.ContainerConfigurationService@408c3e7b tried to register to its updates [statisticsmanager] 2014-09-24 20:41:51,945 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache statisticsmanager.triggers didn't exist when org.opendaylight.controller.statisticsmanager.internal.StatisticsManager@53f21293 tried to register to its updates 2014-09-24 20:41:51,945 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache statisticsmanager.flowStatistics didn't exist when org.opendaylight.controller.statisticsmanager.internal.StatisticsManager@53f21293 tried to register to its updates [frm.workStatus] 2014-09-24 20:41:52,029 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache frm.workOrder didn't exist when org.opendaylight.controller.forwardingrulesmanager.internal.ForwardingRulesManager@3ce3ed23 tried to register to its updates 2014-09-24 20:41:52,029 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache frm.workStatus didn't exist when org.opendaylight.controller.forwardingrulesmanager.internal.ForwardingRulesManager@3ce3ed23 tried to register to its updates 2014-09-24 20:41:52,030 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache frm.installedSwView didn't exist when org.opendaylight.controller.forwardingrulesmanager.internal.ForwardingRulesManager@3ce3ed23 tried to register to its updates {topologymanager.edgesDB}

2014-09-24 20:41:52,041 | DEBUG | Local user karaf | ClusterManagerCommon | 189 - org.opendaylight.controller.clustering.services-implementation - 0.4.3.Helium-RC2-test-v201409232153 | Cache topologymanager.edgesDB didn't exist when org.opendaylight.controller.topologymanager.internal.TopologyManagerImpl@7f6d2f18 tried to register to its updates

Comment by Colin Dixon [ 25/Sep/14 ]

Adding Moiz to the cc list in the hope that he can take this or find somebody who can.

Comment by Ed Warnicke [ 26/Sep/14 ]

Hideyuki...

Please try to collect the logs corresponding to this:

private void retrieveCaches() {
if (this.clusterContainerService == null)

{ log.error("Cluster Services is null, can't retrieve caches."); return; }

this.edgesDB = (ConcurrentMap<Edge, Set<Property>>) this.clusterContainerService.getCache(TOPOEDGESDB);
if (edgesDB == null)

{ log.error("Failed to get cache for " + TOPOEDGESDB); }

this.hostsDB =
(ConcurrentMap<NodeConnector, Set<ImmutablePair<Host, Set<Property>>>>) this.clusterContainerService.getCache(TOPOHOSTSDB);
if (hostsDB == null)

{ log.error("Failed to get cache for " + TOPOHOSTSDB); }

this.nodeConnectorsDB =
(ConcurrentMap<NodeConnector, Set<Property>>) this.clusterContainerService.getCache(TOPONODECONNECTORDB);
if (nodeConnectorsDB == null)

{ log.error("Failed to get cache for " + TOPONODECONNECTORDB); }

this.userLinksDB =
(ConcurrentMap<String, TopologyUserLinkConfig>) this.clusterContainerService.getCache(TOPOUSERLINKSDB);
if (userLinksDB == null)

{ log.error("Failed to get cache for " + TOPOUSERLINKSDB); }

}

From TopologyManagerImpl.

The caches being reported missing are created and retrieved in TopologyManager.init().

I think we should see:
ClusterServicesCommons.init()
TopologyManagerImpl.init()
and then the injection of the TopologyManagerImpl as IClusterAware into the ClusterServicesCommons

So I am hoping we can see from the logs whether we are seeing errors registering the caches.

Comment by Colin Dixon [ 26/Sep/14 ]

Somebody might try a retry with a sleep in setClusterAware in ClusterServicesCommon. Then make sure it doesn't slow down boot unacceptably.

Other dirty solutions would be:

1.) Telling users how to tell if VTN came up good or not, so they can reboot if not.
2.) Writing a script to automate 1.

Comment by Moiz Raja [ 26/Sep/14 ]

I tried reproducing this issue.

Couple of things,

1. I do not see this message logged at all

logger.trace("cachename:{} on container:{} has " +
"been registered", cache,
this.containerName);

2. Where the exception related message is logged I see the comment

// Do nothing, the important is that
// we don't register the listener in
// the shadow, and we are not doing
// that.

So it seems to me that (a) caches never exist and (b) For some reason it is ok to ignore the adding of the listener

Comment by Moiz Raja [ 26/Sep/14 ]

Ed, None of the errors that you asked Hideyuki to look for are reported. There was one though

2014-09-26 09:21:19,868 | ERROR | Local user karaf | ConfigurationService | 195 - org.opendaylight.controller.configuration.implementation - 0.4.3.SNAPSHOT | Failed to create config directory: configuration/startup/default

I don't know if this has any bearing on the issue.

Comment by Moiz Raja [ 26/Sep/14 ]

I tried adding 100 retires when adding a listener with a sleep of upto 100 millisecond between retries and it still failed when adding many listeners.

Comment by Moiz Raja [ 26/Sep/14 ]

I added some additional logging in createCache to see if the cache was ever created. It looks like when the add listener fails the cache is never created.

Comment by Giovanni Meo [ 26/Sep/14 ]

(In reply to Moiz Raja from comment #8)
> I added some additional logging in createCache to see if the cache was ever
> created. It looks like when the add listener fails the cache is never
> created.

The cache is created afterward, so the idea is actually to store those caches that has been got listener before the creation and then to add the listener later on.

Comment by Moiz Raja [ 26/Sep/14 ]

Giovanni,

I added logging like so,

@Override
public ConcurrentMap<?, ?> createCache(String cacheName,
Set<IClusterServices.cacheMode> cMode) throws CacheExistException,
CacheConfigException {
if (this.clusterService != null) {

logger.info("Creating cache {} success!!!!", cacheName);
return this.clusterService.createCache(this.containerName,
cacheName, cMode);
} else {
logger.error("Could not create cache {} because clusterService is null", cacheName);
return null;
}
}

and I see neither of the log statements in the log.

Comment by Moiz Raja [ 26/Sep/14 ]

Attachment karaf.log has been added with description: karaf.log showing that create cache was not called for caches where adding listener failed

Comment by Giovanni Meo [ 26/Sep/14 ]

(In reply to Moiz Raja from comment #11)
> Created attachment 266 [details]
> karaf.log showing that create cache was not called for caches where adding
> listener failed

Thanks Moiz, but this logs show no caches got the message "didn't exist when", so not sure if the log is telling it all.

Comment by Moiz Raja [ 26/Sep/14 ]

Here are the other changes I did. I did comment out the "didn't exist" statement.

void setCacheUpdateAware(Map props, ICacheUpdateAware s) {
logger.info("MOIZ: CacheUpdateAware being set on container:{}",
this.containerName);

if (this.cacheUpdateAware != null) {
Set<String> caches = (Set<String>)props.get("cachenames");
if (caches != null) {
logger.trace("cachenames provided below:");
for (String cache : caches) {
if (this.cacheUpdateAware.get(cache) != null) {
logger.error("cachename:{} on container:{} has already a listener", cache, this.containerName);
} else {
GetUpdatesContainer<?, ?> up = new GetUpdatesContainer(s, this.containerName, cache);
if (up != null)

{ addListener(s, cache, up); }

}
}
}
}
}

private void addListener(ICacheUpdateAware s, String cache, GetUpdatesContainer<?, ?> up) {

int tries;

int MAX_RETRIES = 5;

for(tries=0;tries<MAX_RETRIES;tries++) {
try {
this.clusterService.addListener(this.containerName,
cache, up);
this.cacheUpdateAware.put(cache, up);
logger.info("cachename:{} on container:{} has " +
"been registered", cache,
this.containerName
);

break;
} catch (CacheListenerAddException exc) {
// logger
// .debug("Cache {} didn't exist when {} tried to register to its updates, tries = {}", cache,
// s, tries);
// Do nothing, the important is that
// we don't register the listener in
// the shadow, and we are not doing
// that.
}

try

{ Thread.sleep(100); }

catch (InterruptedException e) {
}
}

if(tries == MAX_RETRIES){
logger.error("Failed to add listener for cache {} after {} tries", cache, MAX_RETRIES);
}
}

Comment by Hideyuki Tai [ 27/Sep/14 ]

Moiz Raja has written a patch for this bug.

https://git.opendaylight.org/gerrit/#/c/11621/

Comment by Colin Dixon [ 29/Sep/14 ]

Hiedyuki, if you have confirmed that this is now fixed, can you set the state to fixed, resolved?

Thanks!

Generated at Wed Feb 07 19:54:09 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.