[OPNFLWPLUG-859] Internal SalSevice queue gets full and ODL does not perform any more Openflow actions in the switch and does not release the mastership Created: 23/Feb/17  Updated: 27/Sep/21  Resolved: 06/Mar/18

Status: Resolved
Project: OpenFlowPlugin
Component/s: clustering
Affects Version/s: None
Fix Version/s: Boron

Type: Bug
Reporter: Jon Castro Assignee: Jalpa Modasiya
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File karaf.log.1.gz     File karaf.log.2.gz    
External issue ID: 7846

 Description   

After running ODL for few days, it seems some internal queue gets fulls and SalService does not allow to perform any action on the switch. The connectivity with the switch is established and the hello messages from the switch are returned properly.

Restarting the switch does not solve the problem, the controller that holds the mastership does not free the cluster singleton service. The rest of the controller that holds "candidate" status on the cluster singleton (slave mode in the switch) release the singleton service.

The problem was solved only after restarting the controller that holds the mastership to the switch.

I noticed exceptions that may point to the source of the problem. Basically, following code does not return a requestContext . (requestContext is always null).

final RequestContext<O> requestContext = requestContextStack.createRequestContext();
if (requestContext == null)

{ LOG.trace("Request context refused."); getMessageSpy().spyMessage(AbstractService.class, MessageSpy.STATISTIC_GROUP.TO_SWITCH_DISREGARDED); return failedFuture(); }

And following logs points that the queue is full.

2017-02-24 09:47:34,239 | TRACE | Thread-101 | AbstractService | 287 - org.opendaylight.openflowplugin.impl - 0.3.1.Boron-SR1 | Handling general service call
2017-02-24 09:47:34,239 | TRACE | Thread-101 | RpcContextImpl | 287 - org.opendaylight.openflowplugin.impl - 0.3.1.Boron-SR1 | Device queue org.opendaylight.openflowplugin.i mpl.rpc.RpcContextImpl@3a502d68 at capacity
2017-02-24 09:47:34,239 | TRACE | Thread-101 | AbstractService | 287 - org.opendaylight.openflowplugin.impl - 0.3.1.Boron-SR1 | Request context refused.

The log "Device queue org.opendaylight.openflowplugin.i mpl.rpc.RpcContextImpl@3a502d68 at capacity" is returned because following code cannot acquire the lock which is a semaphore.

public <T> RequestContext<T> createRequestContext() {
if (!tracker.tryAcquire()) {
LOG.trace("Device queue {} at capacity", this);
return null;
} else {
LOG.trace("Acquired semaphore for {}, available permits:{} ", nodeInstanceIdentifier.getKey().getId().getValue(), tracker.availablePermits());
}



 Comments   
Comment by Jon Castro [ 28/Feb/17 ]

Attachment karaf.log.1.gz has been added with description: karaf file

Comment by Jon Castro [ 28/Feb/17 ]

Attachment karaf.log.2.gz has been added with description: karaf file 2

Comment by Jon Castro [ 28/Feb/17 ]

This is issue has been produced after running an Opendaylight Cluster of 3 nodes for few days. After some unknown time (something around one or two days) the controllers lose the capability to perform openflow actions because the queue gets full.

The controller was not performing any particular action when this issue happened, the controller just were running and openflow switches were connected to them.

This issue should be reproducible after running the controller in clustered more for few days with real hardware switches.

Comment by Jon Castro [ 28/Feb/17 ]

If more logs are required, feel free to let me know what is required in order to gather such information.

Comment by Anil Vishnoi [ 27/Feb/18 ]

Boron : https://git.opendaylight.org/gerrit/#/c/59131/

Comment by Anil Vishnoi [ 06/Mar/18 ]

This issue is not present in the stable/Oxygen, stable/nitrogen and master branch.

Generated at Wed Feb 07 20:33:34 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.