[OPNFLWPLUG-859] Internal SalSevice queue gets full and ODL does not perform any more Openflow actions in the switch and does not release the mastership Created: 23/Feb/17 Updated: 27/Sep/21 Resolved: 06/Mar/18 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | Boron |
| Type: | Bug | ||
| Reporter: | Jon Castro | Assignee: | Jalpa Modasiya |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 7846 |
| Description |
|
After running ODL for few days, it seems some internal queue gets fulls and SalService does not allow to perform any action on the switch. The connectivity with the switch is established and the hello messages from the switch are returned properly. Restarting the switch does not solve the problem, the controller that holds the mastership does not free the cluster singleton service. The rest of the controller that holds "candidate" status on the cluster singleton (slave mode in the switch) release the singleton service. The problem was solved only after restarting the controller that holds the mastership to the switch. I noticed exceptions that may point to the source of the problem. Basically, following code does not return a requestContext . (requestContext is always null). final RequestContext<O> requestContext = requestContextStack.createRequestContext();
And following logs points that the queue is full. 2017-02-24 09:47:34,239 | TRACE | Thread-101 | AbstractService | 287 - org.opendaylight.openflowplugin.impl - 0.3.1.Boron-SR1 | Handling general service call The log "Device queue org.opendaylight.openflowplugin.i mpl.rpc.RpcContextImpl@3a502d68 at capacity" is returned because following code cannot acquire the lock which is a semaphore. public <T> RequestContext<T> createRequestContext() { |
| Comments |
| Comment by Jon Castro [ 28/Feb/17 ] |
|
Attachment karaf.log.1.gz has been added with description: karaf file |
| Comment by Jon Castro [ 28/Feb/17 ] |
|
Attachment karaf.log.2.gz has been added with description: karaf file 2 |
| Comment by Jon Castro [ 28/Feb/17 ] |
|
This is issue has been produced after running an Opendaylight Cluster of 3 nodes for few days. After some unknown time (something around one or two days) the controllers lose the capability to perform openflow actions because the queue gets full. The controller was not performing any particular action when this issue happened, the controller just were running and openflow switches were connected to them. This issue should be reproducible after running the controller in clustered more for few days with real hardware switches. |
| Comment by Jon Castro [ 28/Feb/17 ] |
|
If more logs are required, feel free to let me know what is required in order to gather such information. |
| Comment by Anil Vishnoi [ 27/Feb/18 ] |
| Comment by Anil Vishnoi [ 06/Mar/18 ] |
|
This issue is not present in the stable/Oxygen, stable/nitrogen and master branch. |