[NETVIRT-596] getLock() invoked from createGroupId is caught up in an endless loop - can't connect to new L3 VMs Created: 05/Apr/17  Updated: 19/Oct/17  Resolved: 25/Apr/17

Status: Resolved
Project: netvirt
Component/s: General
Affects Version/s: Carbon
Fix Version/s: None

Type: Bug
Reporter: Koby Aizer Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive getLock.zip    
External issue ID: 8160

 Description   

I ran a test in a loop, that basically creates VMs with FIPs, test connectivity, and deletes them.
After 70 iterations, new VMs connectvity stopped working.

These errors started showing up endlessly:
2017-04-05 03:20:46,877 | INFO | eChangeHandler-0 | LockManager | 351 - org.opendaylight.genius.lockmanager-impl - 0.2.0.SNAPSHOT | Already locked after waiting 1000ms, try 9819
2017-04-05 03:20:47,878 | INFO | eChangeHandler-0 | LockManager | 351 - org.opendaylight.genius.lockmanager-impl - 0.2.0.SNAPSHOT | Already locked after waiting 1000ms, try 9820
2017-04-05 03:20:48,878 | INFO | eChangeHandler-0 | LockManager | 351 - org.opendaylight.genius.lockmanager-impl - 0.2.0.SNAPSHOT | Already locked after waiting 1000ms, try 9821

This is the stack call for this lock:
"AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0" #4442 daemon prio=5 os_prio=0 tid=0x00007fd1ec08d800 nid=0x5266 waiting on condition [0x00007fd166405000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.opendaylight.lockmanager.LockManager.getLock(LockManager.java:148)
at org.opendaylight.lockmanager.LockManager.lock(LockManager.java:67)
at java.lang.invoke.LambdaForm$DMH/23968864.invokeInterface_LL_L(LambdaForm$DMH)
at java.lang.invoke.LambdaForm$BMH/931340740.reinvoke(LambdaForm$BMH)
at java.lang.invoke.LambdaForm$MH/1202781631.invokeExact_MT(LambdaForm$MH)
at org.opendaylight.yangtools.yang.binding.util.RpcMethodInvokerWithInput.invokeOn(RpcMethodInvokerWithInput.java:31)
at org.opendaylight.yangtools.yang.binding.util.AbstractMappedRpcInvoker.invokeRpc(AbstractMappedRpcInvoker.java:52)
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invoke(BindingDOMRpcImplementationAdapter.java:83)
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invokeRpc(BindingDOMRpcImplementationAdapter.java:70)
at org.opendaylight.controller.md.sal.dom.broker.impl.GlobalDOMRpcRoutingTableEntry.invokeRpc(GlobalDOMRpcRoutingTableEntry.java:39)
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRoutingTable.invokeRpc(DOMRpcRoutingTable.java:177)
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRouter.invokeRpc(DOMRpcRouter.java:102)
at Proxy83c929da_f599_4e8e_ab68_e2200f2efd8f.invokeRpc(Unknown Source)
at Proxy5c5c056d_30f3_4662_800b_9699ba76b251.invokeRpc(Unknown Source)
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.invoke0(RpcServiceAdapter.java:65)
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.access$000(RpcServiceAdapter.java:43)
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter$RpcInvocationStrategy.invoke(RpcServiceAdapter.java:159)
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.invoke(RpcServiceAdapter.java:96)
at com.sun.proxy.$Proxy124.lock(Unknown Source)
at org.opendaylight.genius.idmanager.IdUtils.lock(IdUtils.java:230)
at org.opendaylight.genius.idmanager.IdManager.allocateId(IdManager.java:213)
at java.lang.invoke.LambdaForm$DMH/23968864.invokeInterface_LL_L(LambdaForm$DMH)
at java.lang.invoke.LambdaForm$BMH/968378519.reinvoke(LambdaForm$BMH)
at java.lang.invoke.LambdaForm$MH/1202781631.invokeExact_MT(LambdaForm$MH)
at org.opendaylight.yangtools.yang.binding.util.RpcMethodInvokerWithInput.invokeOn(RpcMethodInvokerWithInput.java:31)
at org.opendaylight.yangtools.yang.binding.util.AbstractMappedRpcInvoker.invokeRpc(AbstractMappedRpcInvoker.java:52)
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invoke(BindingDOMRpcImplementationAdapter.java:83)
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invokeRpc(BindingDOMRpcImplementationAdapter.java:70)
at org.opendaylight.controller.md.sal.dom.broker.impl.GlobalDOMRpcRoutingTableEntry.invokeRpc(GlobalDOMRpcRoutingTableEntry.java:39)
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRoutingTable.invokeRpc(DOMRpcRoutingTable.java:177)
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRouter.invokeRpc(DOMRpcRouter.java:102)
at Proxy83c929da_f599_4e8e_ab68_e2200f2efd8f.invokeRpc(Unknown Source)
at Proxy5c5c056d_30f3_4662_800b_9699ba76b251.invokeRpc(Unknown Source)
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.invoke0(RpcServiceAdapter.java:65)
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.access$000(RpcServiceAdapter.java:43)
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter$RpcInvocationStrategy.invoke(RpcServiceAdapter.java:159)
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.invoke(RpcServiceAdapter.java:96)
at com.sun.proxy.$Proxy127.allocateId(Unknown Source)
at org.opendaylight.netvirt.natservice.internal.NatUtil.createGroupId(NatUtil.java:773)
at org.opendaylight.netvirt.natservice.internal.RouterDpnChangeListener.removeSNATFromDPN(RouterDpnChangeListener.java:327)
at org.opendaylight.netvirt.natservice.internal.RouterDpnChangeListener.remove(RouterDpnChangeListener.java:193)
at org.opendaylight.netvirt.natservice.internal.RouterDpnChangeListener.remove(RouterDpnChangeListener.java:37)
at org.opendaylight.genius.datastoreutils.AsyncDataTreeChangeListenerBase$DataTreeChangeHandler.run(AsyncDataTreeChangeListenerBase.java:148)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



 Comments   
Comment by Koby Aizer [ 05/Apr/17 ]

Attachment getLock.zip has been added with description: logs

Comment by Periyasamy Palanisamy [ 06/Apr/17 ]

Kency would raise a review shortly with a potential fix (done through code study) and more info logs to figure out exact root cause.

Comment by Koby Aizer [ 06/Apr/17 ]

Thanks!

Another thought here - even if we fix the root cause, I think we must remove this busy loop. This totally breaks the system when occurring

Comment by Periyasamy Palanisamy [ 06/Apr/17 ]

Looks like this issue occurred when an idKey is not unlocked accidentally in IdManager which leads to LockManager#lock goes into busy spin. But this is a corner case due to a bug.

We could avoid it by thread wait/notify by Async Clustered DTCN on lock (since lock is stored in MD-SAL), upon remove DCN, waiting threads can be notified for the particular lock. Not sure its worth to do it now and should be done as an enhancement since it needs more testing effort.

Comment by Kency Kurian [ 06/Apr/17 ]

Raised the review for this:
https://git.opendaylight.org/gerrit/#/c/54414/

Comment by Koby Aizer [ 06/Apr/17 ]

Thanks guys for handling this.. Could've taken me a long time to find it myself

Should we open another ticket on removing the busy loop future fix?

Generated at Wed Feb 07 20:21:58 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.