[BGPCEP-785] OptimisticLockFailedException when closing multiple sessions Created: 06/Apr/18  Updated: 14/Jun/18  Resolved: 14/Jun/18

Status: Verified
Project: bgpcep
Component/s: BGP
Affects Version/s: Fluorine
Fix Version/s: Fluorine

Type: Bug Priority: Medium
Reporter: Tomas Markovic Assignee: Claudio David Gasparini
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 10_conf.py     File karaf-mdsal-trace.log     File karaf.log     File manypeers_10.sh     File odl1_karaf.log.gz     File play.py    

 Description   

org.opendaylight.mdsal.common.api.OptimisticLockFailedException: Optimistic lock failed for path /(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)bgp-rib/rib/rib[

{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)id=example-bgp-rib}

]/peer/peer[

{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)peer-id=bgp://127.0.0.11}

]/adj-rib-in/tables

 

Seems to be an internal race condition between DS  modification submision.

 

 

Steps for replicate the bug 

  • cherry-pick https://git.opendaylight.org/gerrit/#/c/72568/
  • cherry-pick https://git.opendaylight.org/gerrit/#/c/72566/ (BGP with some changes adding logs and removing changes for an easy of log investigation)
  • run ODL
  • feature:install odl-restconf odl-bgpcep-bgp odl-bgpcep-bgp-config-example
  • ./configureall.py
  • ./manypeersibgp.sh ( run for a few seconds, when seeing INFO BGP-Dummy-4: Iteration: 2000 - total remaining prefixes: 98000, should be enough for kill it Ctrl+c)
  • check error on logs

 

Observation resume

  1. Peers write route via Dom chain
  2.  Routes session is closed, 
  3.  dom chain is closed
  4.  Peer removes table from DS via binding chain
  5. binding chain is closed

Changes coming from 1 [0] clashes with changes coming from 4.

The interesting part is that even 1 is done and submitted before 4 ( the main reason to arrive to this conclusion is because 4 doesn't fail to close)

seems that 4 is somehow executed before full 1 has been executed.

 Logs attached.

 

[0] AdjRibInWriter                   | 224 - org.opendaylight.bgpcep.bgp-rib-impl - 0.10.0.SNAPSHOT | Write routes failed

2018-06-01T15:28:26,038 | ERROR | CommitFutures-4 | AdjRibInWriter | 224 - org.opendaylight.bgpcep.bgp-rib-impl - 0.10.0.SNAPSHOT | Write routes failed
org.opendaylight.controller.md.sal.common.api.data.OptimisticLockFailedException: Optimistic lock failed for path /(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)bgp-rib/rib/rib[

{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)id=example-bgp-rib}

]/peer/peer[

{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)peer-id=bgp://127.0.0.11}

]/adj-rib-in/tables
at org.opendaylight.controller.cluster.databroker.compat.LegacyDOMDataBrokerAdapter$1.newWithCause(LegacyDOMDataBrokerAdapter.java:66) [270:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.compat.LegacyDOMDataBrokerAdapter$1.newWithCause(LegacyDOMDataBrokerAdapter.java:62) [270:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]
at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:106) [343:org.opendaylight.yangtools.util:2.0.3]
at org.opendaylight.controller.cluster.databroker.compat.LegacyDOMDataBrokerAdapter$DOMDataTransactionAdapter$1.onFailure(LegacyDOMDataBrokerAdapter.java:301) [270:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1387) [75:com.google.guava:23.6.0.jre]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398) [75:com.google.guava:23.6.0.jre]
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015) [75:com.google.guava:23.6.0.jre]
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868) [75:com.google.guava:23.6.0.jre]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713) [75:com.google.guava:23.6.0.jre]
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:94) [75:com.google.guava:23.6.0.jre]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:?]
at java.lang.Thread.run(Thread.java:748) [?:?]
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was deleted by other transaction.

 

 

 



 Comments   
Comment by Claudio David Gasparini [ 04/Jun/18 ]

Replicated using mdsal trace tool. We can define that there is no change coming from other different places than the identified ones (1,4).

Conclusion seems to point that there is a race condition between chain1(update Peer/Adj-rib-in) and chain2 (delete of peer).

 

Logs added

Comment by Claudio David Gasparini [ 13/Jun/18 ]

Multiple chain can be submitted at different times, but does not imply that  which is submitted first, will finish first. Therefore what is happening here,

is just that, sometimes ribout removal done by locRibWriter is  been apply after peer close and removal from Data Store, ending on OptimisticLockFailedException.

For solve this, we need to make Peer to be the only one on charge of handle and submit changes under Peer ribs,

and vice versa,  Rib apply through his chains only changes related to Rib ribs( loc-rib-in,..)

In this way, when a Peer session is closed and clean up done, it wont conflict with changes done by other Rib thread handling changes from different peers.

Generated at Wed Feb 07 19:14:07 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.