[BGPCEP-237] PingPongTransaction race during BGP stress testing (RIB only) Created: 10/Jun/15 Updated: 03/Mar/19 Due: 17/Jun/15 Resolved: 16/Jun/15 |
|
| Status: | Resolved |
| Project: | bgpcep |
| Component/s: | BGP |
| Affects Version/s: | Bugzilla Migration |
| Fix Version/s: | Bugzilla Migration |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Claudio David Gasparini |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 3664 |
| Description |
|
Afects build distribution-karaf-0.3.0-20150610.152928-2445.tar.gz Previous build were able to ingest large number of fake routes (with CDS and topology). But this build, when connected to real Internet feed, showed empty topology (while bgp-rib looked OK at first glance). Karaf log is full of these exceptions (karaf.log has over 200 MB): 2015-06-10 17:22:50,914 | ERROR | lt-dispatcher-14 | DataTreeChangeListenerActor | 179 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | Error notifying listener org.opendaylight.protocol.bgp.rib.impl.LocRibWriter@2623e71 raced with transacion PingPongTransaction {delegate=org.opendaylight.controller.cluster.databroker.DOMBrokerReadWriteTransaction@11c1f060} at org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransactionChain.slowAllocateTransaction(PingPongTransactionChain.java:132)[154:org.opendaylight.controller.sal-broker-impl:1.2.0.SNAPSHOT] When run with topology disabled, karaf.log is not hundreds of megabytes long anymore, but there are still few dozens of those exceptions. |
| Comments |
| Comment by Robert Varga [ 10/Jun/15 ] |
|
This looks like an attempt to allocate two transactions at the same time from the broker. This would indicate that the LocRibWriter is invoked from multiple threads at the same time, or that it shares a transaction chain with some other BGP component and they do not synchronize properly on its use. |
| Comment by Robert Varga [ 10/Jun/15 ] |
|
So the BGP code does not seem to be sharing the chain, which leaves to possibilities: 1) onDataTreeChanged() is called concurrently from CDS I think 1) is precluded from happening by the fact each registration is backed by a dedicated Actor and there is only a single registration. I think 3) is not likely, which leaves us with 2), which requires examining logs before this exception first occurs. |
| Comment by Vratko Polak [ 11/Jun/15 ] |
|
Here is segment of log; after last INFO to first ERROR. Build distribution-karaf-0.3.0-20150611.074453-2454.tar.gz 2015-06-11 08:12:35,586 | WARN | ult-dispatcher-4 | AbstractTopologyBuilder | 238 - org.opendaylight.bgpcep.bgp-topology-provider - 0.4.0.SNAPSHOT | Data change org.opendaylight.controller.md.sal.binding.impl.LazyDataTreeModification@17a425e9 was not completely propagated to listener org.opendaylight.bgpcep.bgp.topology.provider.Ipv4ReachabilityTopologyBuilder@2e937711, aborting |
| Comment by Vratko Polak [ 11/Jun/15 ] |
|
For topology, there is already 2015-06-11 08:27:34,595 | WARN | lt-dispatcher-19 | OperationLimiter | 179 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | Failed to acquire operation permit for transaction member-1-chn-3-txn-3773 |
| Comment by Dana Kutenicsova [ 11/Jun/15 ] |
|
Looks related to 3300 - ToExternalImportPolicy bug |
| Comment by Dana Kutenicsova [ 11/Jun/15 ] |
|
If the last exception is seen only after synchronization, that suggests, that LocRib put subtree modified DCN on route { uptodate }and tries to push it to AdjRibsOut, which does consider it to be classic route with attributes, but since they are not present (as they shouldn't be) it throws the NPE. |
| Comment by Robert Varga [ 11/Jun/15 ] |
|
Needs to be fixed before Lithium goes out. |
| Comment by Claudio David Gasparini [ 15/Jun/15 ] |
| Comment by Robert Varga [ 15/Jun/15 ] |
|
Patch to ensure the transaction is committed no matter what: https://git.opendaylight.org/gerrit/22642 |
| Comment by Dana Kutenicsova [ 16/Jun/15 ] |