[CONTROLLER-1070] Clustering: Robot integration tests failing Created: 15/Dec/14  Updated: 24/Jan/15  Resolved: 24/Jan/15

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: Helium
Fix Version/s: None

Type: Bug
Reporter: Tom Pantelis Assignee: Tom Pantelis
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 2516
Priority: Normal

 Description   

The clustering integration tests have been failing for a while now. The "010 Credential Authentication" AAA test always fails - I assume this is an issue with the test setup.

Of more concern are the sporadic failures, specifically "Inventory Scalability OF10". An example can be seen at
https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/198/robot/report/log.html#s1-s6-s1-t4.

The "Get Stats for a node" test case looks for "flow-capable-node-connector-statistics" in the REST output. This query is repeated for 2 minutes waiting for it to exceed.

In looking at the karaf.log, it appears the following OptimisticLockFailedException corresponds to the test time out failure:

2014-12-12 15:30:56,993 | WARN | lt-dispatcher-30 | InMemoryDOMDataStore | 145 - org.opendaylight.controller.sal-inmemory-datastore - 1.2.0.SNAPSHOT | Store Tx: member-1-shard-inventory-operational-441 Conflicting modification for /(urn:opendaylight:inventory?revision=2013-08-19)nodes/node/node[

{(urn:opendaylight:inventory?revision=2013-08-19)id=openflow:2}

].
2014-12-12 15:30:56,994 | WARN | lt-dispatcher-70 | ConcurrentDOMDataBroker | 139 - org.opendaylight.controller.sal-broker-impl - 1.2.0.SNAPSHOT | Tx: DOM-CHAIN-0-217 Error during phase CAN_COMMIT, starting Abort
java.util.concurrent.ExecutionException: OptimisticLockFailedException

{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was deleted by other transaction.]]}

at com.google.common.util.concurrent.Futures$ImmediateFailedFuture.get(Futures.java:183)[52:com.google.guava:14.0.1]
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.doCanCommit(ShardCommitCoordinator.java:138)[259:org.opendaylight.controller.sal-distributed-datastore:1.2.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:128)[259:org.opendaylight.controller.sal-distributed-datastore:1.2.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleCanCommitTransaction(Shard.java:389)[259:org.opendaylight.controller.sal-distributed-datastore:1.2.0.SNAPSHOT]

...

2014-12-12 15:30:57,003 | WARN | ds-oper-thread-0 | StatisticsManagerImpl | 153 - org.opendaylight.controller.md.statistics-manager - 1.2.0.SNAPSHOT | Unhandled exception during processing statistics. Restarting transaction chain.
...
2014-12-12 15:30:57,003 | WARN | CommitFutures-2 | StatisticsManagerImpl | 153 - org.opendaylight.controller.md.statistics-manager - 1.2.0.SNAPSHOT | Failed to export Flow Capable Statistics, Transaction DOM-CHAIN-0-217 failed.
OptimisticLockFailedException

{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was deleted by other transaction.]]}

Note the "Node was deleted by other transaction." error message. This appears to indicate some code (StatisticsManager?) is supposed to put stats under an OF node (id=openflow:2) but some parent node doesn't exist (probably the "openflow:2" Node itself), either because:
1) it was deleted
2) it hadn't been inserted yet
3) a previous insert was attempted but failed

It doesn't appear to be #3 because I don't see any previous commit failures in the log.

There are also sporadic failures in "Compatible.AD SAL NSF OF10" - https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/205/robot/report/log.html - where "FlowProgrammer.Check flow in flow stats" and "StatisticsManager.get port stats" fail.

These same tests have not been failing without clustering.



 Comments   
Comment by Tom Pantelis [ 05/Jan/15 ]

The OptimisticLockFailedExceptions occur w/o clustering as well and aren't related to the test failures.

Anil identified an issue in the StatisticsManager that is fixed by https://bugs.opendaylight.org/show_bug.cgi?id=2551. This appears to be the cause of the intermittent test failures. The issue is actually not related to clustering although the failures only seemed to occur with clustering enabled; maybe just coincidence or clustering changes the timing of things to increase the chance for test failures.

I'll leave this open for a bit to verify the integration tests succeed.

Comment by Tom Pantelis [ 24/Jan/15 ]

There's still intermittent test failures but I've also seen similar failures in the integration-master-csit-compatible-min tests w/o clustering.

Closing this bug...

Generated at Wed Feb 07 19:54:37 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.