[CONTROLLER-1330] Clustering errors under concurrent load Created: 21/May/15 Updated: 19/Oct/17 Resolved: 21/May/15 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Post-Helium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Gary Wu | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 3329 |
| Description |
|
I set up a simple 3 node cluster and ran the flow_config_blaster.py script in the integration repo against it. This script allows you to create flows using multiple threads. I found that whenever I use multiple threads (e.g. 10 threads), a substantial majority of the add-flow requests would fail. It seems that only the first request succeeds, while other concurrent requests get stuck. This is with the latest integration build from master. The leader node produces log entries like these: 2015-05-20 13:07:18,443 | WARN | lt-dispatcher-17 | ConcurrentDOMDataBroker | 212 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | Tx: DOM-336 Error during phase CAN_COMMIT, starting Abort While the follower nodes produce log entries like this: 2015-05-20 13:07:14,093 | WARN | lt-dispatcher-17 | Shard | 205 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | ApplyState took more time than expected. Elapsed Time = 72 ms ApplyState = ApplyState {identifier='null', replicatedLogEntry.index =1771, startTime=1828178749622}Sequential add-flow (e.g. one thread only) works fine. Multithreaded add-flow works fine when the cluster has only a single node. |
| Comments |
| Comment by Gary Wu [ 21/May/15 ] |
|
Sequential test case that works fine: Parallel test case that fails: Three node cluster with every shard replicated onto every node. |
| Comment by Gary Wu [ 21/May/15 ] |
|
The problem seems to have gone away with today's build: |
| Comment by Moiz Raja [ 21/May/15 ] |
|
Most likely due to this fix https://git.opendaylight.org/gerrit/#/c/20890/ |