[CONTROLLER-1724] Robot test for data change listener reports data inconsistency Created: 23/Jun/17 Updated: 25/Jul/23 |
|
| Status: | Confirmed |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Vratko Polak |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8733 |
| Description |
|
This is the robot symptom part of I am opening this because we have changed the test suite to not remove shard replicas, instead cluster-admin:make-leader-local is used to force leader movement, so the recent inconsistency [1] is not caused by [0] https://git.opendaylight.org/gerrit/58952 |
| Comments |
| Comment by Robert Varga [ 12/Jul/17 ] |
|
Reopening for investigation. |
| Comment by Vratko Polak [ 12/Jul/17 ] |
|
A patch set which fixes other bugs is reintroducing this bug. See Sandbox [3]. [2] https://git.opendaylight.org/gerrit/#/c/60137/4 |
| Comment by Vratko Polak [ 17/Jul/17 ] |
|
There is a work in progress to improve the listener unsubscribe and verify logic. Current patch set [4] fails to respond within 125 seconds [5], but when the timeout is increased, it results in "Connection aborted" [6]. The huge karaf.log is here [7]. [4] https://git.opendaylight.org/gerrit/#/c/60270/10 |
| Comment by Vratko Polak [ 17/Jul/17 ] |
|
> "Connection aborted" Oh, I forgot trace verbosity leads to memory exhaustion. |
| Comment by Vratko Polak [ 17/Jul/17 ] |
|
> Current patch set [4] fails to respond within ... 600 seconds [8]. > The huge karaf.log is ... here [9]. [8] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/4/log.html.gz#s1-s2-t1-k2-k14-k2-k1-k4-k6-k1 |
| Comment by Robert Varga [ 20/Jul/17 ] |
|
I think the problem comes from org.opendaylight.mdsal.dom.broker.ShardedDOMDataTree and how ShardedDOMDataTreeListenerContext processes notifications. |
| Comment by Robert Varga [ 20/Jul/17 ] |
|
That is to say the receive path is too slow to keep up, creating a long tail. The implementation needs to be revisited and optimized. |
| Comment by Vratko Polak [ 03/Aug/17 ] |
|
Sandbox with this [10] code, huge log [11] contains this segment, which may be pointing to a cause why the test fail. 2017-08-03 09:01:28,837 | DEBUG | pool-31-thread-1 | AbstractTransactionHandler | 257 - org.opendaylight.control [10] https://git.opendaylight.org/gerrit/#/c/60270/26 |
| Comment by Robert Varga [ 14/Nov/18 ] |
|
vrpolak is this still reproducible? |
| Comment by Vratko Polak [ 14/Nov/18 ] |
|
Yes, still reproducible. Not every run contains the Robot failure, but they are frequent enough. |