[CONTROLLER-1724] Robot test for data change listener reports data inconsistency Created: 23/Jun/17  Updated: 25/Jul/23

Status: Confirmed
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Vratko Polak
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8733

 Description   

This is the robot symptom part of CONTROLLER-1711. As discussed there, one possible reason is the test listener not waiting for quiescence, proposed fix is here [0].

I am opening this because we have changed the test suite to not remove shard replicas, instead cluster-admin:make-leader-local is used to force leader movement, so the recent inconsistency [1] is not caused by CONTROLLER-1711.

[0] https://git.opendaylight.org/gerrit/58952
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/759/log.html.gz#s1-s36-t1-k2-k14-k2-k1-k4-k7-k1



 Comments   
Comment by Robert Varga [ 12/Jul/17 ]

Reopening for investigation.

Comment by Vratko Polak [ 12/Jul/17 ]

A patch set which fixes other bugs is reintroducing this bug. See Sandbox [3].

[2] https://git.opendaylight.org/gerrit/#/c/60137/4
[3] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/4/log.html.gz#s1-s4-t1-k2-k14-k2-k1-k4-k7-k1

Comment by Vratko Polak [ 17/Jul/17 ]

There is a work in progress to improve the listener unsubscribe and verify logic. Current patch set [4] fails to respond within 125 seconds [5], but when the timeout is increased, it results in "Connection aborted" [6]. The huge karaf.log is here [7].

[4] https://git.opendaylight.org/gerrit/#/c/60270/10
[5] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/1/log.html.gz#s1-s2-t1-k2-k14-k2-k1-k4-k6-k1
[6] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/2/log.html.gz#s1-s2-t1-k2-k14-k2-k1-k4-k6-k1
[7] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/2/odl2_karaf.log.gz

Comment by Vratko Polak [ 17/Jul/17 ]

> "Connection aborted"

Oh, I forgot trace verbosity leads to memory exhaustion.
Will repeat without trace.

Comment by Vratko Polak [ 17/Jul/17 ]

> Current patch set [4] fails to respond within

... 600 seconds [8].

> The huge karaf.log is

... here [9].

[8] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/4/log.html.gz#s1-s2-t1-k2-k14-k2-k1-k4-k6-k1
[9] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/4/odl2_karaf.log.gz

Comment by Robert Varga [ 20/Jul/17 ]

I think the problem comes from org.opendaylight.mdsal.dom.broker.ShardedDOMDataTree and how ShardedDOMDataTreeListenerContext processes notifications.

Comment by Robert Varga [ 20/Jul/17 ]

That is to say the receive path is too slow to keep up, creating a long tail. The implementation needs to be revisited and optimized.

Comment by Vratko Polak [ 03/Aug/17 ]

Sandbox with this [10] code, huge log [11] contains this segment, which may be pointing to a cause why the test fail.

2017-08-03 09:01:28,837 | DEBUG | pool-31-thread-1 | AbstractTransactionHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | New future #39998 allocated
2017-08-03 09:01:28,838 | DEBUG | pool-31-thread-1 | AbstractTransactionHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Reached maximum run time with 2 outstanding futures
2017-08-03 09:01:28,840 | DEBUG | pool-31-thread-1 | AbstractTransactionHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Reached maximum run time with 2 outstanding futures
2017-08-03 09:01:28,840 | DEBUG | pool-31-thread-1 | AbstractTransactionHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Reached maximum run time with 2 outstanding futures
2017-08-03 09:01:28,847 | DEBUG | on-dispatcher-59 | AbstractDataListener | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Received 1 data tree changed events
2017-08-03 09:01:28,855 | DEBUG | on-dispatcher-68 | AbstractDataListener | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Received 1 data tree changed events
2017-08-03 09:01:28,856 | DEBUG | on-dispatcher-59 | AbstractDataListener | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Received 1 data tree changed events
2017-08-03 09:01:28,890 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,902 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,908 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,911 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,914 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,916 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,919 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,922 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,924 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,927 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,930 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,933 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,935 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,937 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:28,940 | WARN | lt-dispatcher-41 | ShardedDataTreeActor | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | The notification contained a path on which no producer is registered,
throwing away
2017-08-03 09:01:29,849 | DEBUG | on-dispatcher-62 | AbstractDataListener | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Received 1 data tree changed events

[10] https://git.opendaylight.org/gerrit/#/c/60270/26
[11] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/1/odl2_karaf.log.gz

Comment by Robert Varga [ 14/Nov/18 ]

vrpolak is this still reproducible?

Comment by Vratko Polak [ 14/Nov/18 ]

Yes, still reproducible. Not every run contains the Robot failure, but they are frequent enough.
Recent case of odl-mdsal-lowlevel-control:unsubscribe-ddtl response with copy-matches being false is here: [12].

[12] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-clustering-tell-all-neon/67/robot-plugin/log.html.gz#s1-s24-t3-k2-k16

Generated at Wed Feb 07 19:56:17 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.