[CONTROLLER-1755] RaftActor lastApplied index moves backwards Created: 22/Aug/17  Updated: 06/Sep/17  Resolved: 06/Sep/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Carbon
Fix Version/s: None

Type: Bug
Reporter: Robert Varga Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File csit.txt    
Issue Links:
Blocks
blocks CONTROLLER-1756 OOM due to huge Map in ShardDataTree Resolved
blocks CONTROLLER-1762 ODL is up and ports are listening but... Resolved
blocks CONTROLLER-1763 On restarting ODL on one node, ODL on... Resolved
blocks MDSAL-275 VerifyException from flapping service... Resolved
Duplicate
is duplicated by CONTROLLER-1761 Uncaught error from thread ... shutti... Resolved
External issue ID: 9028

 Description   

The fix for CONTROLLER-1746 exposes a problem with RaftActor, where the sequencing of applied state is violated and the same journal entry is being applied – leading to violation of transaction ordering in DataTree.

This is readily reproduced in https://jenkins.opendaylight.org/releng/view/netvirt/job/netvirt-csit-3node-openstack-ocata-gate-stateful-carbon/ and we saw this happening with Cluster Singleton Service – at high transition rates EOS blows up on the same problem.



 Comments   
Comment by Robert Varga [ 22/Aug/17 ]

Attachment csit.txt has been added with description: CSIT failure log

Comment by Robert Varga [ 22/Aug/17 ]

prototype fix: https://git.opendaylight.org/gerrit/#/c/62107/

Comment by Robert Varga [ 22/Aug/17 ]

Blocks MDSAL-275 as the scale test for it is triggering this bug, as seen in https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-cs-chasing-leader-longevity-only-carbon/7/odl2_karaf.log.gz.

Comment by Robert Varga [ 22/Aug/17 ]

https://jenkins.opendaylight.org/releng/view/netvirt/job/netvirt-csit-3node-openstack-ocata-gate-stateful-carbon/7's odl3_karaf.log shows the same failure at around 2017-08-21 22:50:10,355, along with the time travel:

2017-08-21 22:50:10,355 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational (Leader): Replicate message: identifier: member-3-datastore-operational-fe-1-txn-1133-0, logIndex: 1943, payload: class org.opendaylight.controller.cluster.datastore.persisted.PurgeTransactionPayload, isSendImmediate: false
2017-08-21 22:50:10,355 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational (Leader): Applying to log - commitIndex: 1940, lastAppliedIndex: 1939
2017-08-21 22:50:10,355 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational (Leader): Setting last applied to 1940
2017-08-21 22:50:10,355 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational: Moving last applied index from 1939 to 1940
2017-08-21 22:50:10,355 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational: Applying state for log index 1940 data org.opendaylight.controller.cluster.datastore.persisted.PurgeTransactionPayload@74db3f62
2017-08-21 22:50:10,355 | DEBUG | rd-dispatcher-34 | ShardDataTree | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | member-3-shard-default-operational: replication of member-3-datastore-operational-fe-1-txn-1131-0 has no callback
2017-08-21 22:50:10,362 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational (Leader): Setting last applied to 1940
2017-08-21 22:50:10,362 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational: Moving last applied index from 1940 to 1940
2017-08-21 22:50:10,362 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational: Applying state for log index 1940 data org.opendaylight.controller.cluster.datastore.persisted.PurgeTransactionPayload@74db3f62
2017-08-21 22:50:10,362 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational (Leader): Setting last applied to 1939
2017-08-21 22:50:10,362 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational: Moving last applied index from 1940 to 1939
2017-08-21 22:50:10,362 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational: Applying state for log index 1939 data org.opendaylight.controller.cluster.datastore.persisted.CommitTransactionPayload@1cb3e6f2
2017-08-21 22:50:10,362 | DEBUG | rd-dispatcher-34 | ShardDataTree | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | member-3-shard-default-operational: Applying foreign transaction member-3-datastore-operational-fe-1-txn-1133-0
2017-08-21 22:50:10,362 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational (Leader): Setting last applied to 1940
2017-08-21 22:50:10,362 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational: Moving last applied index from 1939 to 1940
2017-08-21 22:50:10,362 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational: Applying state for log index 1940 data org.opendaylight.controller.cluster.datastore.persisted.PurgeTransactionPayload@74db3f62

Comment by Robert Varga [ 22/Aug/17 ]

Alternative fix: https://git.opendaylight.org/gerrit/62169

Comment by Robert Varga [ 22/Aug/17 ]

The alternative fix should go in anyway, because the API contract should follow what Akka Persistence does.

In case of a sync call, AP juggles the actor mailbox, stashing any non-persistence-related messages and once it's done persisting and will involve the callback.

In case of an async call, AP continues normal operation, scheduling a message in the actor to invoke the callback.

NonPersistentDataProvider invokes both cases directly in the caller.

The patch teaches NonPersistentDataProvider to send an internal message within the actor (which is an actor capability expressed as an interface), which it does in case of an async call.

Since the configuration data store does not seem to be exhibiting this (or we are not hammering it hard enough), aligning the behavior should make the bug go away.

If this is not the call site causing trouble, https://git.opendaylight.org/gerrit/62170 adds stack traces to lastApplied transitions, which should pinpoint the problem.

Comment by A H [ 23/Aug/17 ]

A patch was submitted to revert the changes and fix this bug in Carbon SR2:

https://git.opendaylight.org/gerrit/#/c/62169/

To better assess the impact of this bug and fix, could someone from your team please help us identify the following:
Regression: Is this bug a regression of functionality/performance/feature compared to Carbon?
Severity: Could you elaborate on the severity of this bug? Is this a BLOCKER such that we cannot release Carbon SR2 without it?
Workaround: Is there a workaround such that we can write a release note instead?
Testing: Could you also elaborate on the testing of this patch? How extensively has this patch been tested? Is it covered by any unit tests or system tests?
Impact: Does this fix impact any dependent projects?

Comment by Sam Hague [ 25/Aug/17 ]

I think the NetVirt side is good with this change. Jobs [2] an [3] below look good.

Sometimes there are random failures where a node does not come back properly such as in job [4]. We try to bring ODL1 back into the cluster but it fails to come back within 5 minutes. Then we move to the next tests and they fail. That ODL1 is hitting the below issue. Is there anything we can do to get past that? Is this bug-9006?

2017-08-25 02:02:38,430 | WARN | saction-32-34'}} | DeadlockMonitor | 126 - org.opendaylight.controller.config-manager - 0.6.2.SNAPSHOT | ModuleIdentifier

{factoryName='runtime-generated-mapping', instanceName='runtime-mapping-singleton'}

did not finish after 284864 ms

[2] https://jenkins.opendaylight.org/releng/user/shague/my-views/view/3node/job/netvirt-csit-3node-openstack-ocata-gate-stateful-carbon/25/

[3] https://jenkins.opendaylight.org/releng/user/shague/my-views/view/3node/job/netvirt-csit-3node-openstack-ocata-upstream-stateful-carbon/

4] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-gate-stateful-carbon/24/log.html.gz#s1-s1-t13-k2-k2-k8

Comment by Michael Vorburger [ 28/Aug/17 ]

Robert, Sam, should this BZ issue be closed now, or is there anything left here?

Comment by Robert Varga [ 28/Aug/17 ]

Regression: Is this bug a regression of functionality/performance/feature compared to Carbon?

Yes. This is a pre-existing issue, which was exposed by NETVIRT-673 which itself is a regression.

Severity: Could you elaborate on the severity of this bug? Is this a BLOCKER such that we cannot release Carbon SR2 without it?

It is, as operational datastore can easily encounter errors, which will resurface even after application-level recovery. Data corruption is possible, too.

Workaround: Is there a workaround such that we can write a release note instead?

There is no feasible workaround.

Testing: Could you also elaborate on the testing of this patch? How extensively has this patch been tested? Is it covered by any unit tests or system tests?

Tested with netvirt CSIT.

Impact: Does this fix impact any dependent projects?

Aside from fixing issues, there is no impact as far as we know.

Generated at Wed Feb 07 19:56:22 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.