[CONTROLLER-1755] RaftActor lastApplied index moves backwards Created: 22/Aug/17 Updated: 06/Sep/17 Resolved: 06/Sep/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Carbon |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Robert Varga | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| External issue ID: | 9028 | ||||||||||||||||||||||||||||
| Description |
|
The fix for This is readily reproduced in https://jenkins.opendaylight.org/releng/view/netvirt/job/netvirt-csit-3node-openstack-ocata-gate-stateful-carbon/ and we saw this happening with Cluster Singleton Service – at high transition rates EOS blows up on the same problem. |
| Comments |
| Comment by Robert Varga [ 22/Aug/17 ] |
|
Attachment csit.txt has been added with description: CSIT failure log |
| Comment by Robert Varga [ 22/Aug/17 ] |
|
prototype fix: https://git.opendaylight.org/gerrit/#/c/62107/ |
| Comment by Robert Varga [ 22/Aug/17 ] |
|
Blocks |
| Comment by Robert Varga [ 22/Aug/17 ] |
|
https://jenkins.opendaylight.org/releng/view/netvirt/job/netvirt-csit-3node-openstack-ocata-gate-stateful-carbon/7's odl3_karaf.log shows the same failure at around 2017-08-21 22:50:10,355, along with the time travel: 2017-08-21 22:50:10,355 | DEBUG | rd-dispatcher-34 | Shard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-default-operational (Leader): Replicate message: identifier: member-3-datastore-operational-fe-1-txn-1133-0, logIndex: 1943, payload: class org.opendaylight.controller.cluster.datastore.persisted.PurgeTransactionPayload, isSendImmediate: false |
| Comment by Robert Varga [ 22/Aug/17 ] |
|
Alternative fix: https://git.opendaylight.org/gerrit/62169 |
| Comment by Robert Varga [ 22/Aug/17 ] |
|
The alternative fix should go in anyway, because the API contract should follow what Akka Persistence does. In case of a sync call, AP juggles the actor mailbox, stashing any non-persistence-related messages and once it's done persisting and will involve the callback. In case of an async call, AP continues normal operation, scheduling a message in the actor to invoke the callback. NonPersistentDataProvider invokes both cases directly in the caller. The patch teaches NonPersistentDataProvider to send an internal message within the actor (which is an actor capability expressed as an interface), which it does in case of an async call. Since the configuration data store does not seem to be exhibiting this (or we are not hammering it hard enough), aligning the behavior should make the bug go away. If this is not the call site causing trouble, https://git.opendaylight.org/gerrit/62170 adds stack traces to lastApplied transitions, which should pinpoint the problem. |
| Comment by A H [ 23/Aug/17 ] |
|
A patch was submitted to revert the changes and fix this bug in Carbon SR2: https://git.opendaylight.org/gerrit/#/c/62169/ To better assess the impact of this bug and fix, could someone from your team please help us identify the following: |
| Comment by Sam Hague [ 25/Aug/17 ] |
|
I think the NetVirt side is good with this change. Jobs [2] an [3] below look good. Sometimes there are random failures where a node does not come back properly such as in job [4]. We try to bring ODL1 back into the cluster but it fails to come back within 5 minutes. Then we move to the next tests and they fail. That ODL1 is hitting the below issue. Is there anything we can do to get past that? Is this bug-9006? 2017-08-25 02:02:38,430 | WARN | saction-32-34'}} | DeadlockMonitor | 126 - org.opendaylight.controller.config-manager - 0.6.2.SNAPSHOT | ModuleIdentifier {factoryName='runtime-generated-mapping', instanceName='runtime-mapping-singleton'}did not finish after 284864 ms |
| Comment by Michael Vorburger [ 28/Aug/17 ] |
|
Robert, Sam, should this BZ issue be closed now, or is there anything left here? |
| Comment by Robert Varga [ 28/Aug/17 ] |
|
Regression: Is this bug a regression of functionality/performance/feature compared to Carbon? Yes. This is a pre-existing issue, which was exposed by Severity: Could you elaborate on the severity of this bug? Is this a BLOCKER such that we cannot release Carbon SR2 without it? It is, as operational datastore can easily encounter errors, which will resurface even after application-level recovery. Data corruption is possible, too. Workaround: Is there a workaround such that we can write a release note instead? There is no feasible workaround. Testing: Could you also elaborate on the testing of this patch? How extensively has this patch been tested? Is it covered by any unit tests or system tests? Tested with netvirt CSIT. Impact: Does this fix impact any dependent projects? Aside from fixing issues, there is no impact as far as we know. |