[CONTROLLER-1686] Shards fail to settle after brief isolation Created: 17/May/17  Updated: 25/Jul/23  Resolved: 19/May/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by CONTROLLER-1664 C: OutOfOrderRequestException: Expect... Resolved
External issue ID: 8492

 Description   

This is different from other isolation Bugs in that no major timeout is hit before bad things start to happen. This is also similar to CONTROLLER-1684 in that OutOfOrderRequestException is visible from Robot.

The scenario is using module-based shards with tell-based protocol, transaction producer on each member and short isolation on the original leader (member-1).
The first suspicious messages in karaf.log [0] are the repeated:
2017-05-17 13:28:51,032 | INFO | lt-dispatcher-22 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | member-1-shard-default-config (Follower): The prevLogIndex 33 was found in the log but the term -1 is not equal to the append entriesprevLogTerm 2 - lastIndex: 36, snapshotIndex: 34
2017-05-17 13:28:51,032 | INFO | lt-dispatcher-22 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | member-1-shard-default-config (Follower): Follower is out-of-sync so sending negative reply: AppendEntriesReply [term=4, success=false, followerId=member-1-shard-default-config, logLastIndex=36, logLastTerm=2, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]

Robot has gathered failure responses [1] from each member, member-1 sent NullPointerException according to karaf.log:
2017-05-17 13:29:42,277 | WARN | lt-dispatcher-25 | ConcurrentDOMDataBroker | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | Tx: DOM-CHAIN-1-2 Error during phase CAN_COMMIT, starting Abort
java.lang.NullPointerException
at org.opendaylight.controller.cluster.datastore.FrontendReadWriteTransaction.ensureReady(FrontendReadWriteTransaction.java:336)
at org.opendaylight.controller.cluster.datastore.FrontendReadWriteTransaction.handleModifyTransaction(FrontendReadWriteTransaction.java:319)
at org.opendaylight.controller.cluster.datastore.FrontendReadWriteTransaction.handleRequest(FrontendReadWriteTransaction.java:90)
at org.opendaylight.controller.cluster.datastore.AbstractFrontendHistory.handleTransactionRequest(AbstractFrontendHistory.java:154)
at org.opendaylight.controller.cluster.datastore.LeaderFrontendState.handleTransactionRequest(LeaderFrontendState.java:198)
at org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:461)
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:292)
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/693/archives/odl1_karaf.log.gz
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/693/archives/log.html.gz#s1-s28-t1-k2-k23-k1-k1


Generated at Wed Feb 07 19:56:11 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.