[CONTROLLER-1615] prevLogIndex X was found in the log but the term -1 is not equal to the append Created: 20/Mar/17  Updated: 25/Jul/23  Resolved: 23/Mar/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Guy Sela Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive odl1.zip     Zip Archive odl2.zip     Zip Archive odl3.zip    
External issue ID: 8022

 Description   

Running with Carbon code of 20/03/2017.
Installing a setup with 3-node cluster as part of an openstack solution, and connecting 200 computes(OVSs) to the ODLs.
Many switches failed to connect and log filled with these errors.



 Comments   
Comment by Guy Sela [ 20/Mar/17 ]

Attachment odl1.zip has been added with description: Logs of ODL1

Comment by Guy Sela [ 20/Mar/17 ]

Attachment odl2.zip has been added with description: Logs of ODL2

Comment by Guy Sela [ 20/Mar/17 ]

Attachment odl3.zip has been added with description: Logs of ODL3

Comment by Guy Sela [ 20/Mar/17 ]

http://c6-bl-2:8181/jolokia/read/org.opendaylight.controller:Category=Shards,name=member-1-shard-topology-operational,type=DistributedOperationalDatastore:

{"request":

{"mbean":"org.opendaylight.controller:Category=Shards,name=member-1-shard-topology-operational,type=DistributedOperationalDatastore","type":"read"}

,"value":

{"ReadWriteTransactionCount":0,"SnapshotIndex":59,"InMemoryJournalLogSize":1,"ReplicatedToAllIndex":59,"Leader":"member-3-shard-topology-operational","LastIndex":60,"RaftState":"Follower","LastApplied":60,"LastCommittedTransactionTime":"1970-01-01 02:00:00.000","LastLeadershipChangeTime":"2017-03-20 15:33:17.204","LastLogIndex":60,"PeerAddresses":"member-2-shard-topology-operational: , member-3-shard-topology-operational: akka.tcp:\/\/opendaylight-cluster-data@172.16.1.153:2550\/user\/shardmanager-operational\/member-3-shard-topology-operational","FollowerInitialSyncStatus":true,"WriteOnlyTransactionCount":0,"FollowerInfo":[],"FailedReadTransactionsCount":0,"Voting":true,"StatRetrievalTime":"351.2 μs","CurrentTerm":1,"LastTerm":1,"FailedTransactionsCount":0,"PendingTxCommitQueueSize":0,"VotedFor":"member-3-shard-topology-operational","SnapshotCaptureInitiated":false,"CommittedTransactionsCount":0,"TxCohortCacheSize":0,"PeerVotingStates":"member-2-shard-topology-operational: true, member-3-shard-topology-operational: true","LastLogTerm":1,"StatRetrievalError":null,"CommitIndex":60,"SnapshotTerm":1,"AbortTransactionsCount":0,"ReadOnlyTransactionCount":0,"ShardName":"member-1-shard-topology-operational","LeadershipChangeCount":1,"InMemoryJournalDataSize":41}

,"timestamp":1490018472,"status":200}

http://c6-bl-3:8181/jolokia/read/org.opendaylight.controller:Category=Shards,name=member-2-shard-topology-operational,type=DistributedOperationalDatastore:
{"request":

{"mbean":"org.opendaylight.controller:Category=Shards,name=member-2-shard-topology-operational,type=DistributedOperationalDatastore","type":"read"}

,"value":

{"ReadWriteTransactionCount":0,"SnapshotIndex":59,"InMemoryJournalLogSize":1,"ReplicatedToAllIndex":59,"Leader":"member-3-shard-topology-operational","LastIndex":60,"RaftState":"Follower","LastApplied":60,"LastCommittedTransactionTime":"1970-01-01 02:00:00.000","LastLogIndex":60,"LastLeadershipChangeTime":"2017-03-20 15:33:17.203","PeerAddresses":"member-1-shard-topology-operational: , member-3-shard-topology-operational: akka.tcp:\/\/opendaylight-cluster-data@172.16.1.153:2550\/user\/shardmanager-operational\/member-3-shard-topology-operational","WriteOnlyTransactionCount":0,"FollowerInitialSyncStatus":true,"FollowerInfo":[],"FailedReadTransactionsCount":0,"Voting":true,"StatRetrievalTime":"2.887 ms","CurrentTerm":1,"LastTerm":1,"FailedTransactionsCount":0,"PendingTxCommitQueueSize":0,"VotedFor":"member-3-shard-topology-operational","SnapshotCaptureInitiated":false,"CommittedTransactionsCount":0,"TxCohortCacheSize":0,"PeerVotingStates":"member-1-shard-topology-operational: true, member-3-shard-topology-operational: true","LastLogTerm":1,"StatRetrievalError":null,"CommitIndex":60,"SnapshotTerm":1,"AbortTransactionsCount":0,"ReadOnlyTransactionCount":0,"ShardName":"member-2-shard-topology-operational","LeadershipChangeCount":1,"InMemoryJournalDataSize":41}

,"timestamp":1490018505,"status":200}

http://c6-bl-4:8181/jolokia/read/org.opendaylight.controller:Category=Shards,name=member-3-shard-topology-operational,type=DistributedOperationalDatastore:

{"request":

{"mbean":"org.opendaylight.controller:Category=Shards,name=member-3-shard-topology-operational,type=DistributedOperationalDatastore","type":"read"}

,"value":{"ReadWriteTransactionCount":2,"SnapshotIndex":59,"InMemoryJournalLogSize":1347,"ReplicatedToAllIndex":59,"Leader":"member-3-shard-topology-operational","LastIndex":1397,"RaftState":"Leader","LastApplied":60,"LastCommittedTransactionTime":"2017-03-20 15:33:27.493","LastLogIndex":1397,"LastLeadershipChangeTime":"2017-03-20 15:33:17.163","PeerAddresses":"member-1-shard-topology-operational: akka.tcp:\/\/opendaylight-cluster-data@172.16.1.142:2550\/user\/shardmanager-operational\/member-1-shard-topology-operational, member-2-shard-topology-operational: akka.tcp:\/\/opendaylight-cluster-data@172.16.1.136:2550\/user\/shardmanager-operational\/member-2-shard-topology-operational","WriteOnlyTransactionCount":0,"FollowerInitialSyncStatus":false,"FollowerInfo":[

{"timeSinceLastActivity":"00:00:00.030","active":true,"matchIndex":60,"voting":true,"id":"member-1-shard-topology-operational","nextIndex":61}

,

{"timeSinceLastActivity":"00:00:00.036","active":true,"matchIndex":60,"voting":true,"id":"member-2-shard-topology-operational","nextIndex":61}

],"FailedReadTransactionsCount":0,"Voting":true,"StatRetrievalTime":"9.821 ms","CurrentTerm":1,"LastTerm":1,"FailedTransactionsCount":0,"PendingTxCommitQueueSize":1,"VotedFor":"member-3-shard-topology-operational","SnapshotCaptureInitiated":false,"CommittedTransactionsCount":3,"TxCohortCacheSize":1,"PeerVotingStates":"member-1-shard-topology-operational: true, member-2-shard-topology-operational: true","LastLogTerm":1,"StatRetrievalError":null,"CommitIndex":60,"SnapshotTerm":1,"AbortTransactionsCount":0,"ReadOnlyTransactionCount":2219,"ShardName":"member-3-shard-topology-operational","LeadershipChangeCount":1,"InMemoryJournalDataSize":56425},"timestamp":1490018533,"status":200}

Comment by Guy Sela [ 20/Mar/17 ]

There is another error in the logs, that is also found in the Jenkins runs, and might be related:
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-newton-nodl-v2-upstream-learn-carbon/170/archives/odl3_karaf.log.gz

/10.29.12.25:2550]
2017-03-19 06:26:24,579 | WARN | lt-dispatcher-20 | ConcurrentDOMDataBroker | 237 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Tx: DOM-CHAIN-4-0 Error during phase CAN_COMMIT, starting Abort
akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.29.13.118:2550/), Path(/user/shardmanager-operational/member-1-shard-default-operational)]] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.BatchedModifications".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[204:com.typesafe.akka.actor:2.4.17]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[204:com.typesafe.akka.actor:2.4.17]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[200:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[200:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[200:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)[204:com.typesafe.akka.actor:2.4.17]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)[204:com.typesafe.akka.actor:2.4.17]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)[204:com.typesafe.akka.actor:2.4.17]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)[204:com.typesafe.akka.actor:2.4.17]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_121]
2017-03-19 06:26:24,588 | WARN | CommitFutures-0 | RestConnectorProvider | 385 - org.opendaylight.netconf.sal-rest-connector - 1.5.0.SNAPSHOT | TransactionChain(org.opendaylight.controller.cluster.databroker.DOMBrokerTransactionChain@2aac104c) DOM-CHAIN-4-0 FAILED!
TransactionCommitFailedException

{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.29.13.118:2550/), Path(/user/shardmanager-operational/member-1-shard-default-operational)]] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.BatchedModifications".]]}

at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)[190:org.opendaylight.controller.sal-broker-impl:1.5.0.SNAPSHOT]
at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)[190:org.opendaylight.controller.sal-broker-impl:1.5.0.SNAPSHOT]
at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:102)[88:org.opendaylight.yangtools.util:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.handleException(ConcurrentDOMDataBroker.java:222)[237:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.access$000(ConcurrentDOMDataBroker.java:46)[237:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$1.onFailure(ConcurrentDOMDataBroker.java:126)[237:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)[84:com.google.guava:18.0.0]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)[84:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)[84:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)[84:com.google.guava:18.0.0]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)[84:com.google.guava:18.0.0]
at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:68)[84:com.google.guava:18.0.0]
at org.opendaylight.controller.cluster.datastore.SingleCommitCohortProxy$1.onComplete(SingleCommitCohortProxy.java:58)[237:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at akka.dispatch.OnComplete.internal(Future.scala:258)[204:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.OnComplete.internal(Future.scala:256)[204:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)[204:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)[204:com.typesafe.akka.actor:2.4.17]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)[200:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)[204:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)[204:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)[204:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)[204:com.typesafe.akka.actor:2.4.17]
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)[200:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]

Comment by Tom Pantelis [ 20/Mar/17 ]

More info from email:

2017-03-20 13:48:36,101 | INFO | lt-dispatcher-29 | Shard | 207 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-topology-operational (Follower): The prevLogIndex 3 was found in the log but the term -1 is not equal to the append entriesprevLogTerm 1 - lastIndex: 5, snapshotIndex: 4
2017-03-20 13:48:36,101 | INFO | lt-dispatcher-29 | Shard | 207 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-topology-operational (Follower): Follower is out-of-sync so sending negative reply: AppendEntriesReply [term=1, success=false, followerId=member-3-shard-topology-operational, logLastIndex=5, logLastTerm=1, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]

At that time member-3 was a follower but from the jolokia output member-3 is now the leader so something changed and we don't have the shard state from when the above messages were emitted.

It would also help if you could provide the specific steps that led up the above messages at 2017-03-20 13:48:36. Specifically were nodes restarted after they were originally started and/or was any isolation testing done.

Comment by Guy Sela [ 20/Mar/17 ]

Nothing was restarted - not the ODLs and not the OVSs.

Comment by Vratko Polak [ 21/Mar/17 ]

In some tests, this symptom stops appearing with this [0] patch, but other symptoms appear (in general there are less csit failures).
Can anyone verify when there is distribution build ready?

[0] https://git.opendaylight.org/gerrit/49263

Comment by Guy Sela [ 22/Mar/17 ]

BFD was on, and probably caused a lot of "noise" on the system.

Comment by Vratko Polak [ 23/Mar/17 ]

Marking as fixed, as I have not seen this in recent tests.

Generated at Wed Feb 07 19:55:59 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.