[CONTROLLER-1632] IOException in initiateCaptureSnapshot Created: 11/Apr/17  Updated: 25/Jul/23  Resolved: 21/Apr/17

Status: Verified
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Peter Gubka Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File karaf_log_1.tar.gz     File karaf_log_2.tar.gz     File karaf_log_3.tar.gz     HTML File log.html    
External issue ID: 8206

 Description   

odl: distribution-karaf-0.6.0-20170411.050836-4837.zip

The test scenario follows the steps of DOMDataBroker testing: Leader Isolation: partition heals within transaction timeout
1) start test-transaction producer, running at 1K tps, non-overlapping, from all nodes to a single shard
2) Isolate leader
3) Wait for followers to initiate election
4) Un-isolate leader
5) Wait for partition to heal - this was done within transaction timeout (30s)

After the followers elected new leader, the node was rejoined (node unblocked via iptables).
Then the test waited for all 3 nodes to be reported correctly. Few seconds after rejoin the node still reported IsolatedLeader raft state and then
started to respond with http status 401 (unauthorized).
This took



 Comments   
Comment by Peter Gubka [ 11/Apr/17 ]

Attachment log.html has been added with description: robot log

Comment by Peter Gubka [ 11/Apr/17 ]

Attachment karaf_log_1.tar.gz has been added with description: karaf log

Comment by Peter Gubka [ 11/Apr/17 ]

Attachment karaf_log_2.tar.gz has been added with description: karaf log 2

Comment by Peter Gubka [ 11/Apr/17 ]

Attachment karaf_log_3.tar.gz has been added with description: karaf log 3

Comment by Tom Pantelis [ 11/Apr/17 ]

Which shard are you looking at? In looking at the default-config shard, I see that node1 was the leader and was isolated:

2017-04-11 08:47:33,069 | INFO | ult-dispatcher-4 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-default-config (Leader) :- Switching from behavior Leader to IsolatedLeader, election term: 2

Meanwhile node3 became leader on the other side of the partition with term 3:

2017-04-11 08:47:28,644 | INFO | lt-dispatcher-42 | ShardManager | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-config: Received role changed for member-3-shard-default-config from PreLeader to Leader

node1 switched to follower at:

2017-04-11 08:47:38,854 | INFO | lt-dispatcher-29 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-default-config (IsolatedLeader): Term 3 in "AppendEntriesReply [term=3, success=false, followerId=member-2-shard-default-config, logLastIndex=20, logLastTerm=3, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]" message is greater than leader's term 2 - switching to Follower

2017-04-11 08:47:38,855 | INFO | lt-dispatcher-29 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-default-config (IsolatedLeader) :- Switching from behavior IsolatedLeader to Follower, election term: 3

New leader node3 then installed a snapshot on node1 to catch it up:

2017-04-11 08:47:40,693 | INFO | lt-dispatcher-34 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-default-config (Leader): handleAppendEntriesReply: follower member-1-shard-default-config lastIndex 24 is ahead of our lastIndex 20 (snapshotIndex 17) - forcing install snaphot

2017-04-11 08:47:40,758 | INFO | lt-dispatcher-83 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-default-config: Applying snapshot on follower: Snapshot [lastIndex=20, lastTerm=3, lastAppliedIndex=20, lastAppliedTerm=3, unAppliedEntries size=0, state=org.opendaylight.controller.cluster.datastore.persisted.ShardSnapshotState@51af049b, electionTerm=3, electionVotedFor=null, ServerConfigPayload=null]

I don't see a problem here.

Comment by Peter Gubka [ 11/Apr/17 ]

To fix my the problem description:
after the rejoin the the test again iterates over all nodes and verifies the cluster status.
Some time after the rejoin it reported the the of the node as IsolatedLeader. After some time Follower status should be reported. Instead of that the node started to respond with 401 (probably to anything). In the attached test case it responded 401 to GET for /restconf/modules.

I changed the bug description too.

Comment by Tom Pantelis [ 11/Apr/17 ]

(In reply to Peter Gubka from comment #5)
> To fix my the problem description:
> after the rejoin the the test again iterates over all nodes and verifies the
> cluster status.
> Some time after the rejoin it reported the the of the node as
> IsolatedLeader. After some time Follower status should be reported. Instead
> of that the node started to respond with 401 (probably to anything). In the
> attached test case it responded 401 to GET for /restconf/modules.
>
> I changed the bug description too.

As I mentioned, from the logs the default config shard did transition from IsolatedLeader to Follower as expected. A 401 response means "Unauthorized" which means the credentials were rejected. This would not be an issue with clustering. You need to look at the robot logs to see what it's sending and to which node.

Later on in the log for node3, I see this error repeated a few times:

2017-04-11 08:47:56,268 | INFO | lt-dispatcher-33 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-default-config (Leader): handleAppendEntriesReply: follower member-1-shard-default-config lastIndex 24 is ahead of our lastIndex 21 (snapshotIndex 17) - forcing install snaphot
2017-04-11 08:47:56,419 | WARN | lt-dispatcher-80 | ShardManager | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Supervisor Strategy caught unexpected exception - resuming
java.lang.RuntimeException: java.io.IOException: The # of bytes read from the imput stream, -1, does not match the expected # 3075
at com.google.common.base.Throwables.propagate(Throwables.java:160)[84:com.google.guava:18.0.0]
at org.opendaylight.controller.cluster.raft.behaviors.AbstractLeader.sendSnapshotChunk(AbstractLeader.java:831)[227:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.behaviors.AbstractLeader.initiateCaptureSnapshot(AbstractLeader.java:745)[227:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.behaviors.AbstractLeader.handleAppendEntriesReply(AbstractLeader.java:238)[227:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.behaviors.Leader.handleAppendEntriesReply(Leader.java:101)[227:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.behaviors.AbstractRaftActorBehavior.handleMessage(AbstractRaftActorBehavior.java:411)[227:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.behaviors.AbstractLeader.handleMessage(AbstractLeader.java:469)[227:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.behaviors.Leader.handleMessage(Leader.java:80)[227:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.possiblyHandleBehaviorMessage(RaftActor.java:322)[227:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:271)[227:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]
...
Caused by: java.io.IOException: The # of bytes read from the imput stream, -1, does not match the expected # 3075
at org.opendaylight.controller.cluster.raft.behaviors.LeaderInstallSnapshotState.getNextChunk(LeaderInstallSnapshotState.java:129)[227:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.behaviors.AbstractLeader.sendSnapshotChunk(AbstractLeader.java:804)[227:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]

I think what happened is that when node1 was isolated, node3 tries to continuously send AppendEntries at the heartbeat interval so multiple messages are queued up and eventually get sent to node1 which results in multiple replies in quick succession where the follower's last index is higher and it tries to force a snapshot install. The first reply sends the snapshot chunk and drains the stream which causes the error when the second reply tries to send the same chunk.

This error causes the shard actor to restart but I think this is orthogonal and wouldn't result in a 401 on the HTTP front-end, unless AAA was trying to access the credentials from the datastore but I don't think that info is stored there.

Comment by Tom Pantelis [ 11/Apr/17 ]

I also noticed multiple messages like:

2017-04-11 08:48:07,057 | INFO | lt-dispatcher-16 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-default-config (Leader): follower member-1-shard-default-config last log term 2 conflicts with the leader's 3 - dec next index to -2

I think this is benign. I pushed https://git.opendaylight.org/gerrit/#/c/54780 to prevent decrementing next index below -1.

Comment by Tom Pantelis [ 12/Apr/17 ]

It seems the 401 issue is more general and occurs with other tests as well. NETCONF-398 was opened for that. I'll use this bug to address the IOException during install snapshot.

Comment by Tom Pantelis [ 12/Apr/17 ]

Submitted https://git.opendaylight.org/gerrit/#/c/54898/

Generated at Wed Feb 07 19:56:03 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.