[CONTROLLER-1632] IOException in initiateCaptureSnapshot Created: 11/Apr/17 Updated: 25/Jul/23 Resolved: 21/Apr/17 |
|
| Status: | Verified |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Peter Gubka | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 8206 |
| Description |
|
odl: distribution-karaf-0.6.0-20170411.050836-4837.zip The test scenario follows the steps of DOMDataBroker testing: Leader Isolation: partition heals within transaction timeout After the followers elected new leader, the node was rejoined (node unblocked via iptables). |
| Comments |
| Comment by Peter Gubka [ 11/Apr/17 ] |
|
Attachment log.html has been added with description: robot log |
| Comment by Peter Gubka [ 11/Apr/17 ] |
|
Attachment karaf_log_1.tar.gz has been added with description: karaf log |
| Comment by Peter Gubka [ 11/Apr/17 ] |
|
Attachment karaf_log_2.tar.gz has been added with description: karaf log 2 |
| Comment by Peter Gubka [ 11/Apr/17 ] |
|
Attachment karaf_log_3.tar.gz has been added with description: karaf log 3 |
| Comment by Tom Pantelis [ 11/Apr/17 ] |
|
Which shard are you looking at? In looking at the default-config shard, I see that node1 was the leader and was isolated: 2017-04-11 08:47:33,069 | INFO | ult-dispatcher-4 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-default-config (Leader) :- Switching from behavior Leader to IsolatedLeader, election term: 2 Meanwhile node3 became leader on the other side of the partition with term 3: 2017-04-11 08:47:28,644 | INFO | lt-dispatcher-42 | ShardManager | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-config: Received role changed for member-3-shard-default-config from PreLeader to Leader node1 switched to follower at: 2017-04-11 08:47:38,854 | INFO | lt-dispatcher-29 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-default-config (IsolatedLeader): Term 3 in "AppendEntriesReply [term=3, success=false, followerId=member-2-shard-default-config, logLastIndex=20, logLastTerm=3, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]" message is greater than leader's term 2 - switching to Follower 2017-04-11 08:47:38,855 | INFO | lt-dispatcher-29 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-default-config (IsolatedLeader) :- Switching from behavior IsolatedLeader to Follower, election term: 3 New leader node3 then installed a snapshot on node1 to catch it up: 2017-04-11 08:47:40,693 | INFO | lt-dispatcher-34 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-default-config (Leader): handleAppendEntriesReply: follower member-1-shard-default-config lastIndex 24 is ahead of our lastIndex 20 (snapshotIndex 17) - forcing install snaphot 2017-04-11 08:47:40,758 | INFO | lt-dispatcher-83 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-default-config: Applying snapshot on follower: Snapshot [lastIndex=20, lastTerm=3, lastAppliedIndex=20, lastAppliedTerm=3, unAppliedEntries size=0, state=org.opendaylight.controller.cluster.datastore.persisted.ShardSnapshotState@51af049b, electionTerm=3, electionVotedFor=null, ServerConfigPayload=null] I don't see a problem here. |
| Comment by Peter Gubka [ 11/Apr/17 ] |
|
To fix my the problem description: I changed the bug description too. |
| Comment by Tom Pantelis [ 11/Apr/17 ] |
|
(In reply to Peter Gubka from comment #5) As I mentioned, from the logs the default config shard did transition from IsolatedLeader to Follower as expected. A 401 response means "Unauthorized" which means the credentials were rejected. This would not be an issue with clustering. You need to look at the robot logs to see what it's sending and to which node. Later on in the log for node3, I see this error repeated a few times: 2017-04-11 08:47:56,268 | INFO | lt-dispatcher-33 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-default-config (Leader): handleAppendEntriesReply: follower member-1-shard-default-config lastIndex 24 is ahead of our lastIndex 21 (snapshotIndex 17) - forcing install snaphot I think what happened is that when node1 was isolated, node3 tries to continuously send AppendEntries at the heartbeat interval so multiple messages are queued up and eventually get sent to node1 which results in multiple replies in quick succession where the follower's last index is higher and it tries to force a snapshot install. The first reply sends the snapshot chunk and drains the stream which causes the error when the second reply tries to send the same chunk. This error causes the shard actor to restart but I think this is orthogonal and wouldn't result in a 401 on the HTTP front-end, unless AAA was trying to access the credentials from the datastore but I don't think that info is stored there. |
| Comment by Tom Pantelis [ 11/Apr/17 ] |
|
I also noticed multiple messages like: 2017-04-11 08:48:07,057 | INFO | lt-dispatcher-16 | Shard | 226 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-default-config (Leader): follower member-1-shard-default-config last log term 2 conflicts with the leader's 3 - dec next index to -2 I think this is benign. I pushed https://git.opendaylight.org/gerrit/#/c/54780 to prevent decrementing next index below -1. |
| Comment by Tom Pantelis [ 12/Apr/17 ] |
|
It seems the 401 issue is more general and occurs with other tests as well. |
| Comment by Tom Pantelis [ 12/Apr/17 ] |