[CONTROLLER-2076] Two-Node Cluster Fails to Install Snapshot on Clean Follower Node Created: 17/Apr/23  Updated: 22/Jan/24

Status: In Progress
Project: controller
Component/s: clustering
Affects Version/s: Sodium SR4
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Peter Suna Assignee: Ivan Hrasko
Resolution: Unresolved Votes: 0
Labels: pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File karafFollowerMember-2.log     Text File karafLeaderMember-1.log     File member-1-akka.conf     File member-2-akka.conf     File module-shards.conf     File modules.conf    

 Description   

The two-node cluster fails to start when the leader has stored snapshot data and the follower is started as a clean instance without snapshot data. The leader node starts correctly, but the follower node encounters a repeating issue, which is displayed as:

2023-04-14T13:40:29,326 | WARN  | Thread-35        | AbstractShardBackendResolver     | 219 - org.opendaylight.controller.sal-distributed-datastore - 1.10.4 | Failed to resolve shard
java.util.concurrent.TimeoutException: Connection attempt failed
    at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:151) ~[219:org.opendaylight.controller.sal-distributed-datastore:1.10.4]
    at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.onConnectResponse(AbstractShardBackendResolver.java:168) ~[219:org.opendaylight.controller.sal-distributed-datastore:1.10.4]
    at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$connectShard$4(AbstractShardBackendResolver.java:161) ~[219:org.opendaylight.controller.sal-distributed-datastore:1.10.4]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) [?:?]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) [?:?]
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) [?:?]
    at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation 0 was superseded by 2 

This issue has been observed in the Sodium-SR4 release and needs to be verified if it is present in the current master.

To build the environment where I observed this issue, I used the integration-distribution repository with Java 11.
https://github.com/opendaylight/integration-distribution/tree/release/sodium-sr4

Steps to reproduce the issue are as follows:

(Prepare environment)
1) Start the two-node cluster and verify that it is working correctly. The initial configuration is appended in the attachments.
`feature:install odl-netconf-clustered-topology odl-restconf-nb-rfc8040 odl-clustering-test-app`
2) Add data to create some snapshots in ODL. My snapshot size is around 500MB.

    curl --request POST 'http://192.168.56.101:8181/rests/data/car:cars' \
    --header 'Authorization: Basic YWRtaW46YWRtaW4=' \
    --header 'Content-Type: application/json' \
    --data '{
                "car-entry": [
                    {
                        "id": "id-'"$id"'-model",
                        "model": "Lorem ipsum dolor....",
                        "manufacturer": "Lorem ipsum dorem ....",
                        "year": 198454,
                        "category": "Lorem ipsum dolor ..."
                    }
                ]
            }' 

3) Verify that the cluster is working correctly, even after restarting both nodes with snapshots.

(Testing the issue)
1) Replace the ODL folder in the follower node with a clean ODL distribution.
2) Start the ODL leader, and then start the follower node with the required Karaf features installed.

 



 Comments   
Comment by Robert Varga [ 11/May/23 ]

Note that having only two members configured is not something we support – RAFT requires an odd number of members to prevent split-brain problems.

Generated at Wed Feb 07 19:57:08 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.