[CONTROLLER-1766] The RaftState of one cluster node changed to 'IsolatedLeader' after a large number of Virtual NWs were created. Created: 04/Sep/17  Updated: 01/Oct/19

Status: Confirmed
Project: controller
Component/s: clustering
Affects Version/s: Carbon
Fix Version/s: None

Type: Bug
Reporter: Ran Xiao Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File karaf_odl1_20170921085857.tar.xz.aa     File karaf_odl1_20170921085857.tar.xz.ab     File karaf_odl1_20170921085857.tar.xz.ac     File karaf_odl2_20170921090305.tar.xz.aa     File karaf_odl2_20170921090305.tar.xz.ab     File karaf_odl3_20170921091358.tar.xz.aa     File karaf_odl3_20170921091358.tar.xz.ab     File other.tar.xz     HTML File xaa     HTML File xab     HTML File xac     HTML File xad     HTML File xae     HTML File xaf    
External issue ID: 9090

 Description   

After a large number of Virtual NWs were created, the RaftState of one cluster node became abnormal.
It changed from Leader to IsolatedLeader, Follower to Candidate.
Then RaftState not changed even after 16 hours.
And the RaftState of other 2 nodes did not changed at all.

Environment Details:
OpenStack Version: stable/ocata
ODL Version: carbon-FR + L2GW related patches
ODLのMAX heap: JAVA_MAX_MEM="6144m"

Reproduction steps:
1. Initialization and set up of Controller/Compute/L2GW Node
2. Creation of 1000 virtual NWs
3. Creation of 1 L2GW GATEWAY
4. Creation of 1000 L2GW CONNECTIONs

  • This phenomenon does not always occur (occurrence ratio: 1/3)

Karaf logs:

  • Split Karaf logs are attached.
    The time when RaftState became abnormal is around 17: 20-17: 23.

odl1#
karaf.log.3:2017-08-08 13:28:41,123 | INFO | lt-dispatcher-22 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier:akka.tcp://opendaylight-cluster-data@192.168.101.8:2550/user/shardmanager-operational/member-1-shard-topology-operational/member-1-shard-topology-operational-notifier#1974074480 created and ready for shard:member-1-shard-topology-operational
karaf.log.3:2017-08-08 13:28:41,131 | INFO | lt-dispatcher-23 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier for member-1-shard-topology-operational , received role change from null to Follower
karaf.log.3:2017-08-08 13:28:41,132 | INFO | lt-dispatcher-23 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier for member-1-shard-topology-operational , registered listener akka://opendaylight-cluster-data/user/shardmanager-operational
karaf.log.3:2017-08-08 13:28:51,162 | INFO | lt-dispatcher-36 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier for member-1-shard-topology-operational , received role change from Follower to Candidate
karaf.log.3:2017-08-08 13:28:51,165 | INFO | lt-dispatcher-41 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier for member-1-shard-topology-operational , received role change from Candidate to Leader
karaf.log:2017-08-08 17:20:35,381 | INFO | lt-dispatcher-45 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier for member-1-shard-topology-operational , received role change from Leader to IsolatedLeader

odl2#
karaf.log.1:2017-08-08 13:28:41,124 | INFO | lt-dispatcher-23 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier:akka.tcp://opendaylight-cluster-data@192.168.101.13:2550/user/shardmanager-operational/member-2-shard-topology-operational/member-2-shard-topology-operational-notifier#-1501266548 created and ready for shard:member-2-shard-topology-operational
karaf.log.1:2017-08-08 13:28:41,130 | INFO | ult-dispatcher-3 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier for member-2-shard-topology-operational , received role change from null to Follower
karaf.log.1:2017-08-08 13:28:41,138 | INFO | ult-dispatcher-3 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier for member-2-shard-topology-operational , registered listener akka://opendaylight-cluster-data/user/shardmanager-operational
karaf.log:Caused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@192.168.101.8:2550/), Path(/user/shardmanager-operational/member-1-shard-topology-operational/shard-topology-member-2:datastore-operational@0:89656_200508#2123488974)]] after [5000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadData".
karaf.log:ReadFailedException{message=Error checking ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=hwvtep:1}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=hwvtep://uuid/44307ca2-4ebf-4470-8d06-7e377959a41c}

]/AugmentationIdentifier

{childNames=[(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-ucast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)managers, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-mcast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)switches, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-routers, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-arp-sources, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-ucast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-binding-stats, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)connection-info, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)acls, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-arp-sources, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-mcast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-switches]}

/(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-switches/logical-switches[

{(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)hwvtep-node-name=a40b416f-18c8-43a8-9466-7759d62e4ce4}

], errorList=[RpcError [message=Error checking ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=hwvtep:1}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=hwvtep://uuid/44307ca2-4ebf-4470-8d06-7e377959a41c}

]/AugmentationIdentifier

{childNames=[(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-ucast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)managers, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-mcast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)switches, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-routers, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-arp-sources, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-ucast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-binding-stats, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)connection-info, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)acls, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-arp-sources, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-mcast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-switches]}

/(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-switches/logical-switches[

{(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)hwvtep-node-name=a40b416f-18c8-43a8-9466-7759d62e4ce4}

], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@192.168.101.8:2550/), Path(/user/shardmanager-operational/member-1-shard-topology-operational/shard-topology-member-2:datastore-operational@0:89656_200508#2123488974)]] after [5000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadData".]]}

odl3#
2017-08-08 13:28:41,130 | INFO | lt-dispatcher-20 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier:akka.tcp://opendaylight-cluster-data@192.168.101.15:2550/user/shardmanager-operational/member-3-shard-topology-operational/member-3-shard-topology-operational-notifier#1779121060 created and ready for shard:member-3-shard-topology-operational
2017-08-08 13:28:41,142 | INFO | lt-dispatcher-22 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier for member-3-shard-topology-operational , received role change from null to Follower
2017-08-08 13:28:41,142 | INFO | lt-dispatcher-48 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier for member-3-shard-topology-operational , registered listener akka://opendaylight-cluster-data/user/shardmanager-operational



 Comments   
Comment by Ran Xiao [ 04/Sep/17 ]

Attachment xaa has been added with description: Karaf log 1

Comment by Ran Xiao [ 04/Sep/17 ]

Attachment xab has been added with description: Karaf log 2

Comment by Ran Xiao [ 04/Sep/17 ]

Attachment xac has been added with description: Karaf log 3

Comment by Ran Xiao [ 04/Sep/17 ]

Attachment xad has been added with description: Karaf log 4

Comment by Ran Xiao [ 04/Sep/17 ]

Attachment xae has been added with description: Karaf log 5

Comment by Ran Xiao [ 04/Sep/17 ]

Attachment xaf has been added with description: Karaf log 6

Comment by Tom Pantelis [ 08/Sep/17 ]

The attached logs are in some binary format. Are they zipped? If so what format - the first one attached was just named "xaa".

There are 6 attachments - does that mean it's a 6-node cluster?

The change from Leader to IsolatedLeader indicates the leader lost connection to a majority of the followers, or at least did not receive heartbeat replies for a period of time.

Comment by Ran Xiao [ 11/Sep/17 ]

(In reply to Tom Pantelis from comment #7)

Hi Tom,

Sorry for not having discribed it clearly.
Log files are zipped and split(6 parts), as the size of attachment is limited.
For checking logs, pls download all these 6 files, and run the command below:
cat xaa .. xaf > odl.tar.xz
tar Jxvf odl.tar.xz

This is a 3-node cluster.

>>The change from Leader to IsolatedLeader indicates the leader lost connection to a majority of the followers,or at least did not receive heartbeat replies for a period of time.
How long is this "a period of time”?
How is it set? Is it set by magic number or config parameter?

After the state changed into IsolatedLeader does the situation that the cluster cannot return to normal states happen?
For example, does it cannot return after waited a period of time?

Comment by Tom Pantelis [ 12/Sep/17 ]

(In reply to Ran Xiao from comment #8)
> (In reply to Tom Pantelis from comment #7)
>
> Hi Tom,
>
> Sorry for not having discribed it clearly.
> Log files are zipped and split(6 parts), as the size of attachment is
> limited.
> For checking logs, pls download all these 6 files, and run the command below:
> cat xaa .. xaf > odl.tar.xz
> tar Jxvf odl.tar.xz
>
> This is a 3-node cluster.
>
> >>The change from Leader to IsolatedLeader indicates the leader lost connection to a majority of the followers,or at least did not receive heartbeat replies for a period of time.
> How long is this "a period of time”?
> How is it set? Is it set by magic number or config parameter?
>

10s by default. It is configurable.

> After the state changed into IsolatedLeader does the situation that the
> cluster cannot return to normal states happen?
> For example, does it cannot return after waited a period of time?

Once it regains connectivity to a majority of the followers it will transition back to Leader.

In odl1 karaf.log, reachability with 192.168.101.8 flapped for a bit:

2017-08-08 19:17:46,725 | WARN | lt-dispatcher-29 | ClusterCoreDaemon | 186 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@192.168.101.8:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@192.168.101.13:2550, status = Up)]. Node roles [member-1]
2017-08-08 19:17:56,723 | INFO | lt-dispatcher-40 | kka://opendaylight-cluster-data) | 186 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@192.168.101.8:2550] - Marking node(s) as REACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@192.168.101.13:2550, status = Up)]. Node roles [member-1]

It seems that once it went to IsolatedLeader, it never again received a heartbeat response from a follower. It could be due to load - it would be beneficial to monitor the CPU/memory usage and GC activity. Also thread dumps on the followers - perhaps there's a stuck thread.

Re: the logging, this is the only message I see on odl1 w.r.t IsolatedLeader:

karaf.log:2017-08-08 17:20:35,312 | INFO | lt-dispatcher-60 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier for member-1-shard-inventory-config , received role change from Leader to IsolatedLeader

This is odd b/c there should be additional logging (INFO/WARN), ie

"Switching from behavior Leader to IsolatedLeader ..."
"At least {} followers need to be active, Switching {} from Leader to IsolatedLeader"

It seems there's missing logging or the upstream clustering code has been forked...

Comment by Ran Xiao [ 22/Sep/17 ]

Attachment karaf_odl1_20170921085857.tar.xz.aa has been added with description: karaf_odl1_20170921085857.tar.xz.aa odl1 karaf.log part 1

Comment by Ran Xiao [ 22/Sep/17 ]

Attachment karaf_odl1_20170921085857.tar.xz.ab has been added with description: karaf_odl1_20170921085857.tar.xz.ab odl1 karaf.log part 2

Comment by Ran Xiao [ 22/Sep/17 ]

Attachment karaf_odl1_20170921085857.tar.xz.ac has been added with description: karaf_odl1_20170921085857.tar.xz.ac odl1 karaf.log part 3

Comment by Ran Xiao [ 22/Sep/17 ]

Attachment karaf_odl2_20170921090305.tar.xz.aa has been added with description: karaf_odl2_20170921090305.tar.xz.aa odl2 karaf.log part 1

Comment by Ran Xiao [ 22/Sep/17 ]

Attachment karaf_odl2_20170921090305.tar.xz.ab has been added with description: karaf_odl2_20170921090305.tar.xz.ab odl2 karaf.log part 2

Comment by Ran Xiao [ 22/Sep/17 ]

Attachment karaf_odl3_20170921091358.tar.xz.aa has been added with description: karaf_odl3_20170921091358.tar.xz.aa odl3 karaf.log part 1

Comment by Ran Xiao [ 22/Sep/17 ]

Attachment karaf_odl3_20170921091358.tar.xz.ab has been added with description: karaf_odl3_20170921091358.tar.xz.ab odl3 karaf.log part 2

Comment by Ran Xiao [ 22/Sep/17 ]

Attachment other.tar.xz has been added with description: other.tar.xz Various information

Comment by Ran Xiao [ 22/Sep/17 ]

(In reply to Tom Pantelis from comment #9)

We reproduced this issue and collected the information mentioned above.

The following seems to have occurred:
1. ODL process of ODL2 was stopped
2. ODL1 port: 2550 was closed
We think these caused the IsolatedLeader issue, right?

We checked karaf log but cann't find out why the above two phenomenon occurred.
Do you have any idea?

Attachment:
karaf_odl1_20170921085857.tar.xz.aa odl1 karaf.log part 1
karaf_odl1_20170921085857.tar.xz.ab odl1 karaf.log part 2
karaf_odl1_20170921085857.tar.xz.ac odl1 karaf.log part 3
karaf_odl2_20170921090305.tar.xz.aa odl2 karaf.log part 1
karaf_odl2_20170921090305.tar.xz.ab odl2 karaf.log part 2
karaf_odl3_20170921091358.tar.xz.aa odl3 karaf.log part 1
karaf_odl3_20170921091358.tar.xz.ab odl3 karaf.log part 2
other.tar.xz Various information
lsof -p

{ODL PID}

netstat -an
ps -C java -L -o pcpu,cpu,nice,state,cputime,pid,tid | sort
ps -ef(ps_odl[123].txt)
jstack

{ODL_PID}

(threaddump_odl[13].txt)
top

Generated at Wed Feb 07 19:56:24 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.