[CONTROLLER-1766] The RaftState of one cluster node changed to 'IsolatedLeader' after a large number of Virtual NWs were created. Created: 04/Sep/17 Updated: 01/Oct/19 |
|
| Status: | Confirmed |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Carbon |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Ran Xiao | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 9090 |
| Description |
|
After a large number of Virtual NWs were created, the RaftState of one cluster node became abnormal. Environment Details: Reproduction steps:
Karaf logs:
odl1# odl2# ]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=hwvtep://uuid/44307ca2-4ebf-4470-8d06-7e377959a41c}]/AugmentationIdentifier {childNames=[(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-ucast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)managers, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-mcast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)switches, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-routers, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-arp-sources, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-ucast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-binding-stats, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)connection-info, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)acls, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-arp-sources, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-mcast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-switches]}/(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-switches/logical-switches[ {(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)hwvtep-node-name=a40b416f-18c8-43a8-9466-7759d62e4ce4}], errorList=[RpcError [message=Error checking ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=hwvtep:1}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=hwvtep://uuid/44307ca2-4ebf-4470-8d06-7e377959a41c}]/AugmentationIdentifier {childNames=[(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-ucast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)managers, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-mcast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)switches, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-routers, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-arp-sources, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-ucast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-binding-stats, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)connection-info, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)acls, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)local-arp-sources, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)remote-mcast-macs, (urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-switches]}/(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)logical-switches/logical-switches[ {(urn:opendaylight:params:xml:ns:yang:ovsdb:hwvtep?revision=2015-09-01)hwvtep-node-name=a40b416f-18c8-43a8-9466-7759d62e4ce4}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@192.168.101.8:2550/), Path(/user/shardmanager-operational/member-1-shard-topology-operational/shard-topology-member-2:datastore-operational@0:89656_200508#2123488974)]] after [5000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadData".]]} odl3# |
| Comments |
| Comment by Ran Xiao [ 04/Sep/17 ] |
|
Attachment xaa has been added with description: Karaf log 1 |
| Comment by Ran Xiao [ 04/Sep/17 ] |
|
Attachment xab has been added with description: Karaf log 2 |
| Comment by Ran Xiao [ 04/Sep/17 ] |
|
Attachment xac has been added with description: Karaf log 3 |
| Comment by Ran Xiao [ 04/Sep/17 ] |
|
Attachment xad has been added with description: Karaf log 4 |
| Comment by Ran Xiao [ 04/Sep/17 ] |
|
Attachment xae has been added with description: Karaf log 5 |
| Comment by Ran Xiao [ 04/Sep/17 ] |
|
Attachment xaf has been added with description: Karaf log 6 |
| Comment by Tom Pantelis [ 08/Sep/17 ] |
|
The attached logs are in some binary format. Are they zipped? If so what format - the first one attached was just named "xaa". There are 6 attachments - does that mean it's a 6-node cluster? The change from Leader to IsolatedLeader indicates the leader lost connection to a majority of the followers, or at least did not receive heartbeat replies for a period of time. |
| Comment by Ran Xiao [ 11/Sep/17 ] |
|
(In reply to Tom Pantelis from comment #7) Hi Tom, Sorry for not having discribed it clearly. This is a 3-node cluster. >>The change from Leader to IsolatedLeader indicates the leader lost connection to a majority of the followers,or at least did not receive heartbeat replies for a period of time. After the state changed into IsolatedLeader does the situation that the cluster cannot return to normal states happen? |
| Comment by Tom Pantelis [ 12/Sep/17 ] |
|
(In reply to Ran Xiao from comment #8) 10s by default. It is configurable. > After the state changed into IsolatedLeader does the situation that the Once it regains connectivity to a majority of the followers it will transition back to Leader. In odl1 karaf.log, reachability with 192.168.101.8 flapped for a bit: 2017-08-08 19:17:46,725 | WARN | lt-dispatcher-29 | ClusterCoreDaemon | 186 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@192.168.101.8:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@192.168.101.13:2550, status = Up)]. Node roles [member-1] It seems that once it went to IsolatedLeader, it never again received a heartbeat response from a follower. It could be due to load - it would be beneficial to monitor the CPU/memory usage and GC activity. Also thread dumps on the followers - perhaps there's a stuck thread. Re: the logging, this is the only message I see on odl1 w.r.t IsolatedLeader: karaf.log:2017-08-08 17:20:35,312 | INFO | lt-dispatcher-60 | RoleChangeNotifier | 204 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | RoleChangeNotifier for member-1-shard-inventory-config , received role change from Leader to IsolatedLeader This is odd b/c there should be additional logging (INFO/WARN), ie "Switching from behavior Leader to IsolatedLeader ..." It seems there's missing logging or the upstream clustering code has been forked... |
| Comment by Ran Xiao [ 22/Sep/17 ] |
|
Attachment karaf_odl1_20170921085857.tar.xz.aa has been added with description: karaf_odl1_20170921085857.tar.xz.aa odl1 karaf.log part 1 |
| Comment by Ran Xiao [ 22/Sep/17 ] |
|
Attachment karaf_odl1_20170921085857.tar.xz.ab has been added with description: karaf_odl1_20170921085857.tar.xz.ab odl1 karaf.log part 2 |
| Comment by Ran Xiao [ 22/Sep/17 ] |
|
Attachment karaf_odl1_20170921085857.tar.xz.ac has been added with description: karaf_odl1_20170921085857.tar.xz.ac odl1 karaf.log part 3 |
| Comment by Ran Xiao [ 22/Sep/17 ] |
|
Attachment karaf_odl2_20170921090305.tar.xz.aa has been added with description: karaf_odl2_20170921090305.tar.xz.aa odl2 karaf.log part 1 |
| Comment by Ran Xiao [ 22/Sep/17 ] |
|
Attachment karaf_odl2_20170921090305.tar.xz.ab has been added with description: karaf_odl2_20170921090305.tar.xz.ab odl2 karaf.log part 2 |
| Comment by Ran Xiao [ 22/Sep/17 ] |
|
Attachment karaf_odl3_20170921091358.tar.xz.aa has been added with description: karaf_odl3_20170921091358.tar.xz.aa odl3 karaf.log part 1 |
| Comment by Ran Xiao [ 22/Sep/17 ] |
|
Attachment karaf_odl3_20170921091358.tar.xz.ab has been added with description: karaf_odl3_20170921091358.tar.xz.ab odl3 karaf.log part 2 |
| Comment by Ran Xiao [ 22/Sep/17 ] |
|
Attachment other.tar.xz has been added with description: other.tar.xz Various information |
| Comment by Ran Xiao [ 22/Sep/17 ] |
|
(In reply to Tom Pantelis from comment #9) We reproduced this issue and collected the information mentioned above. The following seems to have occurred: We checked karaf log but cann't find out why the above two phenomenon occurred. Attachment: netstat -an (threaddump_odl[13].txt) |