[CONTROLLER-1645] shard moved during 1M bgp prefix advertizing (with tell-based=true) Created: 27/Apr/17  Updated: 25/Jul/23

Status: Confirmed
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Peter Gubka Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by CONTROLLER-1667 Failure in singleton isolation longev... Confirmed
Duplicate
is duplicated by CONTROLLER-1698 node unreachable if odl under load Resolved
External issue ID: 8318

 Description   

https://jenkins.opendaylight.org/sandbox/job/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/8/

After the bgp peer is connected, 1M prefixes are being advertized. From the log.html we see that updates for 1M prefixes were sent between

017-04-27 07:36:16,151 INFO BGP-Dummy-1: Common single UPDATE will be generated for both NLRI & WITHDRAWN lists
2017-04-27 07:36:16,151 INFO BGP-Dummy-1: Let's go ...
2017-04-27 07:36:16,160 INFO BGP-Dummy-1: Iteration: 0 - total remaining prefixes: 1000000
...
2017-04-27 07:37:18,545 INFO BGP-Dummy-1: Iteration: 999000 - total remaining prefixes: 1000
2017-04-27 07:37:18,603 INFO BGP-Dummy-1: All update messages generated.
2017-04-27 07:37:18,603 INFO BGP-Dummy-1: Storing performance results.
2017-04-27 07:37:18,603 INFO BGP-Dummy-1: ########## Final results ##########
2017-04-27 07:37:18,603 INFO BGP-Dummy-1: Number of iterations: 1000000

During this time there was a shard leader leader movement

https://logs.opendaylight.org/sandbox/jenkins091/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/8/archives/log.html.gz#s1-t7-k2-k3-k7-k3-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k2-k2-k2-k1-k6-k2-k2-k1

Finally only 26k prefixes are present in odl.

https://logs.opendaylight.org/sandbox/jenkins091/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/8/archives/log.html.gz#s1-t10-k2-k2



 Comments   
Comment by Vratko Polak [ 27/Apr/17 ]

> https://jenkins.opendaylight.org/sandbox/job/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/8/

Copied a segment of member-1 karaf.log up to when it marked member-3 unreachable. Note there are few troubling INFOs and WARNs before BGP peer has even connected.

2017-04-27 07:34:34,620 | INFO | config-pusher | ConfigPusherImpl | 126 - org.opendaylight.controller.config-persister-impl - 0.6.0.SNAPSHOT | Pushing configuration snapshot 10-rest-connector.xml(odl-restconf,odl-restconf)
2017-04-27 07:34:34,718 | INFO | er [_value=8185] | WebSocketServer | 253 - org.opendaylight.netconf.sal-rest-connector - 1.5.0.SNAPSHOT | Web socket server started at port 8185.
2017-04-27 07:34:34,733 | INFO | lt-dispatcher-40 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: resolved shard 0 to ShardBackendInfo

{actor=Actor[akka://opendaylight-cluster-data/user/shardmanager-operational/member-1-shard-default-operational#2033146177], sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default}

2017-04-27 07:34:34,733 | INFO | lt-dispatcher-40 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: resolving connection ConnectingClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-operational, generation=0}, cookie=0} to ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-operational, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-operational/member-1-shard-default-operational#2033146177, sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default}}
2017-04-27 07:34:34,734 | INFO | lt-dispatcher-40 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: replaced connection ConnectingClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-operational, generation=0}, cookie=0} with ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-operational, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-operational/member-1-shard-default-operational#2033146177, sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default}}
2017-04-27 07:34:34,794 | INFO | config-pusher | ConfigPusherImpl | 126 - org.opendaylight.controller.config-persister-impl - 0.6.0.SNAPSHOT | Successfully pushed configuration snapshot 10-rest-connector.xml(odl-restconf,odl-restconf)
2017-04-27 07:34:34,794 | INFO | config-pusher | ConfigPusherImpl | 126 - org.opendaylight.controller.config-persister-impl - 0.6.0.SNAPSHOT | Pushing configuration snapshot 10-restconf-service.xml(odl-restconf,odl-restconf)
2017-04-27 07:34:34,900 | INFO | config-pusher | ConfigPusherImpl | 126 - org.opendaylight.controller.config-persister-impl - 0.6.0.SNAPSHOT | Successfully pushed configuration snapshot 10-restconf-service.xml(odl-restconf,odl-restconf)
2017-04-27 07:34:34,996 | WARN | lt-dispatcher-18 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for purged transaction member-3-datastore-config-fe-0-txn-11-1, ignoring
2017-04-27 07:34:34,997 | WARN | lt-dispatcher-18 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for commited transaction member-1-datastore-config-fe-0-txn-12-1, ignoring
2017-04-27 07:34:35,514 | WARN | lt-dispatcher-40 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for purged transaction member-1-datastore-config-fe-0-txn-12-1, ignoring
2017-04-27 07:35:44,649 | INFO | 0]-nio2-thread-1 | ServerSession | 30 - org.apache.sshd.core - 0.14.0 | Server session created from /10.29.4.11:35616
2017-04-27 07:35:44,653 | INFO | 0]-nio2-thread-1 | SimpleGeneratorHostKeyProvider | 30 - org.apache.sshd.core - 0.14.0 | Generating host key...
2017-04-27 07:35:47,722 | INFO | 0]-nio2-thread-5 | ServerUserAuthService | 30 - org.apache.sshd.core - 0.14.0 | Session karaf@/10.29.4.11:35616 authenticated
2017-04-27 07:36:09,856 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting suite /w/workspace/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/test/csit/suites/bgpcep/bgpclustering/singlepeer_pc_shm_1Mroutes.robot
2017-04-27 07:36:13,970 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Get Example Bgp Rib Owner
2017-04-27 07:36:14,218 | INFO | lt-dispatcher-21 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: resolved shard 2 to ShardBackendInfo

{actor=Actor[akka://opendaylight-cluster-data/user/shardmanager-operational/member-1-shard-entity-ownership-operational#-171190699], sessionId=2, version=BORON, maxMessages=1000, cookie=2, shard=entity-ownership}

2017-04-27 07:36:14,219 | INFO | lt-dispatcher-21 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: resolving connection ConnectingClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-operational, generation=0}, cookie=2} to ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-operational, generation=0}

, cookie=2, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-operational/member-1-shard-entity-ownership-operational#-171190699, sessionId=2, version=BORON, maxMessages=1000, cookie=2, shard=entity-ownership}}
2017-04-27 07:36:14,219 | INFO | lt-dispatcher-21 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-operational: replaced connection ConnectingClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-operational, generation=0}, cookie=2} with ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-operational, generation=0}

, cookie=2, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-operational/member-1-shard-entity-ownership-operational#-171190699, sessionId=2, version=BORON, maxMessages=1000, cookie=2, shard=entity-ownership}}
2017-04-27 07:36:14,374 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Check_For_Empty_Ipv4_Topology_Before_Talking_1
2017-04-27 07:36:14,725 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Check_For_Empty_Ipv4_Topology_Before_Talking_2
2017-04-27 07:36:14,920 | WARN | lt-dispatcher-16 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-2-datastore-operational-fe-0-txn-1-1, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.136:2550/user/$b#870108630], operations=[], protocol=ABORT}

2017-04-27 07:36:15,078 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Check_For_Empty_Ipv4_Topology_Before_Talking_3
2017-04-27 07:36:15,297 | WARN | lt-dispatcher-16 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-operational-fe-0-txn-1-1, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$b#1585488765], operations=[], protocol=ABORT}

2017-04-27 07:36:15,426 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Reconfigure_ODL_To_Accept_Connection
2017-04-27 07:36:15,825 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Start_Talking_BGP_Speaker
2017-04-27 07:36:16,071 | WARN | lt-dispatcher-21 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-2-datastore-operational-fe-0-txn-2-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.136:2550/user/$b#870108630], operations=[], protocol=ABORT}

2017-04-27 07:36:16,422 | INFO | on-dispatcher-50 | BgpPeer | 271 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Peer Singleton Service example-bgp-rib-service-group registered
2017-04-27 07:36:22,821 | WARN | lt-dispatcher-29 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-2-datastore-operational-fe-0-txn-3-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.136:2550/user/$b#870108630], operations=[], protocol=ABORT}

2017-04-27 07:36:23,929 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Wait_For_Stable_Talking_Ipv4_Topology_1
2017-04-27 07:36:54,837 | WARN | lt-dispatcher-72 | ClusterCoreDaemon | 174 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.13.130:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.29.14.105:2550, status = Up)]. Node roles [member-1]

Comment by Vratko Polak [ 27/Apr/17 ]

Member-3 log is basically the same.
Member-2 was the peer device owner, so it has more errors (once again even before BGP peer connected):

2017-04-27 07:34:34,121 | INFO | config-pusher | ConfigPusherImpl | 126 - org.opendaylight.controller.config-persister-impl - 0.6.0.SNAPSHOT | Pushing configuration snapshot 10-rest-connector.xml(odl-restconf,odl-restconf)
2017-04-27 07:34:34,210 | WARN | lt-dispatcher-17 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for purged transaction member-2-datastore-operational-fe-0-txn-0-1, ignoring
2017-04-27 07:34:34,211 | WARN | lt-dispatcher-17 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for commited transaction member-3-datastore-operational-fe-0-txn-0-1, ignoring
2017-04-27 07:34:34,213 | WARN | lt-dispatcher-20 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for purged transaction member-3-datastore-operational-fe-0-txn-0-1, ignoring
2017-04-27 07:34:34,213 | WARN | lt-dispatcher-20 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for commited transaction member-1-datastore-operational-fe-0-txn-0-1, ignoring
2017-04-27 07:34:34,354 | INFO | lt-dispatcher-37 | RIBImpl | 271 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | RIB Singleton Service example-bgp-rib-service-group instantiated, RIB example-bgp-rib
2017-04-27 07:34:34,362 | INFO | lt-dispatcher-19 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-2-frontend-datastore-operational: resolved shard 0 to ShardBackendInfo

{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/shardmanager-operational/member-1-shard-default-operational#2033146177], sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default}

2017-04-27 07:34:34,363 | INFO | lt-dispatcher-19 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-2-frontend-datastore-operational: resolving connection ConnectingClientConnection{client=ClientIdentifier

{frontend=member-2-frontend-datastore-operational, generation=0}, cookie=0} to ConnectedClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/shardmanager-operational/member-1-shard-default-operational#2033146177], sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default}}
2017-04-27 07:34:34,364 | INFO | lt-dispatcher-19 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-2-frontend-datastore-operational: replaced connection ConnectingClientConnection{client=ClientIdentifier

{frontend=member-2-frontend-datastore-operational, generation=0}, cookie=0} with ConnectedClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/shardmanager-operational/member-1-shard-default-operational#2033146177], sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default}}
2017-04-27 07:34:34,559 | INFO | er [_value=8185] | WebSocketServer | 253 - org.opendaylight.netconf.sal-rest-connector - 1.5.0.SNAPSHOT | Web socket server started at port 8185.
2017-04-27 07:34:34,669 | WARN | lt-dispatcher-16 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-13-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:34:34,680 | INFO | config-pusher | ConfigPusherImpl | 126 - org.opendaylight.controller.config-persister-impl - 0.6.0.SNAPSHOT | Successfully pushed configuration snapshot 10-rest-connector.xml(odl-restconf,odl-restconf)
2017-04-27 07:34:34,681 | INFO | config-pusher | ConfigPusherImpl | 126 - org.opendaylight.controller.config-persister-impl - 0.6.0.SNAPSHOT | Pushing configuration snapshot 10-restconf-service.xml(odl-restconf,odl-restconf)
2017-04-27 07:34:34,725 | WARN | lt-dispatcher-42 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for purged transaction member-1-datastore-operational-fe-0-txn-0-1, ignoring
2017-04-27 07:34:34,781 | INFO | config-pusher | ConfigPusherImpl | 126 - org.opendaylight.controller.config-persister-impl - 0.6.0.SNAPSHOT | Successfully pushed configuration snapshot 10-restconf-service.xml(odl-restconf,odl-restconf)
2017-04-27 07:34:34,784 | INFO | lt-dispatcher-37 | BgpPeer | 271 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Peer Singleton Service example-bgp-rib-service-group instantiated, Peer IpAddress [_ipv4Address=Ipv4Address [_value=192.0.2.1]]
2017-04-27 07:34:34,792 | INFO | lt-dispatcher-37 | AppPeer | 271 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Application Peer Singleton Service example-bgp-rib-service-group instantiated, Application peer 192.0.2.6
2017-04-27 07:34:36,305 | INFO | lt-dispatcher-21 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational Received follower initial sync status for member-2-shard-default-operational status sync done false
2017-04-27 07:34:36,307 | INFO | lt-dispatcher-19 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational Received follower initial sync status for member-2-shard-default-operational status sync done true
2017-04-27 07:34:39,744 | WARN | lt-dispatcher-21 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-14-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:34:41,039 | WARN | lt-dispatcher-17 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-12-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:35:47,914 | INFO | 4]-nio2-thread-1 | ServerSession | 30 - org.apache.sshd.core - 0.14.0 | Server session created from /10.29.4.11:35174
2017-04-27 07:35:47,919 | INFO | 4]-nio2-thread-1 | SimpleGeneratorHostKeyProvider | 30 - org.apache.sshd.core - 0.14.0 | Generating host key...
2017-04-27 07:35:57,903 | INFO | 4]-nio2-thread-4 | ServerUserAuthService | 30 - org.apache.sshd.core - 0.14.0 | Session karaf@/10.29.4.11:35174 authenticated
2017-04-27 07:36:09,984 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting suite /w/workspace/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/test/csit/suites/bgpcep/bgpclustering/singlepeer_pc_shm_1Mroutes.robot
2017-04-27 07:36:11,181 | WARN | lt-dispatcher-37 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-15-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:11,518 | WARN | lt-dispatcher-34 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-13-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:11,675 | WARN | lt-dispatcher-34 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-16-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:11,974 | WARN | lt-dispatcher-17 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-14-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:12,141 | WARN | lt-dispatcher-17 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-17-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:12,435 | WARN | lt-dispatcher-17 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-15-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:12,595 | WARN | lt-dispatcher-17 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-18-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:12,929 | WARN | lt-dispatcher-16 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-16-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:13,090 | WARN | lt-dispatcher-34 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-19-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:13,380 | WARN | lt-dispatcher-14 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-17-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:13,552 | WARN | lt-dispatcher-14 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-20-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:13,818 | WARN | lt-dispatcher-14 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-18-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:14,053 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Get Example Bgp Rib Owner
2017-04-27 07:36:14,216 | WARN | lt-dispatcher-14 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-21-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:14,455 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Check_For_Empty_Ipv4_Topology_Before_Talking_1
2017-04-27 07:36:14,565 | WARN | lt-dispatcher-21 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-22-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:14,810 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Check_For_Empty_Ipv4_Topology_Before_Talking_2
2017-04-27 07:36:14,824 | WARN | ult-dispatcher-2 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for aborted transaction member-1-datastore-operational-fe-0-txn-1-2, ignoring
2017-04-27 07:36:15,157 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Check_For_Empty_Ipv4_Topology_Before_Talking_3
2017-04-27 07:36:15,287 | WARN | lt-dispatcher-34 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-19-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:15,507 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Reconfigure_ODL_To_Accept_Connection
2017-04-27 07:36:15,604 | WARN | lt-dispatcher-34 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for aborted transaction member-1-datastore-operational-fe-0-txn-2-1, ignoring
2017-04-27 07:36:15,751 | INFO | on-dispatcher-52 | BgpPeer | 271 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Peer Singleton Service example-bgp-rib-service-group registered
2017-04-27 07:36:15,752 | INFO | on-dispatcher-52 | BgpPeer | 271 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Peer Singleton Service example-bgp-rib-service-group instantiated, Peer IpAddress [_ipv4Address=Ipv4Address [_value=10.29.14.35]]
2017-04-27 07:36:15,905 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Start_Talking_BGP_Speaker
2017-04-27 07:36:16,142 | INFO | entLoopGroup-6-1 | StringValueObjectFactory | 66 - org.opendaylight.mdsal.yang-binding - 0.10.0.SNAPSHOT | Instantiated factory for class org.opendaylight.yang.gen.v1.urn.ietf.params.xml.ns.yang.ietf.inet.types.rev130715.Ipv4Address
2017-04-27 07:36:16,142 | INFO | entLoopGroup-6-1 | StringValueObjectFactory | 66 - org.opendaylight.mdsal.yang-binding - 0.10.0.SNAPSHOT | Instantiated factory for class org.opendaylight.yang.gen.v1.urn.ietf.params.xml.ns.yang.ietf.inet.types.rev130715.Ipv4Prefix
2017-04-27 07:36:16,143 | INFO | entLoopGroup-6-1 | StringValueObjectFactory | 66 - org.opendaylight.mdsal.yang-binding - 0.10.0.SNAPSHOT | Instantiated factory for class org.opendaylight.yang.gen.v1.urn.ietf.params.xml.ns.yang.ietf.inet.types.rev130715.Ipv6Address
2017-04-27 07:36:16,143 | INFO | entLoopGroup-6-1 | StringValueObjectFactory | 66 - org.opendaylight.mdsal.yang-binding - 0.10.0.SNAPSHOT | Instantiated factory for class org.opendaylight.yang.gen.v1.urn.ietf.params.xml.ns.yang.ietf.inet.types.rev130715.Ipv6Prefix
2017-04-27 07:36:16,171 | INFO | entLoopGroup-6-1 | StrictBGPPeerRegistry | 271 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | BGP Open message session parameters differ, session still accepted.
2017-04-27 07:36:16,175 | INFO | entLoopGroup-6-1 | BGPSessionImpl | 271 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | BGP HoldTimer new value: 180
2017-04-27 07:36:16,186 | INFO | entLoopGroup-6-1 | BGPPeer | 271 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Session with peer 10.29.14.35 went up with tables [BgpTableTypeImpl [getAfi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.Ipv4AddressFamily, getSafi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.UnicastSubsequentAddressFamily]] and Add Path tables []
2017-04-27 07:36:16,201 | INFO | entLoopGroup-6-1 | AbstractBGPSessionNegotiator | 271 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | BGP Session with peer [id: 0x5ce299ab, L:/10.29.14.136:1790 - R:/10.29.14.35:17900] established successfully.
2017-04-27 07:36:19,401 | WARN | lt-dispatcher-20 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-operational (Follower): Missing index 349 from log. Cannot apply state. Ignoring 349 to 351
2017-04-27 07:36:21,981 | WARN | lt-dispatcher-19 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-operational (Follower): Missing index 355 from log. Cannot apply state. Ignoring 355 to 357
2017-04-27 07:36:23,538 | WARN | lt-dispatcher-18 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-operational (Follower): Missing index 356 from log. Cannot apply state. Ignoring 356 to 358
2017-04-27 07:36:23,995 | INFO | h for user karaf | command | 320 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Wait_For_Stable_Talking_Ipv4_Topology_1
2017-04-27 07:36:24,413 | WARN | lt-dispatcher-18 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-23-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:24,714 | WARN | lt-dispatcher-19 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-20-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:24,876 | WARN | lt-dispatcher-17 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-operational (Follower): Missing index 357 from log. Cannot apply state. Ignoring 357 to 360
2017-04-27 07:36:24,922 | WARN | lt-dispatcher-17 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-operational (Follower): Missing index 360 from log. Cannot apply state. Ignoring 360 to 362
2017-04-27 07:36:25,159 | WARN | ult-dispatcher-2 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-24-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:25,450 | WARN | lt-dispatcher-42 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-21-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:30,335 | WARN | lt-dispatcher-42 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-operational (Follower): Missing index 364 from log. Cannot apply state. Ignoring 364 to 366
2017-04-27 07:36:30,544 | WARN | lt-dispatcher-42 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-25-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:30,618 | WARN | lt-dispatcher-44 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-operational (Follower): Missing index 367 from log. Cannot apply state. Ignoring 367 to 372
2017-04-27 07:36:31,277 | WARN | lt-dispatcher-14 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-22-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:36,172 | WARN | lt-dispatcher-43 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-operational (Follower): Missing index 368 from log. Cannot apply state. Ignoring 368 to 376
2017-04-27 07:36:37,419 | INFO | lt-dispatcher-14 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational Received follower initial sync status for member-2-shard-default-operational status sync done false
2017-04-27 07:36:38,835 | WARN | lt-dispatcher-16 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-26-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:39,197 | WARN | lt-dispatcher-16 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-23-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:42,572 | WARN | lt-dispatcher-35 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-operational (Follower): Missing index 369 from log. Cannot apply state. Ignoring 369 to 380
2017-04-27 07:36:46,020 | WARN | lt-dispatcher-17 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-1-datastore-config-fe-0-txn-27-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.130:2550/user/$a#699258237], operations=[], protocol=ABORT}

2017-04-27 07:36:46,345 | WARN | lt-dispatcher-17 | FrontendReadOnlyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-txn-24-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.14.105:2550/user/$a#-1001685412], operations=[], protocol=ABORT}

2017-04-27 07:36:54,182 | INFO | lt-dispatcher-37 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Received UnreachableMember: memberName MemberName

{name=member-1}

, address: akka.tcp://opendaylight-cluster-data@10.29.13.130:2550

Comment by Vratko Polak [ 27/Apr/17 ]

> bgpcep-csit-3node-periodic-bgpclustering-only-carbon/8/

The build tested:
https://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/org/opendaylight/integration/distribution-karaf/0.6.0-SNAPSHOT/distribution-karaf-0.6.0-20170427.071632-5050.zip

Comment by Vratko Polak [ 11/May/17 ]

This is still happening, also with the scale reduced to 300k prefixes.

See logs at [2] few seconds after 17:13:00. Member-1 and member-3 mark member-2 as unreachable without an apparent reason, subsequent leader movements even result in "relevant data model content does not exist" [3] response.

[2] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/264/archives/
[3] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/264/archives/log.html.gz#s1-s7-t8-k2-k3-k7-k2-k1-k6-k1-k1-k1-k1-k1-k2-k1-k3-k2-k1

Comment by Tom Pantelis [ 11/May/17 ]

Akka cluster will mark nodes as unreachable b/c:

1) Actual network connectivity has been lost
2) Long GC pauses causing false positives
3) Too many messages flowing thru the remoting pipeline resulting in late heartbeats.

For #2, you should enable GC logging and check for long pauses and then possibly tweak JVM memory/GC settings if needed.

For #3, well... not sure there's much we can really do about that right now, except maybe more throttling of transactions, until we can enable artery which promises to alleviate such problems.

(In reply to Vratko Polák from comment #4)
> This is still happening, also with the scale reduced to 300k prefixes.
>
> See logs at [2] few seconds after 17:13:00. Member-1 and member-3 mark
> member-2 as unreachable without an apparent reason, subsequent leader
> movements even result in "relevant data model content does not exist" [3]
> response.
>
> [2]
> https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-
> bgpclustering-only-carbon/264/archives/
> [3]
> https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-
> bgpclustering-only-carbon/264/archives/log.html.gz#s1-s7-t8-k2-k3-k7-k2-k1-
> k6-k1-k1-k1-k1-k1-k2-k1-k3-k2-k1

Comment by Robert Varga [ 12/May/17 ]

Given the symptoms, I think we are seeing 3) except it is not a lot of messages but rather a few large messages.

With an unthrottled BGP peer tell-based protocol can pick 1M prefixes in about 12 seconds (as opposed to ~30 seconds) and it aggressively reuses transactions, i.e. the entire work is done in ~60 transactions rather than ~40K transactions. This is caused by a much straighter path in the local transaction case – hence the balance between the time to commit a transaction and the time to jam a modification into it is shifted.

This results in AppendEntries messages which are huge and end up monopolizing the channel – and that is what causes akka to go crazy.

The long-term solution is to go for artery with fragmentation. For short term we have essentially two options:

  • come up with a reasonable strategy of throttling local modifications (i.e. before commit)
  • desensitize akka so it can tolerate these bursts of activity
Comment by Robert Varga [ 12/May/17 ]

http://doc.akka.io/docs/akka/2.4/java/cluster-usage.html#Failure_Detector is something we should be able to tune for the BGP case.

Comment by Tom Pantelis [ 12/May/17 ]

I'm working on the re-usable fragmentation now so we can switch to artery.

When sending AppendEntries, we already limit the total payload entries size to the snapshot chunk size of 2M. However if a single payload size exceeds the chunk size then we send it as is b/c we don't currently have a mechanism to slice/chunk an AppendEntries. I assume we're seeing large single CommitTransactionPayloads. The maximum-frame-size we set currently for akka messages is ~400M so we're not exceeding that (otherwise we'd see serialization failures).

Comment by Vratko Polak [ 18/May/17 ]

Just writing one more symptom visible to Robot [5] when this Bug happens:
Get on example-ipv4-topology returned status code 404 with message: {"errors":{"error":[

{"error-type":"application","error-tag":"data-missing","error-message":"Request could not be completed because the relevant data model content does not exist "}

]}}

This is expected if the example-ipv4-topology governed by a Singleton service which is currently moving between members (having deleted the whole topology for the time being).

[5] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/280/archives/log.html.gz#s1-s2-t8-k2-k3-k7-k3-k1-k6-k1-k1-k1-k1-k1-k2-k1-k3-k1

Comment by Vratko Polak [ 23/May/17 ]

Just adding a symptom visible from Robot [6]:
Restconf read of topology takes too long.
I guess it is the same as CONTROLLER-1672 but that one is concerned with /restconf/modules or jolokia.

[6] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/290/archives/log.html.gz#s1-s2-t8-k2-k3-k7-k4-k1-k6-k1-k1-k1-k1-k1-k2-k1-k3-k1

Comment by Ajay L [ 15/Sep/17 ]

(In reply to Tom Pantelis from comment #8)
> I'm working on the re-usable fragmentation now so we can switch to artery.
>
> When sending AppendEntries, we already limit the total payload entries size
> to the snapshot chunk size of 2M. However if a single payload size exceeds
> the chunk size then we send it as is b/c we don't currently have a mechanism
> to slice/chunk an AppendEntries. I assume we're seeing large single
> CommitTransactionPayloads. The maximum-frame-size we set currently for akka
> messages is ~400M so we're not exceeding that (otherwise we'd see
> serialization failures).

Tom, is this fix available to test yet? Thx

Comment by Tom Pantelis [ 15/Sep/17 ]

(In reply to Ajay L from comment #12)
> (In reply to Tom Pantelis from comment #8)
> > I'm working on the re-usable fragmentation now so we can switch to artery.
> >
> > When sending AppendEntries, we already limit the total payload entries size
> > to the snapshot chunk size of 2M. However if a single payload size exceeds
> > the chunk size then we send it as is b/c we don't currently have a mechanism
> > to slice/chunk an AppendEntries. I assume we're seeing large single
> > CommitTransactionPayloads. The maximum-frame-size we set currently for akka
> > messages is ~400M so we're not exceeding that (otherwise we'd see
> > serialization failures).
>
> Tom, is this fix available to test yet? Thx

yes - the AppendEntries chunking has been implemented.

Comment by Ajay L [ 19/Sep/17 ]

(In reply to Tom Pantelis from comment #13)
> (In reply to Ajay L from comment #12)
> > (In reply to Tom Pantelis from comment #8)
> > > I'm working on the re-usable fragmentation now so we can switch to artery.
> > >
> > > When sending AppendEntries, we already limit the total payload entries size
> > > to the snapshot chunk size of 2M. However if a single payload size exceeds
> > > the chunk size then we send it as is b/c we don't currently have a mechanism
> > > to slice/chunk an AppendEntries. I assume we're seeing large single
> > > CommitTransactionPayloads. The maximum-frame-size we set currently for akka
> > > messages is ~400M so we're not exceeding that (otherwise we'd see
> > > serialization failures).
> >
> > Tom, is this fix available to test yet? Thx
>
> yes - the AppendEntries chunking has been implemented.

Thx Tom. We ran the BGP 300k prefix test on master and it still fails for the scenario where shard leader and rib entity owner are on different nodes
https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-all-oxygen/2/log.html.gz#s1-s4

Is the AppendEntries chunking expected to directly help with the late heartbeats, or is it something that was done as a pre-requisite for artery support, and the benefits are supposed to kick-in only when artery is used? (which seems to be disabled currently bcoz of https://bugs.opendaylight.org/show_bug.cgi?id=7518)

Increasing akka.cluster.failure-detector.acceptable-heartbeat-pause value did produce better results for above test. Do you think this is a reasonable adjustment for this scenario? Guess the downside is that it will have adverse impact on detecting genuine node down cases

Comment by Tom Pantelis [ 19/Sep/17 ]

You need to enable the tell-based protocol (the use-tell-based-protocol setting) and see what the results are. Then try with artery enabled. You'll have to fiddle with the akka maximum-frame-size and buffer-pool-size settings and our maximum-message-slice-size setting. The latter will have to be some cushion (maybe like 10K) less than maximum-frame-size.

Comment by Ajay L [ 19/Sep/17 ]

The above test already uses tell-based mode. Without that, the numbers are much worse. So tell-based definitely helps. We will tinker with the other settings and share the findings

Generated at Wed Feb 07 19:56:05 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.