[CONTROLLER-1665] C: write-transactions does not return if local leader moved Created: 09/May/17 Updated: 25/Jul/23 Resolved: 07/Jun/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Peter Gubka | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||||||
| External issue ID: | 8403 | ||||||||||||||||
| Description |
|
Write-transactions rcp invoked on the leader of default module shard. Within this time the leader was moved to different node. But after 30s the write-transactions rpc did not return and after 90s is was aborted from robot side The bug is similar to https://bugs.opendaylight.org/show_bug.cgi?id=8372, but that time is was without any leader movement, just write-transactions invoked. Logs of trans producer node: |
| Comments |
| Comment by Peter Gubka [ 09/May/17 ] |
|
Same behavior happened when the write-transactions was run on the default shard leader node and this was isolated(iptables). Logs: |
| Comment by Robert Varga [ 15/May/17 ] |
|
For tell-based protocol, after BUG-8422 the deadline for requests is 120 seconds, not 30 seconds (which is reconnect timer). |
| Comment by Vratko Polak [ 18/May/17 ] |
|
> the deadline for requests is 120 seconds Re-opening. |
| Comment by Robert Varga [ 19/May/17 ] |
|
One troublesome thing is: java.util.ConcurrentModificationException I'll try to see what the heck is going on. |
| Comment by Robert Varga [ 19/May/17 ] |
| Comment by Vratko Polak [ 22/May/17 ] |
|
> https://git.opendaylight.org/gerrit/57438 Even after that merged, this Bug still occurs [0] (module-based shard, tell-based protocol, graceful leader move). [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s24-t1-k2-k10 |
| Comment by Peter Gubka [ 22/May/17 ] |
|
There is a sandbox run for patches 57598+99 |
| Comment by Peter Gubka [ 23/May/17 ] |
|
Another sandbox run (odl includes patches 57598+99) with debug logging for org.opendaylight.controller.cluster.databroker.actors.dds and org.opendaylight.controller.cluster.access.client |
| Comment by Vratko Polak [ 23/May/17 ] |
|
Still occasionally happens [9] when a (module-based, tell protocol) shard replica is removed from the leader member. Possibly just because the fix [10] has not been merged yet. [9] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/719/archives/log.html.gz#s1-s22-t1-k2-k9 |
| Comment by Vratko Polak [ 23/May/17 ] |
|
> shard replica is removed from the leader member. In controller-csit-3node-clustering-only-carbon/720 this happened reliably, both for module-based and prefix based shard, but only when the writer/producer was located on the leader member. |
| Comment by Peter Gubka [ 23/May/17 ] |
|
New run with debug logs here: Not returned rest response from write-transactions rpc https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-2nd/8/archives/log.html.gz#s1-s2-t1-k2-k11 Odl zip from: https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/37/ |
| Comment by Peter Gubka [ 25/May/17 ] |
|
Debug logs including akka Odl built from controller=63/57763/2:99/57699/3 at Not returned rpc |
| Comment by Peter Gubka [ 25/May/17 ] |
|
New debug logs (with akka) Odl built from controller=70/57770/4:99/57699/3 at Not-returned rpc |
| Comment by Robert Varga [ 25/May/17 ] |
|
It seems we are hitting TransmitQueue throttling when sending a purge request after the transaction completes. Since there is a ton of requests in the queue at that time, we naturally get throttled. https://git.opendaylight.org/gerrit/57822 eliminates that delay and adds debug when we get blocked for more than 100msec. |
| Comment by Vratko Polak [ 26/May/17 ] |
|
This still happens [0], also on Carbon post-release codebase [1]. [0] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/15/archives/log.html.gz#s1-s24-t1-k2-k10 |
| Comment by Robert Varga [ 29/May/17 ] |
|
In general this is a manifestation of the client actor dying and being restarted – the restarted instance has no state, so there is noone to complete the futures the old instance has no knowledge of it. The latest failure is an ISE splat during replay of a transaction which is waiting for purge to complete, as the successor txchain is not transitioned to account for the transaction being closed: 2017-05-29 03:22:27,394 | DEBUG | lt-dispatcher-22 | ProxyHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | LocalHistoryIdentifier{client=ClientIdentifier {frontend=member-1-frontend-datastore-config, generation=0}, history=2, cookie=0} creating successor transaction proxy for RemoteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-2-0, state=successor}2017-05-29 03:22:27,394 | DEBUG | lt-dispatcher-22 | ProxyHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Proxy org.opendaylight.controller.cluster.databroker.actors.dds.ProxyHistory$Local@3d6366fb open transaction LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-2-0, state=open} 2017-05-29 03:22:27,394 | DEBUG | lt-dispatcher-22 | ProxyHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Allocated proxy member-1-datastore-config-fe-0-chn-2-txn-2-0 for transaction member-1-datastore-config-fe-0-chn-2-txn-2-0 2017-05-29 03:22:27,394 | DEBUG | lt-dispatcher-22 | ProxyHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0} , history=2, cookie=0} created successor transaction proxy LocalReadWriteProxyTransaction {identifier=member-1-datastore-config-fe-0-chn-2-txn-2-0, state=open}2017-05-29 03:22:27,394 | DEBUG | lt-dispatcher-22 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Replaying queued request TransactionPurgeRequest{target=member-1-datastore-config-fe-0-chn-2-txn-2-0, sequence=2, replyTo=Actor[akka://opendaylight-cluster-data/user/$a#1472136728]} to successor LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-2-0, state=open} 2017-05-29 03:22:27,394 | DEBUG | lt-dispatcher-22 | LocalReadWriteProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Applying replayed request TransactionPurgeRequest {target=member-1-datastore-config-fe-0-chn-2-txn-2-0, sequence=2, replyTo=Actor[akka://opendaylight-cluster-data/user/$a#1472136728]}2017-05-29 03:22:27,394 | DEBUG | lt-dispatcher-22 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Transaction LocalReadWriteProxyTransaction {identifier=member-1-datastore-config-fe-0-chn-2-txn-2-0, state=open} allocated sequence 02017-05-29 03:22:27,394 | DEBUG | lt-dispatcher-22 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Transaction proxy LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-2-0, state=open} enqueing request TransactionPurgeRequest {target=member-1-datastore-config-fe-0-chn-2-txn-2-0, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$a#1472136728]} callback org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction$$Lambda$135/333503310@5ed81725 , enqueuedTicks=1989409436188} transmitting entry {} 2017-05-29 03:22:27,399 | DEBUG | lt-dispatcher-22 | AbstractClientHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Client history org.opendaylight.controller.cluster.databroker.actors.dds.SingleClientHistory@125db7b7 finishing reconnect to ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0} , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-773763026, sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-773763026, sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} 2017-05-29 03:22:27,399 | DEBUG | lt-dispatcher-22 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Finishing reconnect of proxy RemoteProxyTransaction {identifier=member-1-datastore-config-fe-0-chn-2-txn-3-0, state=successor}[...] 2017-05-29 03:22:27,499 | ERROR | lt-dispatcher-30 | OneForOneStrategy | 174 - com.typesafe.akka.slf4j - 2.4.17 | Proxy org.opendaylight.controller.cluster.databroker.actors.dds.ProxyHistory$Local@3d6366fb has LocalReadWriteProxyTransaction {identifier=member-1-datastore-config-fe-0-chn-2-txn-2-0, state=open} currently openjava.lang.IllegalStateException: Proxy org.opendaylight.controller.cluster.databroker.actors.dds.ProxyHistory$Local@3d6366fb has LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-2-0, state=open} currently open after which the actor restarts and receives: 2017-05-29 03:22:27,512 | DEBUG | lt-dispatcher-22 | AbstractClientActor | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-config: switched from org.opendaylight.controller.cluster.access.client.SavingClientActorBehavior@140e3fe1 to org.opendaylight.controller.cluster.databroker.actors.dds.DistributedDataStoreClientBehavior@329ed459 } |
| Comment by Robert Varga [ 31/May/17 ] |
| Comment by Vratko Polak [ 05/Jun/17 ] |
|
> https://git.opendaylight.org/gerrit/57994 After that got merged, this Bug stopped appearing in most scenarios, except one. The scenario uses single writer located at a follower and then the leader is moved to that member. This scenario fails both for module-based and prefix-based shard (tell-based protocol), and in both cases UnreachableMember happens shortly after movement is started. Here is a karaf.log [24] segment from recent module-based RelEng test: 2017-06-04 14:13:16,733 | INFO | qtp897580718-78 | ClusterAdminRpcService | 201 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.1.SNAPSHOT | Moving leader to local node member-2 for shard default, datastoreType Config 2017-06-04 14:13:19,530 | INFO | lt-dispatcher-31 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Received UnreachableMember: memberName MemberName{name=member-1} , address: akka.tcp://opendaylight-cluster-data@10.29.15.16:2550 , address: akka.tcp://opendaylight-cluster-data@10.29.15.16:2550 |
| Comment by Vratko Polak [ 05/Jun/17 ] |
|
Link to a failing Sandbox longevity job [25] with code [26] not merged yet and verbose logs. [25] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/2/ |
| Comment by Robert Varga [ 05/Jun/17 ] |
|
The last patch had a thinko, https://git.opendaylight.org/gerrit/58274 fixes that. |