[CONTROLLER-1654] C: write-transactions rpc does not return Created: 04/May/17 Updated: 25/Jul/23 Resolved: 23/May/17 |
|
| Status: | Verified |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Peter Gubka | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| External issue ID: | 8372 | ||||||||
| Description |
|
(my job #597) is built from controller=33/56433/5 write-transactions is called on the first node of 3 node cluster with params but is not returned after 5 seconds. It timeout out on robot side after 65 seconds. The test started around 2017-05-04 10:43:09,965 | INFO | h for user karaf | command | 46 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Write_Transactions_1_Node The particular rest call was around :43:10.216 and expected to finish after 5 seconds. It did not happen. Instead there is 2017-05-04 10:43:31,838 | WARN | lt-dispatcher-33 | ClusterCoreDaemon | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.4:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.5:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.6:2550, status = Up)]. Node roles [member-1] The system is not under the stress, so i do not really understand why there are akka problems. 2017-05-04 10:43:12,217 | DEBUG | lt-dispatcher-32 | ShardDataTree | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-1-shard-default-config: Preparing transaction member-1-datastore-config-fe-0-txn-10-0 2017-05-04 10:43:31,823 | DEBUG | lt-dispatcher-32 | CompositeDataTreeCohort | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-1-datastore-config-fe-0-txn-10-0: canCommit - candidate: [FAILED toString()] |
| Comments |
| Comment by Peter Gubka [ 04/May/17 ] |
|
Attachment write_transactions.tar has been added with description: karaf and robot log |
| Comment by Tom Pantelis [ 04/May/17 ] |
|
I see a couple of these in the member 1 log: safe.akka.slf4j - 2.4.17 | Failed to serialize remote message [class org.opendaylight.controller.cluster.access.concepts.RequestEnvelope] using serializer [class akka.serialization.JavaSerializer]. Transient association error (association remains live) should never be serialized |
| Comment by Tom Pantelis [ 04/May/17 ] |
|
member switched from Leader to Follower shortly before the error: 2017-05-04 10:43:33,364 | INFO | lt-dispatcher-55 | Shard | 219 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-default-config (Leader) :- Switching from behavior Leader to Follower, election term: 4 Then the front-end reconnected to the new leader: 2017-05-04 10:43:42,271 | INFO | lt-dispatcher-69 | ClientActorBehavior | 224 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-config: replaced connection ReconnectingClientConnection{client=ClientIdentifier {frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#1279384036, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default}} with ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@172.17.0.5:2550/user/shardmanager-config/member-2-shard-default-config#1531829272], sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default}} It looks like the CommitLocalTransactionRequest was initially targeted to go to the local leader and was retried to go to the new remote leader. It seems like we need to somehow change the local request to a remote one in this case. |
| Comment by Peter Gubka [ 04/May/17 ] |
|
(In reply to Tom Pantelis from comment #2) , cookie=0, , cookie=0, Just want to mention again that write-transaction was called via restconf around 10:43:10.2XX and should have finished within 5-6 seconds time. Time 10:43:09,965 looks too late. |
| Comment by Robert Varga [ 04/May/17 ] |
|
That seems to be a duplicate of |
| Comment by Tom Pantelis [ 04/May/17 ] |
|
(In reply to Peter Gubka from comment #3) Not sure what you mean. I didn't mention time 10:43:09,965. The leader change occurred at 10:43:33 and the error at 10:43:42 which is later than 10:43:10. The serialization error would've timed out the request and failed the transaction. |
| Comment by Robert Varga [ 04/May/17 ] |
|
So this is a standalone transaction facing a local->remote connect. We definitely should be translating the message from local to remote equivalent. Since we do not have debugs, it is not clear if the message was already in the queue or we are facing a late forward (i.e. the transaction was allocated while reconnect was happening). I would assume the request was in the queue and it timed out. This is where ProxyHistory$ReconnectCohort.replayRequest() should kick in and route the request to AbstractProxyTransaction.replayRequest(). I will audit the codepaths to see what is going on. |
| Comment by Robert Varga [ 04/May/17 ] |
|
Sorry, that path would be taken by a latecomer request. It seems the trouble lies in RemoteProxyTransaction.handleForwardedRemoteRequest() as it is called from AbstractProxyTransaction.replayMessages(). I will dissect the call paths and untangle the API contract. |
| Comment by Robert Varga [ 04/May/17 ] |
|
There is also another issue in the logs: 2017-05-04 10:42:19,513 | WARN | ult-dispatcher-3 | FrontendReadOnlyTransaction | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request TransactionAbortRequest {target=member-3-datastore-config-fe-0-txn-0-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@172.17.0.6:2550/user/$a#-531325700]}Which looks like a remote request to close a snapshot. |
| Comment by Peter Gubka [ 05/May/17 ] |
|
(In reply to Tom Pantelis from comment #5) My primary problem which i have is that rpc write-transactions did not return (rest). It started 10:43:10.2XX and should have finished 5-6 seconds later. The leader change (i believe) is just a resulting activity which happened. Just want logs from 10:43:10.2XX not to be overlooked. |
| Comment by Peter Gubka [ 05/May/17 ] |
|
(In reply to Robert Varga from comment #9) > What debug logs do you want to be on? |
| Comment by Peter Gubka [ 05/May/17 ] |
|
Same behavior is in 1node setup. Should be reproducible simply. No matter is simple or chained transactions used. |
| Comment by Peter Gubka [ 05/May/17 ] |
|
Attachment karaf_log_1.tar.gz.601_simple has been added with description: logs 1node simple |
| Comment by Peter Gubka [ 05/May/17 ] |
|
Attachment karaf_log_1.tar.gz.603_chained has been added with description: logs 1node chained |
| Comment by Peter Gubka [ 05/May/17 ] |
|
All this above was for module shard test. On prefix shard test i run similar test, 1 node odl, rpc produce-transactions it finished ok, but insted of 5-6 seconds it took 60s to respond correct results to robot. And again there is a timeframe when nothing happened for 20s. But finally it responded http status 200 with data {"output":{"insert-tx":28,"delete-tx":21,"all-tx":49}} around 10:24:56 |
| Comment by Peter Gubka [ 05/May/17 ] |
|
Attachment karaf_log_1.tar.gz has been added with description: prefix shard produce transactions |
| Comment by Robert Varga [ 05/May/17 ] |
|
The fix for TransactionAbortRequest: https://git.opendaylight.org/gerrit/56596 |
| Comment by Vratko Polak [ 23/May/17 ] |
|
Is this a duplicate of |