[CONTROLLER-1654] C: write-transactions rpc does not return Created: 04/May/17  Updated: 25/Jul/23  Resolved: 23/May/17

Status: Verified
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Peter Gubka Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File karaf_log_1.tar.gz     File karaf_log_1.tar.gz.601_simple     File karaf_log_1.tar.gz.603_chained     File write_transactions.tar    
Issue Links:
Duplicate
is duplicated by CONTROLLER-1644 tell-based-protocol: attempt to seria... Resolved
External issue ID: 8372

 Description   

(my job #597)
used odl: https://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/org/opendaylight/integration/integration/distribution/distribution-karaf/0.6.0-SNAPSHOT/distribution-karaf-0.6.0-20170503.152424-68.zip

is built from controller=33/56433/5
https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/11/

write-transactions is called on the first node of 3 node cluster with params
url: http://172.17.0.4:8181/restconf/operations/odl-mdsal-lowlevel-control:write-transactions, data: <input xmlns="tag:opendaylight.org,2017:controller:yang:lowlevel:control">
<id>prefix-1</id>
<seconds>5</seconds>
<transactions-per-second>10</transactions-per-second>
<chained-transactions>False</chained-transactions>
</input>

but is not returned after 5 seconds. It timeout out on robot side after 65 seconds.

The test started around 2017-05-04 10:43:09,965 | INFO | h for user karaf | command | 46 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Write_Transactions_1_Node

The particular rest call was around :43:10.216 and expected to finish after 5 seconds. It did not happen.

Instead there is

2017-05-04 10:43:31,838 | WARN | lt-dispatcher-33 | ClusterCoreDaemon | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.4:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.5:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.6:2550, status = Up)]. Node roles [member-1]

The system is not under the stress, so i do not really understand why there are akka problems.

2017-05-04 10:43:12,217 | DEBUG | lt-dispatcher-32 | ShardDataTree | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-1-shard-default-config: Preparing transaction member-1-datastore-config-fe-0-txn-10-0

2017-05-04 10:43:31,823 | DEBUG | lt-dispatcher-32 | CompositeDataTreeCohort | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-1-datastore-config-fe-0-txn-10-0: canCommit - candidate: [FAILED toString()]



 Comments   
Comment by Peter Gubka [ 04/May/17 ]

Attachment write_transactions.tar has been added with description: karaf and robot log

Comment by Tom Pantelis [ 04/May/17 ]

I see a couple of these in the member 1 log:

safe.akka.slf4j - 2.4.17 | Failed to serialize remote message [class org.opendaylight.controller.cluster.access.concepts.RequestEnvelope] using serializer [class akka.serialization.JavaSerializer]. Transient association error (association remains live)
akka.remote.MessageSerializer$SerializationException: Failed to serialize remote message [class org.opendaylight.controller.cluster.access.concepts.RequestEnvelope] using serializer [class akka.serialization.JavaSerializer].
at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:61)[210:com.typesafe.akka.remote:2.4.17]
...
Caused by: java.lang.UnsupportedOperationException: Local transaction request CommitLocalTransactionRequest

{target=member-1-datastore-config-fe-0-txn-10-0, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$a#-188317460], coordinated=false}

should never be serialized
at org.opendaylight.controller.cluster.access.commands.AbstractLocalTransactionRequest.externalizableProxy(AbstractLocalTransactionRequest.java:34)[223:org.opendaylight.controller.cds-access-api:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.commands.AbstractLocalTransactionRequest.externalizableProxy(AbstractLocalTransactionRequest.java:23)[223:org.opendaylight.controller.cds-access-api:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.concepts.Request.externalizableProxy(Request.java:27)[223:org.opendaylight.controller.cds-access-api:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.concepts.Message.writeReplace(Message.java:170)[223:org.opendaylight.controller.cds-access-api:1.1.0.SNAPSHOT]

Comment by Tom Pantelis [ 04/May/17 ]

member switched from Leader to Follower shortly before the error:

2017-05-04 10:43:33,364 | INFO | lt-dispatcher-55 | Shard | 219 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-default-config (Leader) :- Switching from behavior Leader to Follower, election term: 4

Then the front-end reconnected to the new leader:

2017-05-04 10:43:42,271 | INFO | lt-dispatcher-69 | ClientActorBehavior | 224 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-config: replaced connection ReconnectingClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#1279384036, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default}} with ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@172.17.0.5:2550/user/shardmanager-config/member-2-shard-default-config#1531829272], sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default}}

It looks like the CommitLocalTransactionRequest was initially targeted to go to the local leader and was retried to go to the new remote leader. It seems like we need to somehow change the local request to a remote one in this case.

Comment by Peter Gubka [ 04/May/17 ]

(In reply to Tom Pantelis from comment #2)
> member switched from Leader to Follower shortly before the error:
>
>
> 2017-05-04 10:43:33,364 | INFO | lt-dispatcher-55 | Shard
> | 219 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT
> | member-1-shard-default-config (Leader) :- Switching from behavior Leader
> to Follower, election term: 4
>
> Then the front-end reconnected to the new leader:
>
> 2017-05-04 10:43:42,271 | INFO | lt-dispatcher-69 | ClientActorBehavior
> | 224 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT |
> member-1-frontend-datastore-config: replaced connection
> ReconnectingClientConnection{client=ClientIdentifier

{frontend=member-1- > frontend-datastore-config, generation=0}

, cookie=0,
> backend=ShardBackendInfo{actor=Actor[akka://opendaylight-cluster-data/user/
> shardmanager-config/member-1-shard-default-config#1279384036], sessionId=0,
> version=BORON, maxMessages=1000, cookie=0, shard=default}} with
> ConnectedClientConnection{client=ClientIdentifier

{frontend=member-1-frontend- > datastore-config, generation=0}

, cookie=0,
> backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-
> data@172.17.0.5:2550/user/shardmanager-config/member-2-shard-default-
> config#1531829272], sessionId=1, version=BORON, maxMessages=1000, cookie=0,
> shard=default}}
>
> It looks like the CommitLocalTransactionRequest was initially targeted to go
> to the local leader and was retried to go to the new remote leader. It seems
> like we need to somehow change the local request to a remote one in this
> case.

Just want to mention again that write-transaction was called via restconf around 10:43:10.2XX and should have finished within 5-6 seconds time. Time 10:43:09,965 looks too late.

Comment by Robert Varga [ 04/May/17 ]

That seems to be a duplicate of CONTROLLER-1644. I'll take a closer look.

Comment by Tom Pantelis [ 04/May/17 ]

(In reply to Peter Gubka from comment #3)
> Just want to mention again that write-transaction was called via restconf
> around 10:43:10.2XX and should have finished within 5-6 seconds time. Time
> 10:43:09,965 looks too late.

Not sure what you mean. I didn't mention time 10:43:09,965. The leader change occurred at 10:43:33 and the error at 10:43:42 which is later than 10:43:10. The serialization error would've timed out the request and failed the transaction.

Comment by Robert Varga [ 04/May/17 ]

So this is a standalone transaction facing a local->remote connect. We definitely should be translating the message from local to remote equivalent.

Since we do not have debugs, it is not clear if the message was already in the queue or we are facing a late forward (i.e. the transaction was allocated while reconnect was happening). I would assume the request was in the queue and it timed out.

This is where ProxyHistory$ReconnectCohort.replayRequest() should kick in and route the request to AbstractProxyTransaction.replayRequest(). I will audit the codepaths to see what is going on.

Comment by Robert Varga [ 04/May/17 ]

Sorry, that path would be taken by a latecomer request. It seems the trouble lies in RemoteProxyTransaction.handleForwardedRemoteRequest() as it is called from AbstractProxyTransaction.replayMessages(). I will dissect the call paths and untangle the API contract.

Comment by Robert Varga [ 04/May/17 ]

There is also another issue in the logs:

2017-05-04 10:42:19,513 | WARN | ult-dispatcher-3 | FrontendReadOnlyTransaction | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Rejecting unsupported request TransactionAbortRequest

{target=member-3-datastore-config-fe-0-txn-0-0, sequence=1, replyTo=Actor[akka.tcp://opendaylight-cluster-data@172.17.0.6:2550/user/$a#-531325700]}

Which looks like a remote request to close a snapshot.

Comment by Peter Gubka [ 05/May/17 ]

(In reply to Tom Pantelis from comment #5)
> (In reply to Peter Gubka from comment #3)
> > Just want to mention again that write-transaction was called via restconf
> > around 10:43:10.2XX and should have finished within 5-6 seconds time. Time
> > 10:43:09,965 looks too late.
>
> Not sure what you mean. I didn't mention time 10:43:09,965. The leader
> change occurred at 10:43:33 and the error at 10:43:42 which is later than
> 10:43:10. The serialization error would've timed out the request and failed
> the transaction.

My primary problem which i have is that rpc write-transactions did not return (rest). It started 10:43:10.2XX and should have finished 5-6 seconds later. The leader change (i believe) is just a resulting activity which happened. Just want logs from 10:43:10.2XX not to be overlooked.

Comment by Peter Gubka [ 05/May/17 ]

(In reply to Robert Varga from comment #9)
> There is also another issue in the logs:
>
> 2017-05-04 10:42:19,513 | WARN | ult-dispatcher-3 |
> FrontendReadOnlyTransaction | 226 -
> org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT |
> Rejecting unsupported request
> TransactionAbortRequest

{target=member-3-datastore-config-fe-0-txn-0-0, > sequence=1, > replyTo=Actor[akka.tcp://opendaylight-cluster-data@172.17.0.6:2550/user/$a#- > 531325700]}

>
> Which looks like a remote request to close a snapshot.

What debug logs do you want to be on?

Comment by Peter Gubka [ 05/May/17 ]

Same behavior is in 1node setup.

Should be reproducible simply.
1) Start odl with tell-based=true
2) Make a POST
http://<ip>:8181/restconf/operations/odl-mdsal-lowlevel-control:write-transactions,
<input xmlns="tag:opendaylight.org,2017:controller:yang:lowlevel:control">
<id>prefix-1</id>
<seconds>5</seconds>
<transactions-per-second>10</transactions-per-second>
<chained-transactions>false</chained-transactions>
</input>

No matter is simple or chained transactions used.

Comment by Peter Gubka [ 05/May/17 ]

Attachment karaf_log_1.tar.gz.601_simple has been added with description: logs 1node simple

Comment by Peter Gubka [ 05/May/17 ]

Attachment karaf_log_1.tar.gz.603_chained has been added with description: logs 1node chained

Comment by Peter Gubka [ 05/May/17 ]

All this above was for module shard test.

On prefix shard test i run similar test, 1 node odl, rpc produce-transactions
url: http://172.17.0.4:8181/restconf/operations/odl-mdsal-lowlevel-control:produce-transactions, data: <input xmlns="tag:opendaylight.org,2017:controller:yang:lowlevel:control">
<id>prefix-1</id>
<seconds>5</seconds>
<transactions-per-second>10</transactions-per-second>
<isolated-transactions>True</isolated-transactions>
</input>, timeout: 65

it finished ok, but insted of 5-6 seconds it took 60s to respond correct results to robot.

And again there is a timeframe when nothing happened for 20s.
2017-05-05 10:23:58,632 | DEBUG | lt-dispatcher-29 | ShardDataTree | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config: Preparing transaction member-1-datastore-Shard-id-ints!-fe-0-chn-5-txn-0-0
2017-05-05 10:24:17,929 | DEBUG | lt-dispatcher-29 | CompositeDataTreeCohort | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-1-datastore-Shard-id-ints!-fe-0-chn-5-txn-0-0: canCommit - candidate: [FAILED toString()]

But finally it responded http status 200 with data {"output":{"insert-tx":28,"delete-tx":21,"all-tx":49}} around 10:24:56

Comment by Peter Gubka [ 05/May/17 ]

Attachment karaf_log_1.tar.gz has been added with description: prefix shard produce transactions

Comment by Robert Varga [ 05/May/17 ]

The fix for TransactionAbortRequest: https://git.opendaylight.org/gerrit/56596

Comment by Vratko Polak [ 23/May/17 ]

Is this a duplicate of CONTROLLER-1665 or was this a different bug with similar symptom?

Generated at Wed Feb 07 21:54:06 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.