Loading...

Type: Bug
Resolution: Cannot Reproduce
Fix Version/s: 4.0.0
Affects Version/s: None
Component/s: clustering
Labels:
- pt
Environment:

Operating System: All
Platform: All

External issue ID:
8792

The Robot symptom is basically the same as in ~~CONTROLLER-1707~~.
There is test which started failing [1] quite reliably, the difference from original ~~CONTROLLER-1707~~ description is that there is a transaction writer on each member and there is a data tree change listener on the old leader member. Once again, the only failing writer is on the old leader member:

{"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Submit failed","error-info":"TransactionCommitFailedException

{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 120.023957045seconds]]}

\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)\n\tat
...

Contrary to ~~CONTROLLER-1707~~, "No transactions enqueued while attempting to start canCommit" is not seen on the new leader.
Instead, the following is seen in karaf.log [2] of the writer (member-1, old leader and listener). Not sure which log message is the critical one (if any)

Very many "rejecting request" messages right after "Received LeaderStateChanged":

2017-07-02 15:57:40,058 | INFO | rd-dispatcher-32 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-1-shard-default-config: not currently leader, rejecting request Envelope{sessionId=0, txSequence=3115, message=ModifyTransactionRequest{target=member-2-datastore-config-fe-0-chn-2-txn-4416-0, sequence=0, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.2:2550/user/$a#-603757263], modifications=1, protocol=READY}}. isLeader: true, isLeaderActive: false,isLeadershipTransferInProgress: true.

The NotLeaderException has been received at 15:57:41,474. The new connection has been established at 15:57:46,839. The next reconnect (without a visible reason) has been initiated at 15:59:16,939 and finished at 15:59:16,945. Another one initiated at 15:59:46,958 and finished at 15:59:46,962. Finally request timeout happened:

2017-07-02 15:59:51,878 | WARN | lt-dispatcher-43 | ConcurrentDOMDataBroker | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Tx: DOM-CHAIN-1-5382 Error during phase CAN_COMMIT, starting Abort
org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 120.023957045seconds
at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.lockedCheckTimeout(AbstractClientConnection.java:372)[197:org.opendaylight.controller.cds-access-client:1.1.1.SNAPSHOT]
...

In the whole segment, there were multiple (arguably harmless) snapshot persist events.

On the new leader [3], there is only one suspicious message:

2017-07-02 15:58:06,430 | WARN | rd-dispatcher-31 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-2-shard-default-config: Current transaction member-1-datastore-config-fe-0-chn-2-txn-5382-0 has timed out after 15000 ms in state READY

Update:
While trying to replicate on Sandbox, two [4] tests [5] have failed. In the latter test, the failing writer was on the old leader (without listener listener), but in the former one, the failing writer was actually on the member which stayed follower (also without listener).
In both cases, the leader was member-2. Karaf log [6] is huge, but I have confirmed in both tests there was a "Current transaction ... has timed out after 15000 ms in state READY" message during the time the test was waiting for writer response:

2017-07-03 11:29:50,834 | WARN | rd-dispatcher-44 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-2-shard-default-config: Current transaction member-1-datastore-config-fe-0-chn-2-txn-11757-0 has timed out after 15000 ms in state READY

[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/769/log.html.gz#s1-s36-t1-k2-k12
[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/769/odl1_karaf.log.gz
[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/769/odl2_karaf.log.gz
[4] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/3/log.html.gz#s1-s2-t1-k2-k12
[5] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/3/log.html.gz#s1-s2-t3-k2-k12
[6] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/3/odl2_karaf.log.gz

is blocked by

CONTROLLER-1707 Client not reconnecting successfully after leader movement

Resolved

Details

Description

Attachments

Issue Links

Activity

People

Dates