[CONTROLLER-1633] Singleton global rpc timed out with akka.pattern.AskTimeoutException for 15000ms after leader kill Created: 11/Apr/17  Updated: 25/Jul/23  Resolved: 02/May/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Peter Gubka Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File odl1_karaf.log.gz     File odl2_karaf.log.gz     File odl3_karaf.log.gz     Zip Archive odl_logs.zip     File with-remote-rpc-debug.tar    
External issue ID: 8207

 Description   

Job: https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-carbon/627/

Suite: Global Rpc Kill

Steps:
-rpc /restconf/operations/basic-rpc-test:basic-global works from every node
-the node with singleton device leader is killed (node1)
-new leader is elected (node3)
-rpc does not work from node2 and returns http status 500

{"errors":{"error":[

{"error-type":"application","error-tag":"operation-failed","error-message":"The operation encountered an unexpected error while executing.","error-info":"akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://opendaylight-cluster-data@10.29.12.23:2550/user/rpc/broker#526563179]] after [15000 ms]. Sender[null] sent message of type "org.opendaylight.controller.remote.rpc.messages.ExecuteRpc". at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:745) "}

]}}



 Comments   
Comment by Robert Varga [ 12/Apr/17 ]

Unfortunately node2 logs are not showing much. Two questions:

  • does this occur repeatedly for more than ~2 seconds after the new leader is elected?
  • can you switch logging to DEBUG for org.opendaylight.controller.remote.rpc?
Comment by Peter Gubka [ 12/Apr/17 ]

(In reply to Robert Varga from comment #1)
> Unfortunately node2 logs are not showing much. Two questions:
>
> - does this occur repeatedly for more than ~2 seconds after the new leader
> is elected?
> - can you switch logging to DEBUG for org.opendaylight.controller.remote.rpc?

karaf.log will be attached, link to the job is https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon/3/

Comment by Peter Gubka [ 12/Apr/17 ]

Attachment odl1_karaf.log.gz has been added with description: karaf log 1

Comment by Peter Gubka [ 12/Apr/17 ]

Attachment odl2_karaf.log.gz has been added with description: karaf log 2

Comment by Peter Gubka [ 12/Apr/17 ]

Attachment odl3_karaf.log.gz has been added with description: karaf log 3

Comment by Peter Gubka [ 20/Apr/17 ]

I repeated the test again. I tried rpc again after the first attempt failed (with timeout). The second attempt was done at least 15s seconds after the new leader elected due to the first attempt duration.

> - does this occur repeatedly for more than ~2 seconds after the new leader
> is elected?

The response was the same and the second attempt failed with the same exceptions 15000ms timeout exception. Logs from all 3 nodes attached. Robot's log.html is too big.

Comment by Peter Gubka [ 20/Apr/17 ]

Attachment odl_logs.zip has been added with description: logs with retied rpc

Comment by Robert Varga [ 27/Apr/17 ]

Those logs do not contain remote-rpc debugs, we need those to understand what is going on.

Comment by Peter Gubka [ 27/Apr/17 ]

(In reply to Robert Varga from comment #8)
> Those logs do not contain remote-rpc debugs, we need those to understand
> what is going on.

https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon/19/

Should i attach the logs or the link is enough?

Comment by Peter Gubka [ 28/Apr/17 ]

Adding files from https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/19/

Comment by Peter Gubka [ 28/Apr/17 ]

Attachment with-remote-rpc-debug.tar has been added with description: log with debug level

Comment by Peter Gubka [ 02/May/17 ]

https://git.opendaylight.org/gerrit/56393

This is an update on test suite side. Waiting for new owner is not enough. There must be akka notification for node to update the rpc reference to a new node. It is the reason why it always failed on follower nodes. WUKS added, to wait for akka.

Generated at Wed Feb 07 19:56:03 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.