[CONTROLLER-1858] DOMDataBroker Explicit Leader Movement test fails Created: 07/Aug/18  Updated: 17/Aug/18

Status: Open
Project: controller
Component/s: clustering
Affects Version/s: Oxygen SR3
Fix Version/s: None

Type: Bug Priority: High
Reporter: Vishal Thapar Assignee: Vishal Thapar
Resolution: Unresolved Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

DOMDataBroker Explicit Leader Movement test, controller-ddb-expl-lead-movement-longevity, has been failing fairly consistently [1]. The purpose of test from test documentation:

The goal is to ensure that applications do not observe disruption when a shard leader is moved as the result of explicit application request. This is performed by having a steady-stream producer execute operations against the shard and then initiate shard leader shutdown, then the producer is shut down cleanly.

Though it shows 3/4 passing, those 3 are for restarting ODL with tell, actual test case has been failing consistently with only one pass on 28th July. The steps mentioned above are done repeatedly in a loop and it fails 1097 times during the entire ~14hour run with error 500 [2].

Update:
-----------
Test steps detailed in [3].
ODL Logs [4]

As per [3] searching ODL logs for 'starting leader movement from node' gives us:
018-08-04T23:40:26,930 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-34 | ClientActorBehavior | 191 - org.opendaylight.controller.cds-access-client - 1.3.3.SNAPSHOT | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#542053028, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} indicated sequencing mismatch on member-1-datastore-config-fe-0-txn-13-0 sequence 0 (26), reconnecting it
org.opendaylight.controller.cluster.access.commands.OutOfSequenceEnvelopeException: Expecting envelope 0
at org.opendaylight.controller.cluster.datastore.LeaderFrontendState.checkRequestSequence(LeaderFrontendState.java:98) [220:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.LeaderFrontendState.handleTransactionRequest(LeaderFrontendState.java:191) [220:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:573) [220:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleRequestEnvelope(Shard.java:398) [220:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT]

Also:

2018-08-05T01:17:16,637 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-36 | ClientActorBehavior | 191 - org.opendaylight.controller.cds-access-client - 1.3.3.SNAPSHOT | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.30.170.171:2550/user/shardmanager-config/member-2-shard-default-config#-1686300808], sessionId=71, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=absent}} indicated no leadership, reconnecting it
org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-1686300808 is not the current leader
at org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:566) ~[220:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleRequestEnvelope(Shard.java:398) ~[220:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:324) ~[220:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT]

[1] https://jenkins.opendaylight.org/releng/job/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/
[2] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/51/robot-plugin/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k1
[3] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/51/robot-plugin/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k1
[4] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/51/



 Comments   
Comment by Tom Pantelis [ 07/Aug/18 ]

What are the exact steps this test performs and what failed? I see "FAIL [ 200 | 201 | 204 ] does not contain value '500'." but I don't see from the robot log what URL failed and the error message.

Comment by Tom Pantelis [ 07/Aug/18 ]

I think it would really help to outline the test steps in sentences for those not all that familiar with all the keywords and the author's intent - I can ascertain/infer some meaning by looking at the log output (perhaps this is already documented somewhere?) but I don't know what all the keywords do/mean. The first red output I see is:

 
KEYWORD BuiltIn . Fail ${output}
 
 

Documentation: Fails the test with the given message and optionally alters its tags.
Start / End / Elapsed: 20180805 09:36:16.606 / 20180805 09:36:16.607 / 00:00:00.001
09:36:16.607 FAIL None

 

But I can't tell what actually failed - the prior URL,  /restconf/operations/cluster-admin:make-leader-local, succeeded.

 

Comment by Tom Pantelis [ 07/Aug/18 ]

Also is this Oxygen or Fluorine or both? Can you please update the Affected Version? If this is from Oxygen, does it also fail on Fluorine?

Comment by Vishal Thapar [ 07/Aug/18 ]

Agree, I also been trying to decipher controller tests and thought you might be able to understand them better than me. Have to really dig into robot keywords to get to useful information.

What my understanding of the test is it is doing following:

1. Move leader from nodeA to a randomly chosen nodeB.

2. Write transaction using RPCs - https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/51/robot-plugin/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k1-k1-k1-k2-k5

3. Wait for 5 seconds.

4. Check if all transactions went through fine or not. It checks for http code to be [200 | 201 | 204]. But finds a 500.

5. So, basically while doing a write transaction when leader movement is done, it gets http error 500.

This step in robot:
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/51/robot-plugin/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k1-k1-k1-k2-k2

logs the log message in karaf.

This one is oxygen, I believe we don't have a flourine job for it yet.

Comment by Tom Pantelis [ 07/Aug/18 ]

So the question is what URL failed with 500 and what was the error info? 

Comment by Vishal Thapar [ 07/Aug/18 ]

I thought I pasted this:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/51/robot-plugin/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k1-k1-k1-k2-k5

Documentation:
Invoke write-transactions rpc on given nodes.

Start / End / Elapsed: 20180805 19:05:36.555 / 20180805 19:05:36.558 / 00:00:00.003
19:05:36.556 INFO Input parameters: host_list:[u'10.30.170.171'], index_list:[2], id_prefix:prefix-, duration:40, rate:1000, chained_flag:True
19:05:36.556 INFO url: http://10.30.170.171:8181/restconf/operations/odl-mdsal-lowlevel-control:write-transactions, data: <input xmlns="tag:opendaylight.org,2017:controller:yang:lowlevel:control">
<id>prefix-2</id>
<seconds>40</seconds>
<transactions-per-second>1000</transactions-per-second>
<chained-transactions>True</chained-transactions>
</input>, timeout: 425

From the url, it looks like test app is running in ODL and RPC just triggers it. Correct? That is why couldn't find logs for any python script making rest calls. Suspect some test app within controller code that generates transactions when it gets this RPC call.

Comment by Tom Pantelis [ 07/Aug/18 ]

yeah - it looks like, restconf/operations/odl-mdsal-lowlevel-control:write-transactions.  However AFAICT, the response from that was 200:

 

KEYWORD BuiltIn . Log ${response.text}
 
||Documentation:|Logs the given message with the given level.|
||Start / End / Elapsed:|20180805 09:36:16.600 / 20180805 09:36:16.600 / 00:00:00.000|

|09:36:16.600|INFO|{"output":\{"delete-tx":731,"all-tx":40000,"insert-tx":39269}}| |
00:00:00.001KEYWORD BuiltIn .

 Log ${response.status_code}
||Documentation:|Logs the given message with the given level.|
||Start / End / Elapsed:|20180805 09:36:16.600 / 20180805 09:36:16.601 / 00:00:00.001|
|09:36:16.601|INFO|200|

The encapsulating keyword seems to be OK as well, AFAICT:

 

09:36:16.604	INFO	${status} = PASS	
09:36:16.604	INFO	${output} = None 
 

The code for that RPC would be somewhere in the clustering-it-provider in controller.

So I don't see where the 500 is coming from or the nature of this failure:

09:36:16.607	FAIL	None
Comment by Vishal Thapar [ 07/Aug/18 ]

I think it is this one.

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/51/robot-plugin/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10

Comment by Tom Pantelis [ 07/Aug/18 ]

OK - so the underlying error that caused the 500 occurred on odl2:

2018-08-05T13:36:22,354 | WARN  | opendaylight-cluster-data-shard-dispatcher-42 | AbstractFrontendHistory          | 220 - org.opendaylight.controller.sal-distributed-datastore - 1.7.3.SNAPSHOT | member-2-shard-default-config: Request ModifyTransactionRequest{target=member-1-datastore-config-fe-0-txn-1114-0, sequence=0, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.30.170.146:2550/user/$a#-1538190485], modifications=1, protocol=SIMPLE} is contained purged transactions [[0..1114]]

TransactionCommitFailedException{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, 
cause=org.opendaylight.controller.cluster.access.commands.DeadTransactionException: Transactions [[0..1114]] have been purged]]}
org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:36)
org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)
org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:106)\n\tat org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.handleException(ConcurrentDOMDataBroker.java:227)
org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.access$000(ConcurrentDOMDataBroker.java:50)
org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$1.onFailure(ConcurrentDOMDataBroker.java:124)
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1387)
com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015)
com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868)
com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713)
com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54)
org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.lambda$directCommit$4(AbstractProxyTransaction.java:507)
org.opendaylight.controller.cluster.access.client.ConnectionEntry.complete(ConnectionEntry.java:50)
org.opendaylight.controller.cluster.access.client.AbstractClientConnection.receiveResponse(AbstractClientConnection.java:469)
org.opendaylight.controller.cluster.access.client.AbstractReceivingClientConnection.receiveResponse(AbstractReceivingClientConnection.java:68)
org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onResponse(ClientActorBehavior.java:174)
org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onRequestFailure(ClientActorBehavior.java:186)
org.opendaylight.controller.cluster.access.client.ClientActorBehavior.internalOnRequestFailure(ClientActorBehavior.java:238)
org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:145)
org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:49)
org.opendaylight.controller.cluster.access.client.AbstractClientActor.onReceiveCommand(AbstractClientActor.java:69)
akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:275)
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
akka.actor.Actor.aroundReceive(Actor.scala:517)
akka.actor.Actor.aroundReceive$(Actor.scala:515)
akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:273)
akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:691)
akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:192)
akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:191)
akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:273)
akka.actor.ActorCell.receiveMessage(ActorCell.scala:590)
akka.actor.ActorCell.invoke(ActorCell.scala:559)
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
akka.dispatch.Mailbox.run(Mailbox.scala:224)
akka.dispatch.Mailbox.exec(Mailbox.scala:234)
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.opendaylight.controller.cluster.access.commands.DeadTransactionException: Transactions [[0..1114]] have been purged
org.opendaylight.controller.cluster.datastore.AbstractFrontendHistory.handleTransactionRequest(AbstractFrontendHistory.java:88)
org.opendaylight.controller.cluster.datastore.LeaderFrontendState.handleTransactionRequest(LeaderFrontendState.java:212)
org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:573)
org.opendaylight.controller.cluster.datastore.Shard.handleRequestEnvelope(Shard.java:398)
org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:324)
org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)
org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44)
akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:275)
org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)

rovarga any ideas on this one?

Comment by Tom Pantelis [ 09/Aug/18 ]

I think we're going to have to enabled some debug. Set org.opendaylight.controller.cluster.datastore and org.opendaylight.controller.cluster.databroker to DEBUG. This will also include Shard which will emit a lot of output so to minimize it, set org.opendaylight.controller.cluster.datastore.Shard to INFO. This should do it:

log4j2.logger.datastore.name=org.opendaylight.controller.cluster.datastore
log4j2.logger.datastore.level=DEBUG
log4j2.logger.databroker.name=org.opendaylight.controller.cluster.databroker
log4j2.logger.databroker.level=DEBUG
log4j2.logger.shard.name=org.opendaylight.controller.cluster.datastore.Shard
log4j2.logger.shard.level=INFO

Note: the log files may roll over, ie there may also be karaf.log.1, karaf.log.2 etc.

Comment by Jamo Luhrsen [ 09/Aug/18 ]

this job will be running with the above debug levels

Comment by Vishal Thapar [ 14/Aug/18 ]

https://jenkins.opendaylight.org/releng/job/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/52/

^Passed with deubg

https://jenkins.opendaylight.org/releng/job/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/53/

^Had failures but debugs were not enabled, probably scheduled run.

Comment by Jamo Luhrsen [ 15/Aug/18 ]

trying again:
https://jenkins.opendaylight.org/releng/job/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/54/

Comment by Vishal Thapar [ 17/Aug/18 ]

This one got failure with debug logs:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/54/

Warning: Log files are ~10/19/19MB compressed. Would recommend doing a wget to get them and then untar them.

Comment by Tom Pantelis [ 17/Aug/18 ]
. Fail ${output}
Documentation:	
Fails the test with the given message and optionally alters its tags.

Start / End / Elapsed:	20180816 08:36:31.438 / 20180816 08:36:31.439 / 00:00:00.001
08:36:31.439	FAIL	Dictionary does not contain key 'value'.

This is a different error. A jolokia request failed:

To JSON using : content={
 "error": "javax.management.InstanceNotFoundException : org.opendaylight.controller:Category=Shards,name=member-2-shard-default-config,type=DistributedConfigDatastore",
 "error_type": "javax.management.InstanceNotFoundException",
 "request": {
  "mbean": "org.opendaylight.controller:Category=Shards,name=member-2-shard-default-config,type=DistributedConfigDatastore",
  "type": "read"
 },
 "stacktrace": "javax.management.InstanceNotFoundException: org.opendaylight.controller:Category=Shards,name=member-2-shard-default-config,type=DistributedConfigDatastore\n\tat com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1095)\n\tat com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBeanInfo(DefaultMBeanServerInterceptor.java:1375)\n\tat com.sun.jmx.mbeanserver.JmxMBeanServer.getMBeanInfo(JmxMBeanServer.java:920)\n\tat org.jolokia.handler.ReadHandler$1.execute(ReadHandler.java:46)\n\tat org.jolokia.handler.ReadHandler$1.execute(ReadHandler.java:41)\n\tat org.jolokia.backend.executor.AbstractMBeanServerExecutor.call(AbstractMBeanServerExecutor.java:90)\n\tat org.jolokia.handler.ReadHandler.getMBeanInfo(ReadHandler.java:229)\n\tat org.jolokia.handler.ReadHandler.getAllAttributesNames(ReadHandler.java:241)\n\tat org.jolokia.handler.ReadHandler.resolveAttributes(ReadHandler.java:217)\n\tat org.jolokia.handler.ReadHandler.fetchAttributes(ReadHandler.java:183)\n\tat org.jolokia.handler.ReadHandler.doHandleRequest(ReadHandler.java:118)\n\tat org.jolokia.handler.ReadHandler.doHandleRequest(ReadHandler.java:37)\n\tat org.jolokia.handler.JsonRequestHandler.handleRequest(JsonRequestHandler.java:161)\n\tat org.jolokia.backend.MBeanServerHandler.dispatchRequest(MBeanServerHandler.java:156)\n\tat org.jolokia.backend.LocalRequestDispatcher.dispatchRequest(LocalRequestDispatcher.java:99)\n\tat org.jolokia.backend.BackendManager.callRequestDispatcher(BackendManager.java:428)\n\tat org.jolokia.backend.BackendManager.handleRequest(BackendManager.java:158)\n\tat org.jolokia.http.HttpRequestHandler.executeRequest(HttpRequestHandler.java:197)\n\tat org.jolokia.http.HttpRequestHandler.handleGetRequest(HttpRequestHandler.java:86)\n\tat org.jolokia.http.AgentServlet$4.handleRequest(AgentServlet.java:470)\n\tat org.jolokia.http.AgentServlet.handleSecurely(AgentServlet.java:350)\n\tat org.jolokia.http.AgentServlet.handle(AgentServlet.java:321)\n\tat org.jolokia.http.AgentServlet.doGet(AgentServlet.java:277)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:687)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\tat org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)\n\tat org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:205)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\n\tat org.ops4j.pax.web.service.jetty.internal.HttpServiceServletHandler.doHandle(HttpServiceServletHandler.java:71)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\tat org.ops4j.pax.web.service.jetty.internal.HttpServiceContext.doHandle(HttpServiceContext.java:284)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.ops4j.pax.web.service.jetty.internal.JettyServerHandlerCollection.handle(JettyServerHandlerCollection.java:80)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:534)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)\n\tat org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tat java.lang.Thread.run(Thread.java:748)\n",
 "status": 404
}
Comment by Tom Pantelis [ 17/Aug/18 ]
2018-08-16T12:36:11,723 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Shard                            | 213 - org.opendaylight.controller.sal-clustering-commons - 1.7.4.SNAPSHOT | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.SimpleReplicatedLogEntry] with sequence number [31300240] for persistenceId [member-2-shard-default-config].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.

akka timed out trying to write to disk. This indicates the disk was busy - probably due to all the logging. We'll have to increase the timeout - I'll need to look that up and report back later.

Comment by Jamo Luhrsen [ 17/Aug/18 ]

thanks tpantelis, I'm assuming this timeout increase has to come in code, and not some config file?
in the meantime, I removed the cluster.datastore DEBUG logging and just left these:

controller.cluster.databroker:DEBUG
controller.cluster.datastore.Shard:INFO

running in https://jenkins.opendaylight.org/releng/job/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/55

Comment by Tom Pantelis [ 17/Aug/18 ]

It's an akka setting:

In controller/configuration/initial/akka.conf, set the timeout to like 5 min:

akka {
  persistence {
    journal-plugin-fallback {
      circuit-breaker {
        call-timeout = 300s
      }
    }
    snapshot-store-plugin-fallback {
      circuit-breaker {
        max-failures = 10
        call-timeout = 300s
      }
    }
  }
}

We need cluster.datastore DEBUG.

Generated at Wed Feb 07 21:55:19 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.