[CONTROLLER-1858] DOMDataBroker Explicit Leader Movement test fails Created: 07/Aug/18 Updated: 17/Aug/18 |
|
| Status: | Open |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Oxygen SR3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | High |
| Reporter: | Vishal Thapar | Assignee: | Vishal Thapar |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | csit:3node | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
DOMDataBroker Explicit Leader Movement test, controller-ddb-expl-lead-movement-longevity, has been failing fairly consistently [1]. The purpose of test from test documentation: The goal is to ensure that applications do not observe disruption when a shard leader is moved as the result of explicit application request. This is performed by having a steady-stream producer execute operations against the shard and then initiate shard leader shutdown, then the producer is shut down cleanly. Though it shows 3/4 passing, those 3 are for restarting ODL with tell, actual test case has been failing consistently with only one pass on 28th July. The steps mentioned above are done repeatedly in a loop and it fails 1097 times during the entire ~14hour run with error 500 [2]. Update: As per [3] searching ODL logs for 'starting leader movement from node' gives us: org.opendaylight.controller.cluster.access.commands.OutOfSequenceEnvelopeException: Expecting envelope 0 at org.opendaylight.controller.cluster.datastore.LeaderFrontendState.checkRequestSequence(LeaderFrontendState.java:98) [220:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT] at org.opendaylight.controller.cluster.datastore.LeaderFrontendState.handleTransactionRequest(LeaderFrontendState.java:191) [220:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT] at org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:573) [220:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT] at org.opendaylight.controller.cluster.datastore.Shard.handleRequestEnvelope(Shard.java:398) [220:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT] Also: 2018-08-05T01:17:16,637 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-36 | ClientActorBehavior | 191 - org.opendaylight.controller.cds-access-client - 1.3.3.SNAPSHOT | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0} , cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.30.170.171:2550/user/shardmanager-config/member-2-shard-default-config#-1686300808], sessionId=71, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=absent}} indicated no leadership, reconnecting it [1] https://jenkins.opendaylight.org/releng/job/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/ |
| Comments |
| Comment by Tom Pantelis [ 07/Aug/18 ] | |||||||
|
What are the exact steps this test performs and what failed? I see "FAIL [ 200 | 201 | 204 ] does not contain value '500'." but I don't see from the robot log what URL failed and the error message. | |||||||
| Comment by Tom Pantelis [ 07/Aug/18 ] | |||||||
|
I think it would really help to outline the test steps in sentences for those not all that familiar with all the keywords and the author's intent - I can ascertain/infer some meaning by looking at the log output (perhaps this is already documented somewhere?) but I don't know what all the keywords do/mean. The first red output I see is:
But I can't tell what actually failed - the prior URL, /restconf/operations/cluster-admin:make-leader-local, succeeded.
| |||||||
| Comment by Tom Pantelis [ 07/Aug/18 ] | |||||||
|
Also is this Oxygen or Fluorine or both? Can you please update the Affected Version? If this is from Oxygen, does it also fail on Fluorine? | |||||||
| Comment by Vishal Thapar [ 07/Aug/18 ] | |||||||
|
Agree, I also been trying to decipher controller tests and thought you might be able to understand them better than me. Have to really dig into robot keywords to get to useful information. What my understanding of the test is it is doing following: 1. Move leader from nodeA to a randomly chosen nodeB. 2. Write transaction using RPCs - https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-ddb-expl-lead-movement-longevity-only-oxygen/51/robot-plugin/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k1-k1-k1-k2-k5 3. Wait for 5 seconds. 4. Check if all transactions went through fine or not. It checks for http code to be [200 | 201 | 204]. But finds a 500. 5. So, basically while doing a write transaction when leader movement is done, it gets http error 500. logs the log message in karaf. This one is oxygen, I believe we don't have a flourine job for it yet. | |||||||
| Comment by Tom Pantelis [ 07/Aug/18 ] | |||||||
|
So the question is what URL failed with 500 and what was the error info? | |||||||
| Comment by Vishal Thapar [ 07/Aug/18 ] | |||||||
|
I thought I pasted this: Documentation: Start / End / Elapsed: 20180805 19:05:36.555 / 20180805 19:05:36.558 / 00:00:00.003 From the url, it looks like test app is running in ODL and RPC just triggers it. Correct? That is why couldn't find logs for any python script making rest calls. Suspect some test app within controller code that generates transactions when it gets this RPC call. | |||||||
| Comment by Tom Pantelis [ 07/Aug/18 ] | |||||||
|
yeah - it looks like, restconf/operations/odl-mdsal-lowlevel-control:write-transactions. However AFAICT, the response from that was 200:
KEYWORD BuiltIn . Log ${response.text}
||Documentation:|Logs the given message with the given level.|
||Start / End / Elapsed:|20180805 09:36:16.600 / 20180805 09:36:16.600 / 00:00:00.000|
|09:36:16.600|INFO|{"output":\{"delete-tx":731,"all-tx":40000,"insert-tx":39269}}| |
00:00:00.001KEYWORD BuiltIn .
Log ${response.status_code}
||Documentation:|Logs the given message with the given level.|
||Start / End / Elapsed:|20180805 09:36:16.600 / 20180805 09:36:16.601 / 00:00:00.001|
|09:36:16.601|INFO|200|
The encapsulating keyword seems to be OK as well, AFAICT:
09:36:16.604 INFO ${status} = PASS
09:36:16.604 INFO ${output} = None
The code for that RPC would be somewhere in the clustering-it-provider in controller. So I don't see where the 500 is coming from or the nature of this failure: 09:36:16.607 FAIL None | |||||||
| Comment by Vishal Thapar [ 07/Aug/18 ] | |||||||
|
I think it is this one. | |||||||
| Comment by Tom Pantelis [ 07/Aug/18 ] | |||||||
|
OK - so the underlying error that caused the 500 occurred on odl2: 2018-08-05T13:36:22,354 | WARN | opendaylight-cluster-data-shard-dispatcher-42 | AbstractFrontendHistory | 220 - org.opendaylight.controller.sal-distributed-datastore - 1.7.3.SNAPSHOT | member-2-shard-default-config: Request ModifyTransactionRequest{target=member-1-datastore-config-fe-0-txn-1114-0, sequence=0, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.30.170.146:2550/user/$a#-1538190485], modifications=1, protocol=SIMPLE} is contained purged transactions [[0..1114]]
TransactionCommitFailedException{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null,
cause=org.opendaylight.controller.cluster.access.commands.DeadTransactionException: Transactions [[0..1114]] have been purged]]}
org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:36)
org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)
org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:106)\n\tat org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.handleException(ConcurrentDOMDataBroker.java:227)
org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.access$000(ConcurrentDOMDataBroker.java:50)
org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$1.onFailure(ConcurrentDOMDataBroker.java:124)
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1387)
com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015)
com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868)
com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713)
com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54)
org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.lambda$directCommit$4(AbstractProxyTransaction.java:507)
org.opendaylight.controller.cluster.access.client.ConnectionEntry.complete(ConnectionEntry.java:50)
org.opendaylight.controller.cluster.access.client.AbstractClientConnection.receiveResponse(AbstractClientConnection.java:469)
org.opendaylight.controller.cluster.access.client.AbstractReceivingClientConnection.receiveResponse(AbstractReceivingClientConnection.java:68)
org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onResponse(ClientActorBehavior.java:174)
org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onRequestFailure(ClientActorBehavior.java:186)
org.opendaylight.controller.cluster.access.client.ClientActorBehavior.internalOnRequestFailure(ClientActorBehavior.java:238)
org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:145)
org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:49)
org.opendaylight.controller.cluster.access.client.AbstractClientActor.onReceiveCommand(AbstractClientActor.java:69)
akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:275)
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
akka.actor.Actor.aroundReceive(Actor.scala:517)
akka.actor.Actor.aroundReceive$(Actor.scala:515)
akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:273)
akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:691)
akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:192)
akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:191)
akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:273)
akka.actor.ActorCell.receiveMessage(ActorCell.scala:590)
akka.actor.ActorCell.invoke(ActorCell.scala:559)
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
akka.dispatch.Mailbox.run(Mailbox.scala:224)
akka.dispatch.Mailbox.exec(Mailbox.scala:234)
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.opendaylight.controller.cluster.access.commands.DeadTransactionException: Transactions [[0..1114]] have been purged
org.opendaylight.controller.cluster.datastore.AbstractFrontendHistory.handleTransactionRequest(AbstractFrontendHistory.java:88)
org.opendaylight.controller.cluster.datastore.LeaderFrontendState.handleTransactionRequest(LeaderFrontendState.java:212)
org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:573)
org.opendaylight.controller.cluster.datastore.Shard.handleRequestEnvelope(Shard.java:398)
org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:324)
org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)
org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44)
akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:275)
org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)
rovarga any ideas on this one? | |||||||
| Comment by Tom Pantelis [ 09/Aug/18 ] | |||||||
|
I think we're going to have to enabled some debug. Set org.opendaylight.controller.cluster.datastore and org.opendaylight.controller.cluster.databroker to DEBUG. This will also include Shard which will emit a lot of output so to minimize it, set org.opendaylight.controller.cluster.datastore.Shard to INFO. This should do it: log4j2.logger.datastore.name=org.opendaylight.controller.cluster.datastore log4j2.logger.datastore.level=DEBUG log4j2.logger.databroker.name=org.opendaylight.controller.cluster.databroker log4j2.logger.databroker.level=DEBUG log4j2.logger.shard.name=org.opendaylight.controller.cluster.datastore.Shard log4j2.logger.shard.level=INFO Note: the log files may roll over, ie there may also be karaf.log.1, karaf.log.2 etc. | |||||||
| Comment by Jamo Luhrsen [ 09/Aug/18 ] | |||||||
|
this job will be running with the above debug levels | |||||||
| Comment by Vishal Thapar [ 14/Aug/18 ] | |||||||
|
^Passed with deubg ^Had failures but debugs were not enabled, probably scheduled run. | |||||||
| Comment by Jamo Luhrsen [ 15/Aug/18 ] | |||||||
|
trying again: | |||||||
| Comment by Vishal Thapar [ 17/Aug/18 ] | |||||||
|
This one got failure with debug logs: Warning: Log files are ~10/19/19MB compressed. Would recommend doing a wget to get them and then untar them. | |||||||
| Comment by Tom Pantelis [ 17/Aug/18 ] | |||||||
. Fail ${output}
Documentation:
Fails the test with the given message and optionally alters its tags.
Start / End / Elapsed: 20180816 08:36:31.438 / 20180816 08:36:31.439 / 00:00:00.001
08:36:31.439 FAIL Dictionary does not contain key 'value'.
This is a different error. A jolokia request failed: To JSON using : content={
"error": "javax.management.InstanceNotFoundException : org.opendaylight.controller:Category=Shards,name=member-2-shard-default-config,type=DistributedConfigDatastore",
"error_type": "javax.management.InstanceNotFoundException",
"request": {
"mbean": "org.opendaylight.controller:Category=Shards,name=member-2-shard-default-config,type=DistributedConfigDatastore",
"type": "read"
},
"stacktrace": "javax.management.InstanceNotFoundException: org.opendaylight.controller:Category=Shards,name=member-2-shard-default-config,type=DistributedConfigDatastore\n\tat com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1095)\n\tat com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBeanInfo(DefaultMBeanServerInterceptor.java:1375)\n\tat com.sun.jmx.mbeanserver.JmxMBeanServer.getMBeanInfo(JmxMBeanServer.java:920)\n\tat org.jolokia.handler.ReadHandler$1.execute(ReadHandler.java:46)\n\tat org.jolokia.handler.ReadHandler$1.execute(ReadHandler.java:41)\n\tat org.jolokia.backend.executor.AbstractMBeanServerExecutor.call(AbstractMBeanServerExecutor.java:90)\n\tat org.jolokia.handler.ReadHandler.getMBeanInfo(ReadHandler.java:229)\n\tat org.jolokia.handler.ReadHandler.getAllAttributesNames(ReadHandler.java:241)\n\tat org.jolokia.handler.ReadHandler.resolveAttributes(ReadHandler.java:217)\n\tat org.jolokia.handler.ReadHandler.fetchAttributes(ReadHandler.java:183)\n\tat org.jolokia.handler.ReadHandler.doHandleRequest(ReadHandler.java:118)\n\tat org.jolokia.handler.ReadHandler.doHandleRequest(ReadHandler.java:37)\n\tat org.jolokia.handler.JsonRequestHandler.handleRequest(JsonRequestHandler.java:161)\n\tat org.jolokia.backend.MBeanServerHandler.dispatchRequest(MBeanServerHandler.java:156)\n\tat org.jolokia.backend.LocalRequestDispatcher.dispatchRequest(LocalRequestDispatcher.java:99)\n\tat org.jolokia.backend.BackendManager.callRequestDispatcher(BackendManager.java:428)\n\tat org.jolokia.backend.BackendManager.handleRequest(BackendManager.java:158)\n\tat org.jolokia.http.HttpRequestHandler.executeRequest(HttpRequestHandler.java:197)\n\tat org.jolokia.http.HttpRequestHandler.handleGetRequest(HttpRequestHandler.java:86)\n\tat org.jolokia.http.AgentServlet$4.handleRequest(AgentServlet.java:470)\n\tat org.jolokia.http.AgentServlet.handleSecurely(AgentServlet.java:350)\n\tat org.jolokia.http.AgentServlet.handle(AgentServlet.java:321)\n\tat org.jolokia.http.AgentServlet.doGet(AgentServlet.java:277)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:687)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\tat org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)\n\tat org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:205)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\n\tat org.ops4j.pax.web.service.jetty.internal.HttpServiceServletHandler.doHandle(HttpServiceServletHandler.java:71)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\tat org.ops4j.pax.web.service.jetty.internal.HttpServiceContext.doHandle(HttpServiceContext.java:284)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.ops4j.pax.web.service.jetty.internal.JettyServerHandlerCollection.handle(JettyServerHandlerCollection.java:80)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:534)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)\n\tat org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tat java.lang.Thread.run(Thread.java:748)\n",
"status": 404
}
| |||||||
| Comment by Tom Pantelis [ 17/Aug/18 ] | |||||||
2018-08-16T12:36:11,723 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Shard | 213 - org.opendaylight.controller.sal-clustering-commons - 1.7.4.SNAPSHOT | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.SimpleReplicatedLogEntry] with sequence number [31300240] for persistenceId [member-2-shard-default-config]. akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out. akka timed out trying to write to disk. This indicates the disk was busy - probably due to all the logging. We'll have to increase the timeout - I'll need to look that up and report back later. | |||||||
| Comment by Jamo Luhrsen [ 17/Aug/18 ] | |||||||
|
thanks tpantelis, I'm assuming this timeout increase has to come in code, and not some config file? controller.cluster.databroker:DEBUG | |||||||
| Comment by Tom Pantelis [ 17/Aug/18 ] | |||||||
|
It's an akka setting: In controller/configuration/initial/akka.conf, set the timeout to like 5 min: akka {
persistence {
journal-plugin-fallback {
circuit-breaker {
call-timeout = 300s
}
}
snapshot-store-plugin-fallback {
circuit-breaker {
max-failures = 10
call-timeout = 300s
}
}
}
}
We need cluster.datastore DEBUG. |