Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1838

follower reports 401 (unauthorized) and 500 (Internal Error) when leader is isolated.

XMLWordPrintable

      This is to track the sporadic (but very frequent) failures we see in this job

      The general flow of the job seems to be as follows:

      1. set up a 3 node cluster
      2. run a python tool to add 10k cars, using one of the follower nodes. It's determining leadership with the "car" shard on the config side.
      3. make sure that cars are being added, by ensuring the count of cars at /restconf/config/car:cars is increasing
      4. isolate the leader node by blocking all traffic outbound from the leader node's ip to the two follower ips.
      5. verify a new leader gets elected
      6. wait for the tool to finish adding all 10k cars.
      7. check that the new leader will report that 10k cars are added. THIS FAILS
      8. add the isolated node back
      9. delete the 10k cars
      10. stop the test tool with a CTRL-C

      I can't make total sense of it yet, but the robot logs in step 6 show this output
      from the test tool:

      2018-06-18 02:18:38,781 INFO: Add 10000 car(s) to 10.30.170.28:8181 (1 per request) 2018-06-18 02:18:47,348 INFO: Request started at Mon Jun 18 02:18:42 2018 finished with following detais 2018-06-18 02:18:47,349 INFO: Request started at Mon Jun 18 02:18:42 2018 finished with following detais 2018-06-18 02:18:47,349 INFO: <PreparedRequest [POST]>
      

      but in last step that stops the tool, there is a flushing of the test tools stdout and
      we can see this:

      http://10.30.170.28:8181/restconf/config/car:cars 2018-06-18 02:18:47,349 INFO: <PreparedRequest [POST]> http://10.30.170.28:8181/restconf/config/car:cars 2018-06-18 02:18:47,350 INFO: Headers {'Content-Length': '127', 'Content-Type': 'application/json', 'Authorization': 'Basic YWRtaW46YWRtaW4='}: 2018-06-18 02:18:47,350 INFO: Headers {'Content-Length': '127', 'Content-Type': 'application/json', 'Authorization': 'Basic YWRtaW46YWRtaW4='}: 2018-06-18 02:18:47,350 INFO: Body: {"car-entry": [{"category": "my_category", "model": "model361", "manufacturer": "my_manufacturer", "id": 361, "year": "2015"}]} 2018-06-18 02:18:47,350 INFO: Body: {"car-entry": [{"category": "my_category", "model": "model353", "manufacturer": "my_manufacturer", "id": 353, "year": "2015"}]} 2018-06-18 02:18:47,351 INFO: Response: <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> <title>Error 401 Unauthorized</title> </head> <body><h2>HTTP ERROR 401</h2> <p>Problem accessing /restconf/config/car:cars. Reason: <pre> Unauthorized</pre></p> </body> </html> 2018-06-18 02:18:47,352 INFO: Response: {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"read execution failed","error-info":"ReadFailedException{message=Error checking DataExists for path /(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)cars/car-entry/car-entry[{(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)id=353}], errorList=[RpcError [message=Error checking DataExists for path /(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)cars/car-entry/car-entry[{(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)id=353}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.29:2550/), Path(/user/shardmanager-config/member-1-shard-car-config/shard-car-member-2:datastore-config@0:11946_12283#-1964161993)]] after [5000 ms]. Sender[null] sent message of type \"org.opendaylight.controller.cluster.datastore.messages.DataExists\".]]}\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:242)\n\tat akka.dispatch.OnComplete.internal(Future.scala:260)\n\tat akka.dispatch.OnComplete.internal(Future.scala:258)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:185)\n\tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)\n\tat akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)\n\tat scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)\n\tat akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)\n\tat akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)\n\tat akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.29:2550/), Path(/user/shardmanager-config/member-1-shard-car-config/shard-car-member-2:datastore-config@0:11946_12283#-1964161993)]] after [5000 ms]. Sender[null] sent message of type \"org.opendaylight.controller.cluster.datastore.messages.DataExists\".\n\tat akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595)\n\tat akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605)\n\tat akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)\n\tat scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)\n\tat scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)\n\tat akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\n\tat java.lang.Thread.run(Thread.java:748)\n"}]}} 2018-06-18 02:18:47,352 INFO: <Response [401]> Unauthorized 2018-06-18 02:18:47,353 INFO: <Response [500]> Internal Server Error
      

      That is where I think we need to dive in. It seems that there may be some hiccup
      on the follower node, when the leader is being isolated where typical REST interactions
      have trouble. Looks like we got an unauthorized 401, which matches up timestamps
      with this from the karaf log:

      2018-06-18T02:18:47,330 | WARN  | qtp414495725-104 | MDSALDynamicAuthorizationFilter  | 195 - org.opendaylight.aaa.shiro - 0.8.0.SNAPSHOT | MDSAL attempt to read Http Authz Container failed, disallowing access
      org.opendaylight.controller.md.sal.common.api.data.ReadFailedException: read execution failed
      	at org.opendaylight.controller.cluster.datastore.compat.LegacyDOMStoreAdapter$1.newWithCause(LegacyDOMStoreAdapter.java:40) ~[?:?]
      	at org.opendaylight.controller.cluster.datastore.compat.LegacyDOMStoreAdapter$1.newWithCause(LegacyDOMStoreAdapter.java:36) ~[?:?]
      	at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:91) ~[308:org.opendaylight.yangtools.util:2.0.5]
      	at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:40) ~[308:org.opendaylight.yangtools.util:2.0.5]
      	at org.opendaylight.mdsal.common.api.MappingCheckedFuture.mapException(MappingCheckedFuture.java:62) ~[?:?]
      	at org.opendaylight.mdsal.common.api.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:66) ~[?:?]
      	at org.opendaylight.mdsal.common.api.MappingCheckedFuture.get(MappingCheckedFuture.java:79) ~[?:?]
      	at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:168) ~[32:com.google.guava:23.6.0.jre]
      	at com.google.common.util.concurrent.Futures.getDone(Futures.java:1436) ~[32:com.google.guava:23.6.0.jre]
      	at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:85) ~[32:com.google.guava:23.6.0.jre]
      	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398) ~[32:com.google.guava:23.6.0.jre]
      	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015) ~[32:com.google.guava:23.6.0.jre]
      	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868) ~[32:com.google.guava:23.6.0.jre]
      	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713) ~[32:com.google.guava:23.6.0.jre]
      	at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54) ~[32:com.google.guava:23.6.0.jre]
      	at org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:241) ~[?:?]
      	at akka.dispatch.OnComplete.internal(Future.scala:260) ~[?:?]
      	at akka.dispatch.OnComplete.internal(Future.scala:258) ~[?:?]
      	at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) ~[?:?]
      	at akka.dispatch.japi$CallbackBridge.apply(Future.scala:185) ~[?:?]
      	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) ~[337:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
      	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) ~[?:?]
      	at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91) ~[?:?]
      	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) [337:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
      	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) [337:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
      	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91) [35:com.typesafe.akka.actor:2.5.11]
      	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) [35:com.typesafe.akka.actor:2.5.11]
      	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43) [35:com.typesafe.akka.actor:2.5.11]
      	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [35:com.typesafe.akka.actor:2.5.11]
      	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [35:com.typesafe.akka.actor:2.5.11]
      	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [35:com.typesafe.akka.actor:2.5.11]
      	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [35:com.typesafe.akka.actor:2.5.11]
      Caused by: org.opendaylight.controller.md.sal.common.api.data.ReadFailedException: Error checking ReadData for path /(urn:opendaylight:params:xml:ns:yang:aaa?revision=2016-12-14)http-authorization
      	at org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:242) ~[?:?]
      	... 16 more
      Caused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.29:2550/), Path(/user/shardmanager-config/member-1-shard-default-config/shard-default-member-2:datastore-config@0:11947_12284#-916780901)]] after [5000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadData".
      	at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595) ~[?:?]
      	at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605) ~[?:?]
      	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140) ~[?:?]
      	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870) ~[?:?]
      	at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109) ~[?:?]
      	at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103) ~[?:?]
      	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868) ~[?:?]
      	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) ~[?:?]
      	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) ~[?:?]
      	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) ~[?:?]
      	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) ~[?:?]
      	at java.lang.Thread.run(Thread.java:748) ~[?:?]
      

      I can't find where that internal error 500 comes from though.

      I would not expect one healthy node in the majority, when a single node dies, to start
      giving me 500 or 401 and just take my requests and handle them.

      In this specific example, only 487 of the 10k cars were configured, so it seems there
      was some major trouble.

      all three odl karaf logs:

      initial leader
      follower that was pointed at by the test tool
      other follower that was just minding it's own business

            tpantelis Tom Pantelis
            jluhrsen Jamo Luhrsen
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: