[CONTROLLER-1838] follower reports 401 (unauthorized) and 500 (Internal Error) when leader is isolated. Created: 19/Jun/18  Updated: 09/Aug/18  Resolved: 12/Jul/18

Status: Verified
Project: controller
Component/s: clustering
Affects Version/s: Fluorine
Fix Version/s: Fluorine

Type: Bug Priority: High
Reporter: Jamo Luhrsen Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: csit:3node
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Attachments: File karaf.log.1.gz    
Issue Links:
Relates
relates to NETVIRT-1315 Troubleshooting Controller CSIT In Progress
Sub-Tasks:
Key
Summary
Type
Status
Assignee
CONTROLLER-1839 leader vs follower data Sub-task Verified Jamo Luhrsen  
CONTROLLER-1840 check if python script fails in tell ... Sub-task Open  
CONTROLLER-1841 skip checking /restconf/modules Sub-task Verified Jamo Luhrsen  
CONTROLLER-1842 don't exclude the polling cycles of W... Sub-task Resolved Jamo Luhrsen  
Epic Link: Clustering Stability

 Description   

This is to track the sporadic (but very frequent) failures we see in this job

The general flow of the job seems to be as follows:

  1. set up a 3 node cluster
  2. run a python tool to add 10k cars, using one of the follower nodes. It's determining leadership with the "car" shard on the config side.
  3. make sure that cars are being added, by ensuring the count of cars at /restconf/config/car:cars is increasing
  4. isolate the leader node by blocking all traffic outbound from the leader node's ip to the two follower ips.
  5. verify a new leader gets elected
  6. wait for the tool to finish adding all 10k cars.
  7. check that the new leader will report that 10k cars are added. THIS FAILS
  8. add the isolated node back
  9. delete the 10k cars
  10. stop the test tool with a CTRL-C

I can't make total sense of it yet, but the robot logs in step 6 show this output
from the test tool:

2018-06-18 02:18:38,781 INFO: Add 10000 car(s) to 10.30.170.28:8181 (1 per request) 2018-06-18 02:18:47,348 INFO: Request started at Mon Jun 18 02:18:42 2018 finished with following detais 2018-06-18 02:18:47,349 INFO: Request started at Mon Jun 18 02:18:42 2018 finished with following detais 2018-06-18 02:18:47,349 INFO: <PreparedRequest [POST]>

but in last step that stops the tool, there is a flushing of the test tools stdout and
we can see this:

http://10.30.170.28:8181/restconf/config/car:cars 2018-06-18 02:18:47,349 INFO: <PreparedRequest [POST]> http://10.30.170.28:8181/restconf/config/car:cars 2018-06-18 02:18:47,350 INFO: Headers {'Content-Length': '127', 'Content-Type': 'application/json', 'Authorization': 'Basic YWRtaW46YWRtaW4='}: 2018-06-18 02:18:47,350 INFO: Headers {'Content-Length': '127', 'Content-Type': 'application/json', 'Authorization': 'Basic YWRtaW46YWRtaW4='}: 2018-06-18 02:18:47,350 INFO: Body: {"car-entry": [{"category": "my_category", "model": "model361", "manufacturer": "my_manufacturer", "id": 361, "year": "2015"}]} 2018-06-18 02:18:47,350 INFO: Body: {"car-entry": [{"category": "my_category", "model": "model353", "manufacturer": "my_manufacturer", "id": 353, "year": "2015"}]} 2018-06-18 02:18:47,351 INFO: Response: <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> <title>Error 401 Unauthorized</title> </head> <body><h2>HTTP ERROR 401</h2> <p>Problem accessing /restconf/config/car:cars. Reason: <pre> Unauthorized</pre></p> </body> </html> 2018-06-18 02:18:47,352 INFO: Response: {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"read execution failed","error-info":"ReadFailedException{message=Error checking DataExists for path /(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)cars/car-entry/car-entry[{(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)id=353}], errorList=[RpcError [message=Error checking DataExists for path /(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)cars/car-entry/car-entry[{(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)id=353}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.29:2550/), Path(/user/shardmanager-config/member-1-shard-car-config/shard-car-member-2:datastore-config@0:11946_12283#-1964161993)]] after [5000 ms]. Sender[null] sent message of type \"org.opendaylight.controller.cluster.datastore.messages.DataExists\".]]}\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:242)\n\tat akka.dispatch.OnComplete.internal(Future.scala:260)\n\tat akka.dispatch.OnComplete.internal(Future.scala:258)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:185)\n\tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)\n\tat akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)\n\tat scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)\n\tat akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)\n\tat akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)\n\tat akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.29:2550/), Path(/user/shardmanager-config/member-1-shard-car-config/shard-car-member-2:datastore-config@0:11946_12283#-1964161993)]] after [5000 ms]. Sender[null] sent message of type \"org.opendaylight.controller.cluster.datastore.messages.DataExists\".\n\tat akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595)\n\tat akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605)\n\tat akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)\n\tat scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)\n\tat scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)\n\tat akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\n\tat java.lang.Thread.run(Thread.java:748)\n"}]}} 2018-06-18 02:18:47,352 INFO: <Response [401]> Unauthorized 2018-06-18 02:18:47,353 INFO: <Response [500]> Internal Server Error

That is where I think we need to dive in. It seems that there may be some hiccup
on the follower node, when the leader is being isolated where typical REST interactions
have trouble. Looks like we got an unauthorized 401, which matches up timestamps
with this from the karaf log:

2018-06-18T02:18:47,330 | WARN  | qtp414495725-104 | MDSALDynamicAuthorizationFilter  | 195 - org.opendaylight.aaa.shiro - 0.8.0.SNAPSHOT | MDSAL attempt to read Http Authz Container failed, disallowing access
org.opendaylight.controller.md.sal.common.api.data.ReadFailedException: read execution failed
	at org.opendaylight.controller.cluster.datastore.compat.LegacyDOMStoreAdapter$1.newWithCause(LegacyDOMStoreAdapter.java:40) ~[?:?]
	at org.opendaylight.controller.cluster.datastore.compat.LegacyDOMStoreAdapter$1.newWithCause(LegacyDOMStoreAdapter.java:36) ~[?:?]
	at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:91) ~[308:org.opendaylight.yangtools.util:2.0.5]
	at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:40) ~[308:org.opendaylight.yangtools.util:2.0.5]
	at org.opendaylight.mdsal.common.api.MappingCheckedFuture.mapException(MappingCheckedFuture.java:62) ~[?:?]
	at org.opendaylight.mdsal.common.api.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:66) ~[?:?]
	at org.opendaylight.mdsal.common.api.MappingCheckedFuture.get(MappingCheckedFuture.java:79) ~[?:?]
	at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:168) ~[32:com.google.guava:23.6.0.jre]
	at com.google.common.util.concurrent.Futures.getDone(Futures.java:1436) ~[32:com.google.guava:23.6.0.jre]
	at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:85) ~[32:com.google.guava:23.6.0.jre]
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398) ~[32:com.google.guava:23.6.0.jre]
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015) ~[32:com.google.guava:23.6.0.jre]
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868) ~[32:com.google.guava:23.6.0.jre]
	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713) ~[32:com.google.guava:23.6.0.jre]
	at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54) ~[32:com.google.guava:23.6.0.jre]
	at org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:241) ~[?:?]
	at akka.dispatch.OnComplete.internal(Future.scala:260) ~[?:?]
	at akka.dispatch.OnComplete.internal(Future.scala:258) ~[?:?]
	at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) ~[?:?]
	at akka.dispatch.japi$CallbackBridge.apply(Future.scala:185) ~[?:?]
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) ~[337:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) ~[?:?]
	at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91) ~[?:?]
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) [337:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) [337:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91) [35:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) [35:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43) [35:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [35:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [35:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [35:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [35:com.typesafe.akka.actor:2.5.11]
Caused by: org.opendaylight.controller.md.sal.common.api.data.ReadFailedException: Error checking ReadData for path /(urn:opendaylight:params:xml:ns:yang:aaa?revision=2016-12-14)http-authorization
	at org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:242) ~[?:?]
	... 16 more
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.29:2550/), Path(/user/shardmanager-config/member-1-shard-default-config/shard-default-member-2:datastore-config@0:11947_12284#-916780901)]] after [5000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadData".
	at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595) ~[?:?]
	at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605) ~[?:?]
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140) ~[?:?]
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870) ~[?:?]
	at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109) ~[?:?]
	at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103) ~[?:?]
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868) ~[?:?]
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) ~[?:?]
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) ~[?:?]
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) ~[?:?]
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]

I can't find where that internal error 500 comes from though.

I would not expect one healthy node in the majority, when a single node dies, to start
giving me 500 or 401 and just take my requests and handle them.

In this specific example, only 487 of the 10k cars were configured, so it seems there
was some major trouble.

all three odl karaf logs:

initial leader
follower that was pointed at by the test tool
other follower that was just minding it's own business



 Comments   
Comment by Tom Pantelis [ 19/Jun/18 ]

Notice the AskTimeoutException. This can happen for transaction(s) in flight when the leader is isolated, if the timing is right. So this is expected with the ask-based protocol however the new tell-based protocol promises to alleviate inflight failures (via retries) as was discussed on the call.

The 401 is b/c of a read failure in aaa - apparently it relies on the mdsal DS for authentication (I didn't know that).

Comment by Jamo Luhrsen [ 19/Jun/18 ]

tpantelis, just to be clear, if we run this job with the tell-based protocol we will expect to never hit this problem? I think I can
quickly try that in the sandbox and assuming it passes, we can run in a frequent loop to get the confidence we want.

Comment by Tom Pantelis [ 19/Jun/18 ]

From my understanding, it shouldn't occur.

Comment by Ryan Goulding [ 19/Jun/18 ]

Not for authentication, just authorization.  We derive the authorization rules for MdsalDynamicAuthorizationFilter from data stored in MD-SAL, the same place everyone stores their data.  If it is isolated, then we cannot get that info, and we fail-closed.  So yes, this is correct behavior.

Comment by Sam Hague [ 19/Jun/18 ]

tpantelis what would be the outcome from this issue - that all operations using aaa will fail?

Assuming we can't switch to tell-based, should something be fixed here in aaa or somewhere else? I am guessing we would hit this same error in he downstream jobs also.

Comment by Tom Pantelis [ 19/Jun/18 ]

aaa is simply doing a read and it fails. Whether it should fail-fast and disallow access is another story. That's the safest approach as it doesn't know if it should allow access. However it could cache the last known password and configuration - other systems work that way or provide the option to behave that way.

Either way it probably doesn't matter since if aaa can't access the DS then the user request likely won't be able to either.

Comment by Ryan Goulding [ 19/Jun/18 ]

That isn't entirely true.  First off, it isn't authentication;  that is handled through H2 still.  It is Authorization.  Authorization data shouldn't be cached IMHO, and I wouldn't merge a patch that built a cache for it.  There needs to be a single source of truth;  it isn't like other data, which can be stale.

Comment by Luis Gomez [ 19/Jun/18 ]

Jamo, can you try with tell-based protocol? here is an example:

https://git.opendaylight.org/gerrit/#/c/63592/

Comment by Ryan Goulding [ 19/Jun/18 ]

It does beg the question if we should pre-empt AAA 401 with a check in RESTCONF which returns 5xx in the case of isolation.  I would be behind that, but don't have cycles to contribute.

Comment by Tom Pantelis [ 19/Jun/18 ]

I know what you're saying but I don't completely agree with that. It could use the last known good config if the latest config can't be obtained. We did similar in a previous product I worked on years back to allow that based on user-enablement - it was specifically requested (I recall Windows does that too wrt LDAP authentication). Also I think the initial "admin" user should have "super" privileges, ie it should be authorized for everything so there's always a way in (otherwise users can shoot themselves in the foot with a mis-configuration and not have any way to back it out other then to completely re-install - seen that happen before).  

Anyway I'm digressing.... it is what it is now - CSIT can expect 401 if HA is lost.

Comment by Ryan Goulding [ 19/Jun/18 ]

I envision a situation in which someone gets canned when a node is in isolation but still has access via RESTCONF until the node rejoins the cluster.  This is a bit silly I guess, but man do I not want my name on that CVE .

 

We could probably also just make a super user account at some point too.  I'd say admin, but at this point most people unfortunately only use that account!  At least from non-ONAP context.

Comment by Tom Pantelis [ 19/Jun/18 ]

In my experience, the initial user created on installation is automatically a super-user to bootstrap things. It can't be deleted nor have it's privileges downgraded - again a safety net to avoid shooting oneself in the foot. 

Comment by Sam Hague [ 19/Jun/18 ]

rgoulding tpantelis What exactly would fail during this time when the authorization is failing because aaa can't read from ds? All restconf calls? Does that mean all restconf calls should not be relied upon? Meaning in the context of our csit tests, there are scenarios where the tests use restconf to verify states and if we don't see a 200 for instance, we would fail a test.

In this case, the test itself is not written properly and should be changed to not rely on the restconf calls.

Comment by Tom Pantelis [ 19/Jun/18 ]

If the DS isn't available then it appears currently all restconf calls will fail with 401 b/c it can't authenticate.

Comment by Ryan Goulding [ 19/Jun/18 ]

So the MDSALDynamicAuthorizationFilter is a Shiro AuthorizationFilter that returns boolean whether "isAccessAllowed(...)".  It is executed on ingress of REST requests to anything protected by AAA.  Part of isAccessAllowed(...) is getting the existing authorization rules [0].  If it fails w/ a ReadFailedException (i.e., an indication of Node isolation), then we return false [1], which results in a 401.

Is this correct behavior?  Well it functions as designed, but it ain't pretty. Really, we ought to put a check in RESTCONF to check if a Node is isolated prior to executing any AAA Filter code.  In other words, a Filter in front of AAA which will return a 5XX if the node is isolated, preferably with a nice message (though one could argue how much disclosure we should give the client!).

If you are mapping 200 to failure, well that means any time a REST call is made against an isolated node, you will have a test failure.  Since isolation cases can occur and are "expected" in our target environment, I doubt that you want to map directly to failure.

 

[0] https://github.com/opendaylight/aaa/blob/master/aaa-shiro/impl/src/main/java/org/opendaylight/aaa/shiro/realm/MDSALDynamicAuthorizationFilter.java#L53

[1] https://github.com/opendaylight/aaa/blob/master/aaa-shiro/impl/src/main/java/org/opendaylight/aaa/shiro/realm/MDSALDynamicAuthorizationFilter.java#L91

Comment by Ryan Goulding [ 19/Jun/18 ]

We could do that.  I know some stakeholders have expressed desire for this in the past, but it hasn't been prioritized by anyone thus far.  It really wouldn't take that much.  But as you stated before, we still have that issue of RESTCONF calls probably failing anyway since that code is attempting reads against the DS on an isolated node still.

Comment by Sam Hague [ 19/Jun/18 ]

Not mapping a 200 to failure, but the opposite - if we don't get a 200 when querying for something we consider that a failure. Take a case where a test starts, some events happen that should write something to datastore. To validate that happened the test would read from datastore via restconf expecting to see a 200 for the resource requested. But if during this isolated condition all restconf calls are going to fail, then that test is always going to fail - regardless if the ds was updated correctly or not. That itself might be a another badly written part of this test.

Comment by Jamo Luhrsen [ 19/Jun/18 ]

Thanks ecelgp. I see you have a similar patch to something I tried (with bad results) like 10 months ago.

 

anyway, here's the cars/people version and I'm trying it in the sandbox

Comment by Jamo Luhrsen [ 19/Jun/18 ]

At the risk of sounding stupid here, can I double check that I understand this properly?

this specific issue in this JIRA has some external REST client talking to a single happy node. That node
is never isolated. So, in that sense, this external client is never attempting to talk to an isolated node.

But, I think the idea here is that this node has to do it's sideways stuff to make reads/writes to the leader
node. And there is a chance that when we block that leader to isolate it, this happy node is doing some
transactions to that node and those will fail. So those errors and responses (401 and 500) kind of bubble
up from the side (leader isolated node) and back down through the happy node to the rest client.

correct?

Comment by Sam Hague [ 19/Jun/18 ]

In the netvirt 3node jobs we have hundreds of those "MDSAL attempt to read Http Authz Container failed, disallowing access" errors. No idea if they cause problems, but they are there in the jobs.

Comment by Tom Pantelis [ 20/Jun/18 ]

yes. All transactions have to go to the leader. During isolation there is a period of time where there is no leader that is able to make progress until it stabilizes with a new leader in the majority partition (assuming it's able to do so). Transactions initiated during that window in the majority partition may still timeout and fail. No transactions initiated in the isolated partition will succeed until the partition heals. So if the external tool was adding cars on the isolated node, with the "ask"-based protocol, the transactions initiated during isolation will all fail. If it was adding cars on one of the other nodes they could all succeed if none were initiated during the ~5-6 sec period before akka detects the leader node is unreachable.

The new "tell"-based protocol is intended to retry transactions in the hope that they will eventually succeed. Of course even that has its upper limit deadline. If 2 nodes were down for many minutes/hours you would expect it to eventually give up and fail. 

Comment by Tom Pantelis [ 20/Jun/18 ]

yeah DS access may still fail even if it got thru aaa. But other NB endpoints could still be accessed, eg RPCs, netconf mount point, jolokia etc. An admin might want to access JMX data to check the akka cluster state etc. But with no access via the NB, an admin has no visibility into the system (unless thru CLI).

Comment by Jamo Luhrsen [ 20/Jun/18 ]

thanks for all the explanations, tpantelis.

I'm wondering what the negative side effects are to dial up the timeout from 5s to 1s? that won't get rid of this
timing period, but it would prevent these issues in a chatty REST environment (some). But, any latency or
other network blip would start causing leadership changes.

Anyway, let's see how the tell-based stuff behaves with our simple car/people job. still waiting on results.

Comment by Tom Pantelis [ 20/Jun/18 ]

yup - reducing the time out will open the door for false positives. Garbage collection is the biggest culprit and accounts for most of that ~5s default cushion. You can't avoid a "blackout" period altogether - it happens with networking protocols like spanning tree etc.  The best we can do I think is to retry to hopefully overcome relatively short transient failures. 

Comment by Robert Varga [ 20/Jun/18 ]

Note that tell-based protocol will abort (i.e. render frontend unusable) if it cannot make forward progress in 10 minutes (or so). Instance restart is required to recover from that.

Comment by Jamo Luhrsen [ 20/Jun/18 ]

looking for help.

this tell-based config is having trouble in the job. Example robot logs

basically, the cars suite runs first and passes (like it will do every so often), then a suite runs that
will change to tell-based protocol which passes. Then we run the cars suite again. What happens
there:

  • get cluster state so it knows leader and two followers
  • start adding cars
  • isolate leader
  • try to talk to a follower

that last step fails, because there is a read timeout (restconf read). The karaf.log for that controller
has a lot of repeating logs that seem interesting.

like a whole mess of these:

2018-06-20T16:49:08,981 | WARN  | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder    | 217 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | member-2-shard-car-config: Unknown history for aborted transaction member-1-datastore-config-fe-0-txn-19-1, ignoring

followed by a bunch of:

2018-06-20T16:51:07,052 | INFO  | opendaylight-cluster-data-shard-dispatcher-27 | Shard                            | 209 - org.opendaylight.controller.sal-clustering-commons - 1.8.0.SNAPSHOT | member-2-shard-topology-config (Leader): Cannot append entries because sender's term 2 is less than 3

then some of this:

2018-06-20T16:51:11,046 | INFO  | opendaylight-cluster-data-shard-dispatcher-26 | Shard                            | 209 - org.opendaylight.controller.sal-clustering-commons - 1.8.0.SNAPSHOT | member-2-shard-car-config (Leader): handleAppendEntriesReply - received unsuccessful reply: AppendEntriesReply [term=3, success=false, followerId=member-1-shard-car-config, logLastIndex=1595, logLastTerm=2, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], leader snapshotIndex: 1595

can I get some ideas of what to try, where to look?

Comment by Tom Pantelis [ 20/Jun/18 ]

So the attached karaf log was from member-2. After the 2nd startup at 2018-06-20T16:47:35,134, looking at the cars-config shard, member-1 was initially the leader:

2018-06-20T16:47:48,178 | INFO  | opendaylight-cluster-data-shard-dispatcher-27 | ShardManager                     | 217 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-2-shard-car-config, leaderId=member-1-shard-car-config, leaderPayloadVersion=5]

Then 10.30.170.22, which I assume is member-1, became unreachable:

2018-06-20T16:49:23,879 | INFO  | opendaylight-cluster-data-shard-dispatcher-30 | Shard                            | 209 - org.opendaylight.controller.sal-clustering-commons - 1.8.0.SNAPSHOT | member-2-shard-car-config (Follower): Leader akka.tcp://opendaylight-cluster-data@10.30.170.22:2550 is unreachable
2018-06-20T16:49:23,896 | INFO  | opendaylight-cluster-data-shard-dispatcher-30 | ShardManager                     | 217 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-2-shard-car-config, leaderId=member-2-shard-car-config, leaderPayloadVersion=5]

Then there's a bunch of these:

2018-06-20T16:51:07,079 | INFO  | opendaylight-cluster-data-shard-dispatcher-45 | Shard                            | 209 - org.opendaylight.controller.sal-clustering-commons - 1.8.0.SNAPSHOT | member-2-shard-car-config (Leader): Cannot append entries because sender's term 2 is less than 3

I assume this is during the time of isolation however it appears these messages are coming from member-1. Is traffic just blocked one way? Unless this was after the isolation was lifted but not sure if the test got that far.

I don't see any significant ERRORs in the log (just 2 related to the karaf shell). And no WARNs except for those listed here. So it's unclear from this log why the rest request timed out. The test step Verify_New_Car_Leader_Elected that failed indicates it's attempting to access jolokia to get the leader status but it appears the actual request that failed was attempting /restconf/modules. Either way, it may be that aaa auth was trying to read from the DS but spun. "tell"-based has different behavior with retries so I think it has a much longer timeout period - rovarga knows that code much better than me. I assume aaa would be hitting the default-config shard - the member-2 log indicates member-3 was the leader but I don't see any indication in that log either.

The log from member-1 shows a lot of "Follower is out-of-sync..." messages which corresponds to the "handleAppendEntriesReply - received unsuccessful reply..." on member-2. Something strange going on there... Also these should not have occurred during the isolation period if traffic to/from member-1 was blocked....

It would be ideal to manually run the isolation test steps - one can control the steps and verify and debug the live system at various points.

The isolation scenario is the hardest - perhaps we should first get the other test scenarios working reliably first (unless they already are ).

Comment by Tom Pantelis [ 20/Jun/18 ]

If the read timeouts are due to aaa trying to access the DS, perhaps CSIT can install odl-restconf-noauth to take that out of the equation. At least then Verify_New_Car_Leader_Elected could succeed and convey the results.

Comment by Jamo Luhrsen [ 20/Jun/18 ]

If the read timeouts are due to aaa trying to access the DS, perhaps CSIT can install odl-restconf-noauth to take that out of the equation. At least then Verify_New_Car_Leader_Elected could succeed and convey the results.

I tried this, and the first run was good.

Comment by Jamo Luhrsen [ 20/Jun/18 ]

well, another run failed with the same symptoms even with the odl-restconf-noauth

Comment by Tom Pantelis [ 20/Jun/18 ]

I think we're going to have to try to run the isolation steps manually. I think we really need live systems. As noted above, after isolation (or what I think from the log is when it occurred), I see another member became car leader as expected and it snapshotted at 19999 log entries which indicates thousands of cars were created after isolation. Also the logs indicate the partition was healed after about 2 min and member-1 joined back and was receiving and replying to messages by the new leader. I see no reason in the logs for the rest timeout which wasn't even hitting the DS anyway (appears it was hitting restconf/modules).

How is the rest of the suite? Is the isolation test the only one failing?

Comment by Jamo Luhrsen [ 20/Jun/18 ]

some notes:

the isolation is done by blocking all traffic OUTPUT from the leader. So, the other two nodes will stop hearing from it, but it will still be hearing from them

I don't have a quick local setup to reproduce this manually, but can work on it. Still, it's very easy to try things out with CSIT in the sandbox and just
throw mud against the wall. Keep the ideas coming.

the /restconf/modules endpoint is just the first endpoint in the list of requests that will be made on the way for the test to determine the leader. That
just happens to be first. I'm guessing all of the rest endpoints are going to fail if that one does.

remember, the tests aren't dealing with the node that is marked in isolation. Once those iptables rules are in place on that initial leader,
all the interactions are done with the two other nodes that used to be a follower and now should have one leader and one follower.

Comment by Jamo Luhrsen [ 20/Jun/18 ]

this is the whole suite. see the 10 steps in the description for the high level.

do you think a totally stopped node test, instead of isolation is going to be better for us to look at?

Comment by Jamo Luhrsen [ 20/Jun/18 ]

btw, if you read below, I think we can say that tell-based protocol was not any kind of silver bullet here.

Comment by Tom Pantelis [ 20/Jun/18 ]

Our comments are getting crossed. From what I see in logs, things appear to be working as expected. The downside with automated tests is that you have to try to analyze things post-mortem - sometimes it really helps to have live systems to poke around in.  

Comment by Jamo Luhrsen [ 20/Jun/18 ]

of course. I don't disagree with having a live setup to debug being better (assuming we can reproduce). I don't have one right now. do you?

Comment by Tom Pantelis [ 20/Jun/18 ]

I don't have access to a 3 node env right now. The best I can do is run 3 ODL instances on the same VM on my MAC. Not sure what the system commands are to block traffic - in my case they would each run on a different akka port.

Comment by Tom Pantelis [ 20/Jun/18 ]

Doesn't it do that too? From what I recall 3 years ago when I did some work on the tests, it would stop a node and verify continuity and also 2 to verify expected failures.  It also tested/verified recovery from persistence. Assuming all those tests still exist, are they running reliably?

Comment by Jamo Luhrsen [ 20/Jun/18 ]

I can try to get a setup going as well, but might take me a day or so.

here's the commands the CSIT is using to isolate the leader:

sudo /sbin/iptables -I OUTPUT -p all --source $LEADER_IP --destination $LEADER_IP -j DROP
sudo /sbin/iptables -I OUTPUT -p all --source $LEADER_IP --destination $FOLLOWER_1_IP -j DROP
sudo /sbin/iptables -I OUTPUT -p all --source $LEADER_IP --destination $FOLLOWER_2_IP -j DROP

note that first line is odd, as it blocks traffic to and from it's own ip. I think that's probably just a relic
of keeping it easy in the test code. not sure. I don't think it matters if you have it or not.

Comment by Tom Pantelis [ 20/Jun/18 ]

Back then the isolation test didn't exist.

Comment by Tom Pantelis [ 20/Jun/18 ]

I assume it's possible to just block a port.

Comment by Jamo Luhrsen [ 20/Jun/18 ]

yes, we can do that too. are you asking to try that in CSIT? or how to do it for yourself locally?

Comment by Tom Pantelis [ 20/Jun/18 ]

That would be for running 3 instances on a single VM on my MAC - each instance has IP 127.0.0.1 but different ports.

Comment by Jamo Luhrsen [ 20/Jun/18 ]

try this:

sudo /sbin/iptables -I OUTPUT -p tcp --dport 2550 -j DROP
Comment by Tom Pantelis [ 21/Jun/18 ]

jluhrsen ecelgp There's been a lot comments flying around - I didn't get an answer here. However I cloned integration/test and I do see the original tests or at least most of them under csit/suites/controller/Clustering_Datastore. carpeople_crud.robot tests basic functionality. car_failover_crud.robot is one that stops a node to test HA. There's car_failover_crud_isolation.robot that uses isolation. The one you're running is under ThreeNodes_Datastore (not sure how it differs from car_failover_crud_isolation).  There's also a bunch of other tests spread out in different directories - dom_data_broker, cluster_singleton, singleton_service - I think all those were Vratko's so i assume no one's maintained those or really know what they do and what state they're in.

So do the ones under Clustering_Datastore (maybe sans isolation) run reliably? I think that's a good first step - start simple/basic and work our way up.

Comment by Jamo Luhrsen [ 21/Jun/18 ]

tpantelis that suite you are referring to is doing fine , it seems:

So, that more simple case is working which is good. It's definitely being a little more gentle with the tests,
where it's doing some CRUD stuff with cars/people, then stopping and waiting patiently for a leader to
be downed and a new leader to take over, then some more CRUD.

The job we are looking at, is proactively trying to use the cluster while a leader is being isolated.

Also, I ran some different combos (tell based and not) in the sandbox, and made a new suite (to kill the leader,
instead of isolating). All combos (I want to double check) seem to exhibiting the same fundamental symptoms
that all of the "cars" are not added to the cluster when downing or isolating a node, and it will come with the
tell based and non-tell based protocols. I THINK it's a little less frequent or easy to hit when it's the tell-based
config using the kill suite, if that gives any clues to anything.

Additionally, the more I think about this problem the more it's reminding me of some issue we think is
happening in netvirt 3node jobs. Essentially, there is a point when some openstack services think the
networking services are down (ODL not responding) so they mark compute nodes as invalid and doing
openstack instance creates end up with instances in ERROR state. My hunch is that every time we see
that issue, we'll be able to trace it back to a test that downed/isolated a cluster node and the openstack
services noticed they couldn't talk (401s or 500s) to the ODL cluster.

Comment by Robert Varga [ 21/Jun/18 ]

tell-based protocol has these timeouts:

  • dead leader detection: 30 seconds of no activity from leader, we will start looking for new leader
  • request timeout: 2 minutes of no response to request, the request times out
  • frontend abort: 15 minutes of no forward progress, the frontend will abort, client node needs to be restarted

We can eventually get rid of the request timeout, but we need it now to make UT work.

Comment by Tom Pantelis [ 21/Jun/18 ]

I tested locally with "ask" protocol running the car stress-test RPC creating 500 cars at 10/sec. I tested with graceful leader shutdown and killing the leader, both 5 times. I also manually did GET on car:cars quickly over and over for a bit after stopping the leader - these succeeded.   With graceful leader shutdown, all the cars were successfully created in each run. With leader kill, I saw some car failures during transition which is expected.

Comment by Tom Pantelis [ 21/Jun/18 ]

jluhrsen Re: the suppression of the first 4 failed loops and only displaying the failure form the last one... the downside is that we lose the details of the first 4. It would be interesting to see if the first follower that succeeded in the last try also had failures previously and if the second one had the same 5 sec timeout failures each time. Maybe it succeeded in a prior try but the first one failed etc.  So I think it would be good to show all the failures at least for now.

Comment by Tom Pantelis [ 22/Jun/18 ]

Also I asked Ryan about the HTTP request logging but that is not upstream. I supposed it could be re-invented upstream or maybe ecelgp could help to facilitate a quicker path  

Comment by Tom Pantelis [ 22/Jun/18 ]

One possible theory for the read timeout is that the internal HTTP server (jersey/jetty whatever) gets overloaded. I assume it has a thread pool and queue to handle the incoming requests. During the leader transition, add car requests from the python script will be delayed and thus block the thread. Over time all threads in the pool are in use and blocked and the queue builds up which may result in the client timing out.  

Comment by Ariel Adam [ 24/Jun/18 ]

tpantelis can you think if a way aside from logs to check the status of the threads and queues on the HTTP server? 

For the purpose of testing can we compile an HTTP server with logs and manually push it in?

Comment by Tom Pantelis [ 24/Jun/18 ]

 I haven't looked into that yet. But if you or a colleague could help on that front that would be great.

Comment by Jamo Luhrsen [ 26/Jun/18 ]

This would explain it, but now we gotta prove it.

I have a good local setup I can use for this (not my laptop), but it does not have the cluster-test-app features,
so I can't do the add-cars trick.

Trying to figure out some other quick/easy "create" I can script so simulate this scenario.

BTW, like tpantelisand thapar I cannot reproduce the troubles yet. I will additionally try
on a different setup to use a distribution with the cluster test app features so I can also
run the add-cars script.

Comment by Tom Pantelis [ 26/Jun/18 ]

I tried it with the cars stress-test RPC but that's not using restconf like the python script. The theory would hold water if the script sends restconf requests  simultaneously. But if it sends requests synchronously one at a time then there would only be one outstanding request at a time which shouldn't overload.

Comment by Jamo Luhrsen [ 26/Jun/18 ]

it does, from what I can tell. It's using 10 threads in the job, so I assume that is up to 10 rest calls in parallel. I will be trying to get all of this running locally today and see what happens.

Comment by Jamo Luhrsen [ 26/Jun/18 ]

I was able to run this tool in our sandbox with only a single thread (which takes longer, as expected), but that too resulted in
restconf dying on us. back to the drawing board for ideas.

for reference:
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/143/jamo-controller-csit-3node-rest-clust-cars-perf-only-fluorine/6/robot-plugin/log.html.gz

Comment by Vishal Thapar [ 27/Jun/18 ]

I was able to run the tool locally and get the 500 error, kind of. All the cars addition went through in the end, but bringing leader down, up and then the new leader down and up again caused some 500 errors. Not sure if same issue as this one, but at least it shows the exception that shows up in 500.

title>Error 500 </title>
</head>
<body>
<h2>HTTP ERROR: 500</h2>
<p>Problem accessing /jolokia/read/org.opendaylight.controller:type=DistributedConfigDatastore,Category=Shards,name=member-3-shard-default-config. Reason:
<pre> org.apache.shiro.session.UnknownSessionException: There is no session with id [6cabdaaf-1727-4144-a075-07940bfe15c5]</pre></p>
<hr />
I've attached the relevant karaf log.

I was tracking status through Tim's monitor script in odltools and I saw 401 pop up briefly when follower became leader. I think it has something to do with the follower we're tracking becomes leader or not. Maybe we should log/track leadership status in the jobs that are running the test?

Comment by Robert Varga [ 27/Jun/18 ]
ReadTimeout: HTTPConnectionPool(host='10.30.170.108', port=8181): Read timed out. (read timeout=5.0)

this looks like a client-side timeout, i.e. RESTCONF has not responded within 5 seconds.

Comment by Tom Pantelis [ 27/Jun/18 ]

yes it is - the next question is why is restconf/jetty not responding...

Comment by Robert Varga [ 27/Jun/18 ]

I suggest taking a look at https://git.opendaylight.org/gerrit/#/c/73455/5 and https://git.opendaylight.org/gerrit/#/c/73475/1 to see if they improve the situation (in which case this boils down to AAA). Otherwise we require debug logs to see what is going on in the restconf code.

Comment by Tom Pantelis [ 27/Jun/18 ]

yeah - the aaa DS access was discussed earlier in this Jira (I know it's huge and hard to follow).  To take that out of the equation, I suggested installing restconf noauth feature - I believe all subsequent testing has been done with that. 

Comment by Robert Varga [ 27/Jun/18 ]

Right, but doesn't that just disable authentication, with authorization being still in the picture?

Comment by Tom Pantelis [ 27/Jun/18 ]

hmm - I assumed it would disable both - have to check on that...

Comment by Jamo Luhrsen [ 28/Jun/18 ]

I've taken the distro created from c/73475 which is on top of c/73455 and am running in against this
cars job in the sandbox and have seen 14 consecutive passes. This is with the tell-based protocol.

this is promising.

there is something failing now, however, that I wasn't expecting. The Crud mdsal perf suite
is having a test case (purchasing cars) that is coming up short of it's 10k number. Haven't
looked any closer, but there is no node isolation/downing in this suite, and it seems to pass
fine in the releng (ask based, and without these aaa patches) every time.

Anyway, let me keep collecting data with the aaa patches. I'll run some jobs without the
tell based protocol for starters.

Comment by Jamo Luhrsen [ 28/Jun/18 ]

right, so ask-based with the c/73475 distro is giving the same results as what we initially
reported in this jira (500 error "read execution failed", seen on the first try).

Wondering what to do now?

Did we have an explanation for why a script talking to /restconf (to add cars) would get
a 500 internal error when the leader is isolated?

Also, this info is coming from the scripts stdout and there is no hint of it in any of the
karaf.log files (that I can see). Any ideas why it's not coming as an ERROR in the
logs?

Comment by Tom Pantelis [ 28/Jun/18 ]

500 is internal server error - could come from anywhere. If it's emanating from restconf code then there should be additional error info in the output corresponding to the RFC (I know you've seen it before). Either way we need to see the full HTTP response output.

Comment by Jamo Luhrsen [ 28/Jun/18 ]

do we need more than this:

2018-06-28 04:32:23,034 INFO: Request started at Thu Jun 28 04:32:17 2018 finished with following detais
2018-06-28 04:32:23,034 INFO: <PreparedRequest [POST]> http://10.30.170.131:8181/restconf/config/car:cars
2018-06-28 04:32:23,034 INFO: Headers {'Content-Length': '127', 'Content-Type': 'application/json', 'Authorization': 'Basic YWRtaW46YWRtaW4='}:
2018-06-28 04:32:23,035 INFO: Body: {"car-entry": [{"category": "my_category", "model": "model701", "manufacturer": "my_manufacturer", "id": 701, "year": "2015"}]}
2018-06-28 04:32:23,035 INFO: Response: {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"read execution failed","error-info":"ReadFailedException{message=Error checking DataExists for path /(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)cars/car-entry/car-entry[{(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)id=701}], errorList=[RpcError [message=Error checking DataExists for path /(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)cars/car-entry/car-entry[{(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)id=701}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.170:2550/), Path(/user/shardmanager-config/member-1-shard-car-config/shard-car-member-2:datastore-config@0:7884_728#-1820175598)]] after [5000 ms]. Sender[null] sent message of type \"org.opendaylight.controller.cluster.datastore.messages.DataExists\".]]}\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:247)\n\tat akka.dispatch.OnComplete.internal(Future.scala:260)\n\tat akka.dispatch.OnComplete.internal(Future.scala:258)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:185)\n\tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)\n\tat akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)\n\tat scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)\n\tat akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)\n\tat akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)\n\tat akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.170:2550/), Path(/user/shardmanager-config/member-1-shard-car-config/shard-car-member-2:datastore-config@0:7884_728#-1820175598)]] after [5000 ms]. Sender[null] sent message of type \"org.opendaylight.controller.cluster.datastore.messages.DataExists\".\n\tat akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595)\n\tat akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605)\n\tat akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)\n\tat scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)\n\tat scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)\n\tat akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\n\tat java.lang.Thread.run(Thread.java:748)\n"}]}}
2018-06-28 04:32:23,035 INFO: <Response [500]> Internal Server Error
Comment by Tom Pantelis [ 28/Jun/18 ]

yeah that's the error info. The usual AskTimeoutException with ask-based.

Comment by Jamo Luhrsen [ 28/Jun/18 ]

ok.

what next then?

why is this not showing up in the karaf.log?

how can we avoid this failure in CSIT?

Comment by Tom Pantelis [ 28/Jun/18 ]

There should be a WARN logged by CDS. restconf only logs request failures to DEBUG I believe.

You can't reliably avoid this with ask-based as discussed before. If a tx hits at the right time during leader transition it may time out.

Comment by Tom Pantelis [ 28/Jun/18 ]

A recourse is to retry the request on the client side.  

Comment by Jamo Luhrsen [ 28/Jun/18 ]

s/avoid/workaround

We don't want to have any failures in these jobs. we want them to pass 100% every time so we
can use them as benchmarks without having to dig all the way to the bottom of each failure to finally realize it was a random chance ask-timeout.

Do we have to add retries to the python tool? (looks like it already claims to do that)

How long is the window expected to be here (MAX)?

Can we tweak some timers in the clustering config to reduce the chance?

Comment by Tom Pantelis [ 28/Jun/18 ]

The window depends mostly on akka internals and how it detects failed nodes. Time out settings can be tweaked but there will always be a period of time of leader transition and re-convergence, no matter how small, where a tx can fail if inflight.  The recourse is to retry the entire transaction, whether internally or on the client-side. ask-based does not retry inflight transactions - tell-based was introduced to do that.  The test you're running can never be guaranteed 100% success with ask-based.

Comment by Jamo Luhrsen [ 28/Jun/18 ]

no...

please help me figure this out.

let's say you are a real user and you want to add 10k cars with an external client. So you start
adding them (X per request). Somewhere in the middle you get a 500 response to your request,
but you know that might happen because of a cluster hiccup.

what are you going to do next?

Is it just a simple retry of that exact same request? Assuming that, how many times would you
retry, and how often? When do you give up and start worrying that something major is broken?

Comment by Robert Varga [ 28/Jun/18 ]

This is the usual state reconciliation. What you do depends on what you are doing, but a fair recovery strategy is:

  • read what's there
  • figure out what needs to go there
  • continue pushing

Retry/backoff/escalation is an application-specific thing – in telco's there's like 20% of ONAP to deal with that.

Comment by Jamo Luhrsen [ 28/Jun/18 ]

sure. I'll see what I can figure out from the client side here.

still wondering things like:

how long should I plan to deal with errored responses, before marking a failure that we will all know is legit showing a problem?

btw, for this very specific case when we are trying to create X cars in a single request, and we get that 500 ask-timeout
response, does that mean the entire request failed, or is it possible that "some" of the cars would have been created?

Comment by Robert Varga [ 28/Jun/18 ]

AskTimeout means the transaction is doomed, other transactions are not affected (unless they report an error).

As for granular error reporting ... yeah, someone has to go and define the error contracts on the NB.

Comment by Jamo Luhrsen [ 28/Jun/18 ]

it might be this simple:
https://git.opendaylight.org/gerrit/#/c/73209/13/tools/odl-mdsal-clustering-tests/scripts/cluster_rest_script.py

I have one example of this working, but it looks like there may be more than 2m of retries (see timestamps
below from the script output (first two lines)):

2018-06-28 15:43:20,657 INFO: Add 10000 car(s) to 10.30.170.52:8181 (1 per request)
2018-06-28 15:45:25,211 INFO: Response code(s) got per number of requests: {204: 1003}
2018-06-28 15:45:25,212 INFO: Response code(s) got per number of requests: {204: 996}
2018-06-28 15:45:25,233 INFO: Response code(s) got per number of requests: {204: 993}
2018-06-28 15:45:25,234 INFO: Response code(s) got per number of requests: {204: 998}
2018-06-28 15:45:25,234 INFO: Response code(s) got per number of requests: {204: 996}
2018-06-28 15:45:25,235 INFO: Response code(s) got per number of requests: {204: 1004}
2018-06-28 15:45:25,260 INFO: Response code(s) got per number of requests: {204: 1013}
2018-06-28 15:45:25,261 INFO: Response code(s) got per number of requests: {204: 1005}
2018-06-28 15:45:25,261 INFO: Response code(s) got per number of requests: {204: 1001}
2018-06-28 15:45:25,261 INFO: Response code(s) got per number of requests: {204: 991}

I still want to know proper timing bounds to use, so we can still have failures when real problems are under
the hood.

Comment by Jamo Luhrsen [ 05/Jul/18 ]

To update this one, with this AAA patch's distribution, and these small changes in the CSIT code we are
no longer hitting the major problem of RESTCONF being unavailble (read timeouts). We do still have some periods of inaccessibility where we
will get some 500 errors in ask-based, but those are expected (askTimeouts) and our python test tool is now made to retry on those. If/when
we merge the AAA and CSIT patches, we can close this bug.

while that is good news, it should be noted that a new kind of failure is now coming with the AAA patch in a different suite in
the same job. So, if we merge the AAA patch, we will need a new jira to track it and find a way to fix it. the TL;DR for that is
that we are doing "add cars", then "add people", then "purchase cars" with validation between each. The validation for the
"purchase" step is failing where we are usually near, but not all the way to the full 10k purchases. This failure is there most
of the time, but not always. There is no leader up/down or isolation in this suite. It happens in both ask and tell based
protocols. That's all I know at this point.

Comment by Jamo Luhrsen [ 12/Jul/18 ]

two aaa patches:
https://git.opendaylight.org/gerrit/#/c/73475/
https://git.opendaylight.org/gerrit/#/c/73209/

and one int/test patch:
https://git.opendaylight.org/gerrit/#/c/73209/

Comment by Jamo Luhrsen [ 12/Jul/18 ]

The jobs are now passing 100% for this problem.

A new, less severe problem is now happening in a different suite of
this same job. We will open a new bug for this.

Comment by Luis Gomez [ 12/Jul/18 ]

Awesome

Generated at Wed Feb 07 19:56:34 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.