[CONTROLLER-1838] follower reports 401 (unauthorized) and 500 (Internal Error) when leader is isolated. Created: 19/Jun/18 Updated: 09/Aug/18 Resolved: 12/Jul/18 |
|
| Status: | Verified |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Fluorine |
| Fix Version/s: | Fluorine |
| Type: | Bug | Priority: | High |
| Reporter: | Jamo Luhrsen | Assignee: | Tom Pantelis |
| Resolution: | Done | Votes: | 0 |
| Labels: | csit:3node | ||
| Σ Remaining Estimate: | Not Specified | Remaining Estimate: | Not Specified |
| Σ Time Spent: | Not Specified | Time Spent: | Not Specified |
| Σ Original Estimate: | Not Specified | Original Estimate: | Not Specified |
| Attachments: |
|
|||||||||||||||||||||||||
| Issue Links: |
|
|||||||||||||||||||||||||
| Sub-Tasks: |
|
|||||||||||||||||||||||||
| Epic Link: | Clustering Stability | |||||||||||||||||||||||||
| Description |
|
This is to track the sporadic (but very frequent) failures we see in this job The general flow of the job seems to be as follows:
I can't make total sense of it yet, but the robot logs in step 6 show this output 2018-06-18 02:18:38,781 INFO: Add 10000 car(s) to 10.30.170.28:8181 (1 per request) 2018-06-18 02:18:47,348 INFO: Request started at Mon Jun 18 02:18:42 2018 finished with following detais 2018-06-18 02:18:47,349 INFO: Request started at Mon Jun 18 02:18:42 2018 finished with following detais 2018-06-18 02:18:47,349 INFO: <PreparedRequest [POST]> but in last step that stops the tool, there is a flushing of the test tools stdout and
http://10.30.170.28:8181/restconf/config/car:cars 2018-06-18 02:18:47,349 INFO: <PreparedRequest [POST]> http://10.30.170.28:8181/restconf/config/car:cars 2018-06-18 02:18:47,350 INFO: Headers {'Content-Length': '127', 'Content-Type': 'application/json', 'Authorization': 'Basic YWRtaW46YWRtaW4='}: 2018-06-18 02:18:47,350 INFO: Headers {'Content-Length': '127', 'Content-Type': 'application/json', 'Authorization': 'Basic YWRtaW46YWRtaW4='}: 2018-06-18 02:18:47,350 INFO: Body: {"car-entry": [{"category": "my_category", "model": "model361", "manufacturer": "my_manufacturer", "id": 361, "year": "2015"}]} 2018-06-18 02:18:47,350 INFO: Body: {"car-entry": [{"category": "my_category", "model": "model353", "manufacturer": "my_manufacturer", "id": 353, "year": "2015"}]} 2018-06-18 02:18:47,351 INFO: Response: <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> <title>Error 401 Unauthorized</title> </head> <body><h2>HTTP ERROR 401</h2> <p>Problem accessing /restconf/config/car:cars. Reason: <pre> Unauthorized</pre></p> </body> </html> 2018-06-18 02:18:47,352 INFO: Response: {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"read execution failed","error-info":"ReadFailedException{message=Error checking DataExists for path /(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)cars/car-entry/car-entry[{(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)id=353}], errorList=[RpcError [message=Error checking DataExists for path /(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)cars/car-entry/car-entry[{(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)id=353}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.29:2550/), Path(/user/shardmanager-config/member-1-shard-car-config/shard-car-member-2:datastore-config@0:11946_12283#-1964161993)]] after [5000 ms]. Sender[null] sent message of type \"org.opendaylight.controller.cluster.datastore.messages.DataExists\".]]}\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:242)\n\tat akka.dispatch.OnComplete.internal(Future.scala:260)\n\tat akka.dispatch.OnComplete.internal(Future.scala:258)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:185)\n\tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)\n\tat akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)\n\tat scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)\n\tat akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)\n\tat akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)\n\tat akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.29:2550/), Path(/user/shardmanager-config/member-1-shard-car-config/shard-car-member-2:datastore-config@0:11946_12283#-1964161993)]] after [5000 ms]. Sender[null] sent message of type \"org.opendaylight.controller.cluster.datastore.messages.DataExists\".\n\tat akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595)\n\tat akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605)\n\tat akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)\n\tat scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)\n\tat scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)\n\tat akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\n\tat java.lang.Thread.run(Thread.java:748)\n"}]}} 2018-06-18 02:18:47,352 INFO: <Response [401]> Unauthorized 2018-06-18 02:18:47,353 INFO: <Response [500]> Internal Server Error
That is where I think we need to dive in. It seems that there may be some hiccup 2018-06-18T02:18:47,330 | WARN | qtp414495725-104 | MDSALDynamicAuthorizationFilter | 195 - org.opendaylight.aaa.shiro - 0.8.0.SNAPSHOT | MDSAL attempt to read Http Authz Container failed, disallowing access org.opendaylight.controller.md.sal.common.api.data.ReadFailedException: read execution failed at org.opendaylight.controller.cluster.datastore.compat.LegacyDOMStoreAdapter$1.newWithCause(LegacyDOMStoreAdapter.java:40) ~[?:?] at org.opendaylight.controller.cluster.datastore.compat.LegacyDOMStoreAdapter$1.newWithCause(LegacyDOMStoreAdapter.java:36) ~[?:?] at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:91) ~[308:org.opendaylight.yangtools.util:2.0.5] at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:40) ~[308:org.opendaylight.yangtools.util:2.0.5] at org.opendaylight.mdsal.common.api.MappingCheckedFuture.mapException(MappingCheckedFuture.java:62) ~[?:?] at org.opendaylight.mdsal.common.api.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:66) ~[?:?] at org.opendaylight.mdsal.common.api.MappingCheckedFuture.get(MappingCheckedFuture.java:79) ~[?:?] at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:168) ~[32:com.google.guava:23.6.0.jre] at com.google.common.util.concurrent.Futures.getDone(Futures.java:1436) ~[32:com.google.guava:23.6.0.jre] at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:85) ~[32:com.google.guava:23.6.0.jre] at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398) ~[32:com.google.guava:23.6.0.jre] at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015) ~[32:com.google.guava:23.6.0.jre] at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868) ~[32:com.google.guava:23.6.0.jre] at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713) ~[32:com.google.guava:23.6.0.jre] at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54) ~[32:com.google.guava:23.6.0.jre] at org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:241) ~[?:?] at akka.dispatch.OnComplete.internal(Future.scala:260) ~[?:?] at akka.dispatch.OnComplete.internal(Future.scala:258) ~[?:?] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) ~[?:?] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:185) ~[?:?] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) ~[337:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428] at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) ~[?:?] at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91) ~[?:?] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) [337:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428] at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) [337:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428] at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91) [35:com.typesafe.akka.actor:2.5.11] at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) [35:com.typesafe.akka.actor:2.5.11] at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43) [35:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [35:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [35:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [35:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [35:com.typesafe.akka.actor:2.5.11] Caused by: org.opendaylight.controller.md.sal.common.api.data.ReadFailedException: Error checking ReadData for path /(urn:opendaylight:params:xml:ns:yang:aaa?revision=2016-12-14)http-authorization at org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:242) ~[?:?] ... 16 more Caused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.29:2550/), Path(/user/shardmanager-config/member-1-shard-default-config/shard-default-member-2:datastore-config@0:11947_12284#-916780901)]] after [5000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadData". at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595) ~[?:?] at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605) ~[?:?] at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140) ~[?:?] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870) ~[?:?] at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109) ~[?:?] at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103) ~[?:?] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868) ~[?:?] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) ~[?:?] at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) ~[?:?] at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) ~[?:?] at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) ~[?:?] at java.lang.Thread.run(Thread.java:748) ~[?:?] I can't find where that internal error 500 comes from though. I would not expect one healthy node in the majority, when a single node dies, to start In this specific example, only 487 of the 10k cars were configured, so it seems there all three odl karaf logs: initial leader |
| Comments |
| Comment by Tom Pantelis [ 19/Jun/18 ] |
|
Notice the AskTimeoutException. This can happen for transaction(s) in flight when the leader is isolated, if the timing is right. So this is expected with the ask-based protocol however the new tell-based protocol promises to alleviate inflight failures (via retries) as was discussed on the call. The 401 is b/c of a read failure in aaa - apparently it relies on the mdsal DS for authentication (I didn't know that). |
| Comment by Jamo Luhrsen [ 19/Jun/18 ] |
|
tpantelis, just to be clear, if we run this job with the tell-based protocol we will expect to never hit this problem? I think I can |
| Comment by Tom Pantelis [ 19/Jun/18 ] |
|
From my understanding, it shouldn't occur. |
| Comment by Ryan Goulding [ 19/Jun/18 ] |
|
Not for authentication, just authorization. We derive the authorization rules for MdsalDynamicAuthorizationFilter from data stored in MD-SAL, the same place everyone stores their data. If it is isolated, then we cannot get that info, and we fail-closed. So yes, this is correct behavior. |
| Comment by Sam Hague [ 19/Jun/18 ] |
|
tpantelis what would be the outcome from this issue - that all operations using aaa will fail? Assuming we can't switch to tell-based, should something be fixed here in aaa or somewhere else? I am guessing we would hit this same error in he downstream jobs also. |
| Comment by Tom Pantelis [ 19/Jun/18 ] |
|
aaa is simply doing a read and it fails. Whether it should fail-fast and disallow access is another story. That's the safest approach as it doesn't know if it should allow access. However it could cache the last known password and configuration - other systems work that way or provide the option to behave that way. Either way it probably doesn't matter since if aaa can't access the DS then the user request likely won't be able to either. |
| Comment by Ryan Goulding [ 19/Jun/18 ] |
|
That isn't entirely true. First off, it isn't authentication; that is handled through H2 still. It is Authorization. Authorization data shouldn't be cached IMHO, and I wouldn't merge a patch that built a cache for it. There needs to be a single source of truth; it isn't like other data, which can be stale. |
| Comment by Luis Gomez [ 19/Jun/18 ] |
|
Jamo, can you try with tell-based protocol? here is an example: |
| Comment by Ryan Goulding [ 19/Jun/18 ] |
|
It does beg the question if we should pre-empt AAA 401 with a check in RESTCONF which returns 5xx in the case of isolation. I would be behind that, but don't have cycles to contribute. |
| Comment by Tom Pantelis [ 19/Jun/18 ] |
|
I know what you're saying but I don't completely agree with that. It could use the last known good config if the latest config can't be obtained. We did similar in a previous product I worked on years back to allow that based on user-enablement - it was specifically requested (I recall Windows does that too wrt LDAP authentication). Also I think the initial "admin" user should have "super" privileges, ie it should be authorized for everything so there's always a way in (otherwise users can shoot themselves in the foot with a mis-configuration and not have any way to back it out other then to completely re-install - seen that happen before). Anyway I'm digressing.... it is what it is now - CSIT can expect 401 if HA is lost. |
| Comment by Ryan Goulding [ 19/Jun/18 ] |
|
I envision a situation in which someone gets canned when a node is in isolation but still has access via RESTCONF until the node rejoins the cluster. This is a bit silly I guess, but man do I not want my name on that CVE
We could probably also just make a super user account at some point too. I'd say admin, but at this point most people unfortunately only use that account! At least from non-ONAP context. |
| Comment by Tom Pantelis [ 19/Jun/18 ] |
|
In my experience, the initial user created on installation is automatically a super-user to bootstrap things. It can't be deleted nor have it's privileges downgraded - again a safety net to avoid shooting oneself in the foot. |
| Comment by Sam Hague [ 19/Jun/18 ] |
|
rgoulding tpantelis What exactly would fail during this time when the authorization is failing because aaa can't read from ds? All restconf calls? Does that mean all restconf calls should not be relied upon? Meaning in the context of our csit tests, there are scenarios where the tests use restconf to verify states and if we don't see a 200 for instance, we would fail a test. In this case, the test itself is not written properly and should be changed to not rely on the restconf calls. |
| Comment by Tom Pantelis [ 19/Jun/18 ] |
|
If the DS isn't available then it appears currently all restconf calls will fail with 401 b/c it can't authenticate. |
| Comment by Ryan Goulding [ 19/Jun/18 ] |
|
So the MDSALDynamicAuthorizationFilter is a Shiro AuthorizationFilter that returns boolean whether "isAccessAllowed(...)". It is executed on ingress of REST requests to anything protected by AAA. Part of isAccessAllowed(...) is getting the existing authorization rules [0]. If it fails w/ a ReadFailedException (i.e., an indication of Node isolation), then we return false [1], which results in a 401. Is this correct behavior? Well it functions as designed, but it ain't pretty. Really, we ought to put a check in RESTCONF to check if a Node is isolated prior to executing any AAA Filter code. In other words, a Filter in front of AAA which will return a 5XX if the node is isolated, preferably with a nice message (though one could argue how much disclosure we should give the client!). If you are mapping 200 to failure, well that means any time a REST call is made against an isolated node, you will have a test failure. Since isolation cases can occur and are "expected" in our target environment, I doubt that you want to map directly to failure.
|
| Comment by Ryan Goulding [ 19/Jun/18 ] |
|
We could do that. I know some stakeholders have expressed desire for this in the past, but it hasn't been prioritized by anyone thus far. It really wouldn't take that much. But as you stated before, we still have that issue of RESTCONF calls probably failing anyway since that code is attempting reads against the DS on an isolated node still. |
| Comment by Sam Hague [ 19/Jun/18 ] |
|
Not mapping a 200 to failure, but the opposite - if we don't get a 200 when querying for something we consider that a failure. Take a case where a test starts, some events happen that should write something to datastore. To validate that happened the test would read from datastore via restconf expecting to see a 200 for the resource requested. But if during this isolated condition all restconf calls are going to fail, then that test is always going to fail - regardless if the ds was updated correctly or not. That itself might be a another badly written part of this test. |
| Comment by Jamo Luhrsen [ 19/Jun/18 ] |
|
Thanks ecelgp. I see you have a similar patch to something I tried (with bad results) like 10 months ago.
anyway, here's the cars/people version and I'm trying it in the sandbox |
| Comment by Jamo Luhrsen [ 19/Jun/18 ] |
|
At the risk of sounding stupid here, can I double check that I understand this properly? this specific issue in this JIRA has some external REST client talking to a single happy node. That node But, I think the idea here is that this node has to do it's sideways stuff to make reads/writes to the leader correct? |
| Comment by Sam Hague [ 19/Jun/18 ] |
|
In the netvirt 3node jobs we have hundreds of those "MDSAL attempt to read Http Authz Container failed, disallowing access" errors. No idea if they cause problems, but they are there in the jobs. |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
yes. All transactions have to go to the leader. During isolation there is a period of time where there is no leader that is able to make progress until it stabilizes with a new leader in the majority partition (assuming it's able to do so). Transactions initiated during that window in the majority partition may still timeout and fail. No transactions initiated in the isolated partition will succeed until the partition heals. So if the external tool was adding cars on the isolated node, with the "ask"-based protocol, the transactions initiated during isolation will all fail. If it was adding cars on one of the other nodes they could all succeed if none were initiated during the ~5-6 sec period before akka detects the leader node is unreachable. The new "tell"-based protocol is intended to retry transactions in the hope that they will eventually succeed. Of course even that has its upper limit deadline. If 2 nodes were down for many minutes/hours you would expect it to eventually give up and fail. |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
yeah DS access may still fail even if it got thru aaa. But other NB endpoints could still be accessed, eg RPCs, netconf mount point, jolokia etc. An admin might want to access JMX data to check the akka cluster state etc. But with no access via the NB, an admin has no visibility into the system (unless thru CLI). |
| Comment by Jamo Luhrsen [ 20/Jun/18 ] |
|
thanks for all the explanations, tpantelis. I'm wondering what the negative side effects are to dial up the timeout from 5s to 1s? that won't get rid of this Anyway, let's see how the tell-based stuff behaves with our simple car/people job. still waiting on results. |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
yup - reducing the time out will open the door for false positives. Garbage collection is the biggest culprit and accounts for most of that ~5s default cushion. You can't avoid a "blackout" period altogether - it happens with networking protocols like spanning tree etc. The best we can do I think is to retry to hopefully overcome relatively short transient failures. |
| Comment by Robert Varga [ 20/Jun/18 ] |
|
Note that tell-based protocol will abort (i.e. render frontend unusable) if it cannot make forward progress in 10 minutes (or so). Instance restart is required to recover from that. |
| Comment by Jamo Luhrsen [ 20/Jun/18 ] |
|
looking for help. this tell-based config is having trouble in the job. Example robot logs basically, the cars suite runs first and passes (like it will do every so often), then a suite runs that
that last step fails, because there is a read timeout (restconf read). The karaf.log for that controller like a whole mess of these:
2018-06-20T16:49:08,981 | WARN | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder | 217 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | member-2-shard-car-config: Unknown history for aborted transaction member-1-datastore-config-fe-0-txn-19-1, ignoring
followed by a bunch of: 2018-06-20T16:51:07,052 | INFO | opendaylight-cluster-data-shard-dispatcher-27 | Shard | 209 - org.opendaylight.controller.sal-clustering-commons - 1.8.0.SNAPSHOT | member-2-shard-topology-config (Leader): Cannot append entries because sender's term 2 is less than 3 then some of this: 2018-06-20T16:51:11,046 | INFO | opendaylight-cluster-data-shard-dispatcher-26 | Shard | 209 - org.opendaylight.controller.sal-clustering-commons - 1.8.0.SNAPSHOT | member-2-shard-car-config (Leader): handleAppendEntriesReply - received unsuccessful reply: AppendEntriesReply [term=3, success=false, followerId=member-1-shard-car-config, logLastIndex=1595, logLastTerm=2, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], leader snapshotIndex: 1595 can I get some ideas of what to try, where to look? |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
So the attached karaf log was from member-2. After the 2nd startup at 2018-06-20T16:47:35,134, looking at the cars-config shard, member-1 was initially the leader: 2018-06-20T16:47:48,178 | INFO | opendaylight-cluster-data-shard-dispatcher-27 | ShardManager | 217 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-2-shard-car-config, leaderId=member-1-shard-car-config, leaderPayloadVersion=5] Then 10.30.170.22, which I assume is member-1, became unreachable: 2018-06-20T16:49:23,879 | INFO | opendaylight-cluster-data-shard-dispatcher-30 | Shard | 209 - org.opendaylight.controller.sal-clustering-commons - 1.8.0.SNAPSHOT | member-2-shard-car-config (Follower): Leader akka.tcp://opendaylight-cluster-data@10.30.170.22:2550 is unreachable 2018-06-20T16:49:23,896 | INFO | opendaylight-cluster-data-shard-dispatcher-30 | ShardManager | 217 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-2-shard-car-config, leaderId=member-2-shard-car-config, leaderPayloadVersion=5] Then there's a bunch of these: 2018-06-20T16:51:07,079 | INFO | opendaylight-cluster-data-shard-dispatcher-45 | Shard | 209 - org.opendaylight.controller.sal-clustering-commons - 1.8.0.SNAPSHOT | member-2-shard-car-config (Leader): Cannot append entries because sender's term 2 is less than 3 I assume this is during the time of isolation however it appears these messages are coming from member-1. Is traffic just blocked one way? Unless this was after the isolation was lifted but not sure if the test got that far. I don't see any significant ERRORs in the log (just 2 related to the karaf shell). And no WARNs except for those listed here. So it's unclear from this log why the rest request timed out. The test step Verify_New_Car_Leader_Elected that failed indicates it's attempting to access jolokia to get the leader status but it appears the actual request that failed was attempting /restconf/modules. Either way, it may be that aaa auth was trying to read from the DS but spun. "tell"-based has different behavior with retries so I think it has a much longer timeout period - rovarga knows that code much better than me. I assume aaa would be hitting the default-config shard - the member-2 log indicates member-3 was the leader but I don't see any indication in that log either. The log from member-1 shows a lot of "Follower is out-of-sync..." messages which corresponds to the "handleAppendEntriesReply - received unsuccessful reply..." on member-2. Something strange going on there... Also these should not have occurred during the isolation period if traffic to/from member-1 was blocked.... It would be ideal to manually run the isolation test steps - one can control the steps and verify and debug the live system at various points. The isolation scenario is the hardest - perhaps we should first get the other test scenarios working reliably first (unless they already are |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
If the read timeouts are due to aaa trying to access the DS, perhaps CSIT can install odl-restconf-noauth to take that out of the equation. At least then Verify_New_Car_Leader_Elected could succeed and convey the results. |
| Comment by Jamo Luhrsen [ 20/Jun/18 ] |
I tried this, and the first run was good. |
| Comment by Jamo Luhrsen [ 20/Jun/18 ] |
|
well, another run failed with the same symptoms even with the odl-restconf-noauth |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
I think we're going to have to try to run the isolation steps manually. I think we really need live systems. As noted above, after isolation (or what I think from the log is when it occurred), I see another member became car leader as expected and it snapshotted at 19999 log entries which indicates thousands of cars were created after isolation. Also the logs indicate the partition was healed after about 2 min and member-1 joined back and was receiving and replying to messages by the new leader. I see no reason in the logs for the rest timeout which wasn't even hitting the DS anyway (appears it was hitting restconf/modules). How is the rest of the suite? Is the isolation test the only one failing? |
| Comment by Jamo Luhrsen [ 20/Jun/18 ] |
|
some notes: the isolation is done by blocking all traffic OUTPUT from the leader. So, the other two nodes will stop hearing from it, but it will still be hearing from them I don't have a quick local setup to reproduce this manually, but can work on it. Still, it's very easy to try things out with CSIT in the sandbox and just the /restconf/modules endpoint is just the first endpoint in the list of requests that will be made on the way for the test to determine the leader. That remember, the tests aren't dealing with the node that is marked in isolation. Once those iptables rules are in place on that initial leader, |
| Comment by Jamo Luhrsen [ 20/Jun/18 ] |
|
this is the whole suite. see the 10 steps in the description for the high level. do you think a totally stopped node test, instead of isolation is going to be better for us to look at? |
| Comment by Jamo Luhrsen [ 20/Jun/18 ] |
|
btw, if you read below, I think we can say that tell-based protocol was not any kind of silver bullet here. |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
Our comments are getting crossed. From what I see in logs, things appear to be working as expected. The downside with automated tests is that you have to try to analyze things post-mortem - sometimes it really helps to have live systems to poke around in. |
| Comment by Jamo Luhrsen [ 20/Jun/18 ] |
|
of course. I don't disagree with having a live setup to debug being better (assuming we can reproduce). I don't have one right now. do you? |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
I don't have access to a 3 node env right now. The best I can do is run 3 ODL instances on the same VM on my MAC. Not sure what the system commands are to block traffic - in my case they would each run on a different akka port. |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
Doesn't it do that too? From what I recall 3 years ago when I did some work on the tests, it would stop a node and verify continuity and also 2 to verify expected failures. It also tested/verified recovery from persistence. Assuming all those tests still exist, are they running reliably? |
| Comment by Jamo Luhrsen [ 20/Jun/18 ] |
|
I can try to get a setup going as well, but might take me a day or so. here's the commands the CSIT is using to isolate the leader: sudo /sbin/iptables -I OUTPUT -p all --source $LEADER_IP --destination $LEADER_IP -j DROP sudo /sbin/iptables -I OUTPUT -p all --source $LEADER_IP --destination $FOLLOWER_1_IP -j DROP sudo /sbin/iptables -I OUTPUT -p all --source $LEADER_IP --destination $FOLLOWER_2_IP -j DROP note that first line is odd, as it blocks traffic to and from it's own ip. I think that's probably just a relic |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
Back then the isolation test didn't exist. |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
I assume it's possible to just block a port. |
| Comment by Jamo Luhrsen [ 20/Jun/18 ] |
|
yes, we can do that too. are you asking to try that in CSIT? or how to do it for yourself locally? |
| Comment by Tom Pantelis [ 20/Jun/18 ] |
|
That would be for running 3 instances on a single VM on my MAC - each instance has IP 127.0.0.1 but different ports. |
| Comment by Jamo Luhrsen [ 20/Jun/18 ] |
|
try this: sudo /sbin/iptables -I OUTPUT -p tcp --dport 2550 -j DROP |
| Comment by Tom Pantelis [ 21/Jun/18 ] |
|
jluhrsen ecelgp There's been a lot comments flying around - I didn't get an answer here. However I cloned integration/test and I do see the original tests or at least most of them under csit/suites/controller/Clustering_Datastore. carpeople_crud.robot tests basic functionality. car_failover_crud.robot is one that stops a node to test HA. There's car_failover_crud_isolation.robot that uses isolation. The one you're running is under ThreeNodes_Datastore (not sure how it differs from car_failover_crud_isolation). There's also a bunch of other tests spread out in different directories - dom_data_broker, cluster_singleton, singleton_service - I think all those were Vratko's so i assume no one's maintained those or really know what they do and what state they're in. So do the ones under Clustering_Datastore (maybe sans isolation) run reliably? I think that's a good first step - start simple/basic and work our way up. |
| Comment by Jamo Luhrsen [ 21/Jun/18 ] |
|
tpantelis that suite you are referring to is doing fine , it seems: So, that more simple case is working which is good. It's definitely being a little more gentle with the tests, The job we are looking at, is proactively trying to use the cluster while a leader is being isolated. Also, I ran some different combos (tell based and not) in the sandbox, and made a new suite (to kill the leader, Additionally, the more I think about this problem the more it's reminding me of some issue we think is |
| Comment by Robert Varga [ 21/Jun/18 ] |
|
tell-based protocol has these timeouts:
We can eventually get rid of the request timeout, but we need it now to make UT work. |
| Comment by Tom Pantelis [ 21/Jun/18 ] |
|
I tested locally with "ask" protocol running the car stress-test RPC creating 500 cars at 10/sec. I tested with graceful leader shutdown and killing the leader, both 5 times. I also manually did GET on car:cars quickly over and over for a bit after stopping the leader - these succeeded. With graceful leader shutdown, all the cars were successfully created in each run. With leader kill, I saw some car failures during transition which is expected. |
| Comment by Tom Pantelis [ 21/Jun/18 ] |
|
jluhrsen Re: the suppression of the first 4 failed loops and only displaying the failure form the last one... the downside is that we lose the details of the first 4. It would be interesting to see if the first follower that succeeded in the last try also had failures previously and if the second one had the same 5 sec timeout failures each time. Maybe it succeeded in a prior try but the first one failed etc. So I think it would be good to show all the failures at least for now. |
| Comment by Tom Pantelis [ 22/Jun/18 ] |
|
Also I asked Ryan about the HTTP request logging but that is not upstream. I supposed it could be re-invented upstream or maybe ecelgp could help to facilitate a quicker path |
| Comment by Tom Pantelis [ 22/Jun/18 ] |
|
One possible theory for the read timeout is that the internal HTTP server (jersey/jetty whatever) gets overloaded. I assume it has a thread pool and queue to handle the incoming requests. During the leader transition, add car requests from the python script will be delayed and thus block the thread. Over time all threads in the pool are in use and blocked and the queue builds up which may result in the client timing out. |
| Comment by Ariel Adam [ 24/Jun/18 ] |
|
tpantelis can you think if a way aside from logs to check the status of the threads and queues on the HTTP server? For the purpose of testing can we compile an HTTP server with logs and manually push it in? |
| Comment by Tom Pantelis [ 24/Jun/18 ] |
|
I haven't looked into that yet. But if you or a colleague could help on that front that would be great. |
| Comment by Jamo Luhrsen [ 26/Jun/18 ] |
|
This would explain it, but now we gotta prove it. I have a good local setup I can use for this (not my laptop), but it does not have the cluster-test-app features, Trying to figure out some other quick/easy "create" I can script so simulate this scenario. BTW, like tpantelisand thapar I cannot reproduce the troubles yet. I will additionally try |
| Comment by Tom Pantelis [ 26/Jun/18 ] |
|
I tried it with the cars stress-test RPC but that's not using restconf like the python script. The theory would hold water if the script sends restconf requests simultaneously. But if it sends requests synchronously one at a time then there would only be one outstanding request at a time which shouldn't overload. |
| Comment by Jamo Luhrsen [ 26/Jun/18 ] |
|
it does, from what I can tell. It's using 10 threads in the job, so I assume that is up to 10 rest calls in parallel. I will be trying to get all of this running locally today and see what happens. |
| Comment by Jamo Luhrsen [ 26/Jun/18 ] |
|
I was able to run this tool in our sandbox with only a single thread (which takes longer, as expected), but that too resulted in |
| Comment by Vishal Thapar [ 27/Jun/18 ] |
|
I was able to run the tool locally and get the 500 error, kind of. All the cars addition went through in the end, but bringing leader down, up and then the new leader down and up again caused some 500 errors. Not sure if same issue as this one, but at least it shows the exception that shows up in 500. title>Error 500 </title> I was tracking status through Tim's monitor script in odltools and I saw 401 pop up briefly when follower became leader. I think it has something to do with the follower we're tracking becomes leader or not. Maybe we should log/track leadership status in the jobs that are running the test? |
| Comment by Robert Varga [ 27/Jun/18 ] |
ReadTimeout: HTTPConnectionPool(host='10.30.170.108', port=8181): Read timed out. (read timeout=5.0)
this looks like a client-side timeout, i.e. RESTCONF has not responded within 5 seconds. |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
yes it is - the next question is why is restconf/jetty not responding... |
| Comment by Robert Varga [ 27/Jun/18 ] |
|
I suggest taking a look at https://git.opendaylight.org/gerrit/#/c/73455/5 and https://git.opendaylight.org/gerrit/#/c/73475/1 to see if they improve the situation (in which case this boils down to AAA). Otherwise we require debug logs to see what is going on in the restconf code. |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
yeah - the aaa DS access was discussed earlier in this Jira (I know it's huge and hard to follow). To take that out of the equation, I suggested installing restconf noauth feature - I believe all subsequent testing has been done with that. |
| Comment by Robert Varga [ 27/Jun/18 ] |
|
Right, but doesn't that just disable authentication, with authorization being still in the picture? |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
hmm - I assumed it would disable both - have to check on that... |
| Comment by Jamo Luhrsen [ 28/Jun/18 ] |
|
I've taken the distro created from c/73475 which is on top of c/73455 and am running in against this this is promising. there is something failing now, however, that I wasn't expecting. The Crud mdsal perf suite Anyway, let me keep collecting data with the aaa patches. I'll run some jobs without the |
| Comment by Jamo Luhrsen [ 28/Jun/18 ] |
|
right, so ask-based with the c/73475 distro is giving the same results as what we initially Wondering what to do now? Did we have an explanation for why a script talking to /restconf (to add cars) would get Also, this info is coming from the scripts stdout and there is no hint of it in any of the |
| Comment by Tom Pantelis [ 28/Jun/18 ] |
|
500 is internal server error - could come from anywhere. If it's emanating from restconf code then there should be additional error info in the output corresponding to the RFC (I know you've seen it before). Either way we need to see the full HTTP response output. |
| Comment by Jamo Luhrsen [ 28/Jun/18 ] |
|
do we need more than this: 2018-06-28 04:32:23,034 INFO: Request started at Thu Jun 28 04:32:17 2018 finished with following detais 2018-06-28 04:32:23,034 INFO: <PreparedRequest [POST]> http://10.30.170.131:8181/restconf/config/car:cars 2018-06-28 04:32:23,034 INFO: Headers {'Content-Length': '127', 'Content-Type': 'application/json', 'Authorization': 'Basic YWRtaW46YWRtaW4='}: 2018-06-28 04:32:23,035 INFO: Body: {"car-entry": [{"category": "my_category", "model": "model701", "manufacturer": "my_manufacturer", "id": 701, "year": "2015"}]} 2018-06-28 04:32:23,035 INFO: Response: {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"read execution failed","error-info":"ReadFailedException{message=Error checking DataExists for path /(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)cars/car-entry/car-entry[{(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)id=701}], errorList=[RpcError [message=Error checking DataExists for path /(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)cars/car-entry/car-entry[{(urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car?revision=2014-08-18)id=701}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.170:2550/), Path(/user/shardmanager-config/member-1-shard-car-config/shard-car-member-2:datastore-config@0:7884_728#-1820175598)]] after [5000 ms]. Sender[null] sent message of type \"org.opendaylight.controller.cluster.datastore.messages.DataExists\".]]}\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:247)\n\tat akka.dispatch.OnComplete.internal(Future.scala:260)\n\tat akka.dispatch.OnComplete.internal(Future.scala:258)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:185)\n\tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)\n\tat akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)\n\tat scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)\n\tat akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)\n\tat akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)\n\tat akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.170.170:2550/), Path(/user/shardmanager-config/member-1-shard-car-config/shard-car-member-2:datastore-config@0:7884_728#-1820175598)]] after [5000 ms]. Sender[null] sent message of type \"org.opendaylight.controller.cluster.datastore.messages.DataExists\".\n\tat akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595)\n\tat akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605)\n\tat akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)\n\tat scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)\n\tat scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)\n\tat akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\n\tat java.lang.Thread.run(Thread.java:748)\n"}]}} 2018-06-28 04:32:23,035 INFO: <Response [500]> Internal Server Error |
| Comment by Tom Pantelis [ 28/Jun/18 ] |
|
yeah that's the error info. The usual AskTimeoutException with ask-based. |
| Comment by Jamo Luhrsen [ 28/Jun/18 ] |
|
ok. what next then? why is this not showing up in the karaf.log? how can we avoid this failure in CSIT? |
| Comment by Tom Pantelis [ 28/Jun/18 ] |
|
There should be a WARN logged by CDS. restconf only logs request failures to DEBUG I believe. You can't reliably avoid this with ask-based as discussed before. If a tx hits at the right time during leader transition it may time out. |
| Comment by Tom Pantelis [ 28/Jun/18 ] |
|
A recourse is to retry the request on the client side. |
| Comment by Jamo Luhrsen [ 28/Jun/18 ] |
|
s/avoid/workaround We don't want to have any failures in these jobs. we want them to pass 100% every time so we Do we have to add retries to the python tool? (looks like it already claims to do that) How long is the window expected to be here (MAX)? Can we tweak some timers in the clustering config to reduce the chance? |
| Comment by Tom Pantelis [ 28/Jun/18 ] |
|
The window depends mostly on akka internals and how it detects failed nodes. Time out settings can be tweaked but there will always be a period of time of leader transition and re-convergence, no matter how small, where a tx can fail if inflight. The recourse is to retry the entire transaction, whether internally or on the client-side. ask-based does not retry inflight transactions - tell-based was introduced to do that. The test you're running can never be guaranteed 100% success with ask-based. |
| Comment by Jamo Luhrsen [ 28/Jun/18 ] |
|
no... please help me figure this out. let's say you are a real user and you want to add 10k cars with an external client. So you start what are you going to do next? Is it just a simple retry of that exact same request? Assuming that, how many times would you |
| Comment by Robert Varga [ 28/Jun/18 ] |
|
This is the usual state reconciliation. What you do depends on what you are doing, but a fair recovery strategy is:
Retry/backoff/escalation is an application-specific thing – in telco's there's like 20% of ONAP to deal with that. |
| Comment by Jamo Luhrsen [ 28/Jun/18 ] |
|
sure. I'll see what I can figure out from the client side here. still wondering things like: how long should I plan to deal with errored responses, before marking a failure that we will all know is legit showing a problem? btw, for this very specific case when we are trying to create X cars in a single request, and we get that 500 ask-timeout |
| Comment by Robert Varga [ 28/Jun/18 ] |
|
AskTimeout means the transaction is doomed, other transactions are not affected (unless they report an error). As for granular error reporting ... yeah, someone has to go and define the error contracts on the NB. |
| Comment by Jamo Luhrsen [ 28/Jun/18 ] |
|
it might be this simple: I have one example of this working, but it looks like there may be more than 2m of retries (see timestamps 2018-06-28 15:43:20,657 INFO: Add 10000 car(s) to 10.30.170.52:8181 (1 per request)
2018-06-28 15:45:25,211 INFO: Response code(s) got per number of requests: {204: 1003}
2018-06-28 15:45:25,212 INFO: Response code(s) got per number of requests: {204: 996}
2018-06-28 15:45:25,233 INFO: Response code(s) got per number of requests: {204: 993}
2018-06-28 15:45:25,234 INFO: Response code(s) got per number of requests: {204: 998}
2018-06-28 15:45:25,234 INFO: Response code(s) got per number of requests: {204: 996}
2018-06-28 15:45:25,235 INFO: Response code(s) got per number of requests: {204: 1004}
2018-06-28 15:45:25,260 INFO: Response code(s) got per number of requests: {204: 1013}
2018-06-28 15:45:25,261 INFO: Response code(s) got per number of requests: {204: 1005}
2018-06-28 15:45:25,261 INFO: Response code(s) got per number of requests: {204: 1001}
2018-06-28 15:45:25,261 INFO: Response code(s) got per number of requests: {204: 991}
I still want to know proper timing bounds to use, so we can still have failures when real problems are under |
| Comment by Jamo Luhrsen [ 05/Jul/18 ] |
|
To update this one, with this AAA patch's distribution, and these small changes in the CSIT code we are while that is good news, it should be noted that a new kind of failure is now coming with the AAA patch in a different suite in |
| Comment by Jamo Luhrsen [ 12/Jul/18 ] |
|
two aaa patches: and one int/test patch: |
| Comment by Jamo Luhrsen [ 12/Jul/18 ] |
|
The jobs are now passing 100% for this problem. A new, less severe problem is now happening in a different suite of |
| Comment by Luis Gomez [ 12/Jul/18 ] |
|
Awesome |