[CONTROLLER-1851] Sporadic failures in controller-csit-3node-rest-clust-cars-perf Created: 12/Jul/18 Updated: 14/Aug/18 Resolved: 14/Aug/18 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Fluorine |
| Fix Version/s: | Fluorine |
| Type: | Bug | Priority: | High |
| Reporter: | Vishal Thapar | Assignee: | Tom Pantelis |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | csit:3node | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
Verify Purchases test case of Crud MDSAL Perf [1] has started failing sporadically after fix for |
| Comments |
| Comment by Tom Pantelis [ 15/Jul/18 ] |
|
The first robot failure I see is in Purchase Cars -> Read Until Prompt -> No match found for '>' in 6 seconds Output: I'm not clear on what this step is doing.
|
| Comment by Vishal Thapar [ 15/Jul/18 ] |
|
That is not an error. It is waiting for test script to finish and it detects it from the prompt, hence waiting for '>'. If it doesn't get it in 6 seconds it waits for it again. It is just the way most tests are written, wait for n seconds with 5 retries. If you notice very next step it goes through. That is why it is not marked a failure. Actual failure occurs in Verify Purchases. |
| Comment by Tom Pantelis [ 15/Jul/18 ] |
|
And that fails due to "Keyword 'Purchase Is Completed' failed after retrying for 5 minutes. The last error was: 9946.0 != 10000.0". That doesn't tell me much. If some car purchase RPCs are failing then it would be helpful to see the error reasons. Can you please point me to that? Or if the test isn't outputting such failures then we need it to. Also it would be helpful to know how the script issues RPCs, eg does it round-robin to the nodes, or send all to 1 node and if so which one... So is it that the failure was triggered b/c the script didn't finish by the expected deadline time? |
| Comment by Tom Pantelis [ 16/Jul/18 ] |
|
A car purchase is done via the buyCar RPC (implemented by PurchaseCarProvider). This publishes a yang notification that is received by the PeopleCarListener which writes a car-person item to the data store. The PurchaseCarProvider logs the message "Routed RPC buyCar : generating notification for buying car [{}]" and PeopleCarListener logs "Successfully added car-person entry: [{}]" on successful transaction. So searching the odl1_karaf.log.gz (using grep "<message>" odl1_karaf.log.gz |wc -l) we see: 2461 "Routed RPC buyCar" messages 2461 "Successfully added car-person entry" messages In odl2_karaf.log.gz: 5586 "Routed RPC buyCar" messages 5586 "Successfully added car-person entry" messages In odl3_karaf.log.gz: 11899 "Routed RPC buyCar" messages 11899 "Successfully added car-person entry" messages So in each the number of buyCar RPCs matches the number of car-person items successfully created in the DS. There were no failures. In fact there were no ERRORs logged at all except a few for "Karaf ssh console user karaf | ShellUtil". The total number of car-person items created is 19946 - I assume 9946 were created during the first run and 10000 after the tell-based restart. From the log output, the first run only received 9946 buyCar RPCs. So either there's an issue with the script where it didn't actually send 10K or the missing RPCs never reached the PurchaseCarProvider. I assume cluster_rest_script_purchase_cars.log.gz is the correct output file. According to that 10K requests completed and returned 200 (no failures): 2018-07-11 04:24:45,916 INFO: Add 10000 purchase(s) to 10.30.170.153,10.30.170.134,10.30.170.111:8181 (1 per request)
2018-07-11 04:25:22,715 INFO: Response code(s) got per number of requests: {200: 1990}
2018-07-11 04:25:22,717 INFO: Response code(s) got per number of requests: {200: 2051}
2018-07-11 04:25:22,718 INFO: Response code(s) got per number of requests: {200: 1413}
2018-07-11 04:25:22,721 INFO: Response code(s) got per number of requests: {200: 1574}
2018-07-11 04:25:22,723 INFO: Response code(s) got per number of requests: {200: 1414}
2018-07-11 04:25:22,724 INFO: Response code(s) got per number of requests: {200: 1558}
However since there's only one such log file, I assume this is from the second run which succeeded with 10k created which means the one from the first run got overwritten. So we don't know the request/response data from the first run. We need to retain both. On a side note, I have a suggestion for the tests to make post-mortem analysis easier which I've mentioned before. On node restart, save off the karaf log file to a different name, eg karaf.log.run1, and delete the karaf.log before restarting so it starts with a clean log. This will make it much easier to tell which log messages go with each run. In fact, I would suggest to separate the ask-based and tell-based into different tests instead of running them back to back in the same test. I think this would make tracking and analysis easier and avoids the issue of log files being overwritten and having to explicitly retain the ones from the prior run. Also we had decided we're only focusing on ask-based right now. |
| Comment by Vishal Thapar [ 16/Jul/18 ] |
|
Thanks a ton Tom. Yes, does look like log files get overwritten. Should retain logs or have them as separate tests. In netvirt we normally log test case/suite name into karaf logs too, which makes it easier to map logs to individual test cases, better than separate file for each run. I'm still getting familiar with controller CSIT, will do the changes once figure out how/where to do them. |
| Comment by Vishal Thapar [ 18/Jul/18 ] |
|
Most recent run has failures in second run [3], so we got the right logs for verify purchases [4]. There are total of 78 errors, all 501, which matches with result - 9922.0 != 10000.0 2018-07-18 02:29:30,430 INFO: <PreparedRequest [POST]> http://10.30.170.157:8181/restconf/operations/car-purchase:buy-car available"}]}} : [3] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-rest-clust-cars-perf-only-fluorine/149/robot-plugin/log.html.gz#s1-s4-t6 |
| Comment by Vishal Thapar [ 18/Jul/18 ] |
|
Logs copied to [5] avoid purge during weekend. [5] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/238 |
| Comment by Tom Pantelis [ 18/Jul/18 ] |
|
So the routed RPCs for buyCar are registered via callback of successful person commit (PeopleProvider). Searching for the log message "RPC addPerson : routed rpc registered", we see: odl1: 4947 This adds up to 15053 which doesn't seem to make sense. There were 2 runs so I'd expect 20k. On odl1, I see the RPC registered for person 9687. Unfortunately the script doesn't output the node IP on failure so we don't know which node it sent the buyCar RPC to for person 9687. We need it to print the entire URL, ie including host IP. My suspicion is that the RPC reg had not yet been distributed to the "purchasing" node yet. If so then the script would need to retry. thapar So next steps:
|
| Comment by Vishal Thapar [ 18/Jul/18 ] |
|
Isn't this line showing the URL to which request is made: I am having some issues pushing to sandbox, have raised helpdesk ticket and will do the changes once they're resolved. |
| Comment by Tom Pantelis [ 18/Jul/18 ] |
|
yeah - sorry - I don't know why I didn't see that. |
| Comment by Vishal Thapar [ 18/Jul/18 ] |
|
No problem. I'll push the change for tell only. |
| Comment by Tom Pantelis [ 18/Jul/18 ] |
|
ask-based only, ie take out the tell-based restart. |
| Comment by Vishal Thapar [ 18/Jul/18 ] |
|
Ah yes, sorry for poorly worded response. |
| Comment by Vishal Thapar [ 24/Jul/18 ] |
|
tpantelis Can you look at https://logs.opendaylight.org/sandbox/vex-yul-odl-jenkins-2/thapar-controller-csit-3node-rest-clust-cars-perf-only-fluorine/1/ and see if you get all the logs? This change is to retain logs from both runs and sandbox was testing the patch for it. |
| Comment by Tom Pantelis [ 24/Jul/18 ] |
|
Sorry I'm not clear on what you're asking... "see if you get all the logs" for what? |
| Comment by Vishal Thapar [ 24/Jul/18 ] |
|
I meant if it was still overwriting the logs, I was having some network issues and couldn't open it, but now I see there is still only one file for car purchase log. |
| Comment by Vishal Thapar [ 02/Aug/18 ] |
|
After separating out, not seeing the failures in ask based for over 30 runs now. |
| Comment by Jamo Luhrsen [ 14/Aug/18 ] |
|
thapar please close this if the jobs are not hitting this failure any more |