follower reports 401 (unauthorized) and 500 (Internal Error) when leader is isolated. (CONTROLLER-1838)

[CONTROLLER-1840] check if python script fails in tell based Created: 22/Jun/18  Updated: 22/Jan/24

Status: Open
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: 7.0.12, 8.0.5, 9.0.1

Type: Sub-task Priority: Medium
Reporter: Jamo Luhrsen Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: pick-next, pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Sometimes it's seen that the python script to add the cars has trouble
communicating and fails with some python exceptions. Other times the
failures are manifested in the robot rest clients but we notice that the
python script runs to a clean completion.

We want to know if the python script ever shows a failure when using
the tell-based protocol.



 Comments   
Comment by Jamo Luhrsen [ 26/Jun/18 ]

yes, the python script does fail in tell based config. The failure is because /restconf (what it being used to add
cars) stops responding for some period of time.

 

script output looks like:

2018-06-25 23:47:00,983 INFO: Add 10000 car(s) to 10.30.170.10:8181 (1 per request)
2018-06-25 23:47:22,807 ERROR: No response from 10.30.170.10
2018-06-25 23:47:22,810 ERROR: No response from 10.30.170.10
2018-06-25 23:47:22,813 ERROR: No response from 10.30.170.10
2018-06-25 23:47:22,823 ERROR: No response from 10.30.170.10
2018-06-25 23:47:22,831 ERROR: No response from 10.30.170.10
2018-06-25 23:47:22,836 ERROR: No response from 10.30.170.10
2018-06-25 23:47:22,837 ERROR: No response from 10.30.170.10
2018-06-25 23:47:22,843 ERROR: No response from 10.30.170.10
2018-06-25 23:47:22,845 ERROR: No response from 10.30.170.10
2018-06-25 23:47:22,847 ERROR: No response from 10.30.170.10
2018-06-25 23:47:37,828 ERROR: No response from 10.30.170.10
2018-06-25 23:47:37,828 ERROR: No response from 10.30.170.10
2018-06-25 23:47:37,829 ERROR: No response from 10.30.170.10
2018-06-25 23:47:37,837 ERROR: No response from 10.30.170.10
2018-06-25 23:47:37,847 ERROR: No response from 10.30.170.10
2018-06-25 23:47:37,857 ERROR: No response from 10.30.170.10
2018-06-25 23:47:37,863 ERROR: No response from 10.30.170.10
2018-06-25 23:47:37,864 ERROR: No response from 10.30.170.10
2018-06-25 23:47:37,864 ERROR: No response from 10.30.170.10
2018-06-25 23:47:37,865 ERROR: No response from 10.30.170.10
2018-06-25 23:47:52,853 ERROR: No response from 10.30.170.10
2018-06-25 23:47:52,858 ERROR: No response from 10.30.170.10
2018-06-25 23:47:52,859 ERROR: No response from 10.30.170.10
2018-06-25 23:47:52,863 ERROR: No response from 10.30.170.10
2018-06-25 23:47:52,866 ERROR: No response from 10.30.170.10
2018-06-25 23:47:52,875 ERROR: No response from 10.30.170.10
2018-06-25 23:47:52,888 ERROR: No response from 10.30.170.10
2018-06-25 23:47:52,889 ERROR: No response from 10.30.170.10
2018-06-25 23:47:52,891 ERROR: No response from 10.30.170.10
2018-06-25 23:47:52,891 ERROR: No response from 10.30.170.10
2018-06-25 23:48:49,774 INFO: Response code(s) got per number of requests: {99: 3, 204: 998}
2018-06-25 23:48:49,783 INFO: Response code(s) got per number of requests: {99: 3, 204: 998}
2018-06-25 23:48:49,793 INFO: Response code(s) got per number of requests: {99: 3, 204: 994}
2018-06-25 23:48:49,816 INFO: Response code(s) got per number of requests: {99: 3, 204: 998}
2018-06-25 23:48:49,817 INFO: Response code(s) got per number of requests: {99: 3, 204: 998}
2018-06-25 23:48:49,818 INFO: Response code(s) got per number of requests: {99: 3, 204: 998}
2018-06-25 23:48:49,821 INFO: Response code(s) got per number of requests: {99: 3, 204: 996}
2018-06-25 23:48:49,821 INFO: Response code(s) got per number of requests: {99: 3, 204: 1012}
2018-06-25 23:48:49,822 INFO: Response code(s) got per number of requests: {99: 3, 204: 982}
2018-06-25 23:48:49,822 INFO: Response code(s) got per number of requests: {99: 3, 204: 996}
2018-06-25 23:48:49,826 ERROR: Problems during cars' configuration appeared: {99: 30, 204: 9970}
Traceback (most recent call last):
  File "cluster_rest_script.py", line 750, in <module>
    args.itemcount, auth, args.ipr)
  File "cluster_rest_script.py", line 601, in add_car_with_retries
    raise Exception("Problems during cars' configuration appeared: " + repr(res))
Exception: Problems during cars' configuration appeared: {99: 30, 204: 9970}

so, it seems like maybe there was a ~30s period without a response

btw, here are the few log lines that appear in the node being used as the endpoint
during this period:

2018-06-25T23:47:20,057 | INFO  | pipe-log:log "ROBOT MESSAGE: Starting test controller-rest-clust-cars-perf.txt.Puts During Isolation.Isolate_Current_Car_Leader" | core                             | 112 - org.apache.karaf.log.core - 4.1.5 | ROBOT MESSAGE: Starting test controller-rest-clust-cars-perf.txt.Puts During Isolation.Verify_Cars_Configured
2018-06-25T23:47:32,146 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-20 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.10:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.30.170.10:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.5:2550: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@10.30.170.17:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.5:2550: Unreachable [Unreachable] (1)], member status: [akka.tcp://opendaylight-cluster-data@10.30.170.10:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.17:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.5:2550 Up seen=false]
2018-06-25T23:48:00,207 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-38 | ClientActorBehavior              | 196 - org.opendaylight.controller.cds-access-client - 1.4.0.SNAPSHOT | member-2-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.30.170.5:2550/user/shardmanager-config/member-1-shard-default-config#14171797], sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=absent}} reconnecting as 
Comment by Jamo Luhrsen [ 26/Jun/18 ]

example job

Generated at Wed Feb 07 19:56:35 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.