[CONTROLLER-1862] Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.30.170.99:2550 Created: 17/Sep/18  Updated: 19/Sep/18

Status: Open
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Sam Hague Assignee: Sam Hague
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

During 3node testing using graceful start and stop the following exception is seen. Graceful start and stop means using the bin/stop and start commands to stop and start ODL rather than using kill -9. This means there is an orderly stop to the bundles where each bundle is destroyed.

The flow is all three ODLs are up. Then shutdown ODL1 via bin/stop. No exceptions like below. Bring back ODL1 via bin/start. wait for cluster to sync. Then take down ODL2 via bin/stop. The exception below repeats until much after ODL2 is restarted.

It makes sense that the connection refused comes out since the ODL2 is down. What doesn't make sense is why this didn't happen when ODL1 was taken down.

The second issue is why after ODL2 was brought back, why didn't the exceptions stop as soon as the sync completed. They continued for a little while after sync was finished.

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/408/shague-haproxy-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-neon/8/odl_1/odl1_karaf.log.gz

2018-09-17T16:56:18,081 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-34 | ClusterCoreDaemon                | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.121:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.30.170.99:2550, status = Up)]. Node roles [member-1, dc-default]
2018-09-17T16:56:19,195 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | NettyTransport                   | 41 - com.typesafe.akka.slf4j - 2.5.11 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.30.170.99:2550

2018-09-17T16:56:19,198 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | ReliableDeliverySupervisor       | 41 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.99:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.99:2550]] Caused by: [Connection refused: /10.30.170.99:2550]
2018-09-17T16:56:25,191 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | NettyTransport                   | 41 - com.typesafe.akka.slf4j - 2.5.11 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.30.170.99:2550


 Comments   
Comment by Tom Pantelis [ 17/Sep/18 ]

> The flow is all three ODLs are up. Then shutdown ODL1 via bin/stop. No exceptions like below. Bring back ODL1 via bin/start. wait > for cluster to sync. Then take down ODL2 via bin/stop. The exception below repeats until much after ODL2 is restarted.

> It makes sense that the connection refused comes out since the ODL2 is down. What doesn't make sense is why this didn't  happen when ODL1 was taken down.

What nodes's log were you looking at? When ODL1 is taken down, you won't see such messages in its log - you would see them in the other logs.

> The second issue is why after ODL2 was brought back, why didn't the exceptions stop as soon as the sync completed. They  continued for a little while after sync was finished.

What sync are you referring to? The "Connection refused" messages stop as soon as akka establishes connection to the node. If they continued then connection hadn't been established yet.

Was there an actual failure scenario here that needs to be investigated here or are you just wondering about the messages? 

 

Generated at Wed Feb 07 19:56:38 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.