[CONTROLLER-1751] Sporadic cluster failure when member is restarted in OF cluster test Created: 17/Aug/17 Updated: 25/Jul/23 Resolved: 24/May/18 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Highest |
| Reporter: | Luis Gomez | Assignee: | Luis Gomez |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| External issue ID: | 9006 | ||||||||
| Description |
|
Please see attached karaf log. We see this behavior sporadically at different places in the OpenFlow cluster test after we kill an instance and bring it back. Please help identifying the root cause. |
| Comments |
| Comment by Luis Gomez [ 17/Aug/17 ] |
|
Attachment karaf_member_restart.txt has been added with description: karaf log for restarted member |
| Comment by Luis Gomez [ 17/Aug/17 ] |
|
WARN messages in attached karaf log: 2017-08-15 14:29:10,004 | WARN | saction-28-30'}} | DeadlockMonitor | 119 - org.opendaylight.controller.config-manager - 0.6.2.Carbon | ModuleIdentifier {factoryName='runtime-generated-mapping', instanceName='runtime-mapping-singleton'} did not finish after 274970 ms2017-08-15 14:29:15,004 | WARN | saction-28-30'}} | DeadlockMonitor | 119 - org.opendaylight.controller.config-manager - 0.6.2.Carbon | ModuleIdentifier{factoryName='runtime-generated-mapping', instanceName='runtime-mapping-singleton'} did not finish after 279970 ms |
| Comment by Vratko Polak [ 21/Aug/17 ] |
|
A place in Robot test showing member-2 not syncing within 5 minutes: https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-3node-clustering-only-carbon/733/log.html.gz#s1-s1-t34-k2-k2-k8-k1-k2-k1-k1-k6-k1 |
| Comment by Vratko Polak [ 21/Aug/17 ] |
|
This is also happening in Netconf [1], and similar symptom is occasionally seen in Carbon [2]. [1] https://logs.opendaylight.org/releng/jenkins092/netconf-csit-3node-clustering-only-nitrogen/116/log.html.gz#s1-s7-t19-k2-k2-k8-k1-k2-k1-k1-k6-k1 |
| Comment by Sam Hague [ 24/Aug/17 ] |
|
Issue also seen in Netvirt csit: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-gate-stateful-carbon/12/log.html.gz#s1-s2-s2-t7 |
| Comment by Tom Pantelis [ 24/Aug/17 ] |
|
All the shards timed out and went to Candidate and stayed there repeatedly starting new elections. This means it could not connect to any of the other nodes. Since this intermittent, it seems like an issue in the jenkins environment. I think the CSS timeout is a consequence of this as we block CDS bundle startup while waiting for shards to obtain leaders. |
| Comment by Tom Pantelis [ 24/Aug/17 ] |
|
It looks like it couldn't contact any other seed node in time so joined itself and made itself leader. Since I assume ODL1 is the first seed node, it can do that. This wouldn't occur with the other 2 nodes. 2017-08-15 14:24:55,472 | INFO | ult-dispatcher-2 | kka://opendaylight-cluster-data) | 171 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.15.103:2550] - Node [akka.tcp://opendaylight-cluster-data@10.29.15.103:2550] is JOINING, roles [member-1] |
| Comment by Luis Gomez [ 24/Aug/17 ] |
|
Tom, this is weird because all we do in these tests is to kill the java process running the karaf container and after that start karaf again. So if there was an issue in the network it should be there all the time not just at the moment of the test. |
| Comment by Tom Pantelis [ 24/Aug/17 ] |
|
yeah it seems weird. I'm just noting what I saw in the log - ODL1 joined itself which means it didn't connect to another node within a period of time. I don't know why. We don't have the logs from the other nodes. Perhaps something in those logs might yield something, eg maybe they refused to let ODL1 join for some reason. Or maybe the seed-node-timeout need to be increased. |
| Comment by Luis Gomez [ 24/Aug/17 ] |
|
Looking at these recent logs: https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-3node-clustering-only-nitrogen/118/odl1_karaf.log.gz The restarted member 2 starts listening to 2550 at 14:27:58: 2017-08-23 14:27:58,670 | INFO | ult-dispatcher-3 | Remoting | 41 - com.typesafe.akka.slf4j - 2.4.18 | Remoting started; listening on addresses :[akka.tcp://opendaylight-cluster-data@10.29.14.87:2550] The other member 1 sends connection attempts every 5 secs: 2017-08-23 14:27:53,898 | WARN | lt-dispatcher-48 | ReliableDeliverySupervisor | 41 - com.typesafe.akka.slf4j - 2.4.18 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.29.14.87:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.29.14.87:2550]] Caused by: [Connection refused: /10.29.14.87:2550] But almost same time member 2 starts listening to 2550 I see these errors in member 1 and no more connection attempts are being sent: 2017-08-23 14:27:57,593 | INFO | lt-dispatcher-48 | kka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550 -> akka.tcp://opendaylight-cluster-data@10.29.14.87:2550: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@10.29.15.229:2550 -> akka.tcp://opendaylight-cluster-data@10.29.14.87:2550: Unreachable [Unreachable] (1)], member status: [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.29.14.87:2550 Up seen=false, akka.tcp://opendaylight-cluster-data@10.29.15.229:2550 Up seen=true] 2017-08-23 14:28:57,606 | INFO | ult-dispatcher-4 | kka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550 -> akka.tcp://opendaylight-cluster-data@10.29.14.87:2550: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@10.29.15.229:2550 -> akka.tcp://opendaylight-cluster-data@10.29.14.87:2550: Unreachable [Unreachable] (1)], member status: [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.29.14.87:2550 Up seen=false, akka.tcp://opendaylight-cluster-data@10.29.15.229:2550 Up seen=true] |
| Comment by Tom Pantelis [ 24/Aug/17 ] |
|
"Leader can currently not perform its duties" is normal when a node becomes unreachable. By duties, it means it can't let any new nodes join the cluster until previously unreachable re-connect/re-join or are "downed" and removed from the cluster.. On member-2, it restarted around 2017-08-23 14:27:45: 2017-08-23 14:28:22,849 | WARN | lt-dispatcher-23 | JoinSeedNodeProcess | 41 - com.typesafe.akka.slf4j - 2.4.18 | Couldn't join seed nodes after [2] attmpts, will try again. seed-nodes=[akka.tcp://opendaylight-cluster-data@10.29.12.157:2550, akka.tcp://opendaylight-cluster-data@10.29.15.229:2550] This continued for 87 attempts until it apparently gave up: 2017-08-23 14:45:27,598 | WARN | lt-dispatcher-21 | ReliableDeliverySupervisor | 41 - com.typesafe.akka.slf4j - 2.4.18 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2017-08-23 14:45:46,975 | WARN | lt-dispatcher-26 | NettyTransport | 41 - com.typesafe.akka.slf4j - 2.4.18 | Remote connection to null failed with java.net.ConnectException: Connection refused: /10.29.12.157:2550 2017-08-23 14:45:46,983 | WARN | ult-dispatcher-4 | ReliableDeliverySupervisor | 41 - com.typesafe.akka.slf4j - 2.4.18 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550]] Caused by: [Connection refused: /10.29.12.157:2550] Then it finally connects: 2017-08-23 14:45:58,855 | INFO | ult-dispatcher-6 | kka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.14.87:2550] - Welcome from [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550] , address: akka.tcp://opendaylight-cluster-data@10.29.12.157:2550 On member-1, it looks like 10.29.15.229 (ODL3 I assume) was also stopped: 2017-08-23 14:37:51,592 | WARN | lt-dispatcher-20 | ClusterCoreDaemon | 41 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.29.15.229:2550, status = Up)]. Node roles [member-1] In that case, 2 nodes were down so both nodes have to re-connect or be "downed" before it allows any to join. The latter happened for ODL3: 2017-08-23 14:40:18,419 | INFO | lt-dispatcher-22 | kka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.12.157:2550] - Marking unreachable node [akka.tcp://opendaylight-cluster-data@10.29.15.229:2550] as [Down] Did the test manually down it? So it looks like the test stops 2 of the nodes. I'm not sure what the expectations are. |
| Comment by Luis Gomez [ 24/Aug/17 ] |
|
Hi Tom, you are right, we take down member-3 but that happens 10 mins after (14:37:42) we start member-2 (14:27:45). In our test we give max 5 mins for a member to boot up, join the cluster and load all features. Do you think this is not enough? because when the test pass it only takes 30 secs to go through the above. |
| Comment by Tom Pantelis [ 24/Aug/17 ] |
|
You can try giving it more time as it did eventually connect although it took over 15 min. Maybe there's something in the CSIT env (it gets busy?) where network communications get bogged down sometimes. member-2 was trying to continuously connect but was getting "Connection refused" for some reason until it finally connected. |
| Comment by Luis Gomez [ 24/Aug/17 ] |
|
OK, I think as next step I can try to see if this reproduces outside CI. |
| Comment by Robert Varga [ 24/Aug/17 ] |
|
Preliminary target set at Carbon SR3, pending netvirt CSIT ( |
| Comment by Sam Hague [ 25/Aug/17 ] |
|
NetVirt is hitting this exact same issue. NetVirt tests copied the openflowplugin test pattern to take a node down and bring it back. Then wait 5 minutes. What I don't understand is why taking 1 node down out of the three leads to instability? We have three nodes in the cluster. Take 1 down leave other 2 alone. Attempt to bring back the 1 node, wait 5 minutes, that fails and now the cluster is in a bad state causing the further tests to fail. Sometimes there are random failures where a node does not come back properly such as in job [4]. We try to bring ODL1 back into the cluster but it fails to come back within 5 minutes. Then we move to the next tests and they fail. That ODL1 is hitting the below issue. Is there anything we can do to get past that? We can increase the timeout but why is the cluster in a bad shape? I don't think the infra is loaded since everything else is moving along properly - the robot vm is driving the other two nodes. We can also see odl1 restarting but taking it's time in the failing case. 2017-08-25 02:02:38,430 | WARN | saction-32-34'}} | DeadlockMonitor | 126 - org.opendaylight.controller.config-manager - 0.6.2.SNAPSHOT | ModuleIdentifier {factoryName='runtime-generated-mapping', instanceName='runtime-mapping-singleton'}did not finish after 284864 ms |
| Comment by Tom Pantelis [ 25/Aug/17 ] |
|
I assume this is intermittent and not every run? Based on analysis of the controller test run logs that Luis provided, it's due to connectivity issues between the nodes for whatever reason. I would suggest manually running the tests outside CI and see if it reproduces as Luis noted. Or manually going thru the steps yourself. |
| Comment by Luis Gomez [ 27/Aug/17 ] |
|
Attachment karaf-1.log has been added with description: Member 1 restarting |
| Comment by Luis Gomez [ 27/Aug/17 ] |
|
Attachment karaf-2.log has been added with description: Member 2 cannot connect to member 1 |
| Comment by Luis Gomez [ 27/Aug/17 ] |
|
Attachment karaf-3.log has been added with description: Member 3 cannot connect to member 1 |
| Comment by Luis Gomez [ 27/Aug/17 ] |
|
I could reproduce the original issue in my local setup (see last 3 attached logs) after trying for 9 times killing/restarting owner instance. From the logs it seems like the restarting instance (member-1) can connect to the other 2 but these cannot connect back to member-1. I could also verify the restarting instance did not opened the akka port: telnet 127.0.0.1 2550 Even when you see in the log: 2017-08-27 21:32:00,780 | INFO | ult-dispatcher-2 | Remoting | 179 - com.typesafe.akka.slf4j - 2.4.18 | Remoting started; listening on addresses :[akka.tcp://opendaylight-cluster-data@192.168.0.101:2550] So this seems so far a cluster related issue rather than infra issue. |
| Comment by Luis Gomez [ 27/Aug/17 ] |
|
This is the same issue that originated this bug and is collected in this run: Same behavior: member-2 restarts at 2:25:07 and it is never "akka" reachable from the other members. |
| Comment by Kit Lou [ 08/Sep/17 ] |
|
Tom, Can you look at the additional data Luis added and provide input as to what we can do about this problem? As it stands, this issue will block the Nitrogen release. Thanks! |
| Comment by Tom Pantelis [ 08/Sep/17 ] |
|
I'll take a look but I've already looked at the other logs and can't ascertain why it can't connect - I don't know if it's an actual network issue or something fluky in akka. I don't anticipate this run being any different. Perhaps enabling akka debug may help. In one run I looked at it did eventually connect but not in time for the test. Luis mentioned he would increase the timeout. In another run, ODL1 couldn't connect to the other nodes so it joined itself. As I mentioned, we can try increasing the seed-node-timeout in the akka.conf for the tests. Either way I don't think this needs to block Nitrogen. There's nothing that could've been introduced in Nitrogen that could've caused this. As I've mentioned, it may be the environment. |
| Comment by Tom Pantelis [ 08/Sep/17 ] |
|
(In reply to Luis Gomez from comment #21) In this case, member-1 apparently couldn't connect to another node in time so joined itself about 23s later and formed a single-node island: 2017-08-27 21:32:23,944 | INFO | ult-dispatcher-5 | kka://opendaylight-cluster-data) | 179 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@192.168.0.101:2550] - Node [akka.tcp://opendaylight-cluster-data@192.168.0.101:2550] is JOINING, roles [member-1] Since member-1 is the first seed node, which has special sematice, it is allowed to do that. Looking at member-2, member-1 did initially connect and try to join but member-2 has to first "down" it and remove the old incarnation from the cluster before it allows the new incarnation to join, which it did: 17-08-27 21:32:01,291 | INFO | ult-dispatcher-4 | kka://openaylight-cluster-data) | 179 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@192.168.0.102:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@192.168.0.101:2550, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
There's no other log activity for 192.168.0.101 until about 5 min later when akka reports "Connection refused" after member-1 was stopped again. 2017-08-27 21:36:49,429 | INFO | lt-dispatcher-22 | Remoting | 179 - com.typesafe.akka.slf4j - 2.4.18 | Quarantined address [akka.tcp://opendaylight-cluster-data@192.168.0.101:2550] is still unreachable or has not been restarted. Keeping it quarantined. From my understanding, member-1 should retry connection every second up to the seed-node-timeout (which we set to 12s by default) and member-2 should eventually let it join if it is able to connect. For some reason it wasn't able to reconnect/join within that period of time. Gary Wu saw this behavior a while back and opened an issue with akka https://github.com/akka/akka/issues/18757/. It turned out the default akka seed-node-timeout of 5s was sometimes too low in his docker environment and increasing to 12s alleviated it and we made that our default. As Gary mentioned in his environment "dnode0's startup creates a heavy load, so it's possible that dnode0 might be affecting the responsiveness of dnode1 and dnode2.". It may be that even 12s is sometimes too low in the jenkins test environment. I would suggest increasing it even more for the test, try 30s or even 60s. |
| Comment by Tom Pantelis [ 08/Sep/17 ] |
|
(In reply to Luis Gomez from comment #22) In this case member-2 is not the first seed node so it has to join with another and cannot join itself. We see that it tried to connect/join with another node 23 times over 5 min until apparently robot shut it down: 2017-08-14 02:30:08,093 | WARN | ult-dispatcher-5 | JoinSeedNodeProcess | 171 - com.typesafe.akka.slf4j - 2.4.18 | Couldn't join seed nodes after [23] attmpts, will try again. seed-nodes=[akka.tcp://opendaylight-cluster-data@10.29.15.94:2550, akka.tcp://opendaylight-cluster-data@10.29.15.142:2550] On member-1, during that time we see it try to connect to member-2 every 5s but fails with "Connection refused": 2017-08-14 02:25:15,880 | WARN | lt-dispatcher-23 | ReliableDeliverySupervisor | 171 - com.typesafe.akka.slf4j - 2.4.18 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.29.12.200:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.29.12.200:2550]] Caused by: [Connection refused: /10.29.12.200:2550] So it appears neither side was able to connect. It seems to me the environment was overloaded unless there's some bug in akka which is always possible but just doesn't seem likely as establishing TCP connections is pretty basic and "Connection refused" emanates from the lower-level TCP/socket layer. In a previous run we saw a node take about 17 min before it finally connected as I noted earlier. |
| Comment by Luis Gomez [ 08/Sep/17 ] |
|
Yes, I believe it is a TCP socket problem as when I reproduced locally the TCP port 2550 never came up for the restarting instance. It could be AKKA issue as I guess this library is ultimately opening/using the port. Tom, is there any DEBUG to troubleshoot this more? |
| Comment by Tom Pantelis [ 08/Sep/17 ] |
|
(In reply to Luis Gomez from comment #27) Akka reports it successfully opened the port and I doubt there could be a bug there - that's pretty basic and I've never seen an issue with that. Maybe it could be the networking layer in the VM environment is overloaded or threads/timers get delayed... The fact that akka keeps trying to periodically connect over minutes seems less likely an issue with akka, at least to me. I could see maybe a timing bug or something causing one attempt to fail but many ...? You can set akka.loglevel=DEBUG in the akka.conf and also |
| Comment by Luis Gomez [ 11/Sep/17 ] |
|
OK, I tried to reproduce this one in my local setup and this time it did not fail for the many times I tried. So this puts the issue back to the CI where it is happening sporadically but still after few times. Because of this I think we can downgrade this to CRITICAL for now. |
| Comment by Luis Gomez [ 21/Sep/17 ] |
|
FYI this issue is also present in Boron so all branches: Unfortunately it is difficult to reproduce outside ODL CI, so we cannot discard test env problem. |
| Comment by Vratko Polak [ 21/Sep/17 ] |
|
Some time ago my favorite explanation was that there is a subtle bug in Blueprint definitions somewhere in clustering features, which needs a specific timing to result in this Bug. If that is the case (and we still have trouble reproducing the failure reliably in CSIT), perhaps it would be a good idea to fix other Bugs, hoping this would get fix together with the other bug. SingleFeatureTest should be verifying blueprint, but some features are blacklisted, some of them related to clustering. Specifically, Odlparent 2.0.4 skips [8]: odl-mdsal-broker-local, odl-mdsal-clustering-commons, odl-mdsal-distributed-datastore, and odl-mdsal-remoterpc-connector. (This list is strangely similar to features affected by [9], no idea if there is a causal relation.) There is already [8] https://github.com/opendaylight/odlparent/blob/v2.0.4/features-test/src/main/java/org/opendaylight/odlparent/featuretest/SingleFeatureTest.java#L404-L409 |
| Comment by Tom Pantelis [ 21/Sep/17 ] |
|
This is an issue with akka sometimes not being able to connect to other nodes or, at least, it takes longer than expected. From all the logs I've looked at it appears to be at the TCP level. It is not related to blueprint wiring or anything in ODL code. As Luis mentioned, it is also present in Boron which indicates it is not tied to any akka version or the karaf 4 upgrade. From everything I can tell, it seems related to the CSIT env, ie intermittent load. I would suggest adjusting time outs to take that into account. |
| Comment by Luis Gomez [ 27/Sep/17 ] |
|
OK, I am trying to reproduce the issue in sandbox so I can apply some of the suggested cluster configurations but so far I am also having trouble to reproduce there |
| Comment by Vratko Polak [ 28/Sep/17 ] |
|
> akka sometimes not being able to connect to other nodes or, Is there a logging setting which would enable us to compare such connect timings (without spamming karaf.log too much)? |
| Comment by Luis Gomez [ 12/Oct/17 ] |
|
After adding some network connection printouts (netstat) in the test, I found a pattern for the failure: Every time a restarting member fails to join I can see there are 2 connection (inbound and outbound) established between the restarting member and the same remote member (see 2 first lines below): tcp6 0 0 10.29.15.190:2550 10.29.15.107:58300 ESTABLISHED 12020/java tcp6 0 0 10.29.15.190:47210 10.29.15.113:2550 ESTABLISHED 12020/java This is weird as normally 1 connection (inbound or outbound) is enough to setup the akka association. I am still investigating and adding more printouts to understand how this double connection gets generated. |
| Comment by Luis Gomez [ 18/Oct/17 ] |
|
Hi Tom, From the netstat printouts for a recent failure: It seems clear there is a relation between duplicated akka connections and the issue: Failing restarting instance (member-1): Other instance (member-2): Other instance (member-3): Also looking at the logs it seems member-2 and member-3 try to connect the restarting instance at the same time the tcp socket gets available in member-1: 2017-10-17 09:45:26,352 | INFO | ult-dispatcher-2 | Remoting | 41 - com.typesafe.akka.slf4j - 2.4.18 | Remoting started; listening on addresses :[akka.tcp://opendaylight-cluster-data@10.29.15.220:2550] So is there any possibility of a race condition in the restarting instance so that it accepts both connections from member-2 & member-3 and at the same time creates different connections to member-2 and member-3 to join the cluster? in such scenario is it possible member-1 sends and expects messages from one connection while member-2 and member-3 sends and expects messages from other connection? |
| Comment by Tom Pantelis [ 18/Oct/17 ] |
|
I don't know. I haven't studied akka code to that detail. I would suggest opening a case with akka to answer your questions. |
| Comment by Luis Gomez [ 18/Oct/17 ] |
|
Well, it is not yet clear whether the double connection is the origin or a consequence of the issue. I think next step will be to enable some akka debug in CSIT. In my local setup I changed the akka.conf as instructed: loglevel = "DEBUG" } But I do not see any DEBUG message in karaf.log if this is the place where akka log is dumped, maybe I need to enable some DEBUG in the karaf logger itself, any idea of what is missing? |
| Comment by Vratko Polak [ 18/Oct/17 ] |
|
> I need to enable some DEBUG in the karaf logger itself Yes. See [12]. |
| Comment by Luis Gomez [ 19/Oct/17 ] |
|
Thanks Vratko, it works but unfortunately the akka messaging debug is very chatty so I will start with some other debug and see what happens. |
| Comment by Luis Gomez [ 01/Nov/17 ] |
|
After adding the DEBUGs, I can confirm issue happens when the restarting member and the cluster leader initiates AKKA connection at the same time. The restarting member initiates the connection as part of the boot while the cluster leader is trying every 5 secs. Can we do something in our code to handle this conflict? if not I guess I will open an issue in akka. Tom, which version of akka are we using in ODL? |
| Comment by Luis Gomez [ 01/Nov/17 ] |
|
Node 3 (restarting member) log in [1]: 2017-10-31 22:47:55,553 | DEBUG | lt-dispatcher-25 | AkkaProtocolManager | 41 - com.typesafe.akka.slf4j - 2.4.18 | now supervising Actor[akka://opendaylight-cluster-data/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fopendaylight-cluster-data%4010.29.14.65%3A2550-1#175430158] 2017-10-31 22:47:55,554 | DEBUG | lt-dispatcher-25 | AkkaProtocolManager | 41 - com.typesafe.akka.slf4j - 2.4.18 | now supervising Actor[akka://opendaylight-cluster-data/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fopendaylight-cluster-data%4010.29.13.208%3A2550-2#-2037739644] 2017-10-31 22:47:55,590 | INFO | rint Extender: 3 | AbstractDataStore | 217 - org.opendaylight.controller.sal-distributed-datastore - 1.6.1.SNAPSHOT | Creating ShardManager : shardmanager-config 2017-10-31 22:47:55,596 | DEBUG | ult-dispatcher-2 | ProtocolStateActor | 41 - com.typesafe.akka.slf4j - 2.4.18 | started (akka.remote.transport.ProtocolStateActor@35fbf1c1) 2017-10-31 22:47:55,597 | DEBUG | ult-dispatcher-2 | ProtocolStateActor | 41 - com.typesafe.akka.slf4j - 2.4.18 | started (akka.remote.transport.ProtocolStateActor@519a5d1b) 2017-10-31 22:47:55,597 | DEBUG | ult-dispatcher-2 | AkkaProtocolManager | 41 - com.typesafe.akka.slf4j - 2.4.18 | now supervising Actor[akka://opendaylight-cluster-data/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fopendaylight-cluster-data%4010.29.14.65%3A59194-3#1453714731] 2017-10-31 22:47:55,597 | DEBUG | ult-dispatcher-2 | ProtocolStateActor | 41 - com.typesafe.akka.slf4j - 2.4.18 | started (akka.remote.transport.ProtocolStateActor@76642d3b) 2017-10-31 22:47:55,601 | DEBUG | lt-dispatcher-25 | AkkaProtocolManager | 41 - com.typesafe.akka.slf4j - 2.4.18 | now supervising Actor[akka://opendaylight-cluster-data/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fopendaylight-cluster-data%4010.29.13.208%3A42796-4#562821358] 2017-10-31 22:47:55,601 | DEBUG | lt-dispatcher-25 | ProtocolStateActor | 41 - com.typesafe.akka.slf4j - 2.4.18 | started (akka.remote.transport.ProtocolStateActor@79b35ebe) 2017-10-31 22:47:55,675 | DEBUG | lt-dispatcher-25 | ProtocolStateActor | 41 - com.typesafe.akka.slf4j - 2.4.18 | stopped Node 2 (cluster leader) log in [2]: 2017-10-31 22:47:55,587 | DEBUG | ult-dispatcher-6 | AkkaProtocolManager | 41 - com.typesafe.akka.slf4j - 2.4.18 | now supervising Actor[akka://opendaylight-cluster-data/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fopendaylight-cluster-data%4010.29.15.87%3A2550-6#-1931382701] 2017-10-31 22:47:55,587 | DEBUG | ult-dispatcher-6 | ProtocolStateActor | 41 - com.typesafe.akka.slf4j - 2.4.18 | started (akka.remote.transport.ProtocolStateActor@6a4fb40c) 2017-10-31 22:47:55,595 | DEBUG | ult-dispatcher-5 | AkkaProtocolManager | 41 - com.typesafe.akka.slf4j - 2.4.18 | now supervising Actor[akka://opendaylight-cluster-data/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fopendaylight-cluster-data%4010.29.15.87%3A53374-7#691018071] 2017-10-31 22:47:55,596 | DEBUG | ult-dispatcher-5 | ProtocolStateActor | 41 - com.typesafe.akka.slf4j - 2.4.18 | started (akka.remote.transport.ProtocolStateActor@5146512) [1] https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-3node-clustering-only-nitrogen/208/odl3_karaf.log.gz |
| Comment by Luis Gomez [ 10/Nov/17 ] |
|
FYI I opened this issue in AKKA: |
| Comment by OpenDaylight Release [ 03/May/18 ] |
|
Since the bug is unassigned I'm currently assigning it to you. Please assign to the relevant person. |
| Comment by Robert Varga [ 03/May/18 ] |
|
Oxygen SR1 and Fluorine are using akka_2.12-2.5.11, scala 2.12.5. |
| Comment by Luis Gomez [ 03/May/18 ] |
|
This was identified as an AKKA issue, I will recheck on mentioned branches. |
| Comment by Luis Gomez [ 24/May/18 ] |
|
This seems like the issue is fixed after last AKKA update: |