[CONTROLLER-1546] Operations Failed after Failover with exceptions and Errors Created: 09/Sep/16 Updated: 25/Jul/23 Resolved: 11/Oct/16 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Venkatrangan Govindarajan | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 6686 |
| Description |
|
Operations: 1. Setup 3 node ODL (feature: odl-ovsdb-openstack aka Legacy Netvirt) 2. Attempted failover in thie sequence ODL1 down ODL2 up ODL3 up ODL1 up, ODL2 down ODL3 up --> Operations started to fail here ODL1 up, ODL2 up, ODL3 up --> no recovery At this point ODL3 was listed as "owner" for all entities. All the operations were failing. Checking the logs in ODL2 and ODL3 indiciated a lot of blueprint errors and some WARN indicating problems with remoterpc 2016-09-09 21:48:21,009 | WARN | ult-dispatcher-2 | RpcRegistry | 168 - org.opendaylight.controller.sal-remoterpc-connector - 1.4.0.Boron | Timed out finding routers for RouteIdentifierImpl{context=null, type=(urn:opendaylight:packet:service?revision=2013-07-09)transmit-packet, route=/(urn:opendaylight:inventory?revision=2013-08-19)nodes/node/node[ {(urn:opendaylight:inventory?revision=2013-08-19)id=openflow:92945849353687}]} ]} ]} |
| Comments |
| Comment by Venkatrangan Govindarajan [ 09/Sep/16 ] |
|
Attachment odl1_log.tgz has been added with description: ODL1 Logs |
| Comment by Venkatrangan Govindarajan [ 09/Sep/16 ] |
|
Attachment odl2_log.tgz has been added with description: ODL2 logs |
| Comment by Venkatrangan Govindarajan [ 09/Sep/16 ] |
|
Attachment odl3_log.tgz has been added with description: ODL3 logs |
| Comment by Venkatrangan Govindarajan [ 09/Sep/16 ] |
|
Logs uploaded, Please check the blueprint errors in ODL2 and remoteprc errors in ODL3. |
| Comment by Venkatrangan Govindarajan [ 09/Sep/16 ] |
|
ODL1 log has some entrie with "NullPointer Exception" while startup mostly when invoking entity ownership API. |
| Comment by Venkatrangan Govindarajan [ 12/Sep/16 ] |
|
Could not reproduce the failure. But still the ERROR and exceptins seen in log |
| Comment by Tom Pantelis [ 16/Sep/16 ] |
|
There are many NPE's in ForwardingRulesManagerImpl: 2016-09-09 20:39:10,864 | ERROR | on-dispatcher-31 | DataTreeChangeListenerActor | 170 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.Boron | Error notifying listener org.opendaylight.controller.md.sal.binding.impl.BindingClusteredDOMDataTreeChangeListenerAdapter@2a4f203b Also a few in NeutronNetworkChangeListener: 2016-09-09 20:39:12,906 | ERROR | on-dispatcher-38 | DataChangeListener | 170 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.Boron | Error notifying listener org.opendaylight.netvirt.openstack.netvirt.translator.iaware.impl.NeutronNetworkChangeListener These should be looked at by someone familiar with that code. I don't know the impact. On ODL3, I see a couple transaction failures and a read failure ("Failure to delete ovsdbNode") around 2016-09-09 18:15:15 with message "Metadata not available" which indicates it tried to delete/read a node that doesn't exist. I don't see any blueprint errors on ODL2 but any blueprint issues would occur on startup. The RpcRegistry warning means a client tried to send a routed RPC but there's no implementation registered for the routed node path. That could be b/c there was a prior registration but was unregistered or there never was/is a registration or there is a registration on a remote controller node but hasn't propagated to the calling controller node yet. However there is a 5 sec wait for convergence hence the "Timed out finding routers" message. Wrt the EOS, member-2 became the shard leader originally and transferred leadership to member-3 when it was shut down at 2016-09-09 20:47:54. Since both member-1 and member-2 had been shut down and restarted, I would expect for member-3 would be the owner for the entities. Interestingly, 2016-09-09 20:13:01,009 is the last timestamp in the logs for ODL3. I'm unclear as to the exact issue observed as I'm not familiar with ovsdb. I'm also unclear as to whether there's any issue with clustering here. As I mentioned the EOS behavior looks correct but w/o EOS debug enabled I can't tell for sure. The NPE's mentioned above could be significant. I would suggest having ovsdb folks take a look. |
| Comment by Tom Pantelis [ 16/Sep/16 ] |
|
From the logs, the sequence of restarts is: 2016-09-09 18:13 - all nodes up |
| Comment by Ananthi Palaniswamy [ 19/Sep/16 ] |
|
Followed the steps mentioned above. |
| Comment by Tom Pantelis [ 11/Oct/16 ] |
|
Closing this as it's not reproducible and whatever issue there was doesn't seem to related to clustering based on analysis of the logs. As mentioned before, the NPE's emanating from ovsdb look ominous and may have been the cause so I would suggest creating a bug in that project. |