[CONTROLLER-1546] Operations Failed after Failover with exceptions and Errors Created: 09/Sep/16  Updated: 25/Jul/23  Resolved: 11/Oct/16

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Venkatrangan Govindarajan Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File odl1_log.tgz     File odl2_log.tgz     File odl3_log.tgz    
External issue ID: 6686

 Description   

Image USed RC3.3: https://nexus.opendaylight.org/content/repositories/autorelease-1484/org/opendaylight/integration/distribution-karaf/0.5.0-Boron/distribution-karaf-0.5.0-Boron.zip

Operations:

1. Setup 3 node ODL (feature: odl-ovsdb-openstack aka Legacy Netvirt)

2. Attempted failover in thie sequence

ODL1 down ODL2 up ODL3 up

ODL1 up, ODL2 down ODL3 up

--> Operations started to fail here

ODL1 up, ODL2 up, ODL3 up

--> no recovery

At this point ODL3 was listed as "owner" for all entities. All the operations were failing. Checking the logs in ODL2 and ODL3 indiciated a lot of blueprint errors and some WARN indicating problems with remoterpc

2016-09-09 21:48:21,009 | WARN | ult-dispatcher-2 | RpcRegistry | 168 - org.opendaylight.controller.sal-remoterpc-connector - 1.4.0.Boron | Timed out finding routers for RouteIdentifierImpl{context=null, type=(urn:opendaylight:packet:service?revision=2013-07-09)transmit-packet, route=/(urn:opendaylight:inventory?revision=2013-08-19)nodes/node/node[

{(urn:opendaylight:inventory?revision=2013-08-19)id=openflow:92945849353687}

]}
2016-09-09 21:48:21,009 | WARN | ult-dispatcher-2 | RpcRegistry | 168 - org.opendaylight.controller.sal-remoterpc-connector - 1.4.0.Boron | Timed out finding routers for RouteIdentifierImpl{context=null, type=(urn:opendaylight:packet:service?revision=2013-07-09)transmit-packet, route=/(urn:opendaylight:inventory?revision=2013-08-19)nodes/node/node[

{(urn:opendaylight:inventory?revision=2013-08-19)id=openflow:92945849353687}

]}
2016-09-09 21:48:21,009 | WARN | ult-dispatcher-2 | RpcRegistry | 168 - org.opendaylight.controller.sal-remoterpc-connector - 1.4.0.Boron | Timed out finding routers for RouteIdentifierImpl{context=null, type=(urn:opendaylight:packet:service?revision=2013-07-09)transmit-packet, route=/(urn:opendaylight:inventory?revision=2013-08-19)nodes/node/node[

{(urn:opendaylight:inventory?revision=2013-08-19)id=openflow:92945849353687}

]}



 Comments   
Comment by Venkatrangan Govindarajan [ 09/Sep/16 ]

Attachment odl1_log.tgz has been added with description: ODL1 Logs

Comment by Venkatrangan Govindarajan [ 09/Sep/16 ]

Attachment odl2_log.tgz has been added with description: ODL2 logs

Comment by Venkatrangan Govindarajan [ 09/Sep/16 ]

Attachment odl3_log.tgz has been added with description: ODL3 logs

Comment by Venkatrangan Govindarajan [ 09/Sep/16 ]

Logs uploaded, Please check the blueprint errors in ODL2 and remoteprc errors in ODL3.

Comment by Venkatrangan Govindarajan [ 09/Sep/16 ]

ODL1 log has some entrie with "NullPointer Exception" while startup mostly when invoking entity ownership API.

Comment by Venkatrangan Govindarajan [ 12/Sep/16 ]

Could not reproduce the failure. But still the ERROR and exceptins seen in log

Comment by Tom Pantelis [ 16/Sep/16 ]

There are many NPE's in ForwardingRulesManagerImpl:

2016-09-09 20:39:10,864 | ERROR | on-dispatcher-31 | DataTreeChangeListenerActor | 170 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.Boron | Error notifying listener org.opendaylight.controller.md.sal.binding.impl.BindingClusteredDOMDataTreeChangeListenerAdapter@2a4f203b
java.lang.NullPointerException
at org.opendaylight.openflowplugin.applications.frm.impl.ForwardingRulesManagerImpl.isNodeOwner(ForwardingRulesManagerImpl.java:261)[281:org.opendaylight.openflowplugin.applications.forwardingrules-manager:0.3.0.Boron]
at org.opendaylight.openflowplugin.applications.frm.impl.AbstractListeningCommiter.preConfigurationCheck(AbstractListeningCommiter.java:111)[281:org.opendaylight.openflowplugin.applications.forwardingrules-manager:0.3.0.Boron]
at org.opendaylight.openflowplugin.applications.frm.impl.AbstractListeningCommiter.onDataTreeChanged(AbstractListeningCommiter.java:53)[281:org.opendaylight.openflowplugin.applications.forwardingrules-manager:0.3.0.Boron]
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMDataTreeChangeListenerAdapter.onDataTreeChanged(BindingDOMDataTreeChangeListenerAdapter.java:41)[141:org.opendaylight.controller.sal-binding-broker-impl:1.4.0.Boron]
at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:55)[170:org.opendaylight.controller.sal-distributed-datastore:1.4.0.Boron]

Also a few in NeutronNetworkChangeListener:

2016-09-09 20:39:12,906 | ERROR | on-dispatcher-38 | DataChangeListener | 170 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.Boron | Error notifying listener org.opendaylight.netvirt.openstack.netvirt.translator.iaware.impl.NeutronNetworkChangeListener
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:210)[38:com.google.guava:18.0.0]
at org.opendaylight.netvirt.openstack.netvirt.AbstractHandler.enqueueEvent(AbstractHandler.java:69)[290:org.opendaylight.netvirt.openstack.net-virt:1.3.0.Boron]
at org.opendaylight.netvirt.openstack.netvirt.NetworkHandler.neutronNetworkCreated(NetworkHandler.java:68)[290:org.opendaylight.netvirt.openstack.net-virt:1.3.0.Boron]
at org.opendaylight.netvirt.openstack.netvirt.translator.iaware.impl.NeutronNetworkChangeListener.createNetwork(NeutronNetworkChangeListener.java:87)[290:org.opendaylight.netvirt.openstack.net-virt:1.3.0.Boron]
at org.opendaylight.netvirt.openstack.netvirt.translator.iaware.impl.NeutronNetworkChangeListener.onDataChanged(NeutronNetworkChangeListener.java:74)[290:org.opendaylight.netvirt.openstack.net-virt:1.3.0.Boron]
at org.opendaylight.controller.md.sal.binding.impl.AbstractForwardedDataBroker$TranslatingDataChangeInvoker.onDataChanged(AbstractForwardedDataBroker.java:143)[141:org.opendaylight.controller.sal-binding-broker-impl:1.4.0.Boron]
at org.opendaylight.controller.cluster.datastore.DataChangeListener.dataChanged(DataChangeListener.java:71)[170:org.opendaylight.controller.sal-distributed-datastore:1.4.0.Boron]

These should be looked at by someone familiar with that code. I don't know the impact.

On ODL3, I see a couple transaction failures and a read failure ("Failure to delete ovsdbNode") around 2016-09-09 18:15:15 with message "Metadata not available" which indicates it tried to delete/read a node that doesn't exist.

I don't see any blueprint errors on ODL2 but any blueprint issues would occur on startup.

The RpcRegistry warning means a client tried to send a routed RPC but there's no implementation registered for the routed node path. That could be b/c there was a prior registration but was unregistered or there never was/is a registration or there is a registration on a remote controller node but hasn't propagated to the calling controller node yet. However there is a 5 sec wait for convergence hence the "Timed out finding routers" message.

Wrt the EOS, member-2 became the shard leader originally and transferred leadership to member-3 when it was shut down at 2016-09-09 20:47:54. Since both member-1 and member-2 had been shut down and restarted, I would expect for member-3 would be the owner for the entities. Interestingly, 2016-09-09 20:13:01,009 is the last timestamp in the logs for ODL3.

I'm unclear as to the exact issue observed as I'm not familiar with ovsdb. I'm also unclear as to whether there's any issue with clustering here. As I mentioned the EOS behavior looks correct but w/o EOS debug enabled I can't tell for sure.

The NPE's mentioned above could be significant. I would suggest having ovsdb folks take a look.

Comment by Tom Pantelis [ 16/Sep/16 ]

From the logs, the sequence of restarts is:

2016-09-09 18:13 - all nodes up
2016-09-09 18:41 - ODL1 stopped
2016-09-09 20:38 - ODL1 restarted
2016-09-09 20:47 - ODL2 stopped
2016-09-09 21:15 - ODL2 restarted

Comment by Ananthi Palaniswamy [ 19/Sep/16 ]

Followed the steps mentioned above.
Observed the karaf logs, but couldn't find the above warn messages.
Thus, the bug is not reproduced.

Comment by Tom Pantelis [ 11/Oct/16 ]

Closing this as it's not reproducible and whatever issue there was doesn't seem to related to clustering based on analysis of the logs. As mentioned before, the NPE's emanating from ovsdb look ominous and may have been the cause so I would suggest creating a bug in that project.

Generated at Wed Feb 07 19:55:49 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.