[OVSDB-434] br-int not created after failing over one controller in 3 node cluster Created: 13/Apr/16  Updated: 02/May/19  Resolved: 02/May/19

Status: Resolved
Project: ovsdb
Component/s: Southbound.Open_vSwitch
Affects Version/s: Carbon-SR3
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Jamo Luhrsen Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File 5720_karaf_logs.tar.gz     File 5720_karaf_logs.tar.xz     File websocket-logs.tar.gz    
Issue Links:
Relates
relates to CONTROLLER-1786 Jolokia lookup says leader exists but... Resolved
relates to CONTROLLER-1856 CSIT test Local_Leader_Shutdown fails... Resolved
relates to OVSDB-438 operational node goes missing upon ov... Resolved
Epic Link: Clustering Stability
External issue ID: 5720

 Description   

3 node ODL cluster with these features:
odl-ovsdb-openstack
odl-mdsal-clustering
odl-jolokia

2 OVS nodes where they are being set to connect to each of the
3 controllers. The ovs nodes were disconnected and connected
many (~20) times in a row and it was verified that br-int was
being created.

The controller reporting leader for default config shard was
then stopped (logout command on karaf shell) and started. Once
it came back, connecting to the controllers would see that the
ovsdb manager was set, but there would be no br-int created.

this exception was coming in a controller (not the restarted controller)
when the ovs nodes were set to connect:

2016-04-13 21:19:58,658 | WARN | n-invoker-impl-0 | SouthboundUtil | 251 - org.opendaylight.ovsdb.southbound-impl - 1.2.3.SNAPSHOT | Read Operational/DS for Node failed! KeyedInstanceIdentifier

{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=ovsdb:1]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=ovsdb://uuid/f3354257-9201-4e55-bf6e-98320d6c5f94]]]]}

ReadFailedException{message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/f3354257-9201-4e55-bf6e-98320d6c5f94}

], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/f3354257-9201-4e55-bf6e-98320d6c5f94}

], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-2-shard-topology-operational currently has no leader. Try again later.]]}
at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.executeRead(NoOpTransactionContext.java:71)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.TransactionProxy$1.invoke(TransactionProxy.java:92)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.TransactionContextWrapper.executePriorTransactionOperations(TransactionContextWrapper.java:132)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.onFindPrimaryShardFailure(AbstractTransactionContextFactory.java:97)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.access$100(AbstractTransactionContextFactory.java:35)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:123)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:117)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at akka.dispatch.OnComplete.internal(Future.scala:247)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.OnComplete.internal(Future.scala:245)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172)[150:com.typesafe.akka.actor:2.3.14]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)[147:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)[150:com.typesafe.akka.actor:2.3.14]
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)[147:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)[150:com.typesafe.akka.actor:2.3.14]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[147:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[147:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[147:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[147:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
Caused by: org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-2-shard-topology-operational currently has no leader. Try again later.
at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.executeRead(NoOpTransactionContext.java:67)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
... 23 more
Caused by: org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Shard member-2-shard-topology-operational currently has no leader. Try again later.
at org.opendaylight.controller.cluster.datastore.ShardManager.createNoShardLeaderException(ShardManager.java:744)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.ShardManager.onShardNotInitializedTimeout(ShardManager.java:551)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.ShardManager.handleCommand(ShardManager.java:222)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:36)[161:org.opendaylight.controller.sal-clustering-commons:1.3.2.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(Eventsourced.scala:430)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:97)[161:org.opendaylight.controller.sal-clustering-commons:1.3.2.SNAPSHOT]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:534)[150:com.typesafe.akka.actor:2.3.14]
at akka.persistence.Recovery$State$class.process(Recovery.scala:30)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.ProcessorImpl$$anon$2.process(Processor.scala:103)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Processor.scala:114)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Recovery$class.aroundReceive(Recovery.scala:265)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Eventsourced.scala:428)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$$anon$2.doAroundReceive(Eventsourced.scala:82)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$$anon$2.aroundReceive(Eventsourced.scala:78)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:369)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.UntypedPersistentActor.aroundReceive(Eventsourced.scala:428)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)[150:com.typesafe.akka.actor:2.3.14]
at akka.actor.ActorCell.invoke(ActorCell.scala:487)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.Mailbox.run(Mailbox.scala:220)[150:com.typesafe.akka.actor:2.3.14]
... 5 more
2016-04-13 21:19:58,665 | WARN | n-invoker-impl-0 | SouthboundUtil | 251 - org.opendaylight.ovsdb.southbound-impl - 1.2.3.SNAPSHOT | Read Operational/DS for Node failed! KeyedInstanceIdentifier

{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=ovsdb:1]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=ovsdb://uuid/f3354257-9201-4e55-bf6e-98320d6c5f94]]]]}

ReadFailedException{message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/f3354257-9201-4e55-bf6e-98320d6c5f94}

], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/f3354257-9201-4e55-bf6e-98320d6c5f94}

], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-2-shard-topology-operational currently has no leader. Try again later.]]}
at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.executeRead(NoOpTransactionContext.java:71)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.TransactionProxy$1.invoke(TransactionProxy.java:92)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.TransactionContextWrapper.maybeExecuteTransactionOperation(TransactionContextWrapper.java:92)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.TransactionProxy.executeRead(TransactionProxy.java:89)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.TransactionProxy.singleShardRead(TransactionProxy.java:114)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.TransactionProxy.read(TransactionProxy.java:108)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.DOMBrokerReadWriteTransaction.read(DOMBrokerReadWriteTransaction.java:37)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.md.sal.binding.impl.AbstractForwardedTransaction.doRead(AbstractForwardedTransaction.java:63)[140:org.opendaylight.controller.sal-binding-broker-impl:1.3.2.SNAPSHOT]
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMReadWriteTransactionAdapter.read(BindingDOMReadWriteTransactionAdapter.java:31)[140:org.opendaylight.controller.sal-binding-broker-impl:1.3.2.SNAPSHOT]
at org.opendaylight.ovsdb.southbound.SouthboundUtil.readNode(SouthboundUtil.java:112)[251:org.opendaylight.ovsdb.southbound-impl:1.2.3.SNAPSHOT]
at org.opendaylight.ovsdb.southbound.transactions.md.OvsdbQosRemovedCommand.execute(OvsdbQosRemovedCommand.java:54)[251:org.opendaylight.ovsdb.southbound-impl:1.2.3.SNAPSHOT]
at org.opendaylight.ovsdb.southbound.transactions.md.OvsdbOperationalCommandAggregator.execute(OvsdbOperationalCommandAggregator.java:46)[251:org.opendaylight.ovsdb.southbound-impl:1.2.3.SNAPSHOT]
at org.opendaylight.ovsdb.southbound.transactions.md.TransactionInvokerImpl.run(TransactionInvokerImpl.java:88)[251:org.opendaylight.ovsdb.southbound-impl:1.2.3.SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)[:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)[:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)[:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)[:1.8.0_77]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_77]
Caused by: org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-2-shard-topology-operational currently has no leader. Try again later.
at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.executeRead(NoOpTransactionContext.java:67)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
... 17 more
Caused by: org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Shard member-2-shard-topology-operational currently has no leader. Try again later.
at org.opendaylight.controller.cluster.datastore.ShardManager.createNoShardLeaderException(ShardManager.java:744)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.ShardManager.onShardNotInitializedTimeout(ShardManager.java:551)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.ShardManager.handleCommand(ShardManager.java:222)[165:org.opendaylight.controller.sal-distributed-datastore:1.3.2.SNAPSHOT]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:36)[161:org.opendaylight.controller.sal-clustering-commons:1.3.2.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(Eventsourced.scala:430)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:97)[161:org.opendaylight.controller.sal-clustering-commons:1.3.2.SNAPSHOT]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:534)[150:com.typesafe.akka.actor:2.3.14]
at akka.persistence.Recovery$State$class.process(Recovery.scala:30)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.ProcessorImpl$$anon$2.process(Processor.scala:103)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Processor.scala:114)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Recovery$class.aroundReceive(Recovery.scala:265)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Eventsourced.scala:428)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$$anon$2.doAroundReceive(Eventsourced.scala:82)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$$anon$2.aroundReceive(Eventsourced.scala:78)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:369)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.UntypedPersistentActor.aroundReceive(Eventsourced.scala:428)[155:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)[150:com.typesafe.akka.actor:2.3.14]
at akka.actor.ActorCell.invoke(ActorCell.scala:487)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.Mailbox.run(Mailbox.scala:220)[150:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)[150:com.typesafe.akka.actor:2.3.14]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[147:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[147:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[147:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[147:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]



 Comments   
Comment by Jamo Luhrsen [ 13/Apr/16 ]

it should be noted that scrubbing the ovs system (stopping ovs service, removing conf.db and restarting) did not workaround this issue. The nodes could still
connect, but br-int would not be created.

Comment by Jamo Luhrsen [ 13/Apr/16 ]

ODL_2 was the controller seeing the exceptions.

the logs were trimmed to only have entries that happened from when
I did the ovs connection (set-manager) until I killed (-9) the
controller process.

Comment by Jamo Luhrsen [ 13/Apr/16 ]

Attachment 5720_karaf_logs.tar.gz has been added with description: controller logs

Comment by Vinh Nguyen [ 03/Jun/16 ]

I could not reproduce the problem according to the description in the bug report.
The logs attached in the bug are trimmed so I could not trace the context of the system before the problem occurs.
I ran across a variation of the problem:

  • shutdown two ODL nodes instead of one.
  • the remain ODL node is stuck with the ownership problem
    and can never recover even if the other nodes come back up.
Comment by Anil Vishnoi [ 07/Jun/16 ]

Vinh,

I believe the new problem that you discover is probably not relevant to net-virt clustering but moreover relevant to base clustering service. Can you open a bug against the controller (clustering component) project and provide them the details.

Comment by Jamo Luhrsen [ 21/Jun/16 ]

Sorry it's taken me so long to get back to this. I have recreated this
issue with a recent distro taken from master (from 6/16/2016) I'll find
a way to post all the karaf logs, but the basic steps.

1) configure clustering with the config script packaged with the distro
2) start karaf and install odl-jolokia, followed by odl-ovsdb-openstack
3) connect ovs to all three controllers and verify that br-int is created
4) wipe ovs config (del-manager, del-br, ovs-ctl stop, rm conf.db, ovs-ctl start)
5) kill default shard Leader (CTRL-Z, kill -9 %1)
6) connect ovs to all three controllers.

after 6) I see that br-int is not created.

This happened on my first try, so maybe these updated steps can help you,
Vinh.

I will get karaf logs posted.

Comment by Jamo Luhrsen [ 21/Jun/16 ]

controller 01 is the controller that was killed.

Comment by Jamo Luhrsen [ 21/Jun/16 ]

Attachment 5720_karaf_logs.tar.xz has been added with description: three controller logs

Comment by Bertrand Low [ 06/Jul/16 ]

Hi JamO,

do the logs in attachment 1046 capture all the steps listed in comment 5? The logs do not appear to contain any errors or exception such as those captured in this bug's description.

thanks

Comment by Vinh Nguyen [ 22/Jul/16 ]

Add email thread discussing this issue:

---------------------

Good info and debugging Bertrand.

Is it such that any time the cluster is reorganizing (or say even just organizing when coming up) if ovs connects then we wont get a br-int created?

If that's the case, then I think we have something to worry about. We can't really ever know when/if clustering would have trouble and have to recalculate things, and that means we have these unknown windows in which we can't connect ovs to ODL.

JamO

On 07/11/2016 06:48 PM, Bertrand Low wrote:
> Hi Jamo, Anil, and Sam
>
>
>
> Regarding NETVIRT-13 this bug is reproducible with Jamo’s steps (thanks
> Jamo) if the time between “step 5) default shard leader killed”, and
> “step 6) connect ovs to controllers” is very short. What appears to be
> occurring is that when the shard leader is killed, the cluster is then needing to reorganize by electing a new shard leader. If the ovs switch initiates a connection while the cluster is still re-organizing, then even though the ovsdb manager is set, the br-int is not created.
> However, this seems to be a corner case and it appears that the
> workaround of disconnecting the ovs switch and reconnecting it again
> (after the cluster has finished re-organizing) will successfully push the br-int bridge to the ovs switch. The question I have is: is this is a corner case with an acceptable workaround?
>
>
>
> Note: any existing connections to the cluster when a shard leader is
> killed appear to get their ownership transferred correctly. So, this
> bug appears to be only for new connections to the cluster /before/ the cluster has finished reorganizing itself. New connections to the cluster after reorganization is complete will get br-int pushed successfully.

Comment by Vinh Nguyen [ 15/Sep/16 ]

Problem: br-int is not created if the OVS switch connects to 3-nodes cluster immediately after the ODL node who is the netvirt-provider master instance goes down.

In this case the cluster is still converging and . When the OVS is connected netvirt calls the “ovsdb-netvirt-provider” owner to create the br-int. Since the current “ovsdb-netvirt-provider” owner just went down and the remaining 2 ODL nodes are still selecting new owner for “ovsdb-netvirt-provider”, there is no “ovsdb-netvirt-provider” owner to create the br-int.

There is no exception thrown in this scenario because the remaining ODL nodes simply ignore the call to create the bridge because they are not the “ovsdb-netvirt-provider” owner.

After a couple of seconds when the ODL cluster completes converging, we can manually add br-int successfully - netvirt creates initial flows successfully as well.

In summary: The problem occurs:

  • the netvirt-provider master instance goes down
  • an OVS switch connect to the cluster during the time when the cluster is selecting new the “ovsdb-netvirt-provider” owner
  • the br-int bridge is not created because there is no “ovsdb-netvirt-provider” owner
  • this problem happens intermittently as the OVS switch has to be connect within a small time window when the cluster is converging.
Comment by Anil Vishnoi [ 02/Feb/17 ]

Hi Vinh,

I believe this issue is related to the old netvirt code and as per last update it looks like it's moreover netvirt specific issue?

Moving it to netvirt project to take the final decision on this bug.

Thanks
Anil

Comment by Sam Hague [ 27/Nov/17 ]

Still happens in carbon and looks to be ovsdb specific so moving back to ovsdb.

This is the log in the new attached web-socket zip, overcloud-controller-0-karaf.log:

2017-11-20 13:43:02,999 | WARN  | n-invoker-impl-0 | SouthboundUtil                   | 289 - org.opendaylight.ovsdb.southbound-impl - 1.4.3.Carbon | Read Operational/DS for Node failed!

Comment by Vishal Thapar [ 23/May/18 ]

Is this still an issue and could you rephrase the steps with the new netvirt? With the new diagstatus code coming in netvirt https://git.opendaylight.org/gerrit/#/c/64000/ , Elan shouldn't be trying to create the bridge when cluster is not ready. Is it sufficient to address this issue?

Comment by Jamo Luhrsen [ 23/May/18 ]

steps seem easy enough. I don't have a clustered setup at this exact moment or I'd try them:

 

  1. connect two OVS to all three controllers
  2. disconnect the OVS from all three controllers
  3. repeat 1. and 2. a bunch (~20) times
  4. verify that br-int is ok after this

 

  1. disconnect all OVS from the controllers
  2. restart the shard leader
  3. connect the OVS to the controllers
  4. verify that br-int is ok after this

 

vinh.nguyen@hcl.com can verify my accuracy here.

 

If I get access to a clustered setup locally, I'll try to remember to check this.

 

Having said that, I noticed that this is probably an OVSDB bug and maybe not all the way up the food chain
to netvirt, even though I know those are the features listed as installed. I'm just looking at the Exception.

We have OVSDB Cluster Jobs and they pass most ofthe time, but they do fail periodically. I scanned some
karaf logs in the 5 most recent failed jobs and one of them looks to have a similar exception:

2018-05-09T17:25:31,776 | WARN  | transaction-invoker-impl-0 | SouthboundUtil                   | 312 - org.opendaylight.ovsdb.southbound-impl - 1.7.0.SNAPSHOT | Read Operational/DS for Node failed! KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey{_topologyId=Uri{_value=ovsdb:1}}], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey{_nodeId=Uri{_value=ovsdb://uuid/8fba589d-fb24-44ab-8efa-82ba6fd92e3a}}]]}
org.opendaylight.controller.md.sal.common.api.data.ReadFailedException: Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/8fba589d-fb24-44ab-8efa-82ba6fd92e3a}]
	at org.opendaylight.controller.cluster.datastore.compat.LegacyDOMStoreAdapter$1.newWithCause(LegacyDOMStoreAdapter.java:43) ~[?:?]
	at org.opendaylight.controller.cluster.datastore.compat.LegacyDOMStoreAdapter$1.newWithCause(LegacyDOMStoreAdapter.java:39) ~[?:?]
	at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:91) ~[?:?]
	at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:40) ~[?:?]
	at org.opendaylight.mdsal.common.api.MappingCheckedFuture.mapException(MappingCheckedFuture.java:62) ~[?:?]
	at org.opendaylight.mdsal.common.api.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:66) ~[?:?]
	at org.opendaylight.mdsal.common.api.MappingCheckedFuture.get(MappingCheckedFuture.java:79) ~[?:?]
	at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:168) ~[?:?]
	at com.google.common.util.concurrent.Futures.getDone(Futures.java:1436) ~[?:?]
	at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:85) ~[?:?]
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398) ~[?:?]
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015) ~[?:?]
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868) ~[?:?]
	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713) ~[?:?]
	at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54) ~[?:?]
	at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.executeRead(NoOpTransactionContext.java:67) ~[?:?]
	at org.opendaylight.controller.cluster.datastore.TransactionProxy$1.invoke(TransactionProxy.java:96) ~[?:?]
	at org.opendaylight.controller.cluster.datastore.TransactionContextWrapper.executePriorTransactionOperations(TransactionContextWrapper.java:192) ~[?:?]
	at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.onFindPrimaryShardFailure(AbstractTransactionContextFactory.java:109) ~[?:?]
	at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.access$100(AbstractTransactionContextFactory.java:37) ~[?:?]
	at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:136) ~[?:?]
	at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:130) ~[?:?]
	at akka.dispatch.OnComplete.internal(Future.scala:260) ~[?:?]
	at akka.dispatch.OnComplete.internal(Future.scala:258) ~[?:?]
	at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) ~[?:?]
	at akka.dispatch.japi$CallbackBridge.apply(Future.scala:185) ~[?:?]
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) ~[356:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) ~[?:?]
	at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91) ~[?:?]
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) [356:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) [356:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91) [40:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) [40:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43) [40:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [40:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [40:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [40:com.typesafe.akka.actor:2.5.11]
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [40:com.typesafe.akka.actor:2.5.11]
Caused by: org.opendaylight.mdsal.common.api.DataStoreUnavailableException: Shard member-1-shard-topology-operational currently has no leader. Try again later.
	at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.executeRead(NoOpTransactionContext.java:63) ~[?:?]
	... 22 more
Caused by: org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Shard member-1-shard-topology-operational currently has no leader. Try again later.
	at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.createNoShardLeaderException(ShardManager.java:955) ~[?:?]
	at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.onShardNotInitializedTimeout(ShardManager.java:786) ~[?:?]
	at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.handleCommand(ShardManager.java:253) ~[?:?]
	at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44) ~[?:?]
	at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:275) ~[?:?]
	at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104) ~[?:?]
	at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:608) ~[?:?]
	at akka.actor.Actor.aroundReceive(Actor.scala:517) ~[?:?]
	at akka.actor.Actor.aroundReceive$(Actor.scala:515) ~[?:?]
	at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:273) ~[?:?]
	at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:691) ~[?:?]
	at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:192) ~[?:?]
	at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:191) ~[?:?]
	at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:273) ~[?:?]
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590) ~[?:?]
	at akka.actor.ActorCell.invoke(ActorCell.scala:559) ~[?:?]
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) ~[?:?]
	at akka.dispatch.Mailbox.run(Mailbox.scala:224) ~[?:?]
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234) ~[?:?]
	... 4 more

That job had the exception in two 1 2 of the logs, but not the third 3

I didn't see this exception in any of the other karaf logs in the other 4 failing jobs I looked at.

Comment by Vinh Nguyen [ 23/May/18 ]

The problem is not reproducible for Oxygen/master using Jamo's reproduction steps. In Carbon, netvirt failed to create the br-int when the cluster is in the mid of reorganizing. Current netvirt doesn't have this problem because the br-int will be created only when the cluster is ready.

 

I think we can close this issue because it is not reproducible with the described scenario. We can open new issue for tracking the intermittent exception in OVSDB Cluster Jobs 

Generated at Wed Feb 07 20:36:23 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.