[OVSDB-434] br-int not created after failing over one controller in 3 node cluster Created: 13/Apr/16 Updated: 02/May/19 Resolved: 02/May/19 |
|
| Status: | Resolved |
| Project: | ovsdb |
| Component/s: | Southbound.Open_vSwitch |
| Affects Version/s: | Carbon-SR3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Medium |
| Reporter: | Jamo Luhrsen | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Epic Link: | Clustering Stability | ||||||||||||||||
| External issue ID: | 5720 | ||||||||||||||||
| Description |
|
3 node ODL cluster with these features: 2 OVS nodes where they are being set to connect to each of the The controller reporting leader for default config shard was this exception was coming in a controller (not the restarted controller) 2016-04-13 21:19:58,658 | WARN | n-invoker-impl-0 | SouthboundUtil | 251 - org.opendaylight.ovsdb.southbound-impl - 1.2.3.SNAPSHOT | Read Operational/DS for Node failed! KeyedInstanceIdentifier {targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=ovsdb:1]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=ovsdb://uuid/f3354257-9201-4e55-bf6e-98320d6c5f94]]]]}ReadFailedException{message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/f3354257-9201-4e55-bf6e-98320d6c5f94}], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/f3354257-9201-4e55-bf6e-98320d6c5f94}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-2-shard-topology-operational currently has no leader. Try again later.]]} ReadFailedException{message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/f3354257-9201-4e55-bf6e-98320d6c5f94}], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/f3354257-9201-4e55-bf6e-98320d6c5f94}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-2-shard-topology-operational currently has no leader. Try again later.]]} |
| Comments |
| Comment by Jamo Luhrsen [ 13/Apr/16 ] |
|
it should be noted that scrubbing the ovs system (stopping ovs service, removing conf.db and restarting) did not workaround this issue. The nodes could still |
| Comment by Jamo Luhrsen [ 13/Apr/16 ] |
|
ODL_2 was the controller seeing the exceptions. the logs were trimmed to only have entries that happened from when |
| Comment by Jamo Luhrsen [ 13/Apr/16 ] |
|
Attachment 5720_karaf_logs.tar.gz has been added with description: controller logs |
| Comment by Vinh Nguyen [ 03/Jun/16 ] |
|
I could not reproduce the problem according to the description in the bug report.
|
| Comment by Anil Vishnoi [ 07/Jun/16 ] |
|
Vinh, I believe the new problem that you discover is probably not relevant to net-virt clustering but moreover relevant to base clustering service. Can you open a bug against the controller (clustering component) project and provide them the details. |
| Comment by Jamo Luhrsen [ 21/Jun/16 ] |
|
Sorry it's taken me so long to get back to this. I have recreated this 1) configure clustering with the config script packaged with the distro after 6) I see that br-int is not created. This happened on my first try, so maybe these updated steps can help you, I will get karaf logs posted. |
| Comment by Jamo Luhrsen [ 21/Jun/16 ] |
|
controller 01 is the controller that was killed. |
| Comment by Jamo Luhrsen [ 21/Jun/16 ] |
|
Attachment 5720_karaf_logs.tar.xz has been added with description: three controller logs |
| Comment by Bertrand Low [ 06/Jul/16 ] |
|
Hi JamO, do the logs in attachment 1046 capture all the steps listed in comment 5? The logs do not appear to contain any errors or exception such as those captured in this bug's description. thanks |
| Comment by Vinh Nguyen [ 22/Jul/16 ] |
|
Add email thread discussing this issue: --------------------- Good info and debugging Bertrand. Is it such that any time the cluster is reorganizing (or say even just organizing when coming up) if ovs connects then we wont get a br-int created? If that's the case, then I think we have something to worry about. We can't really ever know when/if clustering would have trouble and have to recalculate things, and that means we have these unknown windows in which we can't connect ovs to ODL. JamO On 07/11/2016 06:48 PM, Bertrand Low wrote: |
| Comment by Vinh Nguyen [ 15/Sep/16 ] |
|
Problem: br-int is not created if the OVS switch connects to 3-nodes cluster immediately after the ODL node who is the netvirt-provider master instance goes down. In this case the cluster is still converging and . When the OVS is connected netvirt calls the “ovsdb-netvirt-provider” owner to create the br-int. Since the current “ovsdb-netvirt-provider” owner just went down and the remaining 2 ODL nodes are still selecting new owner for “ovsdb-netvirt-provider”, there is no “ovsdb-netvirt-provider” owner to create the br-int. There is no exception thrown in this scenario because the remaining ODL nodes simply ignore the call to create the bridge because they are not the “ovsdb-netvirt-provider” owner. After a couple of seconds when the ODL cluster completes converging, we can manually add br-int successfully - netvirt creates initial flows successfully as well. In summary: The problem occurs:
|
| Comment by Anil Vishnoi [ 02/Feb/17 ] |
|
Hi Vinh, I believe this issue is related to the old netvirt code and as per last update it looks like it's moreover netvirt specific issue? Moving it to netvirt project to take the final decision on this bug. Thanks |
| Comment by Sam Hague [ 27/Nov/17 ] |
|
Still happens in carbon and looks to be ovsdb specific so moving back to ovsdb. This is the log in the new attached web-socket zip, overcloud-controller-0-karaf.log: 2017-11-20 13:43:02,999 | WARN | n-invoker-impl-0 | SouthboundUtil | 289 - org.opendaylight.ovsdb.southbound-impl - 1.4.3.Carbon | Read Operational/DS for Node failed! |
| Comment by Vishal Thapar [ 23/May/18 ] |
|
Is this still an issue and could you rephrase the steps with the new netvirt? With the new diagstatus code coming in netvirt https://git.opendaylight.org/gerrit/#/c/64000/ , Elan shouldn't be trying to create the bridge when cluster is not ready. Is it sufficient to address this issue? |
| Comment by Jamo Luhrsen [ 23/May/18 ] |
|
steps seem easy enough. I don't have a clustered setup at this exact moment or I'd try them:
vinh.nguyen@hcl.com can verify my accuracy here.
If I get access to a clustered setup locally, I'll try to remember to check this.
Having said that, I noticed that this is probably an OVSDB bug and maybe not all the way up the food chain We have OVSDB Cluster Jobs and they pass most ofthe time, but they do fail periodically. I scanned some 2018-05-09T17:25:31,776 | WARN | transaction-invoker-impl-0 | SouthboundUtil | 312 - org.opendaylight.ovsdb.southbound-impl - 1.7.0.SNAPSHOT | Read Operational/DS for Node failed! KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey{_topologyId=Uri{_value=ovsdb:1}}], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey{_nodeId=Uri{_value=ovsdb://uuid/8fba589d-fb24-44ab-8efa-82ba6fd92e3a}}]]} org.opendaylight.controller.md.sal.common.api.data.ReadFailedException: Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/8fba589d-fb24-44ab-8efa-82ba6fd92e3a}] at org.opendaylight.controller.cluster.datastore.compat.LegacyDOMStoreAdapter$1.newWithCause(LegacyDOMStoreAdapter.java:43) ~[?:?] at org.opendaylight.controller.cluster.datastore.compat.LegacyDOMStoreAdapter$1.newWithCause(LegacyDOMStoreAdapter.java:39) ~[?:?] at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:91) ~[?:?] at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:40) ~[?:?] at org.opendaylight.mdsal.common.api.MappingCheckedFuture.mapException(MappingCheckedFuture.java:62) ~[?:?] at org.opendaylight.mdsal.common.api.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:66) ~[?:?] at org.opendaylight.mdsal.common.api.MappingCheckedFuture.get(MappingCheckedFuture.java:79) ~[?:?] at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:168) ~[?:?] at com.google.common.util.concurrent.Futures.getDone(Futures.java:1436) ~[?:?] at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:85) ~[?:?] at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398) ~[?:?] at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015) ~[?:?] at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868) ~[?:?] at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713) ~[?:?] at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54) ~[?:?] at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.executeRead(NoOpTransactionContext.java:67) ~[?:?] at org.opendaylight.controller.cluster.datastore.TransactionProxy$1.invoke(TransactionProxy.java:96) ~[?:?] at org.opendaylight.controller.cluster.datastore.TransactionContextWrapper.executePriorTransactionOperations(TransactionContextWrapper.java:192) ~[?:?] at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.onFindPrimaryShardFailure(AbstractTransactionContextFactory.java:109) ~[?:?] at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.access$100(AbstractTransactionContextFactory.java:37) ~[?:?] at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:136) ~[?:?] at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:130) ~[?:?] at akka.dispatch.OnComplete.internal(Future.scala:260) ~[?:?] at akka.dispatch.OnComplete.internal(Future.scala:258) ~[?:?] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) ~[?:?] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:185) ~[?:?] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) ~[356:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428] at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) ~[?:?] at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91) ~[?:?] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) [356:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428] at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) [356:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428] at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91) [40:com.typesafe.akka.actor:2.5.11] at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) [40:com.typesafe.akka.actor:2.5.11] at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43) [40:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [40:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [40:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [40:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [40:com.typesafe.akka.actor:2.5.11] Caused by: org.opendaylight.mdsal.common.api.DataStoreUnavailableException: Shard member-1-shard-topology-operational currently has no leader. Try again later. at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.executeRead(NoOpTransactionContext.java:63) ~[?:?] ... 22 more Caused by: org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Shard member-1-shard-topology-operational currently has no leader. Try again later. at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.createNoShardLeaderException(ShardManager.java:955) ~[?:?] at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.onShardNotInitializedTimeout(ShardManager.java:786) ~[?:?] at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.handleCommand(ShardManager.java:253) ~[?:?] at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44) ~[?:?] at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:275) ~[?:?] at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104) ~[?:?] at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:608) ~[?:?] at akka.actor.Actor.aroundReceive(Actor.scala:517) ~[?:?] at akka.actor.Actor.aroundReceive$(Actor.scala:515) ~[?:?] at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:273) ~[?:?] at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:691) ~[?:?] at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:192) ~[?:?] at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:191) ~[?:?] at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:273) ~[?:?] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590) ~[?:?] at akka.actor.ActorCell.invoke(ActorCell.scala:559) ~[?:?] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) ~[?:?] at akka.dispatch.Mailbox.run(Mailbox.scala:224) ~[?:?] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) ~[?:?] ... 4 more That job had the exception in two 1 2 of the logs, but not the third 3 I didn't see this exception in any of the other karaf logs in the other 4 failing jobs I looked at. |
| Comment by Vinh Nguyen [ 23/May/18 ] |
|
The problem is not reproducible for Oxygen/master using Jamo's reproduction steps. In Carbon, netvirt failed to create the br-int when the cluster is in the mid of reorganizing. Current netvirt doesn't have this problem because the br-int will be created only when the cluster is ready.
I think we can close this issue because it is not reproducible with the described scenario. We can open new issue for tracking the intermittent exception in OVSDB Cluster Jobs |