[OVSDB-221] Ownership changed consistently without down any node Created: 02/Nov/15 Updated: 30/Oct/17 Resolved: 10/Feb/16 |
|
| Status: | Resolved |
| Project: | ovsdb |
| Component/s: | Southbound.Open_vSwitch |
| Affects Version/s: | unspecified |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Hari Prasidh | Assignee: | Anil Vishnoi |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: Windows |
||
| Attachments: |
|
| External issue ID: | 4569 |
| Priority: | High |
| Description |
|
I’m finding the inconsistency behaviour about ownership election in 3-node OVSDB southbound clustering. In 3-Node OVSDB Southbound cluster once owner node elected it should not be change until unless we goes down that node, but, in my observations find the ownership consistently changed one instance to another instance without bring down the node. below are my observations on 107 node. 2015-11-02 07:23:23,428 | INFO | lt-dispatcher-21 | OvsdbConnectionManager | 174 - org.opendaylight.ovsdb.southbound-impl - 1.2.1.SNAPSHOT | handleOwnershipChanged: EntityOwnershipChanged [entity=Entity{type='ovsdb', id=/(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb:/10.106.138.116:6640}]}, wasOwner=false, isOwner=true, hasOwner=true] event received for device ConnectionInfo [Remote-address=10.106.138.116, Remote-port=40335, Local-address10.106.138.107, Local-port=6640, type=PASSIVE] 2015-11-02 08:16:48,965 | INFO | lt-dispatcher-21 | OvsdbConnectionManager | 174 - org.opendaylight.ovsdb.southbound-impl - 1.2.1.SNAPSHOT | handleOwnershipChanged: EntityOwnershipChanged [entity=Entity{type='ovsdb', id=/(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb:/10.106.138.116:6640}]}, wasOwner=true, isOwner=false, hasOwner=true] event received for device ConnectionInfo [Remote-address=10.106.138.116, Remote-port=40335, Local-address10.106.138.107, Local-port=6640, type=PASSIVE] 2015-11-02 08:17:42,892 | INFO | lt-dispatcher-16 | OvsdbConnectionManager | 174 - org.opendaylight.ovsdb.southbound-impl - 1.2.1.SNAPSHOT | handleOwnershipChanged: EntityOwnershipChanged [entity=Entity{type='ovsdb', id=/(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb:/10.106.138.116:6640}]}, wasOwner=false, isOwner=true, hasOwner=true] event received for device ConnectionInfo [Remote-address=10.106.138.116, Remote-port=43804, Local-address10.106.138.107, Local-port=6640, type=PASSIVE]. |
| Comments |
| Comment by Hari Prasidh [ 02/Nov/15 ] |
|
Attachment karaf_logs.zip has been added with description: karaf logs for your reference |
| Comment by Anil Vishnoi [ 02/Nov/15 ] |
|
Hari, looking at the logs 2015-11-02 07:23:23,428 wasOwner=false, isOwner=true, hasOwner=true] event received for device ConnectionInfo [Remote-address=10.106.138.116, Remote-port=40335, Local-address10.106.138.107, Local-port=6640, type=PASSIVE] 2015-11-02 08:16:48,965 wasOwner=true, isOwner=false, hasOwner=true] event received for device ConnectionInfo [Remote-address=10.106.138.116, Remote-port=40335, Local-address10.106.138.107, Local-port=6640, type=PASSIVE] what all actions did you do in this one hour? do you suspect any connection flap? 2015-11-02 08:17:42,892 wasOwner=false, isOwner=true, hasOwner=true] event received for device ConnectionInfo [Remote-address=10.106.138.116, Remote-port=43804, Local-address10.106.138.107, Local-port=6640, type=PASSIVE]. If you look at the port remote port in above log it's 43804, but it's 40335 in first two logs, so looks like device got disconnected from the controller and reconnected and 138.107 controller become master again. So second switch is expected, but i am interested in the first ownership change, can you please provide what all actions happen during that 1 hour? |
| Comment by Hari Prasidh [ 03/Nov/15 ] |
|
Attachment karaf_logs(03-11-15).zip has been added with description: karaf logs(03-11-15) |
| Comment by Hari Prasidh [ 03/Nov/15 ] |
|
Anil, Thanks for your comments on this. In the span of 1 hour I created two bridge informations and deleted again one bridge among them, these actions only I performed in between the specified time. And, today also I've tested the same and got same behaviour while testing I observed there is no connection flap. To clear understanding I'll attach today's log. In today testing I've created 6 bridge informations and deleted 4 bridges, what my observation here is initially 107 act as a owner after few minutes it lost the ownership and 111 will take the ownership in between I performed some switch operations but these operations not effected immediately because there is no owner, after owner elected then only the operations are effected to all instances. mean while if am trying to see the operational data for the owner node it is showing below exception. Once owner is elected then can see the operational data for that node. applicationoperation-failedProblem to get data from transaction.ReadFailedException {message=Error reading data for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology, errorList=[RpcError [message=Error reading data for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-1-shard-topology-operational currently has no leader. Try again later.]]}at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.readData(NoOpTransactionContext.java:80) at org.opendaylight.controller.cluster.datastore.TransactionProxy$2.invoke(TransactionProxy.java:106) at org.opendaylight.controller.cluster.datastore.TransactionContextWrapper.executePriorTransactionOperations(TransactionContextWrapper.java:132) at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.onFindPrimaryShardFailure(AbstractTransactionContextFactory.java:87) at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.access$100(AbstractTransactionContextFactory.java:35) at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:110) at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:104) at akka.dispatch.OnComplete.internal(Future.scala:247) at akka.dispatch.OnComplete.internal(Future.scala:245) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:90) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:88) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:88) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:88) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-1-shard-topology-operational currently has no leader. Try again later. at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.readData(NoOpTransactionContext.java:76) ... 24 more Caused by: org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Shard member-1-shard-topology-operational currently has no leader. Try again later. at org.opendaylight.controller.cluster.datastore.ShardManager.createNoShardLeaderException(ShardManager.java:463) at org.opendaylight.controller.cluster.datastore.ShardManager.onShardNotInitializedTimeout(ShardManager.java:293) at org.opendaylight.controller.cluster.datastore.ShardManager.handleCommand(ShardManager.java:194) at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:36) at akka.persistence.UntypedPersistentActor.onReceive(Eventsourced.scala:430) at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:97) at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:534) at akka.persistence.Recovery$State$class.process(Recovery.scala:30) at akka.persistence.ProcessorImpl$$anon$2.process(Processor.scala:103) at akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Processor.scala:114) at akka.persistence.Recovery$class.aroundReceive(Recovery.scala:265) at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Eventsourced.scala:428) at akka.persistence.Eventsourced$$anon$2.doAroundReceive(Eventsourced.scala:82) at akka.persistence.Eventsourced$$anon$2.aroundReceive(Eventsourced.scala:78) at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:369) at akka.persistence.UntypedPersistentActor.aroundReceive(Eventsourced.scala:428) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) ... 5 more |
| Comment by Tom Pantelis [ 03/Nov/15 ] |
|
There's a lot of these exceptions: ModifiedNodeDoesNotExistException: Node /(urn:opendaylight:inventory?revision=2013-08-19)nodes/node/node[ {(urn:opendaylight:inventory?revision=2013-08-19)id=openflow:125920978130761}] does not exist. Cannot apply modification to its children. Looks like something is trying to write to nodes that don't exist. I suspect these are causing delays and timeouts resulting in leadership changes. I would suggest to try to find where the ModifiedNodeDoesNotExistExceptions are originating from. Could it be the StatisticsManager? Does that still poll every 3 sec and write to the data store? |
| Comment by Anil Vishnoi [ 04/Nov/15 ] |
|
Tom, by leadership changes, you mean shard leadership changes or Device leadership changes ? I believe shard leadership change should not effect the functioning of EOS, right? |
| Comment by Tom Pantelis [ 04/Nov/15 ] |
|
The EOS is backed by a shard so I mean EOS shard leadership changes. It seems that node connectivity was flapping as evidenced by these messages: 2015-11-03 06:43:29,050 | WARN | lt-dispatcher-19 | ClusterCoreDaemon | 150 - com.typesafe.akka.slf4j - 2.3.10 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.106.138.111:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.106.138.107:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@10.106.138.110:2550, status = Up)] 2015-11-03 06:43:33,294 | INFO | lt-dispatcher-16 | kka://opendaylight-cluster-data) | 150 - com.typesafe.akka.slf4j - 2.3.10 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.106.138.111:2550] - Marking node(s) as REACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.106.138.107:2550, status = Up)] 2015-11-03 07:07:52,287 | WARN | lt-dispatcher-21 | ClusterCoreDaemon | 150 - com.typesafe.akka.slf4j - 2.3.10 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.106.138.111:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.106.138.107:2550, status = Up)] This means that akka's connectivity was flapping which causes the EOS shard leader to re-assign ownership. I don't know what was happening in the environment, it could be flaky network or long GC pauses causing timeouts (was enough memory allocated to the JVM?). So the entity ownership re-assignment flapping was a symptom of other environmental issues. There were a lot of ModifiedNodeDoesNotExistException errors as I mentioned which would seem to be orthogonal to the connection flapping but isn't healthy. |
| Comment by Anil Vishnoi [ 02/Feb/16 ] |
|
I believe ModifiedNodeDoesNotExistException are fixed with the latest stable/beryllium branch. DataStoreUnavailableException – this exception is valid exception, because when you fire a request, shard leader was not elected for the topology shard. Can you test this again with the latest stable/beryllium and also make sure that VMs running controller instances are given enough memory (8 G) and connection between these VMs is not flappy. |
| Comment by Anil Vishnoi [ 09/Feb/16 ] |
|
Hi Hari, Most of the related issues are fixed in stable/beryllium, and i didn't see this issue in my testing. I am closing this bug, if you see this issue in your environment please re-open the bug. |
| Comment by Hari Prasidh [ 10/Feb/16 ] |
|
Anil, Thanks for your response, |