[CONTROLLER-1479] Cluster HA test: 500 Server Error when failed owner instance is recovered Created: 30/Jan/16  Updated: 19/Oct/17  Resolved: 15/Feb/16

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Beryllium
Fix Version/s: None

Type: Bug
Reporter: Luis Gomez Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Duplicate
is duplicated by OVSDB-270 Creating Bridge in Cluster Mode retur... Resolved
is duplicated by OVSDB-299 unable to read topology after recover... Resolved
External issue ID: 5153

 Description   

https://jenkins.opendaylight.org/releng/job/openflowplugin-csit-verify-3node-clustering-helium-design/12/robot/report/log.html

I see this ERROR very often when I recover a failed owner instance and try to do a PUT/GET operation in the datastore:

{"errors":{"error":[

{"error-type":"application","error-tag":"operation-failed","error-message":"canCommit encountered an unexpected failure","error-info":"java.lang.Exception: Error creating READ_WRITE transaction on shard inventory\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContextSupport.createTransactionContext(RemoteTransactionContextSupport.java:222)\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContextSupport.onCreateTransactionComplete(RemoteTransactionContextSupport.java:200)\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContextSupport.access$000(RemoteTransactionContextSupport.java:40)\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContextSupport$1.onComplete(RemoteTransactionContextSupport.java:135)\n\tat akka.dispatch.OnComplete.internal(Future.scala:247)\n\tat akka.dispatch.OnComplete.internal(Future.scala:245)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:175)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:172)\n\tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)\n\tat akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)\n\tat scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)\n\tat akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)\n\tat akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)\n\tat scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\nCaused by: akka.actor.InvalidActorNameException: actor name [shard-member-1-txn-21] is not unique!\n\tat akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130)\n\tat akka.actor.dungeon.Children$class.reserveChild(Children.scala:76)\n\tat akka.actor.ActorCell.reserveChild(ActorCell.scala:369)\n\tat akka.actor.dungeon.Children$class.makeChild(Children.scala:201)\n\tat akka.actor.dungeon.Children$class.actorOf(Children.scala:37)\n\tat akka.actor.ActorCell.actorOf(ActorCell.scala:369)\n\tat org.opendaylight.controller.cluster.datastore.ShardTransactionActorFactory.newShardTransaction(ShardTransactionFactory.java:60)\n\tat org.opendaylight.controller.cluster.datastore.Shard.createTypedTransactionActor(Shard.java:538)\n\tat org.opendaylight.controller.cluster.datastore.Shard.createTransaction(Shard.java:570)\n\tat org.opendaylight.controller.cluster.datastore.Shard.createTransaction(Shard.java:549)\n\tat org.opendaylight.controller.cluster.datastore.Shard.handleCreateTransaction(Shard.java:521)\n\tat org.opendaylight.controller.cluster.datastore.Shard.onReceiveCommand(Shard.java:230)\n\tat akka.persistence.UntypedPersistentActor.onReceive(Eventsourced.scala:430)\n\tat org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:97)\n\tat akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:534)\n\tat akka.persistence.Recovery$State$class.process(Recovery.scala:30)\n\tat akka.persistence.ProcessorImpl$$anon$2.process(Processor.scala:103)\n\tat akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Processor.scala:114)\n\tat akka.persistence.Recovery$class.aroundReceive(Recovery.scala:265)\n\tat akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Eventsourced.scala:428)\n\tat akka.persistence.Eventsourced$$anon$2.doAroundReceive(Eventsourced.scala:82)\n\tat akka.persistence.Eventsourced$$anon$2.aroundReceive(Eventsourced.scala:78)\n\tat akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:369)\n\tat akka.persistence.UntypedPersistentActor.aroundReceive(Eventsourced.scala:428)\n\tat akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)\n\tat akka.actor.ActorCell.invoke(ActorCell.scala:487)\n\tat akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)\n\tat akka.dispatch.Mailbox.run(Mailbox.scala:220)\n\t... 5 more\n"}

]}}



 Comments   
Comment by Tom Pantelis [ 02/Feb/16 ]

The error is "akka.actor.InvalidActorNameException: actor name [shard-member-1-txn-21] is not unique!". Haven't seen this before. The id #21 is generated via an atomic long counter which is thread-safe so the id value is unique to the process instance. So node member-1 must have been restarted and counted up to id #21 again but the old transaction with the same id was still active or at least its actor was. Transaction actors will self-destruct after 10 min of inactivity. If the old transaction was already "readied" then it's possible the shard was backed up with enough prior transactions such that member-1 restarted and re-used id #21. Kind of a strange scenario. It seems we need to use a UUID instead of counter to be safe.

Comment by Luis Gomez [ 03/Feb/16 ]

Adding karaf log for the failing instance:

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/openflowplugin-csit-3node-clustering-helium-design-only-beryllium/lastSuccessfulBuild/artifact/odl3_karaf.log

I will try this test manually to see if it is easy to reproduce.

Comment by Tom Pantelis [ 03/Feb/16 ]

Submitted https://git.opendaylight.org/gerrit/#/c/34019/ to add a timestamp to the tx id.

Comment by Luis Gomez [ 04/Feb/16 ]

I cannot actually reproduce this issue outside ODL jenkins, so reducing the priority to normal. I guess the patch is anyway something good to have right?

Comment by Tom Pantelis [ 04/Feb/16 ]

The patch doesn't hurt and would be good to avoid intermittent test failures.

Generated at Wed Feb 07 19:55:39 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.