[CONTROLLER-1479] Cluster HA test: 500 Server Error when failed owner instance is recovered Created: 30/Jan/16 Updated: 19/Oct/17 Resolved: 15/Feb/16 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Beryllium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Luis Gomez | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||
| External issue ID: | 5153 | ||||||||||||
| Description |
|
I see this ERROR very often when I recover a failed owner instance and try to do a PUT/GET operation in the datastore: {"errors":{"error":[ {"error-type":"application","error-tag":"operation-failed","error-message":"canCommit encountered an unexpected failure","error-info":"java.lang.Exception: Error creating READ_WRITE transaction on shard inventory\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContextSupport.createTransactionContext(RemoteTransactionContextSupport.java:222)\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContextSupport.onCreateTransactionComplete(RemoteTransactionContextSupport.java:200)\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContextSupport.access$000(RemoteTransactionContextSupport.java:40)\n\tat org.opendaylight.controller.cluster.datastore.RemoteTransactionContextSupport$1.onComplete(RemoteTransactionContextSupport.java:135)\n\tat akka.dispatch.OnComplete.internal(Future.scala:247)\n\tat akka.dispatch.OnComplete.internal(Future.scala:245)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:175)\n\tat akka.dispatch.japi$CallbackBridge.apply(Future.scala:172)\n\tat scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)\n\tat akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)\n\tat scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)\n\tat akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)\n\tat akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)\n\tat akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)\n\tat scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\nCaused by: akka.actor.InvalidActorNameException: actor name [shard-member-1-txn-21] is not unique!\n\tat akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130)\n\tat akka.actor.dungeon.Children$class.reserveChild(Children.scala:76)\n\tat akka.actor.ActorCell.reserveChild(ActorCell.scala:369)\n\tat akka.actor.dungeon.Children$class.makeChild(Children.scala:201)\n\tat akka.actor.dungeon.Children$class.actorOf(Children.scala:37)\n\tat akka.actor.ActorCell.actorOf(ActorCell.scala:369)\n\tat org.opendaylight.controller.cluster.datastore.ShardTransactionActorFactory.newShardTransaction(ShardTransactionFactory.java:60)\n\tat org.opendaylight.controller.cluster.datastore.Shard.createTypedTransactionActor(Shard.java:538)\n\tat org.opendaylight.controller.cluster.datastore.Shard.createTransaction(Shard.java:570)\n\tat org.opendaylight.controller.cluster.datastore.Shard.createTransaction(Shard.java:549)\n\tat org.opendaylight.controller.cluster.datastore.Shard.handleCreateTransaction(Shard.java:521)\n\tat org.opendaylight.controller.cluster.datastore.Shard.onReceiveCommand(Shard.java:230)\n\tat akka.persistence.UntypedPersistentActor.onReceive(Eventsourced.scala:430)\n\tat org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:97)\n\tat akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:534)\n\tat akka.persistence.Recovery$State$class.process(Recovery.scala:30)\n\tat akka.persistence.ProcessorImpl$$anon$2.process(Processor.scala:103)\n\tat akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Processor.scala:114)\n\tat akka.persistence.Recovery$class.aroundReceive(Recovery.scala:265)\n\tat akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Eventsourced.scala:428)\n\tat akka.persistence.Eventsourced$$anon$2.doAroundReceive(Eventsourced.scala:82)\n\tat akka.persistence.Eventsourced$$anon$2.aroundReceive(Eventsourced.scala:78)\n\tat akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:369)\n\tat akka.persistence.UntypedPersistentActor.aroundReceive(Eventsourced.scala:428)\n\tat akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)\n\tat akka.actor.ActorCell.invoke(ActorCell.scala:487)\n\tat akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)\n\tat akka.dispatch.Mailbox.run(Mailbox.scala:220)\n\t... 5 more\n"}]}} |
| Comments |
| Comment by Tom Pantelis [ 02/Feb/16 ] |
|
The error is "akka.actor.InvalidActorNameException: actor name [shard-member-1-txn-21] is not unique!". Haven't seen this before. The id #21 is generated via an atomic long counter which is thread-safe so the id value is unique to the process instance. So node member-1 must have been restarted and counted up to id #21 again but the old transaction with the same id was still active or at least its actor was. Transaction actors will self-destruct after 10 min of inactivity. If the old transaction was already "readied" then it's possible the shard was backed up with enough prior transactions such that member-1 restarted and re-used id #21. Kind of a strange scenario. It seems we need to use a UUID instead of counter to be safe. |
| Comment by Luis Gomez [ 03/Feb/16 ] |
|
Adding karaf log for the failing instance: I will try this test manually to see if it is easy to reproduce. |
| Comment by Tom Pantelis [ 03/Feb/16 ] |
|
Submitted https://git.opendaylight.org/gerrit/#/c/34019/ to add a timestamp to the tx id. |
| Comment by Luis Gomez [ 04/Feb/16 ] |
|
I cannot actually reproduce this issue outside ODL jenkins, so reducing the priority to normal. I guess the patch is anyway something good to have right? |
| Comment by Tom Pantelis [ 04/Feb/16 ] |
|
The patch doesn't hurt and would be good to avoid intermittent test failures. |