Details
-
Bug
-
Status: Resolved
-
Resolution: Done
-
Helium
-
None
-
None
-
Operating System: All
Platform: All
-
3154
-
High
Description
The following error was seen after a controller restart (Helium SR2):
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Metadata not available for modification [NodeModification [identifier=(com:brocade:neutron:odl?revision=2014-10-02)subnet, modificationType=SUBTREE_MODIFIED, childModification={(com:brocade:neutron:odl?revision=2014-10-02)subnet[
{(com:brocade:neutron:odl?revision=2014-10-02)id=ace19864-b874-47a9-9cef-b02afd52f37b}]=NodeModification [identifier=(com:brocade:neutron:odl?revision=2014-10-02)subnet[
{(com:brocade:neutron:odl?revision=2014-10-02)id=ace19864-b874-47a9-9cef-b02afd52f37b}], modificationType=DELETE, childModification={}]}]]
at java.util.concurrent.FutureTask.report(FutureTask.java:122)[:1.7.0_76]
at java.util.concurrent.FutureTask.get(FutureTask.java:188)[:1.7.0_76]
at org.opendaylight.controller.cluster.datastore.Shard.syncCommitTransaction(Shard.java:586)[301:org.opendaylight.controller.sal-distributed-datastore:1.1.2.Helium-SR2]
at org.opendaylight.controller.cluster.datastore.Shard.onRecoveryComplete(Shard.java:729)[301:org.opendaylight.controller.sal-distributed-datastore:1.1.2.Helium-SR2]
at org.opendaylight.controller.cluster.raft.RaftActor.onRecoveryCompletedMessage(RaftActor.java:257)[294:org.opendaylight.controller.sal-akka-raft:1.1.2.Helium-SR2]
at org.opendaylight.controller.cluster.raft.RaftActor.handleRecover(RaftActor.java:160)[294:org.opendaylight.controller.sal-akka-raft:1.1.2.Helium-SR2]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveRecover(AbstractUntypedPersistentActor.java:52)[293:org.opendaylight.controller.sal-clustering-commons:1.1.2.Helium-SR2]
at org.opendaylight.controller.cluster.datastore.Shard.onReceiveRecover(Shard.java:237)[301:org.opendaylight.controller.sal-distributed-datastore:1.1.2.Helium-SR2]
The modification is for a node delete and it seems "Metadata not available ..." indicates the node doesn't exist. If that's true, how did this modification entry get into the persisted journal? Transaction modifications should only get into the journal if the transaction succeeds.
The ramification of this failure is that the rest of the data failed to recover as well. This is b/c we batch journal entries 5000 at a time into a single transaction. This is more performant but the side effect is that one failed modification fails everything.
In addition, the failed entry remains in the RaftActor's in-memory journal so, in a 3 node cluster, if it becomes the leader then it wipes out the other nodes too. We need to protect against a corrupted journal (or a recovery failure) on one node from corrupting the whole cluster.