Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1297

Clustering: Journal recovery error on restart

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Resolution: Done
    • Helium
    • None
    • clustering
    • None
    • Operating System: All
      Platform: All

    • 3154
    • High

    Description

      The following error was seen after a controller restart (Helium SR2):

      java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Metadata not available for modification [NodeModification [identifier=(com:brocade:neutron:odl?revision=2014-10-02)subnet, modificationType=SUBTREE_MODIFIED, childModification={(com:brocade:neutron:odl?revision=2014-10-02)subnet[

      {(com:brocade:neutron:odl?revision=2014-10-02)id=ace19864-b874-47a9-9cef-b02afd52f37b}

      ]=NodeModification [identifier=(com:brocade:neutron:odl?revision=2014-10-02)subnet[

      {(com:brocade:neutron:odl?revision=2014-10-02)id=ace19864-b874-47a9-9cef-b02afd52f37b}

      ], modificationType=DELETE, childModification={}]}]]
      at java.util.concurrent.FutureTask.report(FutureTask.java:122)[:1.7.0_76]
      at java.util.concurrent.FutureTask.get(FutureTask.java:188)[:1.7.0_76]
      at org.opendaylight.controller.cluster.datastore.Shard.syncCommitTransaction(Shard.java:586)[301:org.opendaylight.controller.sal-distributed-datastore:1.1.2.Helium-SR2]
      at org.opendaylight.controller.cluster.datastore.Shard.onRecoveryComplete(Shard.java:729)[301:org.opendaylight.controller.sal-distributed-datastore:1.1.2.Helium-SR2]
      at org.opendaylight.controller.cluster.raft.RaftActor.onRecoveryCompletedMessage(RaftActor.java:257)[294:org.opendaylight.controller.sal-akka-raft:1.1.2.Helium-SR2]
      at org.opendaylight.controller.cluster.raft.RaftActor.handleRecover(RaftActor.java:160)[294:org.opendaylight.controller.sal-akka-raft:1.1.2.Helium-SR2]
      at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveRecover(AbstractUntypedPersistentActor.java:52)[293:org.opendaylight.controller.sal-clustering-commons:1.1.2.Helium-SR2]
      at org.opendaylight.controller.cluster.datastore.Shard.onReceiveRecover(Shard.java:237)[301:org.opendaylight.controller.sal-distributed-datastore:1.1.2.Helium-SR2]

      The modification is for a node delete and it seems "Metadata not available ..." indicates the node doesn't exist. If that's true, how did this modification entry get into the persisted journal? Transaction modifications should only get into the journal if the transaction succeeds.

      The ramification of this failure is that the rest of the data failed to recover as well. This is b/c we batch journal entries 5000 at a time into a single transaction. This is more performant but the side effect is that one failed modification fails everything.

      In addition, the failed entry remains in the RaftActor's in-memory journal so, in a 3 node cluster, if it becomes the leader then it wipes out the other nodes too. We need to protect against a corrupted journal (or a recovery failure) on one node from corrupting the whole cluster.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            tpantelis Tom Pantelis
            tpantelis Tom Pantelis
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: