[CONTROLLER-1328] Clustering: Recovery misses flows installed Created: 18/May/15  Updated: 02/Jun/15  Resolved: 02/Jun/15

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Post-Helium
Fix Version/s: None

Type: Bug
Reporter: Kamal Rameshan Assignee: Kamal Rameshan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File recovery-issue-1.log    
External issue ID: 3260
Priority: Highest

 Description   

Steps to reproduce
1. Create an OFP distribution on master
2. Start the controller install odl-openflowplugin-all
3. Connect mininet - sudo mn --controller=remote,ip=<host> --topo tree,3
4. Run config flow blaster a couple of times
./flow_config_blaster.py --auth --threads 5 --flows 5000 --no-delete
5. Check the number of flows in config and note the number
5. Stop controller gracefully.
6. Stop mininet
7. Start the controller
8. Connect mininet
9. Check the number of flows in config. Around 400 are missing.

The log is filled with messages :

2015-05-18 14:07:20,796 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | Recovery snapshot applied for member-1-shard-inventory-config in 2.119 s: snapshotIndex=39999, snapshotTerm=1, journal-size=0
2015-05-18 14:07:20,805 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Received ReplicatedLogEntry for recovery: index: 40322, size: 1445
2015-05-18 14:07:20,806 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Received ReplicatedLogEntry for recovery: index: 40323, size: 1441
2015-05-18 14:07:20,806 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Received ReplicatedLogEntry for recovery: index: 40324, size: 1441
2015-05-18 14:07:20,806 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Received ReplicatedLogEntry for recovery: index: 40325, size: 1445
2015-05-18 14:07:20,807 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Received apply journal entries for recovery, applying to state: 40000 to 40322
2015-05-18 14:07:20,813 | ERROR | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | Log entry not found for index 40004
2015-05-18 14:07:20,813 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Received apply journal entries for recovery, applying to state: 40004 to 40323
2015-05-18 14:07:20,814 | ERROR | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | Log entry not found for index 40004
2015-05-18 14:07:20,814 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Received apply journal entries for recovery, applying to state: 40004 to 40324
2015-05-18 14:07:20,814 | ERROR | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | Log entry not found for index 40004
2015-05-18 14:07:20,814 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Received apply journal entries for recovery, applying to state: 40004 to 40325
2015-05-18 14:07:20,814 | ERROR | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | Log entry not found for index 40004
2015-05-18 14:07:20,814 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Received ReplicatedLogEntry for recovery: index: 40326, size: 1441
2015-05-18 14:07:20,814 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Received apply journal entries for recovery, applying to state: 40004 to 40326
2015-05-18 14:07:20,814 | ERROR | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | Log entry not found for index 40005
2015-05-18 14:07:20,814 | INFO | lt-dispatcher-25 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Received ReplicatedLogEntry for recovery: index: 40327, size: 1445

The snapshot was taken at index=39999 with 0 unapplied entries.
The persistent journal starts with the index=40322.
Where did the 322 logs go?



 Comments   
Comment by Kamal Rameshan [ 18/May/15 ]

Attachment recovery-issue-1.log has been added with description: recovery log

Comment by Kamal Rameshan [ 18/May/15 ]

Sanpshots are taken with 0 unapplied entries at batches of 20000

2015-05-18 14:02:42,024 | INFO | lt-dispatcher-33 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Persisting of snapshot done:Snapshot=

{lastTerm:1, lastIndex:39999, LastAppliedIndex:39999, LastAppliedTerm:1, UnAppliedEntries size:0}

2015-05-18 14:02:42,025 | INFO | lt-dispatcher-33 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: Removed in-memory snapshotted entries, adjusted snaphsotIndex:39999 and term:1
2015-05-18 14:02:42,132 | INFO | lt-dispatcher-21 | Shard | 175 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-inventory-config: SaveSnapshotSuccess received for snapshot

Comment by Tom Pantelis [ 18/May/15 ]

The "Received ReplicatedLogEntry for recovery" output is logged to DEBUG. I assume you changed it to INFO in your build.

I suspect when we trimmed the persistent journal after the snapshot we blew away the 322 journal entries that occurred in between the time the snapshot was started and it was committed. So something's wrong there. Either the lastSequenceNumber to delete was incorrect or the 322 entries should've been in the unapplied list.

Comment by Kamal Rameshan [ 18/May/15 ]

Yes i did change the log levels to info (somehow logback.xml change was not working for me)

Am trying to dig in...

Comment by Tom Pantelis [ 18/May/15 ]

I've never used logback.xml - I think that's legacy. With karaf, you turn on debug in etc/org.pos4.pax.logging.cfg. It's standard log4j, e.g.

log4j.logger.org.opendaylight...Shard=DEBUG

I'll take a look at the code as well.

Comment by Kamal Rameshan [ 28/May/15 ]

https://git.opendaylight.org/gerrit/#/c/21372/

Generated at Wed Feb 07 19:55:15 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.