[CONTROLLER-2043] Circuit breaker timeout with BGP and tell-based protocol Created: 24/Apr/22  Updated: 16/Jan/24

Status: Confirmed
Project: controller
Component/s: clustering
Affects Version/s: 5.0.0, 5.0.2, 5.0.3, 5.0.5
Fix Version/s: 10.0.0, 9.0.1

Type: Bug Priority: Highest
Reporter: Robert Varga Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File data-1.log.xz     File segmented-journal.tar.xz     PNG File vim-data-1-log.png    
Issue Links:
Blocks
is blocked by CONTROLLER-2089 Do not store writerUuid in sal-akka-s... In Progress
Relates
relates to CONTROLLER-2056 Minimize serialization proxy names in... Resolved
relates to CONTROLLER-2064 Switch to using RaftVersion.ARGON Resolved
relates to CONTROLLER-2071 Switch to our fork of atomix-storage Resolved
relates to CONTROLLER-2037 Fail to serialize oversized message Resolved
relates to CONTROLLER-2026 Circuit breaker timeout Confirmed
relates to CONTROLLER-2044 Improve sal-akka-raft serialization p... Confirmed

 Description   

We are hitting this problem:

2022-04-24T06:12:59,752 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | Shard                            | 284 - org.opendaylight.controller.repackaged-akka - 5.0.2 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.SimpleReplicatedLogEntry] with sequence number [262218] for persistenceId [member-1-shard-default-config].
akka.pattern.CircuitBreaker$$anon$13: Circuit Breaker Timed out.
2022-04-24T06:12:59,767 | INFO  | opendaylight-cluster-data-shard-dispatcher-31 | Shard                            | 289 - org.opendaylight.controller.sal-clustering-commons - 5.0.2 | Stopping Shard member-1-shard-default-config

in BGP test suite:
https://s3-logs.opendaylight.org/logs/releng/vex-yul-odl-jenkins-1/bgpcep-csit-1node-bgp-ingest-all-sulfur/216/odl_1/odl1_karaf.log.gz

Previous versions work out fine, except they end up saving snapshot only at the 160K mark due to using ask-based protocol, which uses fewer persistence entries.



 Comments   
Comment by Robert Varga [ 24/Apr/22 ]

This looks like a reproducible failure, as the same entry fails here:
https://s3-logs.opendaylight.org/logs/releng/vex-yul-odl-jenkins-1/bgpcep-csit-1node-bgp-ingest-all-sulfur/212/odl_1/odl1_karaf.log.gz
https://s3-logs.opendaylight.org/logs/releng/vex-yul-odl-jenkins-1/bgpcep-csit-1node-bgp-ingest-all-sulfur/213/odl_1/odl1_karaf.log.gz

Comment by Robert Varga [ 24/Apr/22 ]

Debugs are not turning up anything, really:

2022-04-24T12:09:50,035 | INFO  | opendaylight-cluster-data-shard-dispatcher-23 | Shard                            | 289 - org.opendaylight.controller.sal-clustering-commons - 5.0.2 | member-1-shard-default-config: SaveSnapshotSuccess received for snapshot, sequenceNr: 240000
2022-04-24T12:09:50,041 | DEBUG | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | SegmentedJournalActor            | 286 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.2 | member-1-shard-default-config: delete messages DeleteMessagesTo{toSequenceNr=240000}
2022-04-24T12:09:50,041 | DEBUG | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | SegmentedJournalActor            | 286 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.2 | member-1-shard-default-config: adjusted delete to 240000
2022-04-24T12:09:50,041 | DEBUG | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | SegmentedJournalActor            | 286 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.2 | member-1-shard-default-config: deleting entries up to 240000
2022-04-24T12:09:50,041 | DEBUG | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | SegmentedJournalActor            | 286 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.2 | member-1-shard-default-config: compaction started
2022-04-24T12:09:50,041 | DEBUG | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | SegmentedJournalActor            | 286 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.2 | member-1-shard-default-config: compaction finished
2022-04-24T12:10:30,351 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | Shard                            | 284 - org.opendaylight.controller.repackaged-akka - 5.0.2 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.SimpleReplicatedLogEntry] with sequence number [262218] for persistenceId [member-1-shard-default-config].
akka.pattern.CircuitBreaker$$anon$13: Circuit Breaker Timed out.
Comment by Robert Varga [ 24/Apr/22 ]

Trace logs are not much better:

2022-04-24T18:16:44,045 | TRACE | opendaylight-cluster-data-akka.persistence.dispatchers.default-plugin-dispatcher-52 | SegmentedFileJournal             | 286 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.2 | Sending WriteMessages{requests=[AtomicWrite(List(PersistentRepr(member-1-shard-default-config,262213,5ca0c0d4-05cd-427a-83a0-66b9a281224b,0,None)))]} to Actor[akka://opendaylight-cluster-data/system/akka.persistence.journal.segmented-file/$o#-1890472742]
2022-04-24T18:16:44,046 | TRACE | opendaylight-cluster-data-akka.persistence.dispatchers.default-plugin-dispatcher-52 | SegmentedFileJournal             | 286 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.2 | Sending WriteMessages{requests=[AtomicWrite(List(PersistentRepr(member-1-shard-default-config,262214,5ca0c0d4-05cd-427a-83a0-66b9a281224b,0,None)))]} to Actor[akka://opendaylight-cluster-data/system/akka.persistence.journal.segmented-file/$o#-1890472742]
2022-04-24T18:16:44,047 | TRACE | opendaylight-cluster-data-akka.persistence.dispatchers.default-plugin-dispatcher-52 | SegmentedFileJournal             | 286 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.2 | Sending WriteMessages{requests=[AtomicWrite(List(PersistentRepr(member-1-shard-default-config,262215,5ca0c0d4-05cd-427a-83a0-66b9a281224b,0,None)))]} to Actor[akka://opendaylight-cluster-data/system/akka.persistence.journal.segmented-file/$o#-1890472742]
2022-04-24T18:16:44,048 | TRACE | opendaylight-cluster-data-akka.persistence.dispatchers.default-plugin-dispatcher-52 | SegmentedFileJournal             | 286 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.2 | Sending WriteMessages{requests=[AtomicWrite(List(PersistentRepr(member-1-shard-default-config,262216,5ca0c0d4-05cd-427a-83a0-66b9a281224b,0,None)))]} to Actor[akka://opendaylight-cluster-data/system/akka.persistence.journal.segmented-file/$o#-1890472742]
2022-04-24T18:16:44,048 | TRACE | opendaylight-cluster-data-akka.persistence.dispatchers.default-plugin-dispatcher-52 | SegmentedFileJournal             | 286 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.2 | Sending WriteMessages{requests=[AtomicWrite(List(PersistentRepr(member-1-shard-default-config,262217,5ca0c0d4-05cd-427a-83a0-66b9a281224b,0,None)))]} to Actor[akka://opendaylight-cluster-data/system/akka.persistence.journal.segmented-file/$o#-1890472742]
2022-04-24T18:16:44,050 | TRACE | opendaylight-cluster-data-akka.persistence.dispatchers.default-plugin-dispatcher-52 | SegmentedFileJournal             | 286 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.2 | Sending WriteMessages{requests=[AtomicWrite(List(PersistentRepr(member-1-shard-default-config,262218,5ca0c0d4-05cd-427a-83a0-66b9a281224b,0,None)))]} to Actor[akka://opendaylight-cluster-data/system/akka.persistence.journal.segmented-file/$o#-1890472742]
2022-04-24T18:16:54,074 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Shard                            | 284 - org.opendaylight.controller.repackaged-akka - 5.0.2 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.SimpleReplicatedLogEntry] with sequence number [262218] for persistenceId [member-1-shard-default-config].
akka.pattern.CircuitBreaker$$anon$13: Circuit Breaker Timed out.

It looks as though the appender just seizes up after being happy with 260K+ entries at a rate of 1000+ appends/s. I have added more trace points, added toString() to payload implementations and upgraded atomix. We'll retry this with 5.0.3.

Comment by Robert Varga [ 26/Apr/22 ]

Okay, it seems we are narrowing this down to Atomix:

2022-04-26T11:51:17,379 | TRACE | opendaylight-cluster-data-akka.persistence.dispatchers.default-plugin-dispatcher-52 | SegmentedFileJournal             | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | Sending WriteMessages{requests=[AtomicWrite(List(PersistentRepr(member-1-shard-default-config,262217,7dfbbdb1-71ba-467d-868f-73ab4924eb12,0,None)))]} to Actor[akka://opendaylight-cluster-data/system/akka.persistence.journal.segmented-file/$t#1258880154]
2022-04-26T11:51:17,380 | TRACE | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | DataJournalV0                    | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | member-1-shard-default-config: append 0/1: 262216 items at mark 1
2022-04-26T11:51:17,380 | TRACE | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | DataJournalV0                    | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | member-1-shard-default-config: starting append of ApplyJournalEntries [toIndex=131107]
2022-04-26T11:51:17,380 | TRACE | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | DataJournalV0                    | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | member-1-shard-default-config: finished append of ApplyJournalEntries [toIndex=131107] with 147 bytes at 262217
2022-04-26T11:51:17,382 | TRACE | opendaylight-cluster-data-akka.persistence.dispatchers.default-plugin-dispatcher-52 | SegmentedFileJournal             | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | Sending WriteMessages{requests=[AtomicWrite(List(PersistentRepr(member-1-shard-default-config,262218,7dfbbdb1-71ba-467d-868f-73ab4924eb12,0,None)))]} to Actor[akka://opendaylight-cluster-data/system/akka.persistence.journal.segmented-file/$t#1258880154]
2022-04-26T11:51:17,382 | TRACE | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | DataJournalV0                    | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | member-1-shard-default-config: append 0/1: 262217 items at mark 1
2022-04-26T11:51:17,383 | TRACE | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | DataJournalV0                    | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | member-1-shard-default-config: starting append of SimpleReplicatedLogEntry [index=131108, term=1, payload=Simple{size=823}]
2022-04-26T11:51:27,400 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Shard                            | 283 - org.opendaylight.controller.repackaged-akka - 5.0.3 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.SimpleReplicatedLogEntry] with sequence number [262218] for persistenceId [member-1-shard-default-config].

This corresponds to a CommitTransactionPayload, nothing special. We have persisted a similar payload just before:

2022-04-26T11:51:17,375 | TRACE | opendaylight-cluster-data-akka.persistence.dispatchers.default-plugin-dispatcher-52 | SegmentedFileJournal             | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | Sending WriteMessages{requests=[AtomicWrite(List(PersistentRepr(member-1-shard-default-config,262213,7dfbbdb1-71ba-467d-868f-73ab4924eb12,0,None)))]} to Actor[akka://opendaylight-cluster-data/system/akka.persistence.journal.segmented-file/$t#1258880154]
2022-04-26T11:51:17,376 | TRACE | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | DataJournalV0                    | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | member-1-shard-default-config: append 0/1: 262212 items at mark 1
2022-04-26T11:51:17,376 | TRACE | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | DataJournalV0                    | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | member-1-shard-default-config: starting append of SimpleReplicatedLogEntry [index=131106, term=1, payload=Simple{size=823}]
2022-04-26T11:51:17,376 | TRACE | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | DataJournalV0                    | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | member-1-shard-default-config: finished append of SimpleReplicatedLogEntry [index=131106, term=1, payload=Simple{size=823}] with 1096 bytes at 262213
Comment by Robert Varga [ 26/Apr/22 ]

Alright, so some more evidence is here:

022-04-26T11:51:17,382 | TRACE | opendaylight-cluster-data-akka.persistence.dispatchers.default-plugin-dispatcher-52 | SegmentedFileJournal             | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | Sending WriteMessages{requests=[AtomicWrite(List(PersistentRepr(member-1-shard-default-config,262218,7dfbbdb1-71ba-467d-868f-73ab4924eb12,0,None)))]} to Actor[akka://opendaylight-cluster-data/system/akka.persistence.journal.segmented-file/$t#1258880154]
2022-04-26T11:51:27,400 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Shard                            | 283 - org.opendaylight.controller.repackaged-akka - 5.0.3 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.SimpleReplicatedLogEntry] with sequence number [262218] for persistenceId [member-1-shard-default-config].
2022-04-26T11:52:22,301 | TRACE | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | DataJournalV0                    | 285 - org.opendaylight.controller.sal-akka-segmented-journal - 5.0.3 | member-1-shard-default-config: finished append of SimpleReplicatedLogEntry [index=131108, term=1, payload=Simple{size=823}] with 1096 bytes at 262218

So the append took good 65 seconds to complete, but eventually it did complete. Unfortunately we do not store node performance data, so I do not have a smoking gun. What I suspect is happening that we are facing a segmented file rotation.
Looking at the numbers:

  • we have a 128MiB journal (134 217 728 bytes)
  • we have written 262217 entries worth with 132 119 382 bytes of data
  • each entry contains additional 8 bytes of book keeping (length + crc32), for a total of 2 097 736 bytes
  • hence we have 134 217 694 bytes stored

So indeed we are now flushing current segment and allocating a new one. We might be looking at flushing ~128MiB down to storage, which has unknown performance characteristics, but the driver node peaks at 9827 blocks written per second – which amounts to a rather underwhelming performance of 4.8MiB/s.

Comment by Robert Varga [ 26/Apr/22 ]

So we are already performing a flush() after each write, so I am at a loss as to what is taking 65 seconds: allocation of a new file? its mapping into memory?

Comment by Robert Varga [ 26/Apr/22 ]

So the next step here is to create a unit test for SegmentedFileJournal which performs something akin to the failure scenario.
Take SegmentedFileJournalTest as how-to for a new test. The test should:

  • set up a new journal with default parameters (128MiB segments, 16MiB message size)
  • write 1 million 1KiB messages into it, timing each write with a Stopwatch
  • assert that each write takes less than 5 seconds

1M x 1KiB == 1GiB + overhead, which means we should hit segment changeover 8 times.

Once we have that, we can see if we can make it fail during local runs and if so, what exactly is the operation that takes so horribly long. If not, we will have something to throw at CSIT worker nodes to see how they react to it.

Comment by Robert Varga [ 04/Dec/22 ]

CONTROLLER-2056 lowers payload serialization overheads. CONTROLLER-2064 lowers baseline journal entry sizes. Both of these should improve the number of entries we can write out before having to rotate.

Comment by Robert Varga [ 26/Jan/23 ]

So we get a bit more:

https://s3-logs.opendaylight.org/logs/releng/vex-yul-odl-jenkins-1/bgpcep-csit-1node-bgp-ingest-all-argon/119/odl_1/odl1_karaf.log.gz

now fails at:

2023-01-26T12:46:58,006 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-20 | Shard                            | 265 - org.opendaylight.controller.repackaged-akka - 7.0.2 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.SimpleReplicatedLogEntry] with sequence number [313062] for persistenceId [member-1-shard-default-config].
akka.pattern.CircuitBreaker$$anon$13: Circuit Breaker Timed out.

i.e. we get to entry 313062 instead of 262218 for the same journal (~19% improvement), and process 178203 routes instead of 165485 (~8% more), being very close to the target of 180000.

Comment by Robert Varga [ 26/Jan/23 ]

Good news is that now openflowplugin-csit-3node-clustering-bulkomatic-only does not fail, i.e. it just fits into the log before rotation.

Comment by Robert Varga [ 30/Jan/23 ]

I think there is another part which might be worthwhile to optimize: here we are writing two Strings, both of which seem to be good candidates for simpler representation.

What we really need to do first, though, is get hold of a segmented-file directory contents from CSIT.

So the next step is to run either BGP or OFP-bulkomatic tests locally.

Comment by Robert Varga [ 07/Feb/23 ]

Looking at a fresh log (from feature:install odl-netconf-mdsal), it seems the UUID part is there, we have "4ceb3db5-5148-414a-a2c7-cdeae7c8d56" repeating 33 times, which in turn matches the number of "org.opendaylight.controller.cluster.raft.persisted.*" objects persisted – i.e. the number of entries in that journal. So that's 33x35=1155 bytes we could perhaps squash to a definition + a WritableObject.writeLong() – potentially bringing that down to 35+2+32*2=103 bytes.

The other part is that we keep on repeating those cluster.raft.persisted object names – I suspect we could take control of those as well IFF we ditch Kryo and create a custom ObjectOutputStream – similar in some regards to  io.netty.handler.codec.serialization.CompactObjectOutputStream.

Yet another part is that the payloads persisted repeat – "org.opendaylight.controller.cluster.datastore.persisted.CT" (i.e. CommitTransaction) repeats a number of times. This is coming from CDS itself, hence the custom OOS should be pluggable even without ditching Kryo.

Comment by Samuel Schneider [ 15/Feb/23 ]

I run locally these tests:
integration/test/csit/suites/bgpcep/bgpclustering/singlepeer_pc_shm_300kroutes.robot
integration/test/csit/suites/bgpcep/bgpingest/singlepeer_prefixcount.robot
integration/test/csit/suites/bgpcep/bgpingest/singlepeer_changecount.robot

with
feature:install odl-restconf odl-jolokia odl-bgpcep-data-change-counter odl-bgpcep-bgp odl-bgpcep-bgp-config-example

Will attach segmented-journal from karaf segmented-journal.tar.xz

Generated at Wed Feb 07 19:57:04 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.