[CONTROLLER-2037] Fail to serialize oversized message Created: 05/Apr/22  Updated: 26/May/22  Resolved: 26/May/22

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: 5.0.1, 5.0.2, 5.0.3
Fix Version/s: 6.0.0, 5.0.5

Type: Bug Priority: Medium
Reporter: Sangwook Ha Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to CONTROLLER-2031 Add specific config knob for the size... In Review
relates to CONTROLLER-2043 Circuit breaker timeout with BGP and ... Confirmed
relates to CONTROLLER-2044 Improve sal-akka-raft serialization p... Confirmed

 Description   

Found in openflowplugin-csit-3node-clustering-bulkomatic-only-sulfur/190.

Serialization of messages larger than the maximum allowed (524288 bytes) fails and later causes circuit breaker to time out:
https://s3-logs.opendaylight.org/logs/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-bulkomatic-only-sulfur/190/odl_2/odl2_karaf.log.gz

2022-04-04T15:18:46,456 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-28 | Encoder                          | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Failed to serialize oversized message [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
akka.remote.OversizedPayloadException: Discarding oversized payload sent to Some(Actor[akka://opendaylight-cluster-data@10.30.171.35:2550/]): max allowed size 524288 bytes. Message type [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
2022-04-04T15:18:46,979 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-41 | Encoder                          | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Failed to serialize oversized message [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
akka.remote.OversizedPayloadException: Discarding oversized payload sent to Some(Actor[akka://opendaylight-cluster-data@10.30.171.35:2550/]): max allowed size 524288 bytes. Message type [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
2022-04-04T15:18:47,314 | INFO  | ForkJoinPool-9-worker-3 | FlowReader                       | 287 - org.opendaylight.openflowplugin.applications.bulk-o-matic - 0.14.0 | Total Flows read: 10000
2022-04-04T15:18:47,500 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-42 | Encoder                          | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Failed to serialize oversized message [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
akka.remote.OversizedPayloadException: Discarding oversized payload sent to Some(Actor[akka://opendaylight-cluster-data@10.30.171.35:2550/]): max allowed size 524288 bytes. Message type [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
2022-04-04T15:18:48,023 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-40 | Encoder                          | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Failed to serialize oversized message [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
akka.remote.OversizedPayloadException: Discarding oversized payload sent to Some(Actor[akka://opendaylight-cluster-data@10.30.171.35:2550/]): max allowed size 524288 bytes. Message type [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
2022-04-04T15:18:48,233 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-10 | Encoder                          | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Failed to serialize oversized message [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
akka.remote.OversizedPayloadException: Discarding oversized payload sent to Some(Actor[akka://opendaylight-cluster-data@10.30.171.35:2550/]): max allowed size 524288 bytes. Message type [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
2022-04-04T15:18:48,539 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-10 | Encoder                          | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Failed to serialize oversized message [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
akka.remote.OversizedPayloadException: Discarding oversized payload sent to Some(Actor[akka://opendaylight-cluster-data@10.30.171.35:2550/]): max allowed size 524288 bytes. Message type [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
2022-04-04T15:18:49,063 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-6 | Encoder                          | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Failed to serialize oversized message [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
akka.remote.OversizedPayloadException: Discarding oversized payload sent to Some(Actor[akka://opendaylight-cluster-data@10.30.171.35:2550/]): max allowed size 524288 bytes. Message type [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
2022-04-04T15:18:49,579 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-11 | Encoder                          | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Failed to serialize oversized message [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
akka.remote.OversizedPayloadException: Discarding oversized payload sent to Some(Actor[akka://opendaylight-cluster-data@10.30.171.35:2550/]): max allowed size 524288 bytes. Message type [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
2022-04-04T15:18:50,099 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-10 | Encoder                          | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Failed to serialize oversized message [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
akka.remote.OversizedPayloadException: Discarding oversized payload sent to Some(Actor[akka://opendaylight-cluster-data@10.30.171.35:2550/]): max allowed size 524288 bytes. Message type [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
2022-04-04T15:18:50,257 | INFO  | opendaylight-cluster-data-shard-dispatcher-38 | Shard                            | 211 - org.opendaylight.controller.sal-clustering-commons - 5.0.1 | member-2-shard-inventory-config: Initiating snapshot capture CaptureSnapshot [lastAppliedIndex=279998, lastAppliedTerm=5, lastIndex=279999, lastTerm=5, installSnapshotInitiated=, replicatedToAllIndex=-1, replicatedToAllTerm=-1, unAppliedEntries size=319, mandatoryTrim=false]
2022-04-04T15:18:50,259 | INFO  | opendaylight-cluster-data-shard-dispatcher-33 | Shard                            | 211 - org.opendaylight.controller.sal-clustering-commons - 5.0.1 | member-2-shard-inventory-config: Persisting of snapshot done: Snapshot [lastIndex=279999, lastTerm=5, lastAppliedIndex=279998, lastAppliedTerm=5, unAppliedEntries size=319, state=org.opendaylight.controller.cluster.datastore.persisted.ShardSnapshotState@9b0c2c5, electionTerm=5, electionVotedFor=member-2-shard-inventory-config, ServerConfigPayload=null]
2022-04-04T15:18:50,259 | INFO  | opendaylight-cluster-data-shard-dispatcher-33 | Shard                            | 211 - org.opendaylight.controller.sal-clustering-commons - 5.0.1 | member-2-shard-inventory-config: Removed in-memory snapshotted entries, adjusted snaphsotIndex: 279998 and term: 5
2022-04-04T15:18:50,311 | INFO  | opendaylight-cluster-data-shard-dispatcher-27 | Shard                            | 211 - org.opendaylight.controller.sal-clustering-commons - 5.0.1 | member-2-shard-inventory-config: SaveSnapshotSuccess received for snapshot, sequenceNr: 389120
2022-04-04T15:18:50,607 | INFO  | opendaylight-cluster-data-shard-dispatcher-35 | Shard                            | 211 - org.opendaylight.controller.sal-clustering-commons - 5.0.1 | member-2-shard-inventory-config (Leader): Initiating install snapshot to follower member-1-shard-inventory-config: follower nextIndex: 240676, leader snapshotIndex: 279998, leader lastIndex: 281505, leader log size: 1507
2022-04-04T15:18:50,607 | INFO  | opendaylight-cluster-data-shard-dispatcher-35 | Shard                            | 211 - org.opendaylight.controller.sal-clustering-commons - 5.0.1 | member-2-shard-inventory-config: Initiating snapshot capture CaptureSnapshot [lastAppliedIndex=281504, lastAppliedTerm=5, lastIndex=281505, lastTerm=5, installSnapshotInitiated=, replicatedToAllIndex=-1, replicatedToAllTerm=-1, unAppliedEntries size=1, mandatoryTrim=false] to install on member-1-shard-inventory-config
2022-04-04T15:18:50,643 | INFO  | opendaylight-cluster-data-shard-dispatcher-35 | Shard                            | 211 - org.opendaylight.controller.sal-clustering-commons - 5.0.1 | member-2-shard-inventory-config: Persisting of snapshot done: Snapshot [lastIndex=281505, lastTerm=5, lastAppliedIndex=281504, lastAppliedTerm=5, unAppliedEntries size=1, state=org.opendaylight.controller.cluster.datastore.persisted.ShardSnapshotState@4add6906, electionTerm=5, electionVotedFor=member-2-shard-inventory-config, ServerConfigPayload=null]
2022-04-04T15:18:50,644 | INFO  | opendaylight-cluster-data-shard-dispatcher-35 | Shard                            | 211 - org.opendaylight.controller.sal-clustering-commons - 5.0.1 | member-2-shard-inventory-config: Removed in-memory snapshotted entries, adjusted snaphsotIndex: 279998 and term: 5
2022-04-04T15:18:50,680 | INFO  | opendaylight-cluster-data-shard-dispatcher-35 | Shard                            | 211 - org.opendaylight.controller.sal-clustering-commons - 5.0.1 | member-2-shard-inventory-config: SaveSnapshotSuccess received for snapshot, sequenceNr: 392649
2022-04-04T15:18:50,824 | INFO  | opendaylight-cluster-data-shard-dispatcher-29 | Shard                            | 211 - org.opendaylight.controller.sal-clustering-commons - 5.0.1 | member-2-shard-inventory-config (Leader): Snapshot successfully installed on follower member-1-shard-inventory-config (last chunk 2) - matchIndex set to 281504, nextIndex set to 281505
2022-04-04T15:19:00,327 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-40 | Shard                            | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Failed to deleteMessages toSequenceNr [389120] for persistenceId [member-2-shard-inventory-config] due to [null: Circuit Breaker Timed out.].
2022-04-04T15:19:00,329 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-28 | Shard                            | 206 - org.opendaylight.controller.repackaged-akka - 5.0.1 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.ApplyJournalEntries] with sequence number [392650] for persistenceId [member-2-shard-inventory-config].
akka.pattern.CircuitBreaker$$anon$13: Circuit Breaker Timed out.
2022-04-04T15:19:00,335 | INFO  | opendaylight-cluster-data-shard-dispatcher-36 | Shard                            | 211 - org.opendaylight.controller.sal-clustering-commons - 5.0.1 | Stopping Shard member-2-shard-inventory-config
2022-04-04T15:19:33,600 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-39 | ClientActorBehavior              | 199 - org.opendaylight.controller.cds-access-client - 5.0.1 | member-2-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-config, generation=1}, cookie=1, backend=ShardBackendInfo{actor=Actor[akka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-inventory-config#190559700], sessionId=2, version=SODIUM_SR1, maxMessages=1000, cookie=1, shard=inventory, dataTree=present}} reconnecting as ReconnectingClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-config, generation=1}, cookie=1, backend=ShardBackendInfo{actor=Actor[akka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-inventory-config#190559700], sessionId=2, version=SODIUM_SR1, maxMessages=1000, cookie=1, shard=inventory, dataTree=present}}
2022-04-04T15:19:33,600 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-39 | ClientActorBehavior              | 199 - org.opendaylight.controller.cds-access-client - 5.0.1 | member-2-frontend-datastore-config: refreshing backend for shard 1
2022-04-04T15:19:33,854 | INFO  | ForkJoinPool-9-worker-3 | AbstractClientConnection         | 199 - org.opendaylight.controller.cds-access-client - 5.0.1 | Capping ReconnectingClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-config, generation=1}, cookie=1, backend=ShardBackendInfo{actor=Actor[akka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-inventory-config#190559700], sessionId=2, version=SODIUM_SR1, maxMessages=1000, cookie=1, shard=inventory, dataTree=present}} throttle delay from 47 to 5 seconds
java.lang.Throwable: null
	at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.sendEntry(AbstractClientConnection.java:262) ~[?:?]
	at org.opendaylight.controller.cluster.access.client.SimpleReconnectForwarder.forwardEntry(SimpleReconnectForwarder.java:18) ~[?:?]
	at org.opendaylight.controller.cluster.access.client.TransmitQueue.enqueueOrForward(TransmitQueue.java:266) ~[?:?]
	at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.enqueueOrForward(AbstractClientConnection.java:183) ~[?:?]
	at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.sendEntry(AbstractClientConnection.java:257) ~[?:?]
	at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.sendRequest(AbstractClientConnection.java:159) ~[?:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.ProxyHistory.sendRequest(ProxyHistory.java:570) ~[?:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.sendRequest(AbstractProxyTransaction.java:328) ~[?:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.abort(AbstractProxyTransaction.java:446) ~[?:?]
	at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772) [?:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractClientHandle.commonAbort(AbstractClientHandle.java:84) [bundleFile:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractClientHandle.abort(AbstractClientHandle.java:70) [bundleFile:?]
	at org.opendaylight.controller.cluster.databroker.ClientBackedTransaction.close(ClientBackedTransaction.java:62) [bundleFile:?]
	at org.opendaylight.controller.cluster.databroker.ClientBackedReadTransaction.close(ClientBackedReadTransaction.java:52) [bundleFile:?]
	at org.opendaylight.controller.cluster.databroker.AbstractDOMBrokerTransaction.closeSubtransactions(AbstractDOMBrokerTransaction.java:87) [bundleFile:?]
	at org.opendaylight.controller.cluster.databroker.DOMBrokerReadOnlyTransaction.close(DOMBrokerReadOnlyTransaction.java:47) [bundleFile:?]
	at org.opendaylight.mdsal.binding.dom.adapter.BindingDOMReadTransactionAdapter.close(BindingDOMReadTransactionAdapter.java:46) [bundleFile:?]
	at org.opendaylight.openflowplugin.applications.bulk.o.matic.FlowReader.readFlowsX(FlowReader.java:101) [bundleFile:?]
	at org.opendaylight.openflowplugin.applications.bulk.o.matic.FlowReader.run(FlowReader.java:67) [bundleFile:?]
	at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426) [?:?]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?]
	at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?]
	at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?]
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?]
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?]
2022-04-04T15:19:38,617 | WARN  | ForkJoinPool.commonPool-worker-3 | AbstractShardBackendResolver     | 213 - org.opendaylight.controller.sal-distributed-datastore - 5.0.1 | Failed to resolve shard
java.util.concurrent.TimeoutException: Connection attempt failed
	at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:151) ~[?:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.onConnectResponse(AbstractShardBackendResolver.java:168) ~[?:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$connectShard$4(AbstractShardBackendResolver.java:161) ~[?:?]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) [?:?]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) [?:?]
	at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:479) [?:?]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?]
	at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?]
	at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?]
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?]
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-2-shard-inventory-config#190559700)]] after [5000 ms]. Message of type [org.opendaylight.controller.cluster.access.commands.ConnectClientRequest]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
2022-04-04T15:19:38,857 | INFO  | ForkJoinPool-9-worker-3 | AbstractClientConnection         | 199 - org.opendaylight.controller.cds-access-client - 5.0.1 | Capping ReconnectingClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-config, generation=1}, cookie=1, backend=ShardBackendInfo{actor=Actor[akka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-inventory-config#190559700], sessionId=2, version=SODIUM_SR1, maxMessages=1000, cookie=1, shard=inventory, dataTree=present}} throttle delay from 43 to 5 seconds
java.lang.Throwable: null
	at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.sendEntry(AbstractClientConnection.java:262) ~[?:?]
	at org.opendaylight.controller.cluster.access.client.SimpleReconnectForwarder.forwardEntry(SimpleReconnectForwarder.java:18) ~[?:?]
	at org.opendaylight.controller.cluster.access.client.TransmitQueue.enqueueOrForward(TransmitQueue.java:266) ~[?:?]
	at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.enqueueOrForward(AbstractClientConnection.java:183) ~[?:?]
	at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.sendEntry(AbstractClientConnection.java:257) ~[?:?]
	at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.sendRequest(AbstractClientConnection.java:159) ~[?:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.ProxyHistory.sendRequest(ProxyHistory.java:570) ~[?:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.sendRequest(AbstractProxyTransaction.java:328) ~[?:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.abort(AbstractProxyTransaction.java:446) ~[?:?]
	at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772) [?:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractClientHandle.commonAbort(AbstractClientHandle.java:84) [bundleFile:?]
	at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractClientHandle.abort(AbstractClientHandle.java:70) [bundleFile:?]
	at org.opendaylight.controller.cluster.databroker.ClientBackedTransaction.close(ClientBackedTransaction.java:62) [bundleFile:?]
	at org.opendaylight.controller.cluster.databroker.ClientBackedReadTransaction.close(ClientBackedReadTransaction.java:52) [bundleFile:?]
	at org.opendaylight.controller.cluster.databroker.AbstractDOMBrokerTransaction.closeSubtransactions(AbstractDOMBrokerTransaction.java:87) [bundleFile:?]
	at org.opendaylight.controller.cluster.databroker.DOMBrokerReadOnlyTransaction.close(DOMBrokerReadOnlyTransaction.java:47) [bundleFile:?]
	at org.opendaylight.mdsal.binding.dom.adapter.BindingDOMReadTransactionAdapter.close(BindingDOMReadTransactionAdapter.java:46) [bundleFile:?]
	at org.opendaylight.openflowplugin.applications.bulk.o.matic.FlowReader.readFlowsX(FlowReader.java:101) [bundleFile:?]
	at org.opendaylight.openflowplugin.applications.bulk.o.matic.FlowReader.run(FlowReader.java:67) [bundleFile:?]
	at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426) [?:?]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?]
	at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?]
	at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?]
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?]
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?]


 Comments   
Comment by Sangwook Ha [ 27/Apr/22 ]

This is still found with v5.0.3:

https://s3-logs.opendaylight.org/logs/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-bulkomatic-only-sulfur/222/odl_1/odl1_karaf.log.gz

2022-04-27T05:54:07,395 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Encoder                          | 206 - org.opendaylight.controller.repackaged-akka - 5.0.2 | Failed to serialize oversized message [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
akka.remote.OversizedPayloadException: Discarding oversized payload sent to Some(Actor[akka://opendaylight-cluster-data@10.30.170.139:2550/]): max allowed size 524288 bytes. Message type [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].
2022-04-27T05:54:07,411 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | RemoteActorRef                   | 206 - org.opendaylight.controller.repackaged-akka - 5.0.2 | Message [org.opendaylight.controller.cluster.raft.messages.AppendEntries] wrapped in [akka.actor.ActorSelectionMessage] from Actor[akka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-inventory-config#878229052] to Actor[akka://opendaylight-cluster-data@10.30.170.139:2550/] was dropped. Discarding oversized payload sent to Some(Actor[akka://opendaylight-cluster-data@10.30.170.139:2550/]): max allowed size 524288 bytes. Message type [ActorSelectionMessage(org.opendaylight.controller.cluster.raft.messages.AppendEntries)].. [9] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
Comment by Robert Varga [ 11/May/22 ]

So this is just AppendEntries, other messages seem to be okay.

Comment by Robert Varga [ 11/May/22 ]

The circuit breaker seems to be an environment problem, as tracked in CONTROLLER-2043

Comment by Robert Varga [ 12/May/22 ]

This seems to be problem with accounting for the serialized size. The current tuning is that we have 512KiB for buffers and select up to 480KiB in payloads – i.e. leave out 32KiB for wrapping and other serialization overheads. Now that might not be enough as the format is generally this:

            out.writeShort(appendEntries.leaderRaftVersion);
            out.writeLong(appendEntries.getTerm());
            out.writeObject(appendEntries.leaderId);
            out.writeLong(appendEntries.prevLogTerm);
            out.writeLong(appendEntries.prevLogIndex);
            out.writeLong(appendEntries.leaderCommit);
            out.writeLong(appendEntries.replicatedToAllIndex);
            out.writeShort(appendEntries.payloadVersion);

            out.writeInt(appendEntries.entries.size());
            for (ReplicatedLogEntry e: appendEntries.entries) {
                out.writeLong(e.getIndex());
                out.writeLong(e.getTerm());
                out.writeObject(e.getData());
            }

            out.writeObject(appendEntries.leaderAddress);

Now the number of entries we select is dominated by

        final int maxDataSize = context.getConfigParams().getSnapshotChunkSize();

and AbstractReplicatedLogImpl uses ReplicatedLogEntry.size() to count against that limit. That does not include the 16 bytes spent on each entry for term/index, nor does it include the encapsulation overhead of writeObject() – hence we can bite off a larger number of entries than we can actually serialize into the 512KiB buffer.

The second part is that Payload.size() does not say anything about its semantics. Looking at implementations, there is actually a major disconnect between them:

  • ServerConfigurationPayload performs mock serialization and reports accurate serialized size
  • NoopPayload says '0' and uses a singleton Proxy, so the size does not reflect actual serialized size
  • AbstractIdentifiablePayload reports the size of serialized byte[], without related proxy overhead (just as NoopPayload)
  • CommitTransactionPayload does the same thing – ignoring size of the proxy overhead and also discounting the 4 bytes used to hold the length
Comment by Robert Varga [ 13/May/22 ]

So investigating this a bit more, the size() here is meant to be an estimate of heap residency, not the persistent size – which does not have any expression at all.
I think this needs a thorough refactor to reign in the interfaces here, so that the two concerns can readily distinguished.

Generated at Wed Feb 07 19:57:03 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.