[CONTROLLER-1572] ReadDataReply Message was too large can result in "Received UnreachableMember" in cluster Created: 27/Dec/16 Updated: 25/Jul/23 Resolved: 15/Jul/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | HeYunBo | Assignee: | Tom Pantelis |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 7449 |
| Description |
|
The serialization would be very time consuming if the ReadDataReply Message is very large, it will result in "Received UnreachableMember" in cluster The default failure-detector will trigger if there are no heartbeats within 5.5s in akka cluster. "Received UnreachableMember" certainly occur if the serialization time exceeded 5.5s 2016-12-27 20:06:30,999 | WARN | t-dispatcher-199 | ClusterCoreDaemon | 179 - com.typesafe.akka.slf4j - 2.4.12 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.46.60.132:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.46.60.139:2550, status = Up)]. Node roles [member-1] 2016-12-27 20:06:31,002 | INFO | t-dispatcher-296 | ShardManager | 214 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Received UnreachableMember: memberName MemberName{name=member-2} , address: akka.tcp://opendaylight-cluster-data@10.46.60.139:2550 ------------------------------------------------------------------------------------ In addition, akka EndpointWriter will throw OversizedPayloadException if the ReadDataReply size over the maximumPayloadBytes 2016-12-27 20:06:41,571 | ERROR | lt-dispatcher-18 | EndpointWriter | 179 - com.typesafe.akka.slf4j - 2.4.12 | Transient association error (association remains live) |
| Comments |
| Comment by Tom Pantelis [ 27/Dec/16 ] |
|
This will be alleviated when we switch to use akka Artery (https://git.opendaylight.org/gerrit/#/c/49466), which fragments large messages into smaller chunks (http://blog.akka.io/artery/2016/12/05/aeron-in-artery). Artery also has a dedicated sub channel for large messages that we can utilize for ReadDataReply messages. |
| Comment by HeYunBo [ 28/Dec/16 ] |
|
I have switched to use akka Artery, but it still have the problem 2016-12-28 19:33:05,820 | ERROR | t-dispatcher-152 | Encoder | 179 - com.typesafe.akka.slf4j - 2.4.12 | Failed to serialize oversized message [org.opendaylight.controller.cluster.datastore.messages.ReadDataReply]. |
| Comment by Tom Pantelis [ 28/Dec/16 ] |
|
The aeron layer is capable of fragmenting large messages but it seems akka still imposes an upper limit. For artery this appears to be maximum-frame-size I haven't seen any way to get around this upper limit other than setting it really high. Perhaps you could engage the akka folks on this subject (mailing list or open an issue)? |
| Comment by HeYunBo [ 04/Jan/17 ] |
|
I have consulted with akka about this question. They reply as follower: ----------------------------------------------------------------------------- We recommend against sending large messages. Try to split them into smaller messages or send them via a side channel that is not using Akka remoting. Note that Artery has some better support for large messages, but the recommendation is still valid. ----------------------------------------------------------------------------- I wonder whether the ODL have discussed the plan to split a large message into smaller messages? |
| Comment by Tom Pantelis [ 04/Jan/17 ] |
|
I don't agree with their view that messages should be split up/chunked at the app layer - this should be handled at the transport layer. In any event, we can use the large message channel for FE <-> BE messages as discussed in Ideally we would chunk large ReadDataReply messages or any other message containing NormalizedNodes that could be large - similar to the raft install snapshot chunking but generalized. |
| Comment by HeYunBo [ 20/Jan/17 ] |
|
According to your advice in https://bugs.opendaylight.org/show_bug.cgi?id=2890, we are considering to implement a subchannel component which can be used for fragmentation and defragmentation. Please refer to the attachment. |
| Comment by HeYunBo [ 20/Jan/17 ] |
|
Attachment Subchannel-component.doc has been added with description: subchannel component for fragmentation and defragmentation |
| Comment by Tom Pantelis [ 22/Jun/17 ] |
|
Message slicing/re-assembly patch: https://git.opendaylight.org/gerrit/#/c/55767/ Read reply slicing patch: https://git.opendaylight.org/gerrit/#/q/topic:bug/7449 |