--- title: Message Segmentation and Reconstruction name: Message Segmentation and Reconstruction tags: [waku-application, segmentation] version: 0.1 status: draft --- ## Abstract This specification defines an application-layer protocol for **segmentation** and **reconstruction** of messages carried over a message transport/delivery services with size limitation, when the original payload exceeds said limitation. Applications partition the payload into multiple wire-messages envelopes and reconstruct the original on receipt, even when segments arrive out of order or up to a **predefined percentage** of segments are lost. The protocol uses **Reed–Solomon** erasure coding for fault tolerance. All messages are wrapped in a `SegmentMessageProto`, including those that fit in a single segment. Implementations **MAY** opt into a [backwards-compatible mode](#backwards-compatibility) that exempts small payloads from wrapping. ## Motivation Waku Relay deployments typically propagate envelopes up to **150 KB** as per [64/WAKU2-NETWORK - Message](https://rfc.vac.dev/waku/standards/core/64/network#message-size). To support larger application payloads, a segmentation layer is required. This specification enables larger messages by partitioning them into multiple envelopes and reconstructing them at the receiver. Erasure-coded parity segments provide resilience against partial loss or reordering. ## Terminology - **original payload**: the full application payload before segmentation. - **data segment**: one of the partitioned chunks of the original message payload. - **parity segment**: an erasure-coded segment derived from the set of data segments. - **segment message**: a wire-message whose `payload` field carries a serialized `SegmentMessageProto`. - **`segmentSize`**: configured maximum size in bytes of each data segment's `payload` chunk (before protobuf serialization). The key words **"MUST"**, **"MUST NOT"**, **"REQUIRED"**, **"SHALL"**, **"SHALL NOT"**, **"SHOULD"**, **"SHOULD NOT"**, **"RECOMMENDED"**, **"NOT RECOMMENDED"**, **"MAY"**, and **"OPTIONAL"** in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). ## Wire Format Each segmented message is encoded as a `SegmentMessageProto` protobuf message: ```protobuf syntax = "proto3"; message SegmentMessageProto { // Keccak256(original payload), 32 bytes bytes entire_message_hash = 1; // Data segment indexing uint32 index = 2; // zero-based sequence number; valid only if segments_count > 0 uint32 segments_count = 3; // number of data segments // Segment payload (data or parity shard) bytes payload = 4; // Parity segment indexing (used if segments_count == 0) uint32 parity_segment_index = 5; // zero-based sequence number for parity segments uint32 parity_segments_count = 6; // number of parity segments } ``` **Field descriptions:** - `entire_message_hash`: A 32-byte Keccak256 hash of the original complete payload, used to identify which segments belong together and verify reconstruction integrity. - `index`: Zero-based sequence number identifying this data segment's position (0, 1, 2, ..., segments_count - 1). - `segments_count`: Total number of data segments the original message was split into. - `payload`: The actual chunk of data or parity information for this segment. - `parity_segment_index`: Zero-based sequence number for parity segments. - `parity_segments_count`: Total number of parity segments generated. A message is either a **data segment** (when `segments_count > 0`) or a **parity segment** (when `segments_count == 0`). ### Validation Receivers **MUST** enforce: - `entire_message_hash.length == 32` - **Data segments:** `segments_count >= 1` **AND** `index < segments_count` - **Parity segments:** `segments_count == 0` **AND** `parity_segments_count > 0` **AND** `parity_segment_index < parity_segments_count` No other combinations are permitted. A `SegmentMessageProto` with `segments_count == 1` and `index == 0` is a valid single-segment data message: the `payload` field carries the entire original payload (see [Sending](#sending)). ## Segmentation ### Sending To transmit a payload, the sender: - **MUST** compute a 32-byte `entire_message_hash = Keccak256(original_payload)`. - **MUST** split the payload into one or more **data segments**, each of size up to `segmentSize` bytes. A payload of size ≤ `segmentSize` produces a single data segment (`segments_count == 1`). - **MAY** use Reed–Solomon erasure coding at the predefined parity rate. - Encode each segment as a `SegmentMessageProto` with: - The `entire_message_hash` - Either data-segment indices (`segments_count`, `index`) or parity-segment indices (`parity_segments_count`, `parity_segment_index`) - The raw payload data - Send all segments as individual Waku envelopes, preserving application-level metadata (e.g., content topic). This yields a deterministic wire format: every transmitted payload is a `SegmentMessageProto`. Implementations introducing segmentation into a deployment with peers that predate this specification **MAY** instead operate in [backwards-compatible mode](#backwards-compatibility). ### Receiving Upon receiving a segmented message, the receiver: - **MUST** validate each segment according to [Wire Format → Validation](#validation). - **MUST** cache received segments - **MUST** attempt reconstruction when the number of available (data + parity) segments equals or exceeds the data segment count: - Concatenating data segments if all are present, or - Applying Reed–Solomon decoding if parity segments are available. - **MUST** verify `Keccak256(reconstructed_payload)` matches `entire_message_hash`. On mismatch, the message **MUST** be discarded and logged as invalid. - Once verified, the reconstructed payload **SHALL** be delivered to the application. - Incomplete reconstructions **SHOULD** be garbage-collected after a timeout. --- ## Backwards Compatibility Implementations **MAY** support a **backwards-compatible mode**, intended for deployments where this specification is being introduced incrementally and some peers do not yet implement segmentation. The mode is controlled by the `backwardsCompatible` configuration option, which defaults to `false`. When `backwardsCompatible = true`, the [Sending](#sending) procedure is amended as follows: - Payloads with size **≤ `segmentSize`** **SHALL** be transmitted unmodified, i.e., not wrapped in `SegmentMessageProto`. - Payloads exceeding `segmentSize` are wrapped and sent unchanged from [Sending](#sending). A receiver that interoperates with senders operating in this mode **MUST** accept both wrapped and unwrapped payloads on the same channel. A payload that does not parse as a valid `SegmentMessageProto` is treated as an unsegmented original payload and delivered directly to the application. **Trade-off.** This mode preserves on-the-wire compatibility with peers that cannot decode `SegmentMessageProto`, at the cost of the deterministic wire format described in [Sending](#sending). Once all peers in a deployment implement this specification, `backwardsCompatible` **SHOULD** be set to `false`. --- ## Implementation Suggestions ### Reed–Solomon Implementations that apply parity **SHALL** use fixed-size shards of length `segmentSize`. The last data chunk **MUST** be padded to `segmentSize` for encoding. The reference implementation uses **nim-leopard** (Leopard-RS) with a maximum of **256 total shards**. ### Storage / Persistence Segments **MAY** be persisted (e.g., SQLite) and indexed by `entire_message_hash` and by sender. Sender MAY be authenticated, this is out of scope of this spec. Implementations **SHOULD** support: - Duplicate detection and idempotent saves - Completion flags to prevent duplicate processing - Timeout-based cleanup of incomplete reconstructions - Per-sender quotas for stored bytes and concurrent reconstructions ### Configuration - `segmentSize` — maximum size in bytes of each data segment's payload chunk (before protobuf serialization). **REQUIRED** parameter, configurable by the client. - `parityRate` — fraction of parity shards relative to data shards. Configurable by the client. Defaults to **0.125** (12.5%). - `maxTotalSegments` — maximum number of total shards (data + parity) per message. Implementation-specific parameter, fixed. The reference implementation uses **256**. **Reconstruction capability:** With the predefined parity rate, reconstruction is possible if **all data segments** are received or if **any combination of data + parity** totals at least `dataSegments` (i.e., up to the predefined percentage of loss tolerated). **API simplicity:** Libraries **SHOULD** require only `segmentSize` from the application for normal operation. ### Support - **Language / Package:** Nim; **Nimble** package manager - **Intended for:** all Waku nodes at the application layer --- ## Security Considerations ### Privacy `entire_message_hash` enables correlation of segments that belong to the same original message but does not reveal content. Traffic analysis may still identify segmented flows. ### Integrity Implementations **MUST** verify the Keccak256 hash post-reconstruction and discard on mismatch. ### Denial of Service To mitigate resource exhaustion: - Limit concurrent reconstructions and per-sender storage - Enforce timeouts and size caps - Validate segment counts (≤ 256) - Consider rate-limiting using [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) ### Compatibility Nodes that do **not** implement this specification cannot reconstruct large messages. --- ## Deployment Considerations **Overhead:** - Bandwidth overhead ≈ the predefined parity rate from parity (if enabled) - Additional per-segment overhead ≤ **100 bytes** (protobuf + metadata) **Network impact:** - Larger messages increase gossip traffic and storage; operators **SHOULD** consider policy limits --- ## References 1. [10/WAKU2 – Waku](https://rfc.vac.dev/waku/standards/core/10/waku2) 2. [11/WAKU2-RELAY – Relay](https://rfc.vac.dev/waku/standards/core/11/relay) 3. [14/WAKU2-MESSAGE – Message](https://rfc.vac.dev/waku/standards/core/14/message) 4. [64/WAKU2-NETWORK](https://rfc.vac.dev/waku/standards/core/64/network#message-size) 5. [nim-leopard](https://github.com/status-im/nim-leopard) – Nim bindings for Leopard-RS (Reed–Solomon) 6. [Leopard-RS](https://github.com/catid/leopard) – Fast Reed–Solomon erasure coding library 7. [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt) – Key words for use in RFCs to Indicate Requirement Levels