2025-10-19 15:44:17 +03:00
---
2025-10-22 12:03:00 +03:00
title: Message Segmentation and Reconstruction
2025-10-19 15:44:17 +03:00
name: Message Segmentation and Reconstruction
2026-04-29 23:14:03 +01:00
tags: [segmentation]
2025-10-19 15:44:17 +03:00
version: 0.1
status: draft
---
## Abstract
2026-04-28 21:32:42 +01:00
This specification defines an application-layer protocol for **segmentation** and **reconstruction** of messages carried over a transport/delivery service with a message-size limitation, when the original payload exceeds said limitation.
Applications partition the payload into multiple transport messages and reconstruct the original on receipt,
2025-10-22 12:03:00 +03:00
even when segments arrive out of order or up to a **predefined percentage** of segments are lost.
2026-04-29 22:12:56 +01:00
The protocol optionally uses **Reed– Solomon** erasure coding for fault tolerance.
2026-04-28 21:25:10 +01:00
All messages are wrapped in a `SegmentMessageProto` , including those that fit in a single segment.
2025-10-19 15:44:17 +03:00
## Motivation
2026-04-28 21:32:42 +01:00
Many message transport and delivery protocols impose a maximum message size that restricts the size of application payloads.
For example, Waku Relay typically propagates messages up to **150 KB** as per [64/WAKU2-NETWORK - Message ](https://rfc.vac.dev/waku/standards/core/64/network#message-size ).
2026-04-28 18:33:06 +01:00
To support larger application payloads, a segmentation layer is required.
2025-10-22 12:03:00 +03:00
This specification enables larger messages by partitioning them into multiple envelopes and reconstructing them at the receiver.
Erasure-coded parity segments provide resilience against partial loss or reordering.
2025-10-19 15:44:17 +03:00
## Terminology
2025-10-27 10:15:02 +02:00
- **original payload**: the full application payload before segmentation.
2025-10-22 12:03:00 +03:00
- **data segment**: one of the partitioned chunks of the original message payload.
- **parity segment**: an erasure-coded segment derived from the set of data segments.
- **segment message**: a wire-message whose `payload` field carries a serialized `SegmentMessageProto` .
- **`segmentSize` **: configured maximum size in bytes of each data segment's `payload` chunk (before protobuf serialization).
2025-10-19 15:44:17 +03:00
2025-10-27 10:10:46 +02:00
The key words ** "MUST"**, ** "MUST NOT"**, ** "REQUIRED"**, ** "SHALL"**, ** "SHALL NOT"**, ** "SHOULD"**, ** "SHOULD NOT"**, ** "RECOMMENDED"**, ** "NOT RECOMMENDED"**, ** "MAY"**, and ** "OPTIONAL"** in this document are to be interpreted as described in [RFC 2119 ](https://www.ietf.org/rfc/rfc2119.txt ).
## Wire Format
Each segmented message is encoded as a `SegmentMessageProto` protobuf message:
```protobuf
syntax = "proto3";
message SegmentMessageProto {
// Keccak256(original payload), 32 bytes
bytes entire_message_hash = 1;
// Data segment indexing
2026-04-30 15:19:55 +01:00
uint32 index = 2; // zero-based sequence number for data segments
uint32 segments_count = 3; // number of data segments (>= 1)
2025-10-27 10:10:46 +02:00
// Segment payload (data or parity shard)
bytes payload = 4;
2026-04-30 15:19:55 +01:00
// Parity segment indexing
2025-10-27 10:10:46 +02:00
uint32 parity_segment_index = 5; // zero-based sequence number for parity segments
2026-04-28 18:33:06 +01:00
uint32 parity_segments_count = 6; // number of parity segments
2026-04-30 15:19:55 +01:00
// Segment type
bool is_parity = 7; // true for parity segments, false (default) for data segments
2025-10-27 10:10:46 +02:00
}
```
**Field descriptions:**
- `entire_message_hash` : A 32-byte Keccak256 hash of the original complete payload, used to identify which segments belong together and verify reconstruction integrity.
2026-04-30 15:19:55 +01:00
- `index` : Zero-based sequence number identifying this data segment's position (0, 1, 2, ..., segments_count - 1). Set only on data segments.
- `segments_count` : Total number of data segments the original message was split into. Set on every segment (data and parity).
2025-10-27 10:10:46 +02:00
- `payload` : The actual chunk of data or parity information for this segment.
2026-04-30 15:19:55 +01:00
- `parity_segment_index` : Zero-based sequence number for parity segments. Set only on parity segments.
- `parity_segments_count` : Total number of parity segments generated. Set on every segment (data and parity) when Reed– Solomon parity is used; `0` (default) otherwise.
- `is_parity` : Explicit segment type marker. `false` (default) for data segments; `true` for parity segments.
2025-10-27 10:10:46 +02:00
2026-04-30 15:19:55 +01:00
A message is either a **data segment** (when `is_parity == false` ) or a **parity segment** (when `is_parity == true` ).
2025-10-27 10:10:46 +02:00
### Validation
Receivers **MUST** enforce:
- `entire_message_hash.length == 32`
2026-04-30 15:19:55 +01:00
- `segments_count >= 1`
- **Data segments** (`is_parity == false` ):
`index < segments_count`
- **Parity segments** (`is_parity == true` ):
`parity_segments_count > 0` AND `parity_segment_index < parity_segments_count`
2025-10-27 10:10:46 +02:00
No other combinations are permitted.
2026-04-28 21:25:10 +01:00
A `SegmentMessageProto` with `segments_count == 1` and `index == 0` is a valid single-segment data message: the `payload` field carries the entire original payload (see [Sending ](#sending )).
2025-10-19 15:44:17 +03:00
## Segmentation
### Sending
2026-04-28 21:25:10 +01:00
To transmit a payload, the sender:
2025-10-19 15:44:17 +03:00
2025-10-22 12:03:00 +03:00
- **MUST** compute a 32-byte `entire_message_hash = Keccak256(original_payload)` .
- **MUST** split the payload into one or more **data segments** ,
each of size up to `segmentSize` bytes.
2026-04-28 21:25:10 +01:00
A payload of size ≤ `segmentSize` produces a single data segment (`segments_count == 1` ).
2025-10-22 12:03:00 +03:00
- **MAY** use Reed– Solomon erasure coding at the predefined parity rate.
2026-04-30 15:19:55 +01:00
- **MUST** encode every segment as a `SegmentMessageProto` with:
2025-10-19 15:44:17 +03:00
- The `entire_message_hash`
2026-04-30 15:19:55 +01:00
- `segments_count` (total number of data segments, always set)
- When Reed– Solomon parity is used, `parity_segments_count` (total number of parity segments, set on every segment)
- For data segments: `is_parity = false` , `index`
- For parity segments: `is_parity = true` , `parity_segment_index`
2025-10-19 15:44:17 +03:00
- The raw payload data
2026-04-28 21:32:42 +01:00
- Send each segment as an individual transport message according to the underlying transport protocol,
2025-10-22 12:03:00 +03:00
preserving application-level metadata (e.g., content topic).
2025-10-19 15:44:17 +03:00
2026-04-30 15:10:20 +01:00
This yields a deterministic wire format: every transmitted payload is a `SegmentMessageProto` .
2025-10-19 15:44:17 +03:00
### Receiving
2025-10-22 12:03:00 +03:00
Upon receiving a segmented message, the receiver:
2025-10-19 15:44:17 +03:00
2025-10-22 12:03:00 +03:00
- **MUST** validate each segment according to [Wire Format → Validation ](#validation ).
- **MUST** cache received segments
2026-04-29 23:14:03 +01:00
- **MUST** attempt reconstruction once at least `segments_count` distinct segments (data and parity combined) have been received:
- If all data segments are present, concatenate their `payload` fields in `index` order.
- Otherwise, recover the payload via Reed– Solomon decoding over the available data and parity segments.
2025-10-22 12:03:00 +03:00
- **MUST** verify `Keccak256(reconstructed_payload)` matches `entire_message_hash` .
On mismatch,
the message **MUST** be discarded and logged as invalid.
- Once verified,
the reconstructed payload **SHALL** be delivered to the application.
2025-10-19 15:44:17 +03:00
- Incomplete reconstructions **SHOULD** be garbage-collected after a timeout.
---
2025-10-22 12:03:00 +03:00
## Implementation Suggestions
2025-10-19 15:44:17 +03:00
### Reed– Solomon
2025-10-22 12:03:00 +03:00
Implementations that apply parity **SHALL** use fixed-size shards of length `segmentSize` .
The last data chunk **MUST** be padded to `segmentSize` for encoding.
2025-10-19 15:44:17 +03:00
The reference implementation uses **nim-leopard** (Leopard-RS) with a maximum of **256 total shards** .
### Storage / Persistence
2026-04-28 18:33:06 +01:00
Segments **MAY** be persisted (e.g., SQLite) and indexed by `entire_message_hash` and by sender. Sender MAY be authenticated, this is out of scope of this spec.
2025-10-19 15:44:17 +03:00
Implementations **SHOULD** support:
2025-10-22 12:03:00 +03:00
- Duplicate detection and idempotent saves
- Completion flags to prevent duplicate processing
- Timeout-based cleanup of incomplete reconstructions
2025-10-19 15:44:17 +03:00
- Per-sender quotas for stored bytes and concurrent reconstructions
### Configuration
2026-04-28 18:33:06 +01:00
- `segmentSize` — maximum size in bytes of each data segment's payload chunk (before protobuf serialization).
**REQUIRED** parameter, configurable by the client.
- `parityRate` — fraction of parity shards relative to data shards.
Configurable by the client. Defaults to **0.125** (12.5%).
- `maxTotalSegments` — maximum number of total shards (data + parity) per message.
Implementation-specific parameter, fixed. The reference implementation uses **256** .
2025-10-21 17:36:57 +03:00
2025-10-22 12:03:00 +03:00
**Reconstruction capability:**
2026-04-29 23:14:03 +01:00
With the predefined parity rate, reconstruction is possible if **all data segments** are received or if **any combination of data + parity** totals at least `segments_count` (i.e., up to the predefined percentage of loss tolerated).
2025-10-19 15:44:17 +03:00
2025-10-22 12:03:00 +03:00
**API simplicity:**
Libraries **SHOULD** require only `segmentSize` from the application for normal operation.
2025-10-19 15:44:17 +03:00
2025-10-19 15:56:33 +03:00
### Support
2025-10-19 15:44:17 +03:00
2025-10-22 12:03:00 +03:00
- **Language / Package:** Nim;
**Nimble** package manager
2026-04-28 21:32:42 +01:00
- **Intended for:** application-layer use over any transport with message-size constraints
2025-10-19 15:44:17 +03:00
---
## Security Considerations
### Privacy
2025-10-22 12:03:00 +03:00
`entire_message_hash` enables correlation of segments that belong to the same original message but does not reveal content.
2026-04-28 21:45:15 +01:00
To prevent this correlation, applications **SHOULD** encrypt each segment after segmentation (see [Encryption ](#encryption )).
2025-10-19 15:44:17 +03:00
Traffic analysis may still identify segmented flows.
2026-04-28 21:45:15 +01:00
### Encryption
This specification does not provide confidentiality.
Applications **SHOULD** encrypt each segment after segmentation
(i.e., encrypt the serialized `SegmentMessageProto` prior to transmission),
so that `entire_message_hash` and other identifying fields are not visible to observers.
2025-10-19 15:44:17 +03:00
### Integrity
2025-10-22 12:03:00 +03:00
Implementations **MUST** verify the Keccak256 hash post-reconstruction and discard on mismatch.
2025-10-19 15:44:17 +03:00
### Denial of Service
To mitigate resource exhaustion:
2025-10-22 12:03:00 +03:00
- Limit concurrent reconstructions and per-sender storage
- Enforce timeouts and size caps
- Validate segment counts (≤ 256)
2026-04-28 21:32:42 +01:00
- Consider rate-limiting at the transport layer (for example, via [17/WAKU2-RLN-RELAY ](https://rfc.vac.dev/waku/standards/core/17/rln-relay ) on Waku)
2025-10-19 15:44:17 +03:00
### Compatibility
2025-10-22 12:03:00 +03:00
Nodes that do **not** implement this specification cannot reconstruct large messages.
2025-10-19 15:44:17 +03:00
---
2025-10-21 17:36:57 +03:00
## Deployment Considerations
2025-10-19 15:44:17 +03:00
2025-10-21 17:36:57 +03:00
**Overhead:**
2025-10-22 12:03:00 +03:00
- Bandwidth overhead ≈ the predefined parity rate from parity (if enabled)
- Additional per-segment overhead ≤ **100 bytes** (protobuf + metadata)
2025-10-21 17:36:57 +03:00
**Network impact:**
2026-04-28 21:32:42 +01:00
- Larger messages increase transport traffic and storage;
2025-10-22 12:03:00 +03:00
operators **SHOULD** consider policy limits
2025-10-19 15:44:17 +03:00
---
## References
2025-10-22 12:03:00 +03:00
1. [10/WAKU2 – Waku ](https://rfc.vac.dev/waku/standards/core/10/waku2 )
2. [11/WAKU2-RELAY – Relay ](https://rfc.vac.dev/waku/standards/core/11/relay )
2025-10-21 17:36:57 +03:00
3. [14/WAKU2-MESSAGE – Message ](https://rfc.vac.dev/waku/standards/core/14/message )
4. [64/WAKU2-NETWORK ](https://rfc.vac.dev/waku/standards/core/64/network#message-size )
2025-10-22 12:03:00 +03:00
5. [nim-leopard ](https://github.com/status-im/nim-leopard ) – Nim bindings for Leopard-RS (Reed– Solomon)
6. [Leopard-RS ](https://github.com/catid/leopard ) – Fast Reed– Solomon erasure coding library
2025-10-21 17:36:57 +03:00
7. [RFC 2119 ](https://www.ietf.org/rfc/rfc2119.txt ) – Key words for use in RFCs to Indicate Requirement Levels