specs/standards/application/segmentation.md
2025-10-27 10:15:02 +02:00

214 lines
8.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: Message Segmentation and Reconstruction
name: Message Segmentation and Reconstruction
tags: [waku-application, segmentation]
version: 0.1
status: draft
---
## Abstract
This specification defines an application-layer protocol for **segmentation** and **reconstruction** of messages carried over a message transport/delivery services with size limitation, when the original payload exceeds said limitation.
Applications partition the payload into multiple wire-messages envelopes and reconstruct the original on receipt,
even when segments arrive out of order or up to a **predefined percentage** of segments are lost.
The protocol uses **ReedSolomon** erasure coding for fault tolerance.
Messages whose payload size is **`segmentSize`** are sent unmodified.
## Motivation
Waku Relay deployments typically propagate envelopes up to **150 KB** as per [64/WAKU2-NETWORK - Message](https://rfc.vac.dev/waku/standards/core/64/network#message-size).
To support larger application payloads,
a segmentation layer is required.
This specification enables larger messages by partitioning them into multiple envelopes and reconstructing them at the receiver.
Erasure-coded parity segments provide resilience against partial loss or reordering.
## Terminology
- **original payload**: the full application payload before segmentation.
- **data segment**: one of the partitioned chunks of the original message payload.
- **parity segment**: an erasure-coded segment derived from the set of data segments.
- **segment message**: a wire-message whose `payload` field carries a serialized `SegmentMessageProto`.
- **`segmentSize`**: configured maximum size in bytes of each data segment's `payload` chunk (before protobuf serialization).
- **sender public key**: the origin identifier used for indexing persistence.
The key words **"MUST"**, **"MUST NOT"**, **"REQUIRED"**, **"SHALL"**, **"SHALL NOT"**, **"SHOULD"**, **"SHOULD NOT"**, **"RECOMMENDED"**, **"NOT RECOMMENDED"**, **"MAY"**, and **"OPTIONAL"** in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).
## Wire Format
Each segmented message is encoded as a `SegmentMessageProto` protobuf message:
```protobuf
syntax = "proto3";
message SegmentMessageProto {
// Keccak256(original payload), 32 bytes
bytes entire_message_hash = 1;
// Data segment indexing
uint32 index = 2; // zero-based sequence number; valid only if segments_count > 0
uint32 segment_count = 3; // number of data segments (>= 2)
// Segment payload (data or parity shard)
bytes payload = 4;
// Parity segment indexing (used if segments_count == 0)
uint32 parity_segment_index = 5; // zero-based sequence number for parity segments
uint32 parity_segments_count = 6; // number of parity segments (> 0)
}
```
**Field descriptions:**
- `entire_message_hash`: A 32-byte Keccak256 hash of the original complete payload, used to identify which segments belong together and verify reconstruction integrity.
- `index`: Zero-based sequence number identifying this data segment's position (0, 1, 2, ..., segments_count - 1).
- `segment_count`: Total number of data segments the original message was split into.
- `payload`: The actual chunk of data or parity information for this segment.
- `parity_segment_index`: Zero-based sequence number for parity segments.
- `parity_segments_count`: Total number of parity segments generated.
A message is either a **data segment** (when `segment_count > 0`) or a **parity segment** (when `segment_count == 0`).
### Validation
Receivers **MUST** enforce:
- `entire_message_hash.length == 32`
- **Data segments:**
`segments_count >= 2` **AND** `index < segments_count`
- **Parity segments:**
`segments_count == 0` **AND** `parity_segments_count > 0` **AND** `parity_segment_index < parity_segments_count`
No other combinations are permitted.
## Segmentation
### Sending
When the original payload exceeds `segmentSize`, the sender:
- **MUST** compute a 32-byte `entire_message_hash = Keccak256(original_payload)`.
- **MUST** split the payload into one or more **data segments**,
each of size up to `segmentSize` bytes.
- **MAY** use ReedSolomon erasure coding at the predefined parity rate.
- Encode each segment as a `SegmentMessageProto` with:
- The `entire_message_hash`
- Either data-segment indices (`segments_count`, `index`) or parity-segment indices (`parity_segments_count`, `parity_segment_index`)
- The raw payload data
- Send all segments as individual Waku envelopes,
preserving application-level metadata (e.g., content topic).
Messages smaller than or equal to `segmentSize` **SHALL** be transmitted unmodified.
### Receiving
Upon receiving a segmented message, the receiver:
- **MUST** validate each segment according to [Wire Format → Validation](#validation).
- **MUST** cache received segments
- **MUST** attempt reconstruction when the number of available (data + parity) segments equals or exceeds the data segment count:
- Concatenating data segments if all are present, or
- Applying ReedSolomon decoding if parity segments are available.
- **MUST** verify `Keccak256(reconstructed_payload)` matches `entire_message_hash`.
On mismatch,
the message **MUST** be discarded and logged as invalid.
- Once verified,
the reconstructed payload **SHALL** be delivered to the application.
- Incomplete reconstructions **SHOULD** be garbage-collected after a timeout.
---
## Implementation Suggestions
### ReedSolomon
Implementations that apply parity **SHALL** use fixed-size shards of length `segmentSize`.
The last data chunk **MUST** be padded to `segmentSize` for encoding.
The reference implementation uses **nim-leopard** (Leopard-RS) with a maximum of **256 total shards**.
### Storage / Persistence
Segments **MAY** be persisted (e.g., SQLite) and indexed by `entire_message_hash` and sender public key.
Implementations **SHOULD** support:
- Duplicate detection and idempotent saves
- Completion flags to prevent duplicate processing
- Timeout-based cleanup of incomplete reconstructions
- Per-sender quotas for stored bytes and concurrent reconstructions
### Configuration
**Required parameters:**
- `segmentSize`**REQUIRED** configurable parameter;
maximum size in bytes of each data segment's payload chunk (before protobuf serialization).
**Fixed parameters:**
- `parityRate` — fixed at **0.125** (12.5%)
- `maxTotalSegments`**256**
**Reconstruction capability:**
With the predefined parity rate,
reconstruction is possible if **all data segments** are received or if **any combination of data + parity** totals at least `dataSegments` (i.e., up to the predefined percentage of loss tolerated).
**API simplicity:**
Libraries **SHOULD** require only `segmentSize` from the application for normal operation.
### Support
- **Language / Package:** Nim;
**Nimble** package manager
- **Intended for:** all Waku nodes at the application layer
---
## Security Considerations
### Privacy
`entire_message_hash` enables correlation of segments that belong to the same original message but does not reveal content.
Traffic analysis may still identify segmented flows.
### Integrity
Implementations **MUST** verify the Keccak256 hash post-reconstruction and discard on mismatch.
### Denial of Service
To mitigate resource exhaustion:
- Limit concurrent reconstructions and per-sender storage
- Enforce timeouts and size caps
- Validate segment counts (≤ 256)
- Consider rate-limiting using [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay)
### Compatibility
Nodes that do **not** implement this specification cannot reconstruct large messages.
---
## Deployment Considerations
**Overhead:**
- Bandwidth overhead ≈ the predefined parity rate from parity (if enabled)
- Additional per-segment overhead ≤ **100 bytes** (protobuf + metadata)
**Network impact:**
- Larger messages increase gossip traffic and storage;
operators **SHOULD** consider policy limits
---
## References
1. [10/WAKU2 Waku](https://rfc.vac.dev/waku/standards/core/10/waku2)
2. [11/WAKU2-RELAY Relay](https://rfc.vac.dev/waku/standards/core/11/relay)
3. [14/WAKU2-MESSAGE Message](https://rfc.vac.dev/waku/standards/core/14/message)
4. [64/WAKU2-NETWORK](https://rfc.vac.dev/waku/standards/core/64/network#message-size)
5. [nim-leopard](https://github.com/status-im/nim-leopard) Nim bindings for Leopard-RS (ReedSolomon)
6. [Leopard-RS](https://github.com/catid/leopard) Fast ReedSolomon erasure coding library
7. [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt) Key words for use in RFCs to Indicate Requirement Levels