diff --git a/.wordlist.txt b/.wordlist.txt index 8e754de..0fe9c8e 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -1,6 +1,9 @@ ALLOC +Changelog creativecommons danielkaiser +dataSegments +Deployability DHT DoS github @@ -10,14 +13,27 @@ GossipSub https iana IANA +Keccak libp2p +maxTotalSegments md +Nim +nim +parityRate +proto +protobuf pubsub rfc RFC +RLN +SegmentMessageProto +segmentSize SHARDING subnets +uint +waku Waku +Waku's WAKU www ZXCV \ No newline at end of file diff --git a/standards/application/segmentation.md b/standards/application/segmentation.md new file mode 100644 index 0000000..3d1d128 --- /dev/null +++ b/standards/application/segmentation.md @@ -0,0 +1,213 @@ +--- +title: Message Segmentation and Reconstruction +name: Message Segmentation and Reconstruction +tags: [waku-application, segmentation] +version: 0.1 +status: draft +--- + +## Abstract + +This specification defines an application-layer protocol for **segmentation** and **reconstruction** of messages carried over a message transport/delivery services with size limitation, when the original payload exceeds said limitation. +Applications partition the payload into multiple wire-messages envelopes and reconstruct the original on receipt, +even when segments arrive out of order or up to a **predefined percentage** of segments are lost. +The protocol uses **Reed–Solomon** erasure coding for fault tolerance. +Messages whose payload size is **≤ `segmentSize`** are sent unmodified. + +## Motivation + +Waku Relay deployments typically propagate envelopes up to **150 KB** as per [64/WAKU2-NETWORK - Message](https://rfc.vac.dev/waku/standards/core/64/network#message-size). +To support larger application payloads, +a segmentation layer is required. +This specification enables larger messages by partitioning them into multiple envelopes and reconstructing them at the receiver. +Erasure-coded parity segments provide resilience against partial loss or reordering. + +## Terminology + +- **original payload**: the full application payload before segmentation. +- **data segment**: one of the partitioned chunks of the original message payload. +- **parity segment**: an erasure-coded segment derived from the set of data segments. +- **segment message**: a wire-message whose `payload` field carries a serialized `SegmentMessageProto`. +- **`segmentSize`**: configured maximum size in bytes of each data segment's `payload` chunk (before protobuf serialization). +- **sender public key**: the origin identifier used for indexing persistence. + +The key words **"MUST"**, **"MUST NOT"**, **"REQUIRED"**, **"SHALL"**, **"SHALL NOT"**, **"SHOULD"**, **"SHOULD NOT"**, **"RECOMMENDED"**, **"NOT RECOMMENDED"**, **"MAY"**, and **"OPTIONAL"** in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). + +## Wire Format + +Each segmented message is encoded as a `SegmentMessageProto` protobuf message: + +```protobuf +syntax = "proto3"; + +message SegmentMessageProto { + // Keccak256(original payload), 32 bytes + bytes entire_message_hash = 1; + + // Data segment indexing + uint32 index = 2; // zero-based sequence number; valid only if segments_count > 0 + uint32 segment_count = 3; // number of data segments (>= 2) + + // Segment payload (data or parity shard) + bytes payload = 4; + + // Parity segment indexing (used if segments_count == 0) + uint32 parity_segment_index = 5; // zero-based sequence number for parity segments + uint32 parity_segments_count = 6; // number of parity segments (> 0) +} +``` + +**Field descriptions:** + +- `entire_message_hash`: A 32-byte Keccak256 hash of the original complete payload, used to identify which segments belong together and verify reconstruction integrity. +- `index`: Zero-based sequence number identifying this data segment's position (0, 1, 2, ..., segments_count - 1). +- `segment_count`: Total number of data segments the original message was split into. +- `payload`: The actual chunk of data or parity information for this segment. +- `parity_segment_index`: Zero-based sequence number for parity segments. +- `parity_segments_count`: Total number of parity segments generated. + +A message is either a **data segment** (when `segment_count > 0`) or a **parity segment** (when `segment_count == 0`). + +### Validation + +Receivers **MUST** enforce: + +- `entire_message_hash.length == 32` +- **Data segments:** + `segments_count >= 2` **AND** `index < segments_count` +- **Parity segments:** + `segments_count == 0` **AND** `parity_segments_count > 0` **AND** `parity_segment_index < parity_segments_count` + +No other combinations are permitted. + +## Segmentation + +### Sending + +When the original payload exceeds `segmentSize`, the sender: + +- **MUST** compute a 32-byte `entire_message_hash = Keccak256(original_payload)`. +- **MUST** split the payload into one or more **data segments**, + each of size up to `segmentSize` bytes. +- **MAY** use Reed–Solomon erasure coding at the predefined parity rate. +- Encode each segment as a `SegmentMessageProto` with: + - The `entire_message_hash` + - Either data-segment indices (`segments_count`, `index`) or parity-segment indices (`parity_segments_count`, `parity_segment_index`) + - The raw payload data +- Send all segments as individual Waku envelopes, + preserving application-level metadata (e.g., content topic). + +Messages smaller than or equal to `segmentSize` **SHALL** be transmitted unmodified. + +### Receiving + +Upon receiving a segmented message, the receiver: + +- **MUST** validate each segment according to [Wire Format → Validation](#validation). +- **MUST** cache received segments +- **MUST** attempt reconstruction when the number of available (data + parity) segments equals or exceeds the data segment count: + - Concatenating data segments if all are present, or + - Applying Reed–Solomon decoding if parity segments are available. +- **MUST** verify `Keccak256(reconstructed_payload)` matches `entire_message_hash`. + On mismatch, + the message **MUST** be discarded and logged as invalid. +- Once verified, + the reconstructed payload **SHALL** be delivered to the application. +- Incomplete reconstructions **SHOULD** be garbage-collected after a timeout. + +--- + +## Implementation Suggestions + +### Reed–Solomon + +Implementations that apply parity **SHALL** use fixed-size shards of length `segmentSize`. +The last data chunk **MUST** be padded to `segmentSize` for encoding. +The reference implementation uses **nim-leopard** (Leopard-RS) with a maximum of **256 total shards**. + +### Storage / Persistence + +Segments **MAY** be persisted (e.g., SQLite) and indexed by `entire_message_hash` and sender public key. +Implementations **SHOULD** support: + +- Duplicate detection and idempotent saves +- Completion flags to prevent duplicate processing +- Timeout-based cleanup of incomplete reconstructions +- Per-sender quotas for stored bytes and concurrent reconstructions + +### Configuration + +**Required parameters:** + +- `segmentSize` — **REQUIRED** configurable parameter; + maximum size in bytes of each data segment's payload chunk (before protobuf serialization). + +**Fixed parameters:** + +- `parityRate` — fixed at **0.125** (12.5%) +- `maxTotalSegments` — **256** + +**Reconstruction capability:** +With the predefined parity rate, +reconstruction is possible if **all data segments** are received or if **any combination of data + parity** totals at least `dataSegments` (i.e., up to the predefined percentage of loss tolerated). + +**API simplicity:** +Libraries **SHOULD** require only `segmentSize` from the application for normal operation. + +### Support + +- **Language / Package:** Nim; + **Nimble** package manager +- **Intended for:** all Waku nodes at the application layer + +--- + +## Security Considerations + +### Privacy + +`entire_message_hash` enables correlation of segments that belong to the same original message but does not reveal content. +Traffic analysis may still identify segmented flows. + +### Integrity + +Implementations **MUST** verify the Keccak256 hash post-reconstruction and discard on mismatch. + +### Denial of Service + +To mitigate resource exhaustion: + +- Limit concurrent reconstructions and per-sender storage +- Enforce timeouts and size caps +- Validate segment counts (≤ 256) +- Consider rate-limiting using [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) + +### Compatibility + +Nodes that do **not** implement this specification cannot reconstruct large messages. + +--- + +## Deployment Considerations + +**Overhead:** + +- Bandwidth overhead ≈ the predefined parity rate from parity (if enabled) +- Additional per-segment overhead ≤ **100 bytes** (protobuf + metadata) + +**Network impact:** + +- Larger messages increase gossip traffic and storage; + operators **SHOULD** consider policy limits + +--- + +## References + +1. [10/WAKU2 – Waku](https://rfc.vac.dev/waku/standards/core/10/waku2) +2. [11/WAKU2-RELAY – Relay](https://rfc.vac.dev/waku/standards/core/11/relay) +3. [14/WAKU2-MESSAGE – Message](https://rfc.vac.dev/waku/standards/core/14/message) +4. [64/WAKU2-NETWORK](https://rfc.vac.dev/waku/standards/core/64/network#message-size) +5. [nim-leopard](https://github.com/status-im/nim-leopard) – Nim bindings for Leopard-RS (Reed–Solomon) +6. [Leopard-RS](https://github.com/catid/leopard) – Fast Reed–Solomon erasure coding library +7. [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt) – Key words for use in RFCs to Indicate Requirement Levels