feat: add segmentation spec (#91)

This commit is contained in:
Pablo Lopez 2026-05-06 01:11:15 +03:00 committed by GitHub
parent a19d99b79f
commit 63aa0d31b9
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 268 additions and 28 deletions

View File

@ -1,27 +1,26 @@
ABIs
acknowledgementTimeoutMs
aea
addListener
aea
ALLOC
api
AsyncAPI
autosharding
ba
AutoShardingConfig
ba
backend
Backend
backends
Baggs
Bande
Backend
backend
backends
BCP
bool
Bradner
camelCase
causalHistorySize
cd
da
df
centric
Changelog
channelConfig
channelId
ciphertext
@ -36,7 +35,10 @@ createNode
createReliableChannel
creativecommons
cryptographic
da
danielkaiser
dataSegments
dataSegments
decrypt
Decrypt
decrypted
@ -47,17 +49,20 @@ DefaultNetworkingConfig
DefaultRateLimitConfig
DefaultSdsConfig
DefaultSegmentationConfig
Deployability
dev
dev
df
DHT
discv
DISCV
DoS
ec
enableReedSolomon
encodings
encryptionKey
enrtree
enum
ec
epochPeriodSec
epochSizeMs
eth
@ -71,10 +76,10 @@ EventReliableMessageSent
EventSource
eventType
fb
fBF
fc
fd
Folgueira
fBF
getAvailableConfigs
getAvailableNodeInfoIds
getAvailableNodeInfoItems
@ -93,37 +98,39 @@ iana
IANA
IDL
IEncryption
IPersistence
implementers
implementor
implementors
Implementors
Inclusivity
Init
IPersistence
ipv
iterable
Jazzz
JSON
keccak
Keccak
KiB
Kozlov
lifecycle
liblogosdelivery
libp
libp2p
lifecycle
Lightpush
LIGHTPUSH
maxRetransmissions
maxTotalSegments
md
MessageDeliveredEvent
MessageDeliveryErrorEvent
MessageHash
MessageId
messageEnvelope
MessageEnvelope
MessageErrorEvent
messageEvents
MessageEvents
MessageHash
MessageId
MessagePropagatedEvent
MessageReceivedEvent
MessageSendErrorEvent
@ -136,6 +143,8 @@ namespace
NetworkConfig
NetworkingConfig
nim
nim
Nim
nodeConfig
NodeConfig
nodeInfoId
@ -146,37 +155,39 @@ onMessageDelivered
onMessageReceived
onMessageSendError
onMessageSent
openChannel
OpenAPI
openChannel
parityRate
PartiallyConnected
PascalCase
Pax
plaintext
PoV
permissionless
plaintext
pluggable
PoV
Prathi
pre
Prem
ProtocolsConfig
proto
protobuf
ProtocolsConfig
pubsub
rateLimitConfig
Req
RateLimitConfig
Raya
Raya's
ReliableChannel
ReliableChannelConfig
ReliableEnvelope
ReliableSendId
ReliableIrretrievableMessageEvent
ReliableMessageAcknowledgedEvent
ReliableMessageEvents
ReliableMessageReceivedEvent
ReliableMessageSendErrorEvent
ReliableMessageSentEvent
ReliableSendId
ReliableSyncStatusEvent
Req
requestId
RequestId
responder
@ -189,36 +200,41 @@ rfc
RFC
rln
RLN
RLN
RLN
RlnConfig
Royer
rpc
RPC
Saro
Scalable
Sirotin
sdk
SDK
sds
SDS
SDS'ed
sdsConfig
SdsConfig
sdk
senderId
syncStatus
SyncStatus
SyncStatusDetail
SDK
SegmentationConfig
segmentationConfig
SegmentationConfig
SegmentMessageProto
segmentSize
segmentSizeBytes
senderId
sharding
sharding
SHARDING
Sirotin
sqlite
subnets
SubscriptionError
syncStatus
SyncStatus
SyncStatusDetail
TBD
tcp
th
TCP
th
TheWakuNetworkMessageValidation
TheWakuNetworkPreset
TODO
@ -226,13 +242,17 @@ TWN
udp
UDP
uint
uint
unencrypted
unvalidated
UUID
UX
waku
waku
waku
Waku
WAKU
Waku's
WakuNode
www
xB

View File

@ -0,0 +1,220 @@
---
title: Message Segmentation and Reconstruction
name: Message Segmentation and Reconstruction
tags: [segmentation]
version: 0.1
status: raw
---
## Abstract
This specification defines an application-layer protocol for **segmentation** and **reconstruction** of messages carried over a transport/delivery service with a message-size limitation, when the original payload exceeds said limitation.
Applications partition the payload into multiple transport messages and reconstruct the original on receipt,
even when segments arrive out of order or up to a **predefined percentage** of segments are lost.
The protocol optionally uses **ReedSolomon** erasure coding for fault tolerance.
All messages are wrapped in a `SegmentMessageProto`, including those that fit in a single segment.
## Motivation
Many message transport and delivery protocols impose a maximum message size that restricts the size of application payloads.
For example, Waku Relay typically propagates messages up to **150 KB** as per [64/WAKU2-NETWORK - Message](https://rfc.vac.dev/waku/standards/core/64/network#message-size).
To support larger application payloads, a segmentation layer is required.
This specification enables larger messages by partitioning them into multiple envelopes and reconstructing them at the receiver.
Erasure-coded parity segments provide resilience against partial loss or reordering.
## Terminology
- **original payload**: the full application payload before segmentation.
- **data segment**: one of the partitioned chunks of the original message payload.
- **parity segment**: an erasure-coded segment derived from the set of data segments.
- **segment message**: a wire-message whose `payload` field carries a serialized `SegmentMessageProto`.
- **`segmentSize`**: configured maximum size in bytes of each data segment's `payload` chunk (before protobuf serialization).
The key words **"MUST"**, **"MUST NOT"**, **"REQUIRED"**, **"SHALL"**, **"SHALL NOT"**, **"SHOULD"**, **"SHOULD NOT"**, **"RECOMMENDED"**, **"NOT RECOMMENDED"**, **"MAY"**, and **"OPTIONAL"** in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).
## Wire Format
Each segmented message is encoded as a `SegmentMessageProto` protobuf message:
```protobuf
syntax = "proto3";
message SegmentMessageProto {
// Keccak256(original payload), 32 bytes
bytes entire_message_hash = 1;
// Data segment indexing
uint32 data_segment_index = 2; // zero-indexed sequence number for data segments
uint32 data_segment_count = 3; // number of data segments (>= 1)
// Segment payload (data or parity shard)
bytes payload = 4;
// Parity segment indexing
uint32 parity_segment_index = 5; // zero-based sequence number for parity segments
uint32 parity_segment_count = 6; // number of parity segments
// Segment type
bool is_parity = 7; // true for parity segments, false (default) for data segments
}
```
**Field descriptions:**
- `entire_message_hash`: A 32-byte Keccak256 hash of the original complete payload, used to identify which segments belong together and verify reconstruction integrity.
- `data_segment_index`: Zero-indexed sequence number identifying this data segment's position (0, 1, 2, ..., data_segment_count - 1). Set only on data segments.
- `data_segment_count`: Total number of data segments the original message was split into. Set on every segment (data and parity).
- `payload`: The actual chunk of data or parity information for this segment.
- `parity_segment_index`: Zero-based sequence number for parity segments. Set only on parity segments.
- `parity_segment_count`: Total number of parity segments generated. Set on every segment (data and parity) when ReedSolomon parity is used; `0` (default) otherwise.
- `is_parity`: Explicit segment type marker. `false` (default) for data segments; `true` for parity segments.
A message is either a **data segment** (when `is_parity == false`) or a **parity segment** (when `is_parity == true`).
### Validation
Receivers **MUST** enforce:
- `entire_message_hash.length == 32`
- `data_segment_count >= 1`
- `data_segment_count + parity_segment_count < maxTotalSegments`
- **Data segments** (`is_parity == false`):
`data_segment_index < data_segment_count`
- **Parity segments** (`is_parity == true`):
`parity_segment_count > 0` AND `parity_segment_index < parity_segment_count`
No other combinations are permitted.
A `SegmentMessageProto` with `data_segment_count == 1` and `data_segment_index == 0` is a valid single-segment data message: the `payload` field carries the entire original payload (see [Sending](#sending)).
## Segmentation
### Sending
To transmit a payload, the sender:
- **MUST** compute a 32-byte `entire_message_hash = Keccak256(original_payload)`.
- **MUST** split the payload into one or more **data segments**,
each of size up to `segmentSize` bytes.
A payload of size ≤ `segmentSize` produces a single data segment (`data_segment_count == 1`).
- **MUST** pad the last segment to `segmentSize` for Reed-Solomon erasure coding (only if Reed-Solomon coding is enabled)
- **MAY** use ReedSolomon erasure coding at the predefined parity rate.
- **MUST** encode every segment as a `SegmentMessageProto` with:
- The `entire_message_hash`
- `data_segment_count` (total number of data segments, always set)
- When ReedSolomon parity is used, `parity_segment_count` (total number of parity segments, set on every segment)
- For data segments: `is_parity = false`, `data_segment_index`
- For parity segments: `is_parity = true`, `parity_segment_index`
- The raw payload data
- Send each segment as an individual transport message according to the underlying transport service.
This yields a deterministic wire format: every transmitted payload is a `SegmentMessageProto`.
### Receiving
Upon receiving a segmented message, the receiver:
- **MUST** validate each segment according to [Wire Format → Validation](#validation).
- **MUST** cache received segments
- **MUST** attempt reconstruction once at least `data_segment_count` distinct segments (data and parity combined) have been received:
- If all data segments are present, concatenate their `payload` fields in `data_segment_index` order.
- Otherwise, recover the payload via ReedSolomon decoding over the available data and parity segments.
- **MUST** verify `Keccak256(reconstructed_payload)` matches `entire_message_hash`.
On mismatch,
the message **MUST** be discarded and logged as invalid.
- Once verified,
the reconstructed payload **SHALL** be delivered to the application.
---
## Implementation Suggestions
### ReedSolomon
Implementations that apply parity **SHALL** use fixed-size shards of length `segmentSize`.
The reference implementation uses **nim-leopard** (Leopard-RS) with a maximum of **256 total shards**.
### Storage / Persistence
Segments may be persisted (e.g., SQLite) and indexed by `entire_message_hash` and by sender. Sender may be authenticated, this is out of scope of this spec.
Implementations **SHOULD** support:
- Duplicate detection and idempotent saves
- Completion flags to prevent duplicate processing
- Timeout-based cleanup of incomplete reconstructions
- Per-sender quotas for stored bytes and concurrent reconstructions
### Configuration
- `segmentSize` — maximum size in bytes of each data segment's payload chunk (before protobuf serialization).
**REQUIRED** parameter, configurable by the client.
- `parityRate` — fraction of parity shards relative to data shards.
Configurable by the client. Defaults to **0.125** (12.5%).
- `maxTotalSegments` — maximum number of total shards (data + parity) per message.
Implementation-specific parameter, fixed. The reference implementation uses **256**.
**Reconstruction capability:**
With the predefined parity rate, reconstruction is possible if **all data segments** are received or if **any combination of data + parity** totals at least `data_segment_count` (i.e., up to the predefined percentage of loss tolerated).
**API simplicity:**
Libraries **SHOULD** require only `segmentSize` from the application for normal operation.
---
## Security Considerations
### Privacy
`entire_message_hash` enables correlation of segments that belong to the same original message but does not reveal content.
To prevent this correlation, applications **SHOULD** encrypt each segment after segmentation (see [Encryption](#encryption)).
Traffic analysis may still identify segmented flows.
### Encryption
This specification does not provide confidentiality.
Applications **SHOULD** encrypt each segment after segmentation
(i.e., encrypt the serialized `SegmentMessageProto` prior to transmission),
so that `entire_message_hash` and other identifying fields are not visible to observers.
### Integrity
Implementations **MUST** verify the Keccak256 hash post-reconstruction and discard on mismatch.
### Denial of Service
To mitigate resource exhaustion:
- Limit total concurrent reconstructions and aggregate buffered bytes
- When sender identity is available, apply the same two limits per sender
- Enforce timeouts and size caps
- Validate segment counts (≤ 256)
- Consider rate-limiting at the transport layer (for example, via [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) on Waku)
---
## Deployment Considerations
**Overhead:**
- Bandwidth overhead ≈ the predefined parity rate from parity (if enabled)
- Additional per-segment overhead ≤ **100 bytes** (protobuf + metadata)
**Network impact:**Ac
- Larger messages increase transport traffic and storage;
operators **SHOULD** consider policy limits
**Compatibility:**
- Nodes that do **not** implement this specification cannot reconstruct any messages.
---
## References
1. [10/WAKU2 Waku](https://rfc.vac.dev/waku/standards/core/10/waku2)
2. [11/WAKU2-RELAY Relay](https://rfc.vac.dev/waku/standards/core/11/relay)
3. [14/WAKU2-MESSAGE Message](https://rfc.vac.dev/waku/standards/core/14/message)
4. [64/WAKU2-NETWORK](https://rfc.vac.dev/waku/standards/core/64/network#message-size)
5. [nim-leopard](https://github.com/status-im/nim-leopard) Nim bindings for Leopard-RS (ReedSolomon)
6. [Leopard-RS](https://github.com/catid/leopard) Fast ReedSolomon erasure coding library
7. [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt) Key words for use in RFCs to Indicate Requirement Levels