docs: add store-independent messaging reliability design proposal

Design research on making the Store protocol a startup-only dependency and
giving the Messaging API layer store-free runtime reliability. Proposes a
Network Reliability Service that reuses the existing RBSR (store-sync) engine
decoupled from the archive, providing both receive-side gap recovery and
send-side confirmation by replacing store-presence with peer-set-presence.
Keeps SDS for channels and an MVDS-style e2e ACK for unicast.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KkHTPHnTiScRttjXQkfcdV
This commit is contained in:
Claude 2026-06-28 22:08:45 +00:00
parent a0c637843c
commit 58feb187c4
No known key found for this signature in database

View File

@ -0,0 +1,369 @@
# Store-Independent Reliability for the Messaging API — Design Research & Proposal
> **Question.** Make the Store protocol a *startup-only* dependency (used to sync
> history at boot), and after start give the **Messaging API** layer a
> reliable, store-free network. Reliable Channels already get this from SDS — what
> is the equivalent for plain Messaging-API users, and do we need a new protocol?
>
> **Answer (one line).** Replace the two runtime store couplings with **one runtime
> peer-to-peer anti-entropy subsystem** (reuse the existing Range-Based Set
> Reconciliation engine, decoupled from the archive). The same subsystem provides
> *both* receive-side gap recovery *and* send-side delivery confirmation — by
> swapping **"is my message in a store node?"** for **"is my message in N peers'
> reconciliation sets?"**. Store stays only for cold-start history. No store at runtime.
---
## 1. Executive summary / recommendation
Reliability decomposes into two independent guarantees, and **today both are
store-backed**:
| Guarantee | Today (store-backed) | Proposed (store-free, runtime) |
|---|---|---|
| **Send-side confirmation** — "my message got out / is durable" | `SendService` periodically queries a **store node** for the message hash → `MessageSentEvent` | Message hash observed present in **≥N distinct peers' anti-entropy sets** → `MessageSentEvent` |
| **Receive-side completeness** — "I have everything on my topics" | `RecvService` queries a **store node** on reconnect to backfill the offline gap | **Runtime RBSR anti-entropy** with mesh peers over a rolling window + gossipsub IHAVE/IWANT for the live path |
The decisive insight from the research: **SDS cannot simply be "lifted" from the
channel layer to the messaging layer**, because SDS's reliability is a property of a
*bounded set of participants exchanging bidirectional traffic* (its acknowledgement is
"my message was observed as a causal dependency of a peer's later message"). The plain
Messaging API is open, sessionless, unbounded pub/sub — a pure publisher with no
responders gets **no implicit acks ever**. So we need a different primitive.
That primitive already exists in the codebase: the **store-sync Range-Based Set
Reconciliation (RBSR)** engine (`waku_store_sync/`). It is libp2p-native, peer-symmetric
(not client/server), and its reconciliation math runs over a pure in-memory hash set —
its *only* coupling to "store" is that it currently seeds from / writes to the local
archive. Decouple it from the archive, run it on a rolling window of recently-seen
messages, and it becomes a **general runtime anti-entropy service** that answers both
reliability questions without any store node.
**Recommendation:** introduce a **Network Reliability Service (NRS)** — runtime,
store-independent, RBSR-based — mounted on reliability-seeking nodes. Keep the store
protocol mounted **only** for (a) startup `StoreResume` (already cleanly startup-only)
and (b) an optional fallback "retrieval-hint" hash query. Phase it in behind the
existing `useP2PReliability` flag.
This is also **exactly the direction Waku itself has committed to**: *"No store in the
Messaging API; store functions reduced to hash queries that support SDS retrieval hints
and nothing more."* (sources in §9).
---
## 2. Goal & constraints (restated)
- **G1.** Store protocol used **only at startup** to catch up history; no runtime store loop.
- **G2.** After start, the network must be reliable on its own (peer-to-peer / e2e).
- **G3.** Reliable Channels keep SDS (unchanged). The gap to close is the **Messaging API** layer (`logos_delivery/messaging/**`) used by non-channel callers.
- **C1.** Must keep emitting the existing broker events the upper layers depend on — especially `MessageSentEvent` (the Reliable Channel layer correlates per-segment on it by `RequestId`) and `MessageReceivedEvent`.
- **C2.** Must work for both full/relay nodes and light nodes (different capabilities).
- **C3.** Must not open a spam/replay hole (RLN must still gate recovered messages).
---
## 3. Current state — how reliability works today
### 3.1 The only two runtime store couplings to remove
From the exhaustive dependency inventory, **only two** usages are runtime-reliability
(everything else is startup, server-side, or REST/FFI/debug surface):
1. **Send-side validation**`SendService.checkMsgsInStore`
(`logos_delivery/messaging/delivery_service/send_service/send_service.nim:132-182`).
Every ~3 s it runs `wakuStoreClient.queryToAny(StoreQueryRequest(includeData:false,
messageHashes:…))` for propagated-but-unvalidated hashes. Presence flips the task to
`SuccessfullyValidated`, which is the **only** producer of `MessageSentEvent`
(`:196-200`). Remove it with nothing in its place ⇒ `MessageSentEvent` is *never*
emitted; callers get only `MessagePropagatedEvent` (reached a neighbor), and the
Reliable Channel layer never sees segment confirmations.
2. **Receive-side backfill**`RecvService.checkStore`
(`recv_service.nim:106-170`). On an offline→online edge
(`onConnectionStatusChange`, `:155-170`), it queries a store node over the offline
gap window, diffs against `recentReceivedMsgs`, fetches missing bodies
(`getMissingMsgsFromStore`, `:57-77`), and replays them as `MessageReceivedEvent`.
Remove it with nothing in its place ⇒ messages missed while offline are lost unless
SDS (channel layer) happens to detect the gap.
### 3.2 What is already store-independent
- **Live receive path** is already store-free: relay's per-message fan-out
(`node/subscription_manager.nim:57-78`) emits `MessageSeenEvent`, which `RecvService`
turns into `MessageReceivedEvent` with hash-dedup over a 7-min window. Gossipsub
itself self-heals short-horizon gaps via IHAVE/IWANT (≈6 s history, 2-min seen-TTL).
- **Send path** (relay publish + lightpush fallback + retry loop) is store-free; only
the *confirmation* step is store-backed.
### 3.3 What is already cleanly startup-only (keep as-is)
- **`StoreResume`** (`waku_store/resume.nim`) runs **once** at boot (3 retries), queries
`max(lastOnline, now6h) → now`, writes results into the local archive, and its only
ongoing task just persists a "last online" timestamp. This is already the model we
want to generalize — it is *not* a runtime reliability loop.
### 3.4 Why gossipsub alone isn't enough
`WakuRelay` is a thin `GossipSub` subclass and exposes **no** extended gap-recovery API.
Native gossipsub gives eager push + IHAVE/IWANT lazy pull, but the recovery window is
only ~6 s (`historyLength`) / 2 min (`seenTTL`); a node offline or partitioned longer
**misses the message entirely**, and light nodes (not in the mesh) get no lazy-pull at
all. Gossipsub is a best-effort *substrate*, not a guarantee.
---
## 4. The core problem decomposition
Two orthogonal questions, and **no single mechanism answers both store-free unless we
reframe them onto the same primitive**:
- **Receive-side:** "Do I have all messages published on my subscribed content topics?"
This is a **set-completeness** question → solved by anti-entropy (reconcile my set with
peers' sets, pull the difference).
- **Send-side:** "Did my message reach / become durable in the network?" In open
broadcast there is *no recipient set*, so "delivered to whom?" is undefined. But we can
answer a well-defined proxy: **"is my message now present in the reconciliation sets of
N independent peers?"** — i.e. it propagated and is being retained/served by the mesh.
This is the *same* set-presence signal store-validation uses today, just sourced from
**peers instead of a store server**.
**This is the unifying idea of the proposal: send-side confirmation and receive-side
recovery are the same anti-entropy protocol observed from two directions.** One
subsystem, two guarantees.
### Why not just generalize SDS? (the rejected obvious answer)
SDS needs three things the Messaging API doesn't have: a **channel/session id**, a
**participant set**, and **bidirectional traffic** so messages get acked by appearing in
others' causal history. It also imposes **Lamport causal ordering** the Messaging API
explicitly doesn't want. Applied to an open content topic with possibly zero responders,
every message would retransmit its max attempts and be reported unacked. SDS is the right
tool for *channels* (bounded, bidirectional) and the wrong tool for *broadcast*. (Full
options matrix in §8.)
---
## 5. Proposed architecture — the Network Reliability Service (NRS)
```
Messaging API caller
│ send(envelope) ▲ MessageReceivedEvent
▼ │ MessageSentEvent
┌───────────────────────────────────────────────────────────────┐
│ MessagingClient (SendService / RecvService) │
│ • publish: relay (primary) → lightpush (fallback) │ ← unchanged
│ • live receive: MessageSeenEvent → dedup → emit │ ← unchanged
│ • confirmation & recovery: delegate to NRS ◄── NEW SEAM │
└───────────────┬───────────────────────────────────────────────┘
│ register hash / observe presence / pull gaps
┌───────────────────────────────────────────────────────────────┐
│ Network Reliability Service (NRS) = runtime RBSR anti-entropy│
│ • in-memory SeqStorage per subscribed content topic │
│ (rolling window, fed by MessageSeenEvent — NOT the archive) │
│ • reconciliation/1.0.0 + transfer/1.0.0 with mesh peers │
│ • emits: "hash present in ≥N peers" + "here are missed msgs" │
└───────────────┬───────────────────────────────────────────────┘
│ libp2p
WakuRelay (gossipsub eager push + IHAVE/IWANT lazy pull)
── store protocol: mounted ONLY for StoreResume (boot) + optional hint query ──
```
### 5.1 Receive-side completeness (replaces `RecvService.checkStore`)
- **Live path (unchanged):** gossipsub eager push → `MessageSeenEvent` → dedup →
`MessageReceivedEvent`; gossipsub IHAVE/IWANT heals sub-10s gaps for mesh members.
- **Recovery path (new, store-free):** the NRS maintains an in-memory `SeqStorage`
(`{timestamp, msgHash}` set) **per subscribed content topic**, fed by the same
`MessageSeenEvent` stream over a rolling window (reuse the existing 7-min
`MaxMessageLife`, tunable). On a timer (and opportunistically on reconnect) it runs the
existing **reconciliation** protocol with a few mesh peers; any hashes a peer has that
we don't are pulled via the **transfer** protocol and **replayed through
`processIncomingMessage`** so they surface as ordinary `MessageReceivedEvent`s. This is
a drop-in functional replacement for `checkStore`, with no store node involved.
### 5.2 Send-side confirmation (replaces `SendService.checkMsgsInStore`)
- On `send`, register the message hash with the NRS as "awaiting confirmation."
- The NRS already learns, through reconciliation fingerprints, **which of our hashes are
present in which peers' sets**. When a hash is observed in **≥N distinct peers'**
reconciliation sets (N configurable, e.g. 23), mark the `DeliveryTask`
`SuccessfullyValidated` → emit **`MessageSentEvent`** (preserving constraint **C1**).
- Until then, the existing `serviceLoop` keeps the task `NextRoundRetry` and
**periodically re-broadcasts** (ephemeral) — this machinery already exists
(`send_service.nim:265-280`); we simply use *peer-set presence* instead of *store
presence* as the stop condition, with the same `MaxTimeInCache` timeout →
`MessageErrorEvent` on no confirmation.
**Net effect:** `MessageSentEvent` is now produced by peer-set presence, not a store
query — same contract, no store. The Reliable Channel layer keeps working unchanged
because it only cares about the event keyed by `RequestId`.
### 5.3 Store: startup-only
- Keep `StoreResume` at boot for history older than the NRS rolling window (cold start,
long offline). Optionally trigger one bounded resume-style query on a *long* offline
reconnect (gap ≫ window) — startup-style, not a loop.
- Optionally keep store **hash queries** as a *fallback hint resolver* (matches Waku's
"retrieval hints" direction): if the NRS knows a hash exists (from a peer fingerprint)
but no peer will transfer the body, fall back to a one-shot store fetch. This keeps the
store *client* available but off the steady-state path.
- A node that wants to *serve* others still mounts the store server + archive + store-sync
as today — that's orthogonal server-side capability, not this node's own reliability.
---
## 6. Light-node handling (constraint C2)
Light nodes don't join the gossipsub mesh, so they have neither eager push nor IHAVE/IWANT
nor a peer mesh to reconcile against. Options, in preference order:
1. **NRS against service peers.** A light node runs reconciliation/transfer against one or
two *service* nodes that advertise the capability (same way it already picks store /
filter / lightpush service peers via `serviceSlots`). This is bounded (12 sessions),
not a broadcast mesh, and replaces the per-reconnect store query with a per-interval
reconciliation that *also* yields send-side confirmation.
2. **Send-side via lightpush response.** Lightpush v3 already returns `relayPeerCount`;
treat "relayed to ≥1 peer" as propagation and let NRS-against-service-peer upgrade it to
confirmation.
3. **Pragmatic fallback.** For ultra-thin clients, a bounded `StoreResume`-style query on
reconnect (startup-style, not a loop) is acceptable and still satisfies "no runtime
store *loop*."
---
## 7. What must change — seams & phased plan
### Prerequisites (must land first)
- **P0 — Close the RLN-on-transfer gap.** `waku_store_sync/transfer.nim:173-174` has
`#TODO verify msg RLN proof`, and `archive.syncMessageIngress` skips the timestamp
validator. Before any node ingests messages transferred from arbitrary peers, recovered
messages **must pass the same RLN + timestamp validation** as relay ingress, or we open a
spam/replay vector (constraint C3). This is the single most important precondition.
### Phase 1 — Decouple the RBSR engine from the archive
- Make `SyncReconciliation`/`SyncTransfer` constructible with a **pluggable backing set**
and **pluggable ingress/egress sinks** instead of a hard `wakuArchive`
(`reconciliation.nim:343-344`, `transfer.nim:117-130,175`). Default backing = the new
in-memory rolling window; archive remains an option for store-server nodes.
- Add a `mountNetworkReliability` seam **independent of `storeServiceConf`**
(today it's gated inside the store-service block, `node_factory.nim:220-230`).
- Advertise the `reconciliation`/`transfer` capability in the ENR bitfield (the `Sync`
capability bit already exists) so peers can discover NRS-capable nodes.
### Phase 2 — Receive-side recovery via NRS
- Feed the NRS `SeqStorage` from `MessageSeenEvent` per subscribed content topic.
- Replace `RecvService.checkStore`/`onConnectionStatusChange` store calls
(`recv_service.nim:106-170`) with NRS recovery; route transferred messages through
`processIncomingMessage`. Keep `StoreResume` for boot/long-gap.
### Phase 3 — Send-side confirmation via NRS
- Register sent hashes with the NRS; expose a "hash present in ≥N peers" signal.
- Replace `SendService.checkMsgsInStore` (`send_service.nim:132-182`) with that signal as
the trigger for `SuccessfullyValidated`/`MessageSentEvent`; keep the existing
re-broadcast/timeout loop as the pacing/failure mechanism.
### Phase 4 — Make store startup-only by configuration
- Stop mounting the store *client* on the steady-state path; mount it for `StoreResume`
(and optional hint-fallback) only. `mountStoreClient` is currently unconditional
(`node_factory.nim:239`) — gate it.
- Light-node policy (§6) wired via service slots.
Each phase is independently shippable behind `useP2PReliability`; the contract
(`MessageSentEvent`/`MessageReceivedEvent` by `RequestId`) never changes, so the Reliable
Channel and FFI layers are untouched.
---
## 8. Alternatives considered (and why)
| Mechanism | Store-free? | Reuses code | New protocol | Main obstacle | Verdict |
|---|---|---|---|---|---|
| **RBSR runtime anti-entropy (NRS)** | Yes | High (whole `waku_store_sync`) | No (re-mount/decouple) | RLN-on-transfer; per-peer cost; recovers a *set* | **Chosen** — covers receive *and* (via peer-presence) send |
| Generalize SDS to per-topic/per-pair | Yes | High (`SdsHandler` generic) | meta marker | Needs bounded bidirectional participants; forces causal order | Rejected for broadcast; only the node-pair sub-case (≈ e2e ACK) |
| Lightweight e2e ACK/NACK (MVDS-style) | Yes | Medium (retry loop + RequestId) | Yes (ack codec/topic) | No recipient set in pub/sub; ACK storms; acks need own anti-spam | **Complement** — adopt for *known-recipient / request-response* flows |
| Periodic re-broadcast of unconfirmed | Yes | Very high (`serviceLoop` already retries) | No | No stop condition alone; bandwidth | **Adopted as the send-side pacing**, paired with NRS presence as the stop condition |
| Bloom/IBLT digest gossip | Yes | Lowmedium | Yes (full protocol) | Duplicates RBSR; false-positive blind spots | Rejected — RBSR dominates it |
| Gossipsub IHAVE/IWANT only | Yes | n/a (native) | No | ~6 s horizon; nothing for light nodes | **Kept for the live path**, insufficient alone |
The **MVDS** explicit-ACK model (OFFER/REQUEST/MESSAGE/ACK with per-peer state and
exponential-backoff retransmission until ACK) is the textbook store-free reliability
protocol and is worth adopting as the **known-recipient / unicast** complement (e.g.
request/response, direct messages), where a recipient set *is* defined. For open broadcast
it doesn't apply (no one to ACK), which is why NRS is the primary mechanism.
---
## 9. Alignment with Waku's own roadmap (external validation)
This proposal is not a detour from upstream — it is the same destination:
- **"No store in the Messaging API."** Waku's Reliable Channel API work explicitly states:
*"reducing the store API in Waku API: No store in Messaging API. Store related functions
on the Waku API need to be sufficient for reliable channel (SDS) and nothing more — so
exposing store hash queries, to find messages based on retrieval hints."* That is §5.3
verbatim: store → startup + hint-fallback, reliability → e2e/peer.
- **SDS is a group/participant protocol.** Waku: *"application reliability is handled by
data sync protocols, enabled by the fact that messages are published in groups with
active participants."* Confirms SDS can't cover sessionless broadcast (§4).
- **Store-node reliability is itself moving to set reconciliation** (the store-sync /
FTSTORE / Negentropy line of work), i.e. the same RBSR primitive we propose to reuse at
runtime.
- **MVDS** remains Waku's reference store-free e2e protocol for the unicast case.
Sources:
- [Waku — Message Reliability and Waku API](https://blog.waku.org/2024-06-20-message-reliability/)
- [Waku — A unified stack for scalable and reliable P2P communication](https://blog.waku.org/explanation-series-a-unified-stack-for-scalable-and-reliable-p2p-communication/)
- [Vac forum — Introducing the Reliable Channel API](https://forum.research.logos.co/t/introducing-the-reliable-channel-api/580)
- [Vac forum — The future of Waku Store](https://forum.vac.dev/t/the-future-of-waku-store/588)
- [SDS protocol RFC (vacp2p/rfc-index)](https://github.com/vacp2p/rfc-index/blob/main/vac/raw/sds.md)
- [MVDS spec (status-im/bigbrother-specs)](https://github.com/status-im/bigbrother-specs/blob/master/data_sync/mvds.md)
- [Waku docs — Reliable Channels](https://docs.waku.org/build/javascript/reliable-channels)
---
## 10. Risks & open questions
1. **RLN on recovered messages (blocking).** Must validate transferred messages exactly
like relay ingress before any of this is safe (`transfer.nim:173`). Non-negotiable.
2. **Privacy / metadata leak.** Reconciliation reveals which message hashes a node holds.
Waku flags that a missed-message protocol can leak the social graph. Mitigate by scoping
reconciliation strictly to content topics the node already subscribes to, and consider
not reconciling on topics with tiny anonymity sets.
3. **Bandwidth & per-peer cost.** Every reliability-seeking node now runs O(peers)
reconciliation sessions on a timer. RBSR is efficient for large sets but was sized for
store servers; tune window size, peer count, and interval; cap on light nodes (§6).
4. **Confirmation semantics.** "Present in ≥N peers' sets" is a *propagation/retention*
guarantee, not proof a specific human read it — which is the honest ceiling for open
broadcast. Where the app needs true delivery-to-a-recipient, use the e2e-ACK complement.
5. **Convergence of send-confirmation.** Choose N and the reconciliation cadence so
`MessageSentEvent` latency is comparable to today's ~3 s store-validation; validate under
churn.
6. **Light-node anonymity.** NRS-against-service-peers concentrates trust/metadata on a few
nodes; weigh against the bounded-store-resume fallback.
---
## 11. Bottom line
- **You do not need to invent a brand-new protocol.** The store-sync **RBSR engine already
in the tree** is the right runtime anti-entropy primitive; it just needs to be unhooked
from the archive and mounted as a first-class **Network Reliability Service** on regular
nodes.
- **One subsystem, both guarantees:** receive-side recovery *and* send-side confirmation,
by replacing **store-presence** with **peer-set-presence**.
- **SDS stays for channels; do not force it onto broadcast** — it structurally can't serve
a sessionless, possibly-unidirectional Messaging API.
- **Add MVDS-style e2e ACK only for known-recipient / unicast** flows where a recipient set
exists.
- **Store ends up exactly where you (and Waku) want it:** startup history sync + optional
retrieval-hint fallback, and nothing on the steady-state path.
- **Hard prerequisite:** fix RLN verification on transferred/recovered messages before
enabling peer-to-peer recovery.
This is implementable in four incremental phases behind the existing `useP2PReliability`
flag, with no change to the `MessageSentEvent`/`MessageReceivedEvent` contract that the
Reliable Channel and FFI layers depend on.