From 25e3f5a09a69224b01cbdd3374398919dda35bec Mon Sep 17 00:00:00 2001 From: Hanno Cornelius <68783915+jm-clius@users.noreply.github.com> Date: Thu, 20 Feb 2025 16:11:07 +0000 Subject: [PATCH] docs: add p2p reliability spec (#52) * docs: add p2p reliability spec * docs: fix link, explain p2p reliability better --- standards/application/p2p-reliability.md | 332 +++++++++++++++++++++++ 1 file changed, 332 insertions(+) create mode 100644 standards/application/p2p-reliability.md diff --git a/standards/application/p2p-reliability.md b/standards/application/p2p-reliability.md new file mode 100644 index 0000000..7a359c1 --- /dev/null +++ b/standards/application/p2p-reliability.md @@ -0,0 +1,332 @@ +--- +title: WAKU-P2P-RELIABILITY +name: Waku P2P Reliability +editor: Hanno Cornelius +contributors: + - Danish Arora + - Kaichao Sun + - Oleksandr Kozlov + - Prem Chaitanya Prathi +--- + +## Abstract + +This specification defines peer-to-peer (p2p) reliability within [10/WAKU2](https://rfc.vac.dev/waku/standards/core/10/waku2) networks, +addressing dependability of message propagation between Waku node hops. +It situates the problem of p2p reliability between the lower layer _transport_ reliability +and the higher layer _end-to-end_ reliability. +We also propose strategies to enhance message propagation reliability +and mechanisms to detect and mitigate message loss +between Waku nodes, +taking into account the trade-offs between reliability, bandwidth usage, latency, and resource consumption. + +## Reliability in a Waku context + +Reliability can be considered on three layers within a [10/WAKU2](https://rfc.vac.dev/waku/standards/core/10/waku2) context: + +1. **Transport layer reliability:** +Reliability within the networking layer of an individual node. +This includes reliability of the underlying libp2p layer, +constituent transports, +peer management, +and peer discovery. +1. **Peer-to-peer (p2p) reliability:** +Reliability between nodes within a [10/WAKU2](https://rfc.vac.dev/waku/standards/core/10/waku2) p2p network, +including message routing and discovery between peers. +Note that reliability at this layer is agnostic of the application. +At this layer, [10/WAKU2](https://rfc.vac.dev/waku/standards/core/10/waku2) nodes have no knowledge of the origin or intended destination of messages being routed. +1. **End-to-end (e2e) reliability:** +Reliability within the application layer on top of the Waku p2p routing layer. +This layer is concerned with message delivery between intended participants from the application's perspective +regardless of the underlying Waku network. +As such, e2e reliability mechanisms are usually implemented within the encrypted application payload. + +## Scope + +This specification focuses on p2p reliability. +It does _not_ cover: +- Transport-level reliability mechanisms, such as fine-tuning libp2p parameters, improving peer discovery and management, etc. +- End-to-end message reliability protocols, covered by higher-layer application logic. + +We focus on p2p reliability within the context of the following [10/WAKU2](https://rfc.vac.dev/waku/standards/core/10/waku2) protocols: +- [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay): used by nodes to publish and receive messages on a pub/sub topic by participating in [libp2p gossipsub](https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md) routing +- [12/WAKU2-FILTER](https://rfc.vac.dev/waku/standards/core/12/filter): used by nodes to receive messages from a pub/sub topic without participating in gossipsub routing +- [WAKU2-LIGHTPUSH](../core/lightpush.md): used by nodes to publish messages to a pub/sub topic without participating in gossipsub routing +- [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store): used by any node to query and retrieve historical messages + +## Problem statement + +The primary challenge at the p2p layer is ensuring that messages sent from one Waku node propagate to peers +with a high probability of success +while minimising redundant transmissions and excessive bandwidth use. + +Factors affecting reliability include: +- **Network instability**: detected and undetected network connectivity drops can lead to message loss or duplication. +- **Network churn**: frequent changes in the network topology as peers connect and disconnect can lead to potential message loss. +- **Node resource constraints**: nodes with limited bandwidth or processing power may struggle to handle high message volumes, leading to potential drops. +- **Adversarial behaviour**: adversarial nodes may deliberately drop, delay, or selectively forward messages. + +## Reliability Strategies + +### General + +In this section, we first summarise strategies to improve p2p reliability +based on their functional effect before specifying the detailed strategies for each Waku protocol. +P2p reliability consists of either configuring individual Waku protocols +or composing several Waku protocols +to achieve one or both of the following two objectives: + +#### 1. Improve redundancy + +Redundancy improves the probability that a message reaches its intended recipients. +Redundancy strategies can focus on: + +- **Redundant publishing:** + +Publishers MAY improve the probability of their messages propagating through the network +by publishing it simultaneously to several peers as first hop. +Several peers would then continue routing the message. +[17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) already implements redundant publishing by forwarding a message _at least_ to all mesh peers +or, as is RECOMMENDED for Waku, publishing the message to _all_ peers using [flood publishing](https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md#flood-publishing). +Configuring [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) is part of the transport layer and a detailed description falls outside the scope of this spec. +[WAKU2-LIGHTPUSH](../core/lightpush.md) MAY similarly be configured to publish to more than one lightpush service node at the same time. +See the [lightpush section](#lightpush) for a specification of the strategy. + +- **Redundant receiving:** + +Nodes MAY improve the probability for receiving all messages +by increasing the number of sources from which they receive messages. +[17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) already implements redundant receiving as messages are eagerly pushed from all mesh peers +with added gossip mechanisms to allow lazy pulling from all peers +as described in the [GossipSub v1.1 specification](https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md). +Configuring [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) is part of the transport layer and a detailed description falls outside the scope of this spec. +[12/WAKU2-FILTER](https://rfc.vac.dev/waku/standards/core/12/filter) MAY similarly be configured to subscribe to receive messages from more than one filter service node at the same time. +See the [filter section](#filter) for a specification of the strategy. + +#### 2. Detect and remedy losses + +Nodes MAY combine different Waku protocols to detect and remedy possible losses. +Losses can occur either from the publisher's or the recipient's perspective. + +- **Failure to publish:** + +Nodes using either [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) or [WAKU2-LIGHTPUSH](../core/lightpush.md) to publish +MAY determine that a message has probably failed to publish +by combining these protocols with [12/WAKU2-FILTER](https://rfc.vac.dev/waku/standards/core/12/filter) or [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) +to attempt to retrieve the published message from other peers in the network. +Failure to detect the message could indicate a publishing failure. +The usual remedial action is to retransmit the message. +See [Store-based reliability](#store-based-reliability) for a specification of the strategy to combine [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) or [WAKU2-LIGHTPUSH](../core/lightpush.md) with [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store). +See [Lightpush](#lightpush) for a specification of the strategy to combine [WAKU2-LIGHTPUSH](../core/lightpush.md) with [12/WAKU2-FILTER](https://rfc.vac.dev/waku/standards/core/12/filter). +Combining [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) with [12/WAKU2-FILTER](https://rfc.vac.dev/waku/standards/core/12/filter) is conceptually possible, but underspecified. + +- **Failure to receive:** + +Nodes using either [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) or [12/WAKU2-FILTER](https://rfc.vac.dev/waku/standards/core/12/filter) to receive messages +MAY determine message losses +by combining these protocols with [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) +to compare their local message history with the historical messages cached in the [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store). +The usual remedial action is to retrieve the missing messages from the [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store). +See [Store-based reliability](#store-based-reliability) for a specification of this strategy. + +### Store-based reliability + +[13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) provides a way for nodes to query the existence of or fetch specific historical messages. +Nodes using any combination of [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay), [12/WAKU2-FILTER](https://rfc.vac.dev/waku/standards/core/12/filter) and [WAKU2-LIGHTPUSH](../core/lightpush.md) to publish/receive messages +MAY combine these protocols with [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) to improve p2p reliability. +Depending on the use case, such a node MAY use any of the strategies below: + +#### 1. Store-based reliability for publishing + +- A publishing node using this strategy MUST consider a published message as "unacknowledged" at first. +- The publisher MUST keep a copy of this message against its [deterministic message hash](https://rfc.vac.dev/waku/standards/core/14/message#deterministic-message-hashing) in a local outgoing message buffer. +It MAY keep the last several published messages in this buffer. +- The publisher MUST periodically perform a [presence query](https://rfc.vac.dev/waku/standards/core/13/store#presence-queries) to the [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) service +against the message hashes of all published messages in the outgoing buffer +to verify their existence in the store. + - If a message hash exists in the [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) service, +the publisher SHOULD consider the corresponding message as "acknowledged" and remove it from the outgoing buffer. + - If a message hash does not exist in the [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) service, +the publisher MUST consider the corresponding message as still "unacknowledged". +- The publisher SHOULD retransmit all unacknowledged messages either periodically or upon reception of the presence query response +until a positive presence query response is received for the corresponding message hash. +A publisher MAY consider a message publication as having failed irremediably after a set number of failed presence query attempts. + +#### 2. Store-based reliability for receiving + +- A receiving node using this strategy MUST have a local cache of the [deterministic message hashes](https://rfc.vac.dev/waku/standards/core/14/message#deterministic-message-hashing) of all received messages +spanning _at least_ the time period for which reliability is required. +- The node MUST periodically perform a [content filtered query](https://rfc.vac.dev/waku/standards/core/13/store#content-filtered-queries) to the [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) service, +spanning the time period for which reliability is required +and including all content topics over which it is interested to receive messages. +The `include_data` field SHOULD be set to `false` to retrieve only the matching message hashes from the [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) service. +- The node MUST compare the received message hashes to those in the local cache. +Any message hashes in the response that are not in the local cache MUST be considered "missing". +- The node SHOULD perform a [message hash lookup query](https://rfc.vac.dev/waku/standards/core/13/store#message-hash-lookup-queries) for all missing message hashes +to retrieve the full contents of the corresponding messages. +It MAY do so either periodically (in batches) or upon reception of the content filtered query response. +- The node MUST add to the local cache all message hashes corresponding to the retrieved messages, +in order to prevent them from being considered missing in future. + +#### 3. Combined store-based reliability for publishing and receiving messages + +- A node using this strategy MUST have a local cache of the [deterministic message hashes](https://rfc.vac.dev/waku/standards/core/14/message#deterministic-message-hashing) of all received messages +spanning _at least_ the time period for which reliability is required. +- In addition, the node MUST consider a published message as "unacknowledged" at first. The node MUST keep a copy of this message against its [deterministic message hash](https://rfc.vac.dev/waku/standards/core/14/message#deterministic-message-hashing) in a local outgoing message buffer. +It MAY keep the last several published messages in this buffer. +- The node MUST periodically perform a [content filtered query](https://rfc.vac.dev/waku/standards/core/13/store#content-filtered-queries) to the [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) service, +spanning the time period for which reliability is required +and including all content topics over which it is interested to both publish and receive messages. +The `include_data` field SHOULD be set to `false` to retrieve only the matching message hashes from the [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) service. +- The node MUST compare the message hashes in the content filtered query response to the message hashes of all published messages in the outgoing buffer +to verify their existence in the store. + - If a message hash is included in the query response, +the publisher SHOULD consider the corresponding message as "acknowledged" and remove it from the outgoing buffer. + - If a message hash is not included in the query response, +the publisher MUST consider the corresponding messages as still "unacknowledged". +- In addition, the node MUST compare the message hashes in the content filtered query response to the message hashes in the local cache. +Any message hashes in the response that are not in the local cache MUST be considered "missing". +- The node SHOULD retransmit all unacknowledged messages in the ougoing buffer, +either periodically or upon reception of the query response, +until a positive inclusion in a follow-up query response for the corresponding message hashes. +The node MAY consider a message publication as having failed irremediably after a set number of query attempts without inclusion. +- In addition, the node SHOULD perform a [message hash lookup query](https://rfc.vac.dev/waku/standards/core/13/store#message-hash-lookup-queries) for all missing message hashes +to retrieve the full contents of the corresponding messages. +It MAY do so either periodically (in batches) or upon reception of the content filtered query response. +- The node MUST add to the local cache all message hashes corresponding to the retrieved messages, +in order to prevent them from being considered missing in future. + +### Lightpush + +[WAKU2-LIGHTPUSH](../core/lightpush.md) provides a way for client nodes to publish messages to a pub/sub topic via a lightpush service node without participating in gossipsub routing. +Lightpush clients MAY use any of the following strategies to improve reliability of the service: + +#### 1. Maintain a pool of reliable lightpush service nodes + +A lightpush client using this strategy MUST maintain a pool of reliable lightpush service nodes. +[Discovery](https://rfc.vac.dev/waku/standards/core/10/waku2#discovery-domain) of these service nodes falls outside the scope of this specification. +As a simple heuristic, a lightpush client MAY consider all discovered lightpush service nodes as reliable until it detects a service failure. +In case the lightpush service fails due to service node behaviour, +the client MAY disconnect from the service node and replace it with another service node from the pool. +Such a client SHOULD also remove the failing service node from the pool of reliable service nodes. + +We RECOMMEND replacing a lightpush service node after a single failure in the following categories: +- the connection to the service node is lost or a lightpush request times out +- the lightpush request fails due to a service-side error, for example if the response contains one of the following [error codes](../core/lightpush.md#examples-of-possible-error-codes): + - `UNSUPPORTED_PUBSUB_TOPIC` + - `INTERNAL_SERVER_ERROR` + - `NO_PEERS_TO_RELAY` +- the request failed without an error response + +#### 2. Redundant lightpush publishing + +A lightpush client using this strategy MUST publish each message simultaneously to two or more lightpush service nodes. +Note that bandwidth usage increases proportionally to the amount of service nodes used. +For this reason, we RECOMMEND using only two lightpush service nodes at a time. + +#### 3. Retransmit on failure + +- A lightpush client using this strategy MUST wait for a [WAKU2-LIGHTPUSH response](../core/lightpush.md) after publishing a message. +- If the response times out or contains [a recoverable error code](../core/lightpush.md#examples-of-possible-error-codes) (e.g. `TOO_MANY_REQUESTS`) +the lightpush client SHOULD attempt to retransmit the message after some interval. +- The client MAY choose to continue retransmitting the message until an `OK` response is received from the service node. +The interval between each retransmission attempt is up to the implementation, +but we RECOMMEND starting with `1 second` and increasing it after each failure. +- The client MAY consider a message publication as having failed irremediably after a set number of failed lightpush requests. + +#### 4. Retransmit on loss detection + +> *_Note:_* Lightpush clients participating in [Store-based reliability](#store-based-reliability) already performs this strategy and can ignore this section. + +- A lightpush client using this strategy MUST use either [Store-based reliability](#store-based-reliability) +or install one or more [12/WAKU2-FILTER](https://rfc.vac.dev/waku/standards/core/12/filter) subscriptions matching the content topic(s) used for publishing. +In this way, the client can confirm that a published message did indeed reach the targeted store or filter service node(s). +- The store queries or filter subscription SHOULD be requested at service nodes different from those used for the lightpush service. +- If the client determines that a published message has not been received by the filter or store service node, +it SHOULD retransmit the message after some interval. +- The client MAY choose to continue retransmitting the message until it is confirmed by one more service nodes. +The interval between each retransmission attempt is up to the implementation, +but we RECOMMEND starting with `1 second` and increasing it after each attempt. +- The client MAY consider a message publication as having failed irremediably after a set number of failed lightpush requests. + +### Filter + +[12/WAKU2-FILTER](https://rfc.vac.dev/waku/standards/core/12/filter) provides a way for client nodes to receive messages from a pub/sub topic via a filter service node without participating in gossipsub routing. +Filter clients MAY use any of the following strategies to improve reliability of the service: + +#### 1. Maintain a pool of reliable filter service nodes + +A filter client using this strategy MUST maintain a pool of reliable filter service nodes. +[Discovery](https://rfc.vac.dev/waku/standards/core/10/waku2#discovery-domain) of these service nodes falls outside the scope of this specification. +As a simple heuristic, a filter client MAY consider all discovered filter service nodes as reliable until it detects a service failure. +In case the filter service fails due to service node behaviour, +the client MAY disconnect from the service node and replace it with another service node from the pool. +Such a client SHOULD also remove the failing service node from the pool of reliable service nodes. + +We RECOMMEND replacing a filter service node under the following conditions: +- a [`SUBSCRIBE`](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscribe) request fails due to a service-side error +- a [`SUBSCRIBER_PING`](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscriber_ping) request fails twice in a row + +#### 2. Redundant filter subscriptions + +A filter client using this strategy MUST subscribe to two or more filter service nodes. +Such clients SHOULD filter out duplicate messages by comparing [deterministic message hashes](https://rfc.vac.dev/waku/standards/core/14/message#deterministic-message-hashing). +Note that both bandwidth usage and computational complexity increases proportionally to the amount of service nodes used. +For this reason, we RECOMMEND using only two filter service nodes at a time. + +#### 3. Maintaining healthy subscriptions + +- A filter client using this strategy MUST regularly send a [`SUBSCRIBER_PING`](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscriber_ping) to each of its filter service nodes, +to ensure that the service node is online and maintaining an active subscription for the client. +The interval between each `SUBSCRIBER_PING` is up to the implementation, +but we RECOMMEND `1 minute`. +- If the `SUBSCRIBER_PING` request times out or returns an error code, +the client SHOULD attempt to reinstall its filter subscription with a [`SUBSCRIBE`](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscribe) request. +- If a [`SUBSCRIBER_PING`] or [`SUBSCRIBE`] request fails more than once to the same filter service node, +the client MAY choose to replace this service node with another from the service node pool. +We RECOMMEND a strict policy of replacing filter service nodes after only two `SUBSCRIBER_PING` failures +or after a single `SUBSCRIBE` failure. +- A filter client MAY also choose to refresh its existing subscriptions periodically, +by submitting the same filter criteria as before in a new [`SUBSCRIBE`](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscribe) request to the same service node. +This helps ensure that local and remote views of filter criteria remains synchronised. + +#### 4. Store query on loss detection + +> *_Note:_* Filter clients participating in [Store-based reliability](#store-based-reliability) already performs this strategy and can ignore this section. + +- A filter client using this strategy MUST use [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) queries to retrieve lost messages. +- Such clients MAY use [Store-based reliability](#store-based-reliability) to periodically detect and remedy message losses. +- If [Store-based reliability](#store-based-reliability) is unsuitable (e.g. due to the high resource usage of repeated store queries), +the client MAY perform opportunistic store queries covering periods over which it detected a disconnection. +For example, a client MAY consider itself offline over a period of repeated failed regular pings and perform a store query once the connection has been restored. +The store query SHOULD cover the period from the last successfully received message to the reestablishment of connectivity. + +## Tradeoffs + +All p2p reliability strategies set out in this document increases resource usage, +most prominently bandwidth usage but also processing power and storage in some circumstances. +As such, each strategy SHOULD be carefully considered and configured based on the intended use case. +Increasing redundancy 20-fold may significantly improve reliability, +but the accompanying bump in resource usage will be unacceptable for the average Waku user. +At the same time, +none of the reliability strategies described here can _guarantee_ end-to-end reliability from an application's perspective. +This is due to the inherent probabilistic nature of p2p message propagation on the routing layer. +For example, certain sections of the network may become temporarily unreachable due to a network split, +without this being visible on a hop-to-hop basis. +Only the application has the end-to-end view to ensure reliability spanning all routing layer hops between that application's publishers and intended recipients. +For applications with an integrated end-to-end reliability protocol, +most p2p reliability strategies can be minimally configured (or even disabled) to save resources. + +## References + +1. [10/WAKU2](https://rfc.vac.dev/waku/standards/core/10/waku2) +1. [12/WAKU2-FILTER](https://rfc.vac.dev/waku/standards/core/12/filter) +1. [13/WAKU2-STORE](https://rfc.vac.dev/waku/standards/core/13/store) +1. [17/WAKU2-RLN-RELAY](https://rfc.vac.dev/waku/standards/core/17/rln-relay) +1. [14/WAKU2-MESSAGE](https://rfc.vac.dev/waku/standards/core/14/message#deterministic-message-hashing) +1. [GossipSub v1.1](https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md#flood-publishing) +1. [WAKU2-LIGHTPUSH](../core/lightpush.md) + +## Copyright + +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). \ No newline at end of file