specs/standards/application/relay-reliability.md

---
title: RELIABILITY-RELAY
name: Reliability for Relay Protocol
category: Standards Track
tags: [reliability, application]
editor: Kaichao Sun <kaichao@status.im>
contributors:
  - Richard Ramos <richard@status.im>
---

## Abstract

[Relay Protocol](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/11/relay.md) in Waku is efficient for routing messages, but there's no guarantee that a message will reach to its destination. For example, the receiver in a chat application may miss messages when network issue happens at either sender or receiver side. 

In general, a message in Waku network includes 3 status from sender's perspective:

- **outgoing**, the message is posted by its creator but no confirmations from other nodes yet
- **sent**, the message is received by any other node in the network
- **delivered**, the message is acknowledged by the receiver

Application like Status already uses [MVDS](https://github.com/vacp2p/rfc-index/blob/main/vac/2/mvds.md) for e2e acknowledgement in direct messages and group chat. There is an ongoing [discussion](https://forum.vac.dev/t/end-to-end-reliability-for-scalable-distributed-logs/293) about a more general and bandwidth efficient solution for e2e reliablity.

In other words, an application defines a payload over Waku and is interested in e2e delivery between application users. Waku provides a pub/sub broadcast transport, which is interested in reliably routing a message to all participants in the pub/sub broadcast group.

Before we have a complete design for e2e reliability, we need to compose existing protocols to increase the reliability of the relay protocol. This document proposes a few options for such composition. The approch proposed here can also be applied to scenarios which depend on [Light Push](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/19/lightpush.md) and [Filter](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/12/filter.md) protocol, since they are service wrappers of Relay protocol.

## Motivation

The [Store protocol](https://github.com/waku-org/specs/blob/master/standards/core/store.md) provides a way for nodes in the network to query the existence of messages or fetch specific messages based on the search criteria.

**Search criteria with message hash**

For the nodes that may have connection issues to **publish** messages via relay network, this search criteria can be used to check whether a message is populated in the network or not. The message exists in store node can be marked from `outgoing` to `sent` by application. If the message is not found in the store node, the application can resend the message.

**Search criteria with topics and time range**

For the nodes that may have connection issues to **receive** messages via relay network, this search criteria can be used to fetch missing messages from store nodes periodically after network resumes. 

In summary, by leveraging the store node to provide such query services, the applications are able to mitigate the reliability issue of relay protocol. 

But this approach also introduces new challenges like centralization, privacy, and scalability. It should be viewed as a temporary solution and deprecated when e2e reliability solution is ready.

## Implementation Suggestions 

### Query with Message Hash

For outgoing messages, the processing flow can be like this:
- create a buffer for all "outgoing" messages
- send message via relay or lightpush protocol
- add message hash to the buffer
- save the message to local database with status "outgoing"
- check the buffer periodically
- query the store node with message hash in the buffer, the messages should be posted more than a few seconds ago
- if the message exists, update the status to "sent" and remove the message hash from the buffer
- if the message does not exist, resend the message
- if the message is still missing in the store node for a period of time, trigger the message failed to send workflow and remove the message hash from the buffer

The implementation in Python may look like this:

```python
outgoingMessageHashes = []

class Message:
    hash: str
    postTime: int
    status: str
    content: str

def send(message):
    # send message via relay or lightpush protocol, here use relay as example
    waku.relay.post(message)
    outgoingMessageHashes.append(message.hash)

    message.status = 'outgoing'
    database.saveMessage(message)

def checkOutgoingMessages(peerID):
    for messageHash in outgoingMessageHashes:
        message = database.getMessage(messageHash)
        # only query store node for ongoing message, and posted more than 3 seconds ago
        if message.status == 'ongoing' && time.now() - message.postTime > 3:
            response = waku.store.queryMessage(peerID, messageHash)
            if response.exists():
                database.updateMessageStatus(messageHash, 'sent')
                outgoingMessageHashes.remove(messageHash)
            elif time.now() - message.postTime > 10:
                # resend the message if it's not stored in store node after 10 seconds
                waku.relay.post(message)
```

Function `checkOutgoingMessages` is called periodically, most likely every a few seconds. Message hashes can be queried in batch to reduce the number of requests to store nodes, the size in a batch shoud not exceed the max supported size by store node.

The store node can be set and updated directly by application or selected from peers which are discovery by protocols like [discv5](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/33/discv5.md) or [peer exchange](https://github.com/waku-org/specs/blob/master/standards/core/peer-exchange.md).

The store node may only support specific pubsub topics, and the application should group message hashes by pubsub topic before sending the request.

When persistent network issue happens, you may not want to resend the failed messages indefinitely, the application should have a mechanism to clean the cache with failed message hashes and trigger other retry logic after a few attempts.

### Query with Topics and Time Range

An application could use different pubsub topics and content topics, for example a community may have its own pubsub topic, and each channel may have its own content topic. To fetch all missing messages in a specific channel, the application can query the store node with the provided pubsub topic, content topic and time range.

For incoming messages, the processing flow can be like this:
- subscribe to the interested pubsub and content topics
- query the store node with the interested topics and time range for message hashes periodically
- check if the message hash is exist in local database, if not, adding the message hash to a buffer, if yes, skip the message
- fetch the missing messages in the buffer from store node
- process the messages necessarily
- update the last fetch time for the interested topic

The implementation in Python may look like this:

```python
class FetchRecord:
    pubsubTopic: str
    lastFetch: int

class QueryParams:
    pubsubTopic: str
    contentTopics: List[str]
    fromTime: int
    toTime: int

def fetchMissingMessages(peerID, queryParams):
    missingMessageHashes = []

    # get missing message identifiers first in order to reduce the data transfer
    response = waku.store.queryMessageHashes(peerID, queryParams)
    for !response.isComplete():
        # process each message in the response
        response.messages().forEach(messageHash -> {
            message = queryDbMessageByHash(messageHash)
            if message.exists():
                continue
            }
            missingMessageHashes.append(messageHash)
        })

        # process next page of the response
        response.Next()
    
    # fetch missing messages with hashes in batch
    response = waku.store.queryMessagesByHash(peerID, missingMessageHashes)
    response.messages().forEach(message -> {
        processMessage(message)
    })

    updateFetchRecord(queryParams.pubsubTopic, queryParams.toTime)
```

`QueryParam` includes all the necessary information to fetch missing messages. The application should iterate all the interested pubsub topics, along with its content topics to construct the `QueryParam`.

Function `fetchMissingMessages` is runing periocally, for example 1 minute. It first fetch all the message hashes in the specified time range, check if message exist in local dabatase, if not, fetch the missing messages in batch. The batch size should be bounded to avoid large data transfer or exceed the max supported size by store node. 

When finishing fetching missing messages, the application should update the last fetch time in `FetchRecord`. The last fetch time can be used to calculate the time range for the next fetch and avoid fetching the same messages again.


### Unified Query

There are cases that both outgoing and incoming messages are queried in similar situation, for example at similar interval. The application can combine the above two worflows into one to have a unified query with better performance overall.

The workflow can be like this:
- create outgoing buffer for all "outgoing" messages
- create incoming buffer for all recently received message hashes
- query store node based on topics and time range for message hashes periodically
- check outgoing buffer with returned message hash, if included, mark message as `sent`, resend if needed.
- check incoming buffer with returned message hash, if not included, fetch the missing message with its hash.

## Security and Performance Considerations

The message query request exposes the metadata of clients to the store nodes, and the store node is capable to associate the messages with interested clents.

The query requests add a fair amount of load to store nodes, and increased linearly with more users onboarded. Store nodes should be able to scale up and scale down itself by monitoring or predicting the workload. 

The store node can also be a target for DDoS attack. The store node should have a mechanism to prevent such attack.

Application should provide options to configure different store nodes for its users, such nodes can either be self-hosted or public with better reputation.


## Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).

## References

1. [Relay Protocol](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/11/relay.md)
2. [Light Push Protocl](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/19/lightpush.md)
3. [Filter Protocl](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/12/filter.md)
4. [MVDS - Minimum Viable Data Synchronization](https://github.com/vacp2p/rfc-index/blob/main/vac/2/mvds.md)
5. [End-to-end reliability for scalable distributed logs](https://forum.vac.dev/t/end-to-end-reliability-for-scalable-distributed-logs/293)
6. [Waku Store Query](https://github.com/waku-org/specs/blob/master/standards/core/store.md)
7. [Waku v2 Discv5 Ambient Peer Discovery](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/33/discv5.md)
8. [Waku v2 Peer Exchange](https://github.com/waku-org/specs/blob/master/standards/core/peer-exchange.md)
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00			`---`
			`title: RELIABILITY-RELAY`
			`name: Reliability for Relay Protocol`
feat: implementation of message hash query 2024-06-26 15:58:54 +08:00			`category: Standards Track`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00			`tags: [reliability, application]`
			`editor: Kaichao Sun <kaichao@status.im>`
			`contributors:`
feat: implementation of message hash query 2024-06-26 15:58:54 +08:00			`- Richard Ramos <richard@status.im>`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00			`---`

			`## Abstract`

chore: more line breaks 2024-07-02 10:51:42 +08:00			`[Relay Protocol](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/11/relay.md) in Waku is efficient for routing messages, but there's no guarantee that a message will reach to its destination. For example, the receiver in a chat application may miss messages when network issue happens at either sender or receiver side.`

			`In general, a message in Waku network includes 3 status from sender's perspective:`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
chore: polish words 2024-06-28 17:14:03 +08:00			`- outgoing, the message is posted by its creator but no confirmations from other nodes yet`
			`- sent, the message is received by any other node in the network`
			`- delivered, the message is acknowledged by the receiver`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
chore: polish words 2024-06-28 17:14:03 +08:00			`Application like Status already uses [MVDS](https://github.com/vacp2p/rfc-index/blob/main/vac/2/mvds.md) for e2e acknowledgement in direct messages and group chat. There is an ongoing [discussion](https://forum.vac.dev/t/end-to-end-reliability-for-scalable-distributed-logs/293) about a more general and bandwidth efficient solution for e2e reliablity.`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
chore: explicit words between app and waku. 2024-07-02 10:32:44 +08:00			`In other words, an application defines a payload over Waku and is interested in e2e delivery between application users. Waku provides a pub/sub broadcast transport, which is interested in reliably routing a message to all participants in the pub/sub broadcast group.`

chore: more about lightpush 2024-07-02 10:46:35 +08:00			Before we have a complete design for e2e reliability, we need to compose existing protocols to increase the reliability of the relay protocol. This document proposes a few options for such composition. The approch proposed here can also be applied to scenarios which depend on [Light Push](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/19/lightpush.md) and [Filter](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/12/filter.md) protocol, since they are service wrappers of Relay protocol.
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
			`## Motivation`

chore: polish words 2024-06-28 17:14:03 +08:00			`The [Store protocol](https://github.com/waku-org/specs/blob/master/standards/core/store.md) provides a way for nodes in the network to query the existence of messages or fetch specific messages based on the search criteria.`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
			`Search criteria with message hash`

chore: polish words 2024-06-28 17:14:03 +08:00			For the nodes that may have connection issues to publish messages via relay network, this search criteria can be used to check whether a message is populated in the network or not. The message exists in store node can be marked from `outgoing` to `sent` by application. If the message is not found in the store node, the application can resend the message.
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
			`Search criteria with topics and time range`

chore: explicit words between app and waku. 2024-07-02 10:32:44 +08:00			`For the nodes that may have connection issues to receive messages via relay network, this search criteria can be used to fetch missing messages from store nodes periodically after network resumes.`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
chore: more line breaks 2024-07-02 10:51:42 +08:00			`In summary, by leveraging the store node to provide such query services, the applications are able to mitigate the reliability issue of relay protocol.`

			`But this approach also introduces new challenges like centralization, privacy, and scalability. It should be viewed as a temporary solution and deprecated when e2e reliability solution is ready.`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
chore: add todo 2024-06-25 17:20:18 +08:00			`## Implementation Suggestions`

feat: implementation of message hash query 2024-06-26 15:58:54 +08:00			`### Query with Message Hash`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
chore_: more workflow 2024-07-01 16:29:31 +08:00			`For outgoing messages, the processing flow can be like this:`
			`- create a buffer for all "outgoing" messages`
			`- send message via relay or lightpush protocol`
			`- add message hash to the buffer`
			`- save the message to local database with status "outgoing"`
			`- check the buffer periodically`
			`- query the store node with message hash in the buffer, the messages should be posted more than a few seconds ago`
			`- if the message exists, update the status to "sent" and remove the message hash from the buffer`
			`- if the message does not exist, resend the message`
			`- if the message is still missing in the store node for a period of time, trigger the message failed to send workflow and remove the message hash from the buffer`

			`The implementation in Python may look like this:`

feat: implementation of message hash query 2024-06-26 15:58:54 +08:00			```python
			`outgoingMessageHashes = []`

			`class Message:`
			`hash: str`
			`postTime: int`
			`status: str`
			`content: str`

			`def send(message):`
			`# send message via relay or lightpush protocol, here use relay as example`
			`waku.relay.post(message)`
			`outgoingMessageHashes.append(message.hash)`

			`message.status = 'outgoing'`
			`database.saveMessage(message)`

chore_: more workflow 2024-07-01 16:29:31 +08:00			`def checkOutgoingMessages(peerID):`
feat: implementation of message hash query 2024-06-26 15:58:54 +08:00			`for messageHash in outgoingMessageHashes:`
			`message = database.getMessage(messageHash)`
			`# only query store node for ongoing message, and posted more than 3 seconds ago`
			`if message.status == 'ongoing' && time.now() - message.postTime > 3:`
chore_: more workflow 2024-07-01 16:29:31 +08:00			`response = waku.store.queryMessage(peerID, messageHash)`
feat: implementation of message hash query 2024-06-26 15:58:54 +08:00			`if response.exists():`
			`database.updateMessageStatus(messageHash, 'sent')`
			`outgoingMessageHashes.remove(messageHash)`
			`elif time.now() - message.postTime > 10:`
			`# resend the message if it's not stored in store node after 10 seconds`
			`waku.relay.post(message)`
			```

fix format 2024-06-26 16:02:09 +08:00			Function `checkOutgoingMessages` is called periodically, most likely every a few seconds. Message hashes can be queried in batch to reduce the number of requests to store nodes, the size in a batch shoud not exceed the max supported size by store node.
feat: implementation of message hash query 2024-06-26 15:58:54 +08:00
feat: message missing messages 2024-06-27 17:19:42 +08:00			`The store node can be set and updated directly by application or selected from peers which are discovery by protocols like [discv5](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/33/discv5.md) or [peer exchange](https://github.com/waku-org/specs/blob/master/standards/core/peer-exchange.md).`
feat: implementation of message hash query 2024-06-26 15:58:54 +08:00
chore: polish words 2024-06-28 17:14:03 +08:00			`The store node may only support specific pubsub topics, and the application should group message hashes by pubsub topic before sending the request.`
feat: implementation of message hash query 2024-06-26 15:58:54 +08:00
			`When persistent network issue happens, you may not want to resend the failed messages indefinitely, the application should have a mechanism to clean the cache with failed message hashes and trigger other retry logic after a few attempts.`

			`### Query with Topics and Time Range`

feat: message missing messages 2024-06-27 17:19:42 +08:00			`An application could use different pubsub topics and content topics, for example a community may have its own pubsub topic, and each channel may have its own content topic. To fetch all missing messages in a specific channel, the application can query the store node with the provided pubsub topic, content topic and time range.`

chore_: more workflow 2024-07-01 16:29:31 +08:00			`For incoming messages, the processing flow can be like this:`
			`- subscribe to the interested pubsub and content topics`
			`- query the store node with the interested topics and time range for message hashes periodically`
			`- check if the message hash is exist in local database, if not, adding the message hash to a buffer, if yes, skip the message`
			`- fetch the missing messages in the buffer from store node`
			`- process the messages necessarily`
			`- update the last fetch time for the interested topic`

			`The implementation in Python may look like this:`

feat: message missing messages 2024-06-27 17:19:42 +08:00			```python
			`class FetchRecord:`
			`pubsubTopic: str`
			`lastFetch: int`

			`class QueryParams:`
			`pubsubTopic: str`
			`contentTopics: List[str]`
			`fromTime: int`
			`toTime: int`

chore_: more workflow 2024-07-01 16:29:31 +08:00			`def fetchMissingMessages(peerID, queryParams):`
feat: message missing messages 2024-06-27 17:19:42 +08:00			`missingMessageHashes = []`

			`# get missing message identifiers first in order to reduce the data transfer`
Update standards/application/relay-reliability.md Co-authored-by: richΛrd <info@richardramos.me> 2024-07-01 16:06:42 +08:00			`response = waku.store.queryMessageHashes(peerID, queryParams)`
feat: message missing messages 2024-06-27 17:19:42 +08:00			`for !response.isComplete():`
			`# process each message in the response`
			`response.messages().forEach(messageHash -> {`
			`message = queryDbMessageByHash(messageHash)`
			`if message.exists():`
			`continue`
			`}`
			`missingMessageHashes.append(messageHash)`
			`})`

			`# process next page of the response`
			`response.Next()`

			`# fetch missing messages with hashes in batch`
Update standards/application/relay-reliability.md Co-authored-by: richΛrd <info@richardramos.me> 2024-07-01 16:07:16 +08:00			`response = waku.store.queryMessagesByHash(peerID, missingMessageHashes)`
feat: message missing messages 2024-06-27 17:19:42 +08:00			`response.messages().forEach(message -> {`
			`processMessage(message)`
			`})`

			`updateFetchRecord(queryParams.pubsubTopic, queryParams.toTime)`
			```

chore_: more workflow 2024-07-01 16:29:31 +08:00			`QueryParam` includes all the necessary information to fetch missing messages. The application should iterate all the interested pubsub topics, along with its content topics to construct the `QueryParam`.
feat: message missing messages 2024-06-27 17:19:42 +08:00
chore: polish words 2024-06-28 17:14:03 +08:00			Function `fetchMissingMessages` is runing periocally, for example 1 minute. It first fetch all the message hashes in the specified time range, check if message exist in local dabatase, if not, fetch the missing messages in batch. The batch size should be bounded to avoid large data transfer or exceed the max supported size by store node.
feat: message missing messages 2024-06-27 17:19:42 +08:00
			When finishing fetching missing messages, the application should update the last fetch time in `FetchRecord`. The last fetch time can be used to calculate the time range for the next fetch and avoid fetching the same messages again.
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
chore: explicit words between app and waku. 2024-07-02 10:32:44 +08:00
			`### Unified Query`

			`There are cases that both outgoing and incoming messages are queried in similar situation, for example at similar interval. The application can combine the above two worflows into one to have a unified query with better performance overall.`

			`The workflow can be like this:`
			`- create outgoing buffer for all "outgoing" messages`
			`- create incoming buffer for all recently received message hashes`
			`- query store node based on topics and time range for message hashes periodically`
			- check outgoing buffer with returned message hash, if included, mark message as `sent`, resend if needed.
			`- check incoming buffer with returned message hash, if not included, fetch the missing message with its hash.`

chore: add todo 2024-06-25 17:20:18 +08:00			`## Security and Performance Considerations`

chore: polish words 2024-06-28 17:14:03 +08:00			`The message query request exposes the metadata of clients to the store nodes, and the store node is capable to associate the messages with interested clents.`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
chore: polish words 2024-06-28 17:14:03 +08:00			`The query requests add a fair amount of load to store nodes, and increased linearly with more users onboarded. Store nodes should be able to scale up and scale down itself by monitoring or predicting the workload.`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
chore: add todo 2024-06-25 17:20:18 +08:00			`The store node can also be a target for DDoS attack. The store node should have a mechanism to prevent such attack.`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00
chore: add todo 2024-06-25 17:20:18 +08:00			`Application should provide options to configure different store nodes for its users, such nodes can either be self-hosted or public with better reputation.`
feat: reliability for relay protocol 2024-06-25 17:14:17 +08:00

			`## Copyright`

			`Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).`

			`## References`

			`1. [Relay Protocol](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/11/relay.md)`
chore: more about lightpush 2024-07-02 10:46:35 +08:00			`2. [Light Push Protocl](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/19/lightpush.md)`
			`3. [Filter Protocl](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/12/filter.md)`
			`4. [MVDS - Minimum Viable Data Synchronization](https://github.com/vacp2p/rfc-index/blob/main/vac/2/mvds.md)`
			`5. [End-to-end reliability for scalable distributed logs](https://forum.vac.dev/t/end-to-end-reliability-for-scalable-distributed-logs/293)`
			`6. [Waku Store Query](https://github.com/waku-org/specs/blob/master/standards/core/store.md)`
			`7. [Waku v2 Discv5 Ambient Peer Discovery](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/33/discv5.md)`
			`8. [Waku v2 Peer Exchange](https://github.com/waku-org/specs/blob/master/standards/core/peer-exchange.md)`