8.4 KiB
SDS-Repair (SDS-R) Implementation Guide
Overview
SDS-R is an optional extension to the Scalable Data Sync (SDS) protocol that enables collaborative repair of missing messages within a limited time window. It's designed to work over Waku and assumes participants are already in a secure channel.
Core Concept
When a participant detects missing messages (via causal dependencies), it waits a random backoff period before requesting repairs. Other participants who have the missing message wait their own random backoff before responding. The protocol uses clever timing and grouping to ensure typically only one request and one response per missing message.
Data Structures
Protobuf Schema Modifications
message HistoryEntry {
string message_id = 1;
optional bytes retrieval_hint = 2;
optional string sender_id = 3; // NEW: Original sender's ID (only for SDS-R)
}
message Message {
string sender_id = 1;
string message_id = 2;
string channel_id = 3;
optional int32 lamport_timestamp = 10;
repeated HistoryEntry causal_history = 11;
optional bytes bloom_filter = 12;
repeated HistoryEntry repair_request = 13; // NEW: List of missing messages
optional bytes content = 20;
}
Additional Participant State
Each participant must maintain:
-
Outgoing Repair Request Buffer
- Map:
HistoryEntry -> T_req (timestamp) - Sorted by ascending T_req
- Contains missing messages waiting to be requested
- Map:
-
Incoming Repair Request Buffer
- Map:
HistoryEntry -> T_resp (timestamp) - Contains repair requests from others that this participant can fulfill
- Only includes requests where participant is in the response group
- Map:
-
Augmented Local History
- Change from base SDS: Store full
Messageobjects, not just message IDs - Only for messages where participant could be a responder
- Needed to rebroadcast messages when responding to repairs
- Change from base SDS: Store full
Global Configuration (per channel)
T_min = 30 seconds // Minimum wait before requesting repair
T_max = 120 seconds // Maximum wait for repair window (recommend 120-600)
num_response_groups = max(1, num_participants / 128) // Response group count
Critical Algorithms
1. Calculate T_req (When to Request Repair)
IMPORTANT BUG FIX: The spec has an off-by-one error. Use this corrected formula:
T_req = current_time + hash(participant_id, message_id) % (T_max - T_min) + T_min
participant_id: Your OWN participant ID (not the sender's)message_id: The missing message's ID- Result: Timestamp between
current_time + T_minandcurrent_time + T_max
2. Calculate T_resp (When to Respond to Repair)
distance = participant_id XOR sender_id
T_resp = current_time + (distance * hash(message_id)) % T_max
participant_id: Your OWN participant IDsender_id: Original sender's ID from the HistoryEntrymessage_id: The requested message's ID- Note: Original sender has distance=0, responds immediately
3. Determine Response Group Membership
is_in_group = (hash(participant_id, message_id) % num_response_groups) ==
(hash(sender_id, message_id) % num_response_groups)
- Only respond to repairs if
is_in_groupis true - Original sender is always in their own response group
Protocol Implementation Steps
When SENDING a Message
- Check outgoing repair request buffer for eligible entries (where
T_req <= current_time) - Take up to 3 eligible entries with lowest T_req values
- Populate
repair_requestfield with these HistoryEntries:- Include
message_id - Include
retrieval_hintif available - Include
sender_id(original sender's ID)
- Include
- If no eligible entries, leave
repair_requestfield unset - Continue with normal SDS send procedure
When RECEIVING a Message
-
Clean up buffers:
- Remove received message_id from outgoing repair request buffer
- Remove received message_id from incoming repair request buffer
-
Process causal dependencies:
- For each missing dependency in causal_history:
- Add to outgoing repair request buffer
- Calculate T_req using formula above
- Include sender_id from the causal history entry
- For each missing dependency in causal_history:
-
Process repair_request field:
- For each repair request entry:
a. Remove from your own outgoing buffer (someone else is requesting it)
b. Check if you have this message in local history
c. Check if you're in the response group (use formula above)
d. If both b and c are true:
- Add to incoming repair request buffer
- Calculate T_resp using formula above
- For each repair request entry:
a. Remove from your own outgoing buffer (someone else is requesting it)
b. Check if you have this message in local history
c. Check if you're in the response group (use formula above)
d. If both b and c are true:
-
Continue with normal SDS receive procedure
Periodic Sweeps
Outgoing Repair Request Buffer Sweep (every ~5 seconds)
for entry, t_req in outgoing_buffer:
if current_time >= t_req:
# This entry will be included in next message's repair_request
# No action needed here, just wait for next send
pass
Incoming Repair Request Buffer Sweep (every ~5 seconds)
for entry, t_resp in incoming_buffer:
if current_time >= t_resp:
message = get_from_local_history(entry.message_id)
if message:
broadcast(message) # Rebroadcast the full original message
remove_from_incoming_buffer(entry)
Periodic Sync Messages with SDS-R
When sending periodic sync messages:
- Check if there are eligible entries in outgoing repair request buffer
- If yes, send the sync message WITH repair_request field populated
- Unlike base SDS, don't suppress sync message even if others recently sent one
Implementation Notes and Edge Cases
Hash Function
CRITICAL: The spec doesn't specify which hash function to use. Recommend:
- Use SHA256 for cryptographic properties
- Convert to integer for modulo operations:
int(hash_bytes[:8], byteorder='big') - Must be consistent across all participants
Participant ID Format
- Must support XOR operation for distance calculation
- Recommend using numeric IDs or convert string IDs to integers
- Must be globally unique within the channel
Memory Management
- Buffer limits: Implement max size for repair buffers (suggest 1000 entries)
- Eviction policy: Remove oldest T_req/T_resp when at capacity
- History retention: Only keep messages for T_max duration
- Response group optimization: Only cache full messages if you're likely to be in response group
Edge Cases to Handle
- Duplicate repair requests: Use Set semantics, only track once
- Expired repairs: If T_req > current_time + T_max, remove from buffer
- Non-numeric participant IDs: Hash to integer for XOR operations
- Missing sender_id: Cannot participate in repair for that message
- Circular dependencies: Set maximum recursion depth for dependency resolution
Typo to Fix
The spec has "Perdiodic" on line 461 - should be "Periodic"
Testing Scenarios
- Single missing message: Verify only one repair request and response
- Cascade recovery: Missing message A depends on missing message B
- Original sender offline: Verify next closest participant responds
- Response group isolation: Verify only in-group participants respond
- Buffer overflow: Test eviction policies
- Network partition: Test behavior when repair window expires
Integration with Base SDS
Modified State from Base SDS
- Local history stores full Messages, not just IDs
- Additional buffers for repair tracking
- Sender_id must be preserved in HistoryEntry
Unchanged from Base SDS
- Lamport timestamp management
- Bloom filter operations
- Causal dependency checking
- Message delivery and conflict resolution
Performance Recommendations
- Use priority queues for T_req/T_resp ordered buffers
- Index local history by message_id for O(1) lookup
- Batch repair requests in single message (up to 3)
- Cache response group calculation results
- Implement exponential backoff in future version (noted as TODO in spec)
Security Assumptions
- Operating within secure channel (via Waku)
- All participants are authenticated
- Rate limiting via Waku RLN-RELAY
- No additional authentication needed for repairs
- Trust all repair requests from channel members
This implementation guide should be sufficient to implement SDS-R without access to the original specification. The key insight is that SDS-R elegantly uses timing and randomization to coordinate distributed repair without central coordination or excessive network traffic.