nim-sds/tests/test_persistence.nim
NagyZoltanPeter 4ccdd122fc
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72)
* feat: propagate persistence backend errors via Result

The Persistence contract previously returned `Future[void]` for writes and
`Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no
way to report a failure, so a failed write or a failed/partial read was
silently swallowed — and on the read path a mid-scan failure could bootstrap
a *truncated* channel snapshot, corrupting the rebuilt bloom filter and
lamport clock across a restart.

Make every contract field Result-returning:
  * mutating ops  -> Future[Result[void, string]]
  * loadAllForChannel -> Future[Result[ChannelSnapshot, string]]

The backend-supplied error string is mapped to a new
`ReliabilityError.rePersistenceError` (logged once at the boundary via
`reliabilityErr`) and threaded up through every persistence-touching proc to
the public API, where the caller decides what to do. Request-driven paths
(wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate
the error; background maintenance loops (periodicBufferSweep,
periodicRepairSweep) log and retry on the next tick, since they have no
synchronous caller.

Tests: in-memory backend gains a `failingOps` injection hook; new
"Persistence: error propagation" suite asserts read/write/drop failures
surface as `rePersistenceError`. Full suite passes (90 OK).

BREAKING CHANGE: the `Persistence` contract signature changed; custom
backends must return `Result` and `ok()` on success. Bumped to 0.3.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(persistence): add snapshot types and codec (phase 0)

Introduce atomic-snapshot persistence types that will replace the current
fine-grained 13-proc Persistence interface. This commit is purely additive:
no existing call site changes, no behaviour change.

New types (sds/types/):
- channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob),
  ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV
  (flattened map entries for protobuf wire shape).
- history_update.nim — HistoryUpdate (combined append/evict payload for
  the message log).

New codec (sds/snapshot_codec.nim):
- Protobuf encode/decode for all new types, reusing the existing
  SdsMessage and HistoryEntry encoders from sds/protobuf.nim.
- Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown
  versions loudly rather than silently truncating.
- Time encoded as int64 unix milliseconds.

Tests (tests/test_snapshot_codec.nim):
- 13 round-trip cases covering empty, single-entry, full-buffer, and
  repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants;
  schemaVersion rejection.

Planning artefacts:
- ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write
  divergence, chatty call rate, non-fatal-error policy gap).
- ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op
  and projected call rates.
- PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit
  implements phase 0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(persistence): add PersistenceV2 interface alongside legacy (phase 1)

Introduce the 5-proc snapshot-based Persistence interface that will
replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so
phase 2 can migrate protocol ops one at a time without breaking existing
callers.

New file:
- sds/types/persistence_v2.nim — `PersistenceV2` type with
  saveChannelMeta / updateHistory / loadChannel / dropChannel /
  setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture
  the atomicity pairing (meta save + history update issued back-to-back
  under the channel lock) and the non-fatal failure policy from PLAN §8.

Modified:
- sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2`
  field alongside `persistence`; constructor takes both, both default to
  no-op.
- sds.nim — `newReliabilityManager` plumbs the new optional parameter.
- AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 +
  phase 1 additions; symbol counts updated by `npx gitnexus analyze`.

No call site uses the new interface yet — that's phase 2. All existing
tests still pass against the legacy interface.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1)

Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced
by a single trySaveMeta per *dirty* channel at the end of that channel's
sweep. Failure is logged but does NOT abort the sweep — in-memory state
is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8).

Helpers added in sds/sds_utils.nim:
- snapshotMeta(channel) — capture current ChannelContext as ChannelMeta
  blob (flattens Table-keyed buffers to seqs for the wire shape).
- trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save;
  logs on failure, never propagates.
- tryUpdateHistory(rm, channelId, append, evict) — best-effort history
  update; skips the call entirely when both lists are empty (HistoryUpdate
  contract).

Call-rate impact for runRepairSweep:
- Before: N persistence calls per expired entry per channel.
- After:  at most 1 saveChannelMeta per dirty channel; 0 on idle channels
  (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS).

All existing tests pass — including the 3 SDS-R Repair Sweep tests that
directly exercise this proc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2)

Per-entry saveOutgoing / removeOutgoing calls are replaced by one
trySaveMeta at the end of the pass, conditional on a dirty flag (resend
attempt incremented, or entry expired). Pass succeeds even if the save
fails — next tick reissues the snapshot.

Call-rate impact:
- Before: N persistence calls per affected entry per pass.
- After:  at most 1 saveChannelMeta per pass; 0 when nothing aged out.

All existing tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A)

Wires `trySaveMeta` into the three public protocol ops that mutate
per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and
markDependenciesMet — at the operation's end, under the channel lock.

Legacy fine-grained persistence calls REMAIN in place; this commit is
additive. Both interfaces persist the same state simultaneously, so all
existing tests pass and a real backend wired to either interface
continues to work. Phase 2B will strip the legacy calls.

Save points match the §"Save Points" table in
ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly:
- wrapOutgoingMessage: 1 save (always)
- unwrapReceivedMessage: 1 save on every path including duplicate
  (the duplicate path still mutates the repair buffers)
- markDependenciesMet: 1 save after the processIncomingBuffer cascade

Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues;
the protocol op never returns rePersistenceError for snapshot failures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D)

End-state of phase 2: the protocol code no longer issues any legacy
fine-grained Persistence calls. All state survives via the snapshot-based
PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory
batched inside addToHistory. The legacy Persistence field on
ReliabilityManager remains for backwards compatibility; phase 3 deletes it.

Protocol changes (sds.nim, sds/sds_utils.nim):
- reviewAckStatus, processIncomingBuffer, updateLamportTimestamp →
  pure in-memory; no per-mutation persistence.
- addToHistory: replaces appendLogEntry+removeLogEntry with a single
  tryUpdateHistory call carrying (append, evict) atomically.
- getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal.
- wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet:
  all per-row saveOutgoing / removeOutgoing / saveIncoming /
  removeIncoming / saveOutgoingRepair / removeOutgoingRepair /
  saveIncomingRepair / removeIncomingRepair calls removed (16 call
  sites in total). State is captured by the op-end trySaveMeta added
  in phase 2A.
- getOrCreateChannel: bootstraps from persistenceV2.loadChannel.
- dropChannelFromPersistence: uses persistenceV2.dropChannel.

Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8):
- Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal —
  trySaveMeta / tryUpdateHistory log and continue; the protocol op
  returns ok regardless of disk failure. In-memory state is the source
  of truth; the next op re-issues a complete snapshot and disk catches
  up automatically.
- Durability-intent ops (removeChannel, resetReliabilityManager via
  dropChannelFromPersistence; getOrCreateChannel via loadChannel):
  still propagate rePersistenceError, because the caller asked us to
  confirm a disk operation and we cannot silently lie.

Test infrastructure:
- tests/in_memory_persistence_v2.nim: new V2 adapter mock that
  decomposes the meta blob into the existing InMemoryStore shape so
  test assertions on store.outgoing / store.incoming / etc. continue to
  work without change.
- tests/test_persistence.nim: 17 tests, all rewritten against V2.
  - 13 state-survival tests carry over with identical assertions.
  - "loadChannel failure surfaces as err on bootstrap" — bootstrap
    keeps durability-intent semantics.
  - "saveChannelMeta failure during send does NOT surface" — deliberate
    inversion of the legacy "write failure surfaces as err" test. Asserts
    the new non-fatal policy: op returns ok, in-memory state correct,
    disk re-syncs on the next op.
  - "updateHistory failure during send does NOT surface" — same policy
    applied to the history path.
  - "dropChannel failure during removeChannel surfaces as err" — kept.
- All 17 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3)

End-state of the snapshot-persistence refactor. The legacy 13-proc
Persistence interface and its noOpPersistence are gone; the 5-proc
snapshot-based interface (formerly PersistenceV2) takes their place under
the canonical name.

Source:
- sds/types/persistence.nim: replaced 13-proc contract with the 5-proc
  snapshot interface (saveChannelMeta, updateHistory, loadChannel,
  dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere
  and an empty ChannelData on load.
- sds/types/persistence_v2.nim: removed.
- sds/types/reliability_manager.nim: dropped the second persistenceV2
  field; constructor takes a single `persistence: Persistence`.
- sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments
  updated.
- sds.nim: dropped the persistenceV2 parameter from newReliabilityManager.

Tests:
- tests/in_memory_persistence_v2.nim: removed; its content moved to...
- tests/in_memory_persistence.nim: replaces the old legacy mock with the
  snapshot adapter under the canonical filename. Same InMemoryStore
  shape so test assertions stay unchanged.
- tests/test_persistence.nim: ctor param renamed, suite name de-prefixed.

FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean.
All 4 test suites pass:
- test_bloom
- test_reliability
- test_persistence (17 V2 tests)
- test_snapshot_codec (13 codec round-trip tests)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Persisting persistence redesign plan for reference

* refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix)

Addresses all three substantive review findings on PR #72 in one
structural change: fold the per-op accumulator and the R2 retry buffer
into a single queue on `ChannelContext`, flushed once at op end.

Changes:

- sds/types/channel_context.nim: add `pendingHistoryAppends`
  (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts`
  (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full
  SdsMessage is looked up from `messageHistory` at flush time. Documented
  invariant: every id in pendingHistoryAppends is also in messageHistory,
  upheld by the merge rule.

- sds/sds_utils.nim:
  * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel,
    msgId)` — "latest-wins" merge: append cancels any pending evict
    and vice versa. Symmetric, simple, handles the evict-then-re-add
    sequence correctly (SDS-R repair re-delivering an evicted message
    while the backend is unreachable).
  * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the
    channel's pending queue. Dual role: per-op accumulator (multiple
    `addToHistory` calls within one op queue together and flush as one
    round-trip) AND R2 retry buffer (a failed flush leaves the queue
    populated for the next op to retry).
  * `addToHistory` queues via the helpers; does not call persistence.
  * Pending queue cleared on `cleanup` and `removeChannel`.

- sds.nim:
  * `processIncomingBuffer` returns to its single-arg signature — the
    queue lives on the channel, no parameter threading needed.
  * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths),
    `markDependenciesMet` issue exactly one `trySaveMeta` +
    `tryUpdateHistory` pair at op end, under the lock, with no
    intervening `await`-of-other-work. Matches the Persistence atomicity
    contract documented in `sds/types/persistence.nim`.
  * Pending queue cleared in `resetReliabilityManager`.

- tests/test_persistence.nim:
  * Direct `addToHistory` callers (state-survival setup) now follow with
    explicit `tryUpdateHistory(channelId)` to flush. Reflects the
    production op-end flush pattern.
  * New: `updateHistory failure is retried via R2 pending-write queue` —
    verifies that two failed sends leave both messages on the queue,
    and a third successful send drains the whole queue in one call.
  * New: `pending queue survives idle ops` — verifies that an op with
    no history changes of its own still flushes a previously-failed
    batch at op end.
  * New: `evict-then-re-add merge rule preserves the re-added message
    on disk` — regression for the "latest-wins" merge rule. The original
    "evict-wins" rule would silently drop the re-add and leave the
    message permanently absent from disk; this test would fail under
    that rule and passes under the corrected one.

Resolves PR #72 review comments:
- #1 (delta loss on failed updateHistory) — R2 retry queue.
- #2 (cascade chattiness — N updateHistory calls per op) — queue collects
  cascaded entries, flushed as one batch.
- #3 (atomicity contract mismatch) — implementation now matches the
  documented "saveChannelMeta then updateHistory back-to-back" pairing.

Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests).
FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00

557 lines
22 KiB
Nim

import results, std/[tables, sets, times]
import sds
import ./async_unittest
import ./in_memory_persistence
converter toParticipantID(s: string): SdsParticipantID =
s.SdsParticipantID
const testChannel = "testChannel"
# Helper: build a ReliabilityManager wired only to the V2 in-memory
# persistence (no legacy backend). Mirrors how production callers will
# construct the manager once phase 3 deletes the legacy field.
proc newV2Manager(
store: InMemoryStore, config = defaultConfig()
): ReliabilityManager =
newReliabilityManager(
participantId = "alice",
config = config,
persistence = newInMemoryPersistence(store),
)
.get()
suite "Persistence: write → restart → read-back":
asyncTest "outgoing buffer survives restart":
let store = newInMemoryStore()
let rm1 = newV2Manager(store)
check (await rm1.ensureChannel(testChannel)).isOk()
let wrapped = await rm1.wrapOutgoingMessage(@[1.byte, 2, 3], "msg-1", testChannel)
check wrapped.isOk()
check store.outgoing[testChannel].len == 1
check "msg-1" in store.outgoing[testChannel]
await rm1.cleanup()
# Simulate restart: fresh manager, same backend.
let rm2 = newV2Manager(store)
check (await rm2.ensureChannel(testChannel)).isOk()
let buf = await rm2.getOutgoingBuffer(testChannel)
check buf.len == 1
check buf[0].message.messageId == "msg-1"
await rm2.cleanup()
asyncTest "lamport clock survives restart":
let store = newInMemoryStore()
let rm1 = newV2Manager(store)
check (await rm1.ensureChannel(testChannel)).isOk()
check (await rm1.updateLamportTimestamp(42, testChannel)).isOk()
# updateLamportTimestamp is now pure; the mutation is persisted by the
# next op-end save. Drive a wrap to force a trySaveMeta.
discard await rm1.wrapOutgoingMessage(@[byte(1)], "tick", testChannel)
# max(42,0)+1 then max(getTime().toUnix, 43)+1; whatever wrap sets is
# what we'll see. We just assert it stayed monotonic.
check store.lamports[testChannel] >= 43
let savedLamport = store.lamports[testChannel]
await rm1.cleanup()
let rm2 = newV2Manager(store)
check (await rm2.ensureChannel(testChannel)).isOk()
check rm2.channels[testChannel].lamportTimestamp == savedLamport
asyncTest "delivered messages survive restart and rebuild bloom":
let store = newInMemoryStore()
let rm1 = newV2Manager(store)
check (await rm1.ensureChannel(testChannel)).isOk()
let msg = SdsMessage.init(
messageId = "delivered-1",
lamportTimestamp = 1,
causalHistory = @[],
channelId = testChannel,
content = @[9.byte, 9],
bloomFilter = @[],
senderId = "alice",
)
check (await rm1.addToHistory(msg, testChannel)).isOk()
# New design: addToHistory queues; tryUpdateHistory flushes. Tests
# that drive addToHistory directly must follow with an explicit flush
# (in production, the public protocol op issues the flush at op end).
await rm1.tryUpdateHistory(testChannel)
check store.log[testChannel].len == 1
await rm1.cleanup()
let rm2 = newV2Manager(store)
check (await rm2.ensureChannel(testChannel)).isOk()
let ch = rm2.channels[testChannel]
check ch.messageHistory.len == 1
check "delivered-1" in ch.messageHistory
# Bloom filter rebuilt from log on bootstrap.
check ch.bloomFilter.contains("delivered-1")
asyncTest "ack removes outgoing entry from persistence":
let store = newInMemoryStore()
let rm = newV2Manager(store)
check (await rm.ensureChannel(testChannel)).isOk()
discard await rm.wrapOutgoingMessage(@[1.byte], "msg-x", testChannel)
check "msg-x" in store.outgoing[testChannel]
# Synthesize an incoming message that ACKs msg-x via causal history.
let ackMsg = SdsMessage.init(
messageId = "ack-bearer",
lamportTimestamp = 5,
causalHistory = @[HistoryEntry.init("msg-x", @[])],
channelId = testChannel,
content = @[],
bloomFilter = @[],
senderId = "bob",
)
let serialized = serializeMessage(ackMsg).get()
discard await rm.unwrapReceivedMessage(serialized)
check "msg-x" notin store.outgoing[testChannel]
await rm.cleanup()
asyncTest "removeChannel issues exactly one dropChannel call and wipes all state":
# Regression for PR #66 review: removal must be a single transactional
# drop, not N per-row removes.
let store = newInMemoryStore()
let rm = newV2Manager(store)
check (await rm.ensureChannel(testChannel)).isOk()
discard await rm.wrapOutgoingMessage(@[1.byte], "msg-r", testChannel)
check store.outgoing[testChannel].len == 1
check store.lamports[testChannel] > 0
check (await rm.removeChannel(testChannel)).isOk()
check store.dropChannelCalls.getOrDefault(testChannel) == 1
check testChannel notin store.outgoing
check testChannel notin store.lamports
check testChannel notin store.log
check testChannel notin store.incoming
check testChannel notin store.outgoingRepair
check testChannel notin store.incomingRepair
await rm.cleanup()
asyncTest "noOpPersistence keeps existing manager working":
let rm = newReliabilityManager(participantId = "alice").get()
# default no-op persistence (both legacy and V2)
check (await rm.ensureChannel(testChannel)).isOk()
let wrapped = await rm.wrapOutgoingMessage(@[1.byte], "msg-n", testChannel)
check wrapped.isOk()
let buf = await rm.getOutgoingBuffer(testChannel)
check buf.len == 1
await rm.cleanup()
asyncTest "continue operating after restart: lamport stays monotonic":
let store = newInMemoryStore()
let rm1 = newV2Manager(store)
check (await rm1.ensureChannel(testChannel)).isOk()
discard await rm1.wrapOutgoingMessage(@[1.byte], "m1", testChannel)
let lamportAfterSession1 = store.lamports[testChannel]
check lamportAfterSession1 > 0
await rm1.cleanup()
# Restart and send another message — lamport must not regress.
let rm2 = newV2Manager(store)
check (await rm2.ensureChannel(testChannel)).isOk()
check rm2.channels[testChannel].lamportTimestamp == lamportAfterSession1
discard await rm2.wrapOutgoingMessage(@[2.byte], "m2", testChannel)
check store.lamports[testChannel] > lamportAfterSession1
let buf = await rm2.getOutgoingBuffer(testChannel)
check buf.len == 2
await rm2.cleanup()
asyncTest "multiple restart cycles preserve state":
let store = newInMemoryStore()
for i in 1 .. 3:
let rm = newV2Manager(store)
check (await rm.ensureChannel(testChannel)).isOk()
discard await rm.wrapOutgoingMessage(@[byte(i)], "m" & $i, testChannel)
await rm.cleanup()
# Final session: all three messages must be in the buffer.
let rmFinal = newV2Manager(store)
check (await rmFinal.ensureChannel(testChannel)).isOk()
let buf = await rmFinal.getOutgoingBuffer(testChannel)
check buf.len == 3
var ids = newSeq[string]()
for unack in buf:
ids.add(unack.message.messageId.string)
check "m1" in ids
check "m2" in ids
check "m3" in ids
await rmFinal.cleanup()
asyncTest "incoming dep-waiting buffer survives restart with missingDeps intact":
let store = newInMemoryStore()
let rm1 = newV2Manager(store)
check (await rm1.ensureChannel(testChannel)).isOk()
# Receive a message whose causal-history references an unknown predecessor.
let depMsg = SdsMessage.init(
messageId = "msg-with-deps",
lamportTimestamp = 10,
causalHistory = @[HistoryEntry.init("missing-dep", @[])],
channelId = testChannel,
content = @[7.byte],
bloomFilter = @[],
senderId = "carol",
)
let serialized = serializeMessage(depMsg).get()
discard await rm1.unwrapReceivedMessage(serialized)
check "msg-with-deps" in store.incoming[testChannel]
await rm1.cleanup()
# Restart — buffered message and its missing-deps set must be back.
let rm2 = newV2Manager(store)
check (await rm2.ensureChannel(testChannel)).isOk()
let inbuf = await rm2.getIncomingBuffer(testChannel)
check "msg-with-deps" in inbuf
check "missing-dep" in inbuf["msg-with-deps"].missingDeps
await rm2.cleanup()
asyncTest "removeChannel + recreate does not inherit stale lamport":
let store = newInMemoryStore()
let rm1 = newV2Manager(store)
check (await rm1.ensureChannel(testChannel)).isOk()
discard await rm1.wrapOutgoingMessage(@[1.byte], "m-old", testChannel)
check store.lamports[testChannel] > 0
check (await rm1.removeChannel(testChannel)).isOk()
check testChannel notin store.lamports
await rm1.cleanup()
# Recreate the same channelId after a restart — must start fresh.
let rm2 = newV2Manager(store)
check (await rm2.ensureChannel(testChannel)).isOk()
check rm2.channels[testChannel].lamportTimestamp == 0
let buf = await rm2.getOutgoingBuffer(testChannel)
check buf.len == 0
await rm2.cleanup()
asyncTest "SDS-R outgoing repair buffer survives restart with absolute t_req_at":
let store = newInMemoryStore()
let rm1 = newV2Manager(store)
check (await rm1.ensureChannel(testChannel)).isOk()
let depMsg = SdsMessage.init(
messageId = "msg-needs-repair",
lamportTimestamp = 5,
causalHistory = @[HistoryEntry.init("missing-dep", @[])],
channelId = testChannel,
content = @[1.byte],
bloomFilter = @[],
senderId = "bob",
)
discard await rm1.unwrapReceivedMessage(serializeMessage(depMsg).get())
check "missing-dep" in store.outgoingRepair[testChannel]
let originalTReqAt =
store.outgoingRepair[testChannel]["missing-dep"].minTimeRepairReq
check originalTReqAt.toUnix > 0
await rm1.cleanup()
# Restart — repair entry must be back with the SAME absolute time.
# Codec serialises Time as int64 unix milliseconds (PLAN §1.5), so the
# restored Time may differ by sub-millisecond precision from the
# original. Compare at second resolution which is what the protocol
# actually relies on.
let rm2 = newV2Manager(store)
check (await rm2.ensureChannel(testChannel)).isOk()
let buf = rm2.channels[testChannel].outgoingRepairBuffer
check "missing-dep" in buf
check buf["missing-dep"].minTimeRepairReq.toUnix == originalTReqAt.toUnix
await rm2.cleanup()
asyncTest "FIFO eviction state survives restart":
let store = newInMemoryStore()
var smallCfg = defaultConfig()
smallCfg.maxMessageHistory = 3
smallCfg.bloomFilterCapacity = 3
let rm1 = newV2Manager(store, smallCfg)
check (await rm1.ensureChannel(testChannel)).isOk()
# Add 5 delivered messages — first 2 should be evicted by FIFO.
for i in 1 .. 5:
let m = SdsMessage.init(
messageId = "m" & $i,
lamportTimestamp = int64(i),
causalHistory = @[],
channelId = testChannel,
content = @[byte(i)],
bloomFilter = @[],
senderId = "alice",
)
check (await rm1.addToHistory(m, testChannel)).isOk()
await rm1.tryUpdateHistory(testChannel)
check store.log[testChannel].len == 3
check "m1" notin store.log[testChannel]
check "m2" notin store.log[testChannel]
await rm1.cleanup()
# Restart — evicted entries must NOT come back; survivors keep order.
let rm2 = newV2Manager(store, smallCfg)
check (await rm2.ensureChannel(testChannel)).isOk()
let history = rm2.channels[testChannel].messageHistory
check history.len == 3
check "m1" notin history
check "m2" notin history
check "m3" in history
check "m5" in history
# FIFO continues correctly after restart: adding m6 evicts m3.
let m6 = SdsMessage.init(
messageId = "m6",
lamportTimestamp = 6,
causalHistory = @[],
channelId = testChannel,
content = @[6.byte],
bloomFilter = @[],
senderId = "alice",
)
check (await rm2.addToHistory(m6, testChannel)).isOk()
await rm2.tryUpdateHistory(testChannel)
check "m3" notin store.log[testChannel]
check "m6" in store.log[testChannel]
await rm2.cleanup()
asyncTest "dep-clear cascade resumes correctly across a restart":
let store = newInMemoryStore()
let rm1 = newV2Manager(store)
check (await rm1.ensureChannel(testChannel)).isOk()
# Receive c (deps on b), then b (deps on a). Both must buffer.
let msgC = SdsMessage.init(
messageId = "c",
lamportTimestamp = 30,
causalHistory = @[HistoryEntry.init("b", @[])],
channelId = testChannel,
content = @[3.byte],
bloomFilter = @[],
senderId = "carol",
)
let msgB = SdsMessage.init(
messageId = "b",
lamportTimestamp = 20,
causalHistory = @[HistoryEntry.init("a", @[])],
channelId = testChannel,
content = @[2.byte],
bloomFilter = @[],
senderId = "bob",
)
discard await rm1.unwrapReceivedMessage(serializeMessage(msgC).get())
discard await rm1.unwrapReceivedMessage(serializeMessage(msgB).get())
check "c" in store.incoming[testChannel]
check "b" in store.incoming[testChannel]
await rm1.cleanup()
# Restart — both still buffered with intact missingDeps.
let rm2 = newV2Manager(store)
check (await rm2.ensureChannel(testChannel)).isOk()
let inbuf = await rm2.getIncomingBuffer(testChannel)
check "c" in inbuf
check "b" in inbuf
# Now receive a (root) — should cascade-deliver a, b, c.
let msgA = SdsMessage.init(
messageId = "a",
lamportTimestamp = 10,
causalHistory = @[],
channelId = testChannel,
content = @[1.byte],
bloomFilter = @[],
senderId = "alice",
)
discard await rm2.unwrapReceivedMessage(serializeMessage(msgA).get())
let history = rm2.channels[testChannel].messageHistory
check "a" in history
check "b" in history
check "c" in history
let inbufFinal = await rm2.getIncomingBuffer(testChannel)
check inbufFinal.len == 0
await rm2.cleanup()
suite "Persistence: failure policy":
asyncTest "loadChannel failure surfaces as rePersistenceError on bootstrap":
# Bootstrap durability is the semantic intent of getOrCreateChannel —
# the caller asked us to materialise a channel and we can't do that
# without knowing prior state. So this op DOES propagate err on load
# failure (PLAN §8).
let store = newInMemoryStore()
store.failingOps.incl("loadChannel")
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
let res = await rm.ensureChannel(testChannel)
check res.isErr()
check res.error == ReliabilityError.rePersistenceError
asyncTest "saveChannelMeta failure during send does NOT surface — non-fatal policy":
# PLAN §8: persistence failures during foreground ops are logged but
# MUST NOT abort the op. The in-memory state is the source of truth;
# the next op's snapshot will re-synchronise on-disk state. This test
# is the inversion of the legacy "write failure surfaces as err" —
# the new policy is deliberate.
let store = newInMemoryStore()
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
store.failingOps.incl("saveChannelMeta")
let res = await rm.wrapOutgoingMessage(@[byte(1)], "m1", testChannel)
# Op succeeds: bytes were produced, protocol state is correct in
# memory, the FFI caller is unaffected.
check res.isOk()
# In-memory state is correct even though disk save was rejected.
let buf = await rm.getOutgoingBuffer(testChannel)
check buf.len == 1
check buf[0].message.messageId == "m1"
# Recovery: clear the failure, drive another op, disk catches up.
store.failingOps.excl("saveChannelMeta")
let res2 = await rm.wrapOutgoingMessage(@[byte(2)], "m2", testChannel)
check res2.isOk()
check "m1" in store.outgoing[testChannel]
check "m2" in store.outgoing[testChannel]
asyncTest "updateHistory failure during send does NOT surface — non-fatal policy":
# Same policy applied to the history-update path.
let store = newInMemoryStore()
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
store.failingOps.incl("updateHistory")
let res = await rm.wrapOutgoingMessage(@[byte(1)], "m1", testChannel)
check res.isOk()
check rm.channels[testChannel].messageHistory.len == 1
asyncTest "updateHistory failure is retried via R2 pending-write queue":
# Fix for PR #72 review comment #1: a failed history write must not
# silently drop the delta. The pending-write queue parks failed
# entries and retries them on the next op end. Once the backend
# recovers, the disk catches up automatically — no caller action
# needed, no err surfaced.
let store = newInMemoryStore()
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
# Failure 1: send m1 while updateHistory is broken.
store.failingOps.incl("updateHistory")
discard await rm.wrapOutgoingMessage(@[byte(1)], "m1", testChannel)
# In-memory state is correct; disk has no log entry for m1 yet.
check rm.channels[testChannel].messageHistory.len == 1
check testChannel notin store.log or "m1" notin store.log[testChannel]
# Pending queue should be holding m1 for retry.
check rm.channels[testChannel].pendingHistoryAppends.len == 1
check "m1" in rm.channels[testChannel].pendingHistoryAppends
# Failure 2: send m2 while still broken. Pending should now hold both.
discard await rm.wrapOutgoingMessage(@[byte(2)], "m2", testChannel)
check rm.channels[testChannel].pendingHistoryAppends.len == 2
check "m1" in rm.channels[testChannel].pendingHistoryAppends
check "m2" in rm.channels[testChannel].pendingHistoryAppends
# Still nothing on disk.
check testChannel notin store.log or store.log[testChannel].len == 0
# Recovery: clear the backend failure, send m3. The op-end flush
# should drain ALL pending entries plus the new one in a single call.
store.failingOps.excl("updateHistory")
discard await rm.wrapOutgoingMessage(@[byte(3)], "m3", testChannel)
check rm.channels[testChannel].pendingHistoryAppends.len == 0
check "m1" in store.log[testChannel]
check "m2" in store.log[testChannel]
check "m3" in store.log[testChannel]
asyncTest "evict-then-re-add merge rule preserves the re-added message on disk":
# Regression: with the original "evict-wins" merge rule, a message
# re-added (e.g. via SDS-R repair) after being evicted during a
# backend outage would have its append silently dropped because the
# id was still in pendingHistoryEvicts. The "latest-wins" rule fixes
# this — the re-add cancels the pending evict.
let store = newInMemoryStore()
var smallCfg = defaultConfig()
smallCfg.maxMessageHistory = 2
smallCfg.bloomFilterCapacity = 2
let rm = newReliabilityManager(
participantId = "alice",
config = smallCfg,
persistence = newInMemoryPersistence(store),
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
proc mkMsg(id: string, ts: int64): SdsMessage =
SdsMessage.init(
messageId = id,
lamportTimestamp = ts,
causalHistory = @[],
channelId = testChannel,
content = @[byte(ts)],
bloomFilter = @[],
senderId = "alice",
)
# Break the backend, then fill the channel past maxMessageHistory so
# m1 gets evicted while we have no successful flush yet.
store.failingOps.incl("updateHistory")
check (await rm.addToHistory(mkMsg("m1", 1), testChannel)).isOk()
await rm.tryUpdateHistory(testChannel) # fails — m1 queued
check (await rm.addToHistory(mkMsg("m2", 2), testChannel)).isOk()
check (await rm.addToHistory(mkMsg("m3", 3), testChannel)).isOk()
# m1 evicted by FIFO; pending should now have m2,m3 as appends and m1 as evict.
check "m1" notin rm.channels[testChannel].messageHistory
check "m1" in rm.channels[testChannel].pendingHistoryEvicts
check "m1" notin rm.channels[testChannel].pendingHistoryAppends
# SDS-R-style re-delivery of m1. With latest-wins, this MUST cancel
# the pending evict and re-queue the append.
check (await rm.addToHistory(mkMsg("m1", 4), testChannel)).isOk()
check "m1" in rm.channels[testChannel].messageHistory
check "m1" notin rm.channels[testChannel].pendingHistoryEvicts
check "m1" in rm.channels[testChannel].pendingHistoryAppends
# Recover and flush. m1 must land on disk.
store.failingOps.excl("updateHistory")
await rm.tryUpdateHistory(testChannel)
check "m1" in store.log[testChannel]
asyncTest "pending queue survives idle ops (flush on next op without history changes)":
# Even if the next op makes no history changes of its own, it must
# still flush the pending queue at op end — otherwise a failed write
# could sit indefinitely if the application only ever does
# mark-deps-met-style ops after a failure.
let store = newInMemoryStore()
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
# Stage a pending entry by failing one send.
store.failingOps.incl("updateHistory")
discard await rm.wrapOutgoingMessage(@[byte(1)], "m1", testChannel)
check rm.channels[testChannel].pendingHistoryAppends.len == 1
# Now clear the failure and drive a markDependenciesMet on a no-op
# input — it has no history changes of its own but its op-end flush
# must still retry the queue.
store.failingOps.excl("updateHistory")
check (await rm.markDependenciesMet(@["nonexistent"], testChannel)).isOk()
check rm.channels[testChannel].pendingHistoryAppends.len == 0
check "m1" in store.log[testChannel]
asyncTest "dropChannel failure during removeChannel surfaces as rePersistenceError":
# Durability is the semantic intent of removeChannel — the caller
# asked us to confirm a disk wipe. We cannot silently lie. So this op
# DOES propagate err on failure (PLAN §8).
let store = newInMemoryStore()
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
store.failingOps.incl("dropChannel")
let res = await rm.removeChannel(testChannel)
check res.isErr()
check res.error == ReliabilityError.rePersistenceError