Addresses all three substantive review findings on PR #72 in one
structural change: fold the per-op accumulator and the R2 retry buffer
into a single queue on `ChannelContext`, flushed once at op end.
Changes:
- sds/types/channel_context.nim: add `pendingHistoryAppends`
(`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts`
(`HashSet[SdsMessageID]`) fields. Only ids are stored — the full
SdsMessage is looked up from `messageHistory` at flush time. Documented
invariant: every id in pendingHistoryAppends is also in messageHistory,
upheld by the merge rule.
- sds/sds_utils.nim:
* `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel,
msgId)` — "latest-wins" merge: append cancels any pending evict
and vice versa. Symmetric, simple, handles the evict-then-re-add
sequence correctly (SDS-R repair re-delivering an evicted message
while the backend is unreachable).
* `tryUpdateHistory(rm, channelId)` — no more list params; flushes the
channel's pending queue. Dual role: per-op accumulator (multiple
`addToHistory` calls within one op queue together and flush as one
round-trip) AND R2 retry buffer (a failed flush leaves the queue
populated for the next op to retry).
* `addToHistory` queues via the helpers; does not call persistence.
* Pending queue cleared on `cleanup` and `removeChannel`.
- sds.nim:
* `processIncomingBuffer` returns to its single-arg signature — the
queue lives on the channel, no parameter threading needed.
* `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths),
`markDependenciesMet` issue exactly one `trySaveMeta` +
`tryUpdateHistory` pair at op end, under the lock, with no
intervening `await`-of-other-work. Matches the Persistence atomicity
contract documented in `sds/types/persistence.nim`.
* Pending queue cleared in `resetReliabilityManager`.
- tests/test_persistence.nim:
* Direct `addToHistory` callers (state-survival setup) now follow with
explicit `tryUpdateHistory(channelId)` to flush. Reflects the
production op-end flush pattern.
* New: `updateHistory failure is retried via R2 pending-write queue` —
verifies that two failed sends leave both messages on the queue,
and a third successful send drains the whole queue in one call.
* New: `pending queue survives idle ops` — verifies that an op with
no history changes of its own still flushes a previously-failed
batch at op end.
* New: `evict-then-re-add merge rule preserves the re-added message
on disk` — regression for the "latest-wins" merge rule. The original
"evict-wins" rule would silently drop the re-add and leave the
message permanently absent from disk; this test would fail under
that rule and passes under the corrected one.
Resolves PR #72 review comments:
- #1 (delta loss on failed updateHistory) — R2 retry queue.
- #2 (cascade chattiness — N updateHistory calls per op) — queue collects
cascaded entries, flushed as one batch.
- #3 (atomicity contract mismatch) — implementation now matches the
documented "saveChannelMeta then updateHistory back-to-back" pairing.
Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests).
FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
End-state of the snapshot-persistence refactor. The legacy 13-proc
Persistence interface and its noOpPersistence are gone; the 5-proc
snapshot-based interface (formerly PersistenceV2) takes their place under
the canonical name.
Source:
- sds/types/persistence.nim: replaced 13-proc contract with the 5-proc
snapshot interface (saveChannelMeta, updateHistory, loadChannel,
dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere
and an empty ChannelData on load.
- sds/types/persistence_v2.nim: removed.
- sds/types/reliability_manager.nim: dropped the second persistenceV2
field; constructor takes a single `persistence: Persistence`.
- sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments
updated.
- sds.nim: dropped the persistenceV2 parameter from newReliabilityManager.
Tests:
- tests/in_memory_persistence_v2.nim: removed; its content moved to...
- tests/in_memory_persistence.nim: replaces the old legacy mock with the
snapshot adapter under the canonical filename. Same InMemoryStore
shape so test assertions stay unchanged.
- tests/test_persistence.nim: ctor param renamed, suite name de-prefixed.
FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean.
All 4 test suites pass:
- test_bloom
- test_reliability
- test_persistence (17 V2 tests)
- test_snapshot_codec (13 codec round-trip tests)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
End-state of phase 2: the protocol code no longer issues any legacy
fine-grained Persistence calls. All state survives via the snapshot-based
PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory
batched inside addToHistory. The legacy Persistence field on
ReliabilityManager remains for backwards compatibility; phase 3 deletes it.
Protocol changes (sds.nim, sds/sds_utils.nim):
- reviewAckStatus, processIncomingBuffer, updateLamportTimestamp →
pure in-memory; no per-mutation persistence.
- addToHistory: replaces appendLogEntry+removeLogEntry with a single
tryUpdateHistory call carrying (append, evict) atomically.
- getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal.
- wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet:
all per-row saveOutgoing / removeOutgoing / saveIncoming /
removeIncoming / saveOutgoingRepair / removeOutgoingRepair /
saveIncomingRepair / removeIncomingRepair calls removed (16 call
sites in total). State is captured by the op-end trySaveMeta added
in phase 2A.
- getOrCreateChannel: bootstraps from persistenceV2.loadChannel.
- dropChannelFromPersistence: uses persistenceV2.dropChannel.
Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8):
- Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal —
trySaveMeta / tryUpdateHistory log and continue; the protocol op
returns ok regardless of disk failure. In-memory state is the source
of truth; the next op re-issues a complete snapshot and disk catches
up automatically.
- Durability-intent ops (removeChannel, resetReliabilityManager via
dropChannelFromPersistence; getOrCreateChannel via loadChannel):
still propagate rePersistenceError, because the caller asked us to
confirm a disk operation and we cannot silently lie.
Test infrastructure:
- tests/in_memory_persistence_v2.nim: new V2 adapter mock that
decomposes the meta blob into the existing InMemoryStore shape so
test assertions on store.outgoing / store.incoming / etc. continue to
work without change.
- tests/test_persistence.nim: 17 tests, all rewritten against V2.
- 13 state-survival tests carry over with identical assertions.
- "loadChannel failure surfaces as err on bootstrap" — bootstrap
keeps durability-intent semantics.
- "saveChannelMeta failure during send does NOT surface" — deliberate
inversion of the legacy "write failure surfaces as err" test. Asserts
the new non-fatal policy: op returns ok, in-memory state correct,
disk re-syncs on the next op.
- "updateHistory failure during send does NOT surface" — same policy
applied to the history path.
- "dropChannel failure during removeChannel surfaces as err" — kept.
- All 17 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Persistence contract previously returned `Future[void]` for writes and
`Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no
way to report a failure, so a failed write or a failed/partial read was
silently swallowed — and on the read path a mid-scan failure could bootstrap
a *truncated* channel snapshot, corrupting the rebuilt bloom filter and
lamport clock across a restart.
Make every contract field Result-returning:
* mutating ops -> Future[Result[void, string]]
* loadAllForChannel -> Future[Result[ChannelSnapshot, string]]
The backend-supplied error string is mapped to a new
`ReliabilityError.rePersistenceError` (logged once at the boundary via
`reliabilityErr`) and threaded up through every persistence-touching proc to
the public API, where the caller decides what to do. Request-driven paths
(wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate
the error; background maintenance loops (periodicBufferSweep,
periodicRepairSweep) log and retry on the next tick, since they have no
synchronous caller.
Tests: in-memory backend gains a `failingOps` injection hook; new
"Persistence: error propagation" suite asserts read/write/drop failures
surface as `rePersistenceError`. Full suite passes (90 OK).
BREAKING CHANGE: the `Persistence` contract signature changed; custom
backends must return `Result` and `ok()` on success. Bumped to 0.3.0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat: make Persistence interface async
The 14 Persistence proc fields now return Future[...] with
{.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite,
encrypted file, network) to suspend rather than block the Chronos event
loop the manager runs on.
Propagates through:
- ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired
across awaits cleanly; matches the single-threaded Chronos worker the
FFI uses. Multi-OS-thread use is now explicitly the caller's
responsibility.
- sds_utils + sds.nim public API procs (wrapOutgoingMessage,
unwrapReceivedMessage, markDependenciesMet, setCallbacks,
resetReliabilityManager, cleanup, ensureChannel, removeChannel, the
getter snapshots, etc.) are now async.
- FFI request handlers in library/sds_thread/... await the new API.
- Tests converted via an asyncTest template that wraps each test body
in an async proc; setup/teardown use waitFor for their single async
call (ensureChannel / cleanup).
Lock scope is preserved exactly: the same call sites that held the
kernel Lock today hold AsyncLock now -- no new locking added.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* refactor: drop asyncSpawn, add asyncSetup/asyncTeardown
Three asyncSpawn usages removed:
- sds.nim startPeriodicTasks: stored the periodic-task futures on
ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so
cleanup can cancel them on shutdown instead of leaking the loops
against a cleared manager.
- library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing,
then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing
it. Aligns the worker with the SP-channel + lock assumption that
there are no concurrent requests; caller throughput is unchanged
because the caller only waits for receipt (fireSync), not processing.
- tests TestBus repair callback: replaced asyncSpawn(deliverExcept...)
with an explicit pending-delivery queue drained by `bus.drain()`.
Integration tests no longer rely on `sleepAsync(10ms)` to let
spawned deliveries finish — they await drain instead.
Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim)
so suite fixtures can `await` directly. All `waitFor` in setup/teardown
blocks is gone; only the top-level asyncTest wrapper still uses waitFor
(once, to drive the async proc to completion).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Correctly propagate error hidden by new async move
* Correctly handle future cancellation exceptions, +some housekeeping
* Apply suggestion from @Ivansete-status
Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
* Stylistics, async default implication addressed, nph style run
* Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>