nim-sds/sds/sds_utils.nim

503 lines
20 KiB
Nim
Raw Normal View History

refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
import std/[times, tables, sequtils, sets, hashes]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
import chronos, chronicles, results
import ./rolling_bloom_filter
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
import
./types/[
sds_message_id, history_entry, sds_message, unacknowledged_message,
incoming_message, reliability_error, callbacks, app_callbacks, reliability_config,
repair_entry, channel_context, reliability_manager,
]
export
sds_message_id, history_entry, sds_message, unacknowledged_message, incoming_message,
reliability_error, callbacks, app_callbacks, reliability_config, repair_entry,
channel_context, reliability_manager
proc defaultConfig*(): ReliabilityConfig =
return ReliabilityConfig.init()
proc snapshotMeta(channel: ChannelContext): ChannelMeta {.gcsafe, raises: [].} =
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
## Captures the current in-memory state of a `ChannelContext` as a
## `ChannelMeta` blob, suitable for `Persistence.saveChannelMeta`.
##
## The in-memory shape uses `Table`-keyed buffers for fast lookup;
## `ChannelMeta` flattens them to `seq`s for stable wire serialization
## (see PLAN §6). The bloom filter and message history are intentionally
## excluded — the former is rebuilt from the latter on bootstrap, and
## the latter is persisted separately via `updateHistory`.
var meta = ChannelMeta.init()
meta.lamportTimestamp = channel.lamportTimestamp
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
for u in channel.outgoingBuffer:
meta.outgoingBuffer.add(u)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
for _, m in channel.incomingBuffer.pairs:
meta.incomingBuffer.add(m)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
for id, e in channel.outgoingRepairBuffer.pairs:
meta.outgoingRepairBuffer.add(OutgoingRepairKV(messageId: id, entry: e))
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
for id, e in channel.incomingRepairBuffer.pairs:
meta.incomingRepairBuffer.add(IncomingRepairKV(messageId: id, entry: e))
return meta
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
proc trySaveMeta*(
rm: ReliabilityManager, channelId: SdsChannelID, channel: ChannelContext
) {.async: (raises: []).} =
## Best-effort meta snapshot save. Per PLAN §8 the protocol op does NOT
## abort on persistence failure — in-memory state is the source of truth
## and the next op's snapshot will re-synchronise on-disk state.
##
## This helper is the single point where snapshot-save failures are
## logged; callers do not need to handle the Result.
(await rm.persistence.saveChannelMeta(channelId, snapshotMeta(channel))).isOkOr:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
warn "snapshot save failed; in-memory state authoritative, next op will retry",
channelId = channelId, detail = error
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
proc queueHistoryAppend(channel: ChannelContext, msgId: SdsMessageID) =
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
## Push an append onto the pending history queue. Only the id is
## stored — the full SdsMessage is looked up from `messageHistory` at
## flush time (invariant: every queued id is present in messageHistory).
##
## Merge rule: **latest operation wins.** Cancels any pending evict for
## the same id, then adds. Handles the evict-then-re-add sequence
## correctly (e.g. SDS-R repair re-delivers a previously-evicted
## message while the backend is unreachable).
channel.pendingHistoryEvicts.excl(msgId)
channel.pendingHistoryAppends.incl(msgId)
proc queueHistoryEvict(channel: ChannelContext, msgId: SdsMessageID) =
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
## Push an evict onto the pending history queue. Merge rule symmetric
## with `queueHistoryAppend`: cancels any pending append for the same
## id (the just-evicted message no longer needs to be persisted as an
## addition), then adds to the evict set.
channel.pendingHistoryAppends.excl(msgId)
channel.pendingHistoryEvicts.incl(msgId)
proc tryUpdateHistory*(
rm: ReliabilityManager, channelId: SdsChannelID
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
) {.async: (raises: []).} =
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
## Flush the channel's pending history queue to disk.
##
## The pending queue (`channel.pendingHistoryAppends` /
## `pendingHistoryEvicts`) plays a DUAL role — and that's deliberate:
## 1. **Per-op accumulator.** Every `addToHistory` call pushes its
## mutation into this queue but does NOT persist. A protocol op
## that invokes `addToHistory` N times (e.g. a
## `processIncomingBuffer` cascade) leaves N entries queued and
## issues exactly ONE `tryUpdateHistory` at op end — one
## round-trip per op regardless of cascade depth. This fixes PR
## #72 review comments #2 and #3.
## 2. **R2 retry queue.** If the flush fails, the queue is NOT
## cleared. The next op's `addToHistory` calls add to it; the
## next op's `tryUpdateHistory` retries the merged batch. This
## fixes PR #72 review comment #1 (delta loss).
##
## Both roles share the same data structure because they want the same
## semantics: "merge everything pending into one batch and try to
## flush". Failure is non-fatal at the FFI boundary (PLAN §8) — the
## in-memory state is the source of truth.
##
## Callers MUST invoke this once at the end of every protocol op (even
## when this op had no history changes) — otherwise a previously-failed
## batch could sit on the queue indefinitely.
var channel: ChannelContext
try:
if channelId notin rm.channels:
return
channel = rm.channels[channelId]
except KeyError:
return # checked `in` above; unreachable, but tables can raise per spec
if channel.pendingHistoryAppends.len == 0 and channel.pendingHistoryEvicts.len == 0:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
return # nothing to flush — no round-trip cost
var batch = HistoryUpdate.init()
# Look up each queued id in messageHistory (source of truth). The
# invariant on pendingHistoryAppends guarantees the id is present;
# the defensive check below logs any violation rather than crashing.
for id in channel.pendingHistoryAppends:
try:
if id in channel.messageHistory:
batch.append.add(channel.messageHistory[id])
else:
warn "queued append id missing from messageHistory; invariant violated, skipping",
channelId = channelId, msgId = id
except KeyError:
discard # unreachable — `in` was true
for id in channel.pendingHistoryEvicts:
batch.evict.add(id)
let res = await rm.persistence.updateHistory(channelId, batch)
if res.isOk:
channel.pendingHistoryAppends.clear()
channel.pendingHistoryEvicts.clear()
else:
warn "history update failed; queued for retry on next op",
channelId = channelId,
pendingAppends = channel.pendingHistoryAppends.len,
pendingEvicts = channel.pendingHistoryEvicts.len,
detail = res.error
if channel.pendingHistoryAppends.len > rm.config.maxMessageHistory:
warn "pending history queue exceeds maxMessageHistory; backend may be stuck",
channelId = channelId, pendingAppends = channel.pendingHistoryAppends.len
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
proc dropChannelFromPersistence*(
rm: ReliabilityManager, channelId: SdsChannelID
): Future[Result[void, ReliabilityError]] {.async: (raises: []).} =
## Wipes all persisted state for a channel via a single backend call.
## Called by removeChannel / resetReliabilityManager before they clear
## in-memory state. Backend executes the wipe in one transaction.
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
##
## Phase 2D: uses `persistenceV2.dropChannel`. This op DOES propagate
## err on failure (durability is the semantic intent — the caller asked
## us to confirm a disk wipe; we cannot silently lie). See PLAN §8.
(await rm.persistence.dropChannel(channelId)).isOkOr:
warn "persistence operation failed", cause = error
return err(ReliabilityError.rePersistenceError)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
proc cleanup*(rm: ReliabilityManager) {.async: (raises: []).} =
## Releases in-memory state. Does NOT wipe persistence — the manager may be
## reconstructed against the same backend after cleanup, so disk state must
## survive. For deliberate disk wipe, use `removeChannel` or
## `resetReliabilityManager`.
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
##
## Periodic tasks are cancelled BEFORE acquiring the lock so that a task
## currently blocked on `lock.acquire()` can unwind via CancelledError
## without deadlocking against cleanup itself.
if rm.isNil():
return
for task in rm.periodicTasks:
if not task.finished:
await task.cancelAndWait()
rm.periodicTasks.setLen(0)
try:
await rm.lock.acquire()
try:
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
for channelId, channel in rm.channels:
channel.outgoingBuffer.setLen(0)
channel.incomingBuffer.clear()
channel.messageHistory.clear()
channel.outgoingRepairBuffer.clear()
channel.incomingRepairBuffer.clear()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
channel.pendingHistoryAppends.clear()
channel.pendingHistoryEvicts.clear()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
rm.channels.clear()
finally:
rm.lock.release()
except CatchableError:
error "Error during cleanup", error = getCurrentExceptionMsg()
proc cleanBloomFilter*(
rm: ReliabilityManager, channelId: SdsChannelID
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
) {.async: (raises: []).} =
try:
await rm.lock.acquire()
try:
if channelId in rm.channels:
rm.channels[channelId].bloomFilter.clean()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
finally:
rm.lock.release()
except CatchableError:
error "Failed to clean bloom filter",
error = getCurrentExceptionMsg(), channelId = channelId
proc addToHistory*(
rm: ReliabilityManager, msg: SdsMessage, channelId: SdsChannelID
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
): Future[Result[void, ReliabilityError]] {.async: (raises: []).} =
## Inserts a delivered message into the channel's history map, evicts
## the eldest entries past `maxMessageHistory`, and queues the resulting
## append+evict on the channel's pending-history queue. Does NOT issue
## a persistence call — the caller's op-end `tryUpdateHistory` flushes
## the queue in one round-trip.
##
## A cascade of N unblocked messages (e.g. `processIncomingBuffer`)
## therefore leaves N entries queued and triggers ONE persistence call
## at op end, not N. Fixes PR #72 review #2/#3.
##
## Direct callers (tests, ad-hoc) that want the disk write to land
## immediately should follow this with `await rm.tryUpdateHistory(channelId)`.
try:
if channelId in rm.channels:
let channel = rm.channels[channelId]
channel.messageHistory[msg.messageId] = msg
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
queueHistoryAppend(channel, msg.messageId)
while channel.messageHistory.len > rm.config.maxMessageHistory:
var firstKey: SdsMessageID
for k in channel.messageHistory.keys:
firstKey = k
break
channel.messageHistory.del(firstKey)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
queueHistoryEvict(channel, firstKey)
ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
except CatchableError:
error "Failed to add to history",
channelId = channelId, msgId = msg.messageId, error = getCurrentExceptionMsg()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
err(ReliabilityError.reInternalError)
proc updateLamportTimestamp*(
rm: ReliabilityManager, msgTs: int64, channelId: SdsChannelID
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
): Future[Result[void, ReliabilityError]] {.async: (raises: []).} =
## Pure in-memory update (phase 2B). The new lamport value is captured
## by the op-end `trySaveMeta` issued by the calling protocol op; no
## per-mutation persistence call here.
try:
if channelId in rm.channels:
let channel = rm.channels[channelId]
channel.lamportTimestamp = max(msgTs, channel.lamportTimestamp) + 1
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
except CatchableError:
error "Failed to update lamport timestamp",
channelId = channelId, msgTs = msgTs, error = getCurrentExceptionMsg()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
err(ReliabilityError.reInternalError)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
proc newHistoryEntry*(
messageId: SdsMessageID, retrievalHint: seq[byte] = @[]
): HistoryEntry =
return HistoryEntry.init(messageId, retrievalHint)
proc toCausalHistory*(messageIds: seq[SdsMessageID]): seq[HistoryEntry] =
return messageIds.mapIt(newHistoryEntry(it))
proc getMessageIds*(causalHistory: seq[HistoryEntry]): seq[SdsMessageID] =
return causalHistory.mapIt(it.messageId)
## SDS-R: Repair computation functions
proc computeTReq*(
participantId: SdsParticipantID,
messageId: SdsMessageID,
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
tMin: times.Duration,
tMax: times.Duration,
): times.Duration =
## Computes the repair request backoff duration per SDS-R spec:
## T_req = hash(participant_id, message_id) % (T_max - T_min) + T_min
let h = abs(hash(participantId.string & messageId))
let rangeMs = tMax.inMilliseconds - tMin.inMilliseconds
if rangeMs <= 0:
return tMin
let offsetMs = h mod rangeMs
initDuration(milliseconds = tMin.inMilliseconds + offsetMs)
proc computeTResp*(
participantId: SdsParticipantID,
senderId: SdsParticipantID,
messageId: SdsMessageID,
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
tMax: times.Duration,
): times.Duration =
## Computes the repair response backoff duration per SDS-R spec:
## distance = hash(participant_id) XOR hash(sender_id)
## T_resp = distance * hash(message_id) % T_max
## Original sender has distance=0, so T_resp=0 (responds immediately).
let distance = abs(hash(participantId) xor hash(senderId))
let msgHash = abs(hash(messageId))
let tMaxMs = tMax.inMilliseconds
if tMaxMs <= 0 or distance == 0:
return initDuration(milliseconds = 0)
# Use uint64 to avoid overflow on multiplication
let d = uint64(distance mod tMaxMs)
let m = uint64(msgHash mod tMaxMs)
let offsetMs = int64((d * m) mod uint64(tMaxMs))
initDuration(milliseconds = offsetMs)
proc isInResponseGroup*(
participantId: SdsParticipantID,
senderId: SdsParticipantID,
messageId: SdsMessageID,
numResponseGroups: int,
): bool =
## Determines if this participant is in the response group for a given message per SDS-R spec:
## hash(participant_id, message_id) % num_groups == hash(sender_id, message_id) % num_groups
if numResponseGroups <= 1:
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
return true # All participants in the same group
let myGroup = abs(hash(participantId.string & messageId)) mod numResponseGroups
let senderGroup = abs(hash(senderId.string & messageId)) mod numResponseGroups
myGroup == senderGroup
proc getRecentHistoryEntries*(
rm: ReliabilityManager, n: int, channelId: SdsChannelID
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
): Future[Result[seq[HistoryEntry], ReliabilityError]] {.async: (raises: []).} =
## Get recent history entries for sending in causal history.
## Populates retrieval hints and senderId (SDS-R) for each entry.
try:
if channelId in rm.channels:
let channel = rm.channels[channelId]
var orderedIds: seq[SdsMessageID] = @[]
for msgId in channel.messageHistory.keys:
orderedIds.add(msgId)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let recentMessageIds = orderedIds[max(0, orderedIds.len - n) .. ^1]
var entries: seq[HistoryEntry] = @[]
for msgId in recentMessageIds:
var entry = HistoryEntry(messageId: msgId)
if not rm.onRetrievalHint.isNil():
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
{.cast(raises: []).}:
entry.retrievalHint = rm.onRetrievalHint(msgId)
entry.senderId = channel.messageHistory[msgId].senderId
entries.add(entry)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
ok(entries)
else:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
ok(newSeq[HistoryEntry]())
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
except CatchableError:
error "Failed to get recent history entries",
channelId = channelId, n = n, error = getCurrentExceptionMsg()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
err(ReliabilityError.reInternalError)
proc checkDependencies*(
rm: ReliabilityManager, deps: seq[HistoryEntry], channelId: SdsChannelID
): seq[HistoryEntry] =
var missingDeps: seq[HistoryEntry] = @[]
try:
if channelId in rm.channels:
let channel = rm.channels[channelId]
for dep in deps:
if dep.messageId notin channel.messageHistory:
missingDeps.add(dep)
else:
missingDeps = deps
except Exception:
error "Failed to check dependencies",
channelId = channelId, error = getCurrentExceptionMsg()
missingDeps = deps
return missingDeps
proc getMessageHistory*(
rm: ReliabilityManager, channelId: SdsChannelID
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
): Future[seq[SdsMessageID]] {.async: (raises: []).} =
try:
await rm.lock.acquire()
try:
if channelId in rm.channels:
var ids: seq[SdsMessageID] = @[]
for msgId in rm.channels[channelId].messageHistory.keys:
ids.add(msgId)
return ids
else:
return @[]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
finally:
rm.lock.release()
except CatchableError:
error "Failed to get message history",
channelId = channelId, error = getCurrentExceptionMsg()
return @[]
proc getOutgoingBuffer*(
rm: ReliabilityManager, channelId: SdsChannelID
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
): Future[seq[UnacknowledgedMessage]] {.async: (raises: []).} =
try:
await rm.lock.acquire()
try:
if channelId in rm.channels:
return rm.channels[channelId].outgoingBuffer
else:
return @[]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
finally:
rm.lock.release()
except CatchableError:
error "Failed to get outgoing buffer",
channelId = channelId, error = getCurrentExceptionMsg()
return @[]
proc getIncomingBuffer*(
rm: ReliabilityManager, channelId: SdsChannelID
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
): Future[Table[SdsMessageID, IncomingMessage]] {.async: (raises: []), gcsafe.} =
try:
await rm.lock.acquire()
try:
if channelId in rm.channels:
return rm.channels[channelId].incomingBuffer
else:
return initTable[SdsMessageID, IncomingMessage]()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
finally:
rm.lock.release()
except CatchableError:
error "Failed to get incoming buffer",
channelId = channelId, error = getCurrentExceptionMsg()
return initTable[SdsMessageID, IncomingMessage]()
proc getOrCreateChannel*(
rm: ReliabilityManager, channelId: SdsChannelID
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
): Future[Result[ChannelContext, ReliabilityError]] {.async: (raises: []).} =
## Returns the channel context, creating and bootstrapping it from the
## persistence backend if it does not yet exist in memory. The bloom filter
## is rebuilt deterministically from the loaded message history rather than
## persisted directly. Caller is expected to hold rm.lock.
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
##
## Phase 2C: bootstrap via `persistenceV2.loadChannel`. Bootstrap DOES
## propagate err on load failure — the caller asked us to materialise a
## channel and we cannot do that without knowing the prior state. See
## PLAN §8.
try:
if channelId notin rm.channels:
let channel = ChannelContext.new(
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
RollingBloomFilter.init(
rm.config.bloomFilterCapacity, rm.config.bloomFilterErrorRate
)
)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let data = (await rm.persistence.loadChannel(channelId)).valueOr:
warn "persistence operation failed", cause = error
return err(ReliabilityError.rePersistenceError)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
channel.lamportTimestamp = data.meta.lamportTimestamp
# Backend contract: messageHistory MUST be ordered oldest-first.
# If a backend violates this, FIFO eviction breaks across restarts.
for msg in data.messageHistory:
channel.messageHistory[msg.messageId] = msg
channel.bloomFilter.add(msg.messageId)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
for unack in data.meta.outgoingBuffer:
channel.outgoingBuffer.add(unack)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
for incoming in data.meta.incomingBuffer:
channel.incomingBuffer[incoming.message.messageId] = incoming
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
for kv in data.meta.outgoingRepairBuffer:
channel.outgoingRepairBuffer[kv.messageId] = kv.entry
for kv in data.meta.incomingRepairBuffer:
channel.incomingRepairBuffer[kv.messageId] = kv.entry
rm.channels[channelId] = channel
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
ok(rm.channels[channelId])
except CatchableError:
error "Failed to get or create channel",
channelId = channelId, error = getCurrentExceptionMsg()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
err(ReliabilityError.reInternalError)
proc ensureChannel*(
rm: ReliabilityManager, channelId: SdsChannelID
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
): Future[Result[void, ReliabilityError]] {.async: (raises: []).} =
try:
await rm.lock.acquire()
try:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
(await rm.getOrCreateChannel(channelId)).isOkOr:
return err(error)
return ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
finally:
rm.lock.release()
except CatchableError:
error "Failed to ensure channel (lock)",
channelId = channelId, msg = getCurrentExceptionMsg()
return err(ReliabilityError.reInternalError)
proc removeChannel*(
rm: ReliabilityManager, channelId: SdsChannelID
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
): Future[Result[void, ReliabilityError]] {.async: (raises: []).} =
try:
await rm.lock.acquire()
try:
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
try:
if channelId in rm.channels:
let channel = rm.channels[channelId]
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
(await rm.dropChannelFromPersistence(channelId)).isOkOr:
return err(error)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
channel.outgoingBuffer.setLen(0)
channel.incomingBuffer.clear()
channel.messageHistory.clear()
channel.outgoingRepairBuffer.clear()
channel.incomingRepairBuffer.clear()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
channel.pendingHistoryAppends.clear()
channel.pendingHistoryEvicts.clear()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
rm.channels.del(channelId)
return ok()
except CatchableError:
error "Failed to remove channel",
channelId = channelId, msg = getCurrentExceptionMsg()
return err(ReliabilityError.reInternalError)
finally:
rm.lock.release()
except CatchableError:
error "Failed to remove channel (lock)",
channelId = channelId, msg = getCurrentExceptionMsg()
return err(ReliabilityError.reInternalError)