nim-sds/tests/test_persistence.nim

557 lines
22 KiB
Nim
Raw Normal View History

feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
import results, std/[tables, sets, times]
import sds
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
import ./async_unittest
import ./in_memory_persistence
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
converter toParticipantID(s: string): SdsParticipantID =
s.SdsParticipantID
const testChannel = "testChannel"
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Helper: build a ReliabilityManager wired only to the V2 in-memory
# persistence (no legacy backend). Mirrors how production callers will
# construct the manager once phase 3 deletes the legacy field.
proc newV2Manager(
store: InMemoryStore, config = defaultConfig()
): ReliabilityManager =
newReliabilityManager(
participantId = "alice",
config = config,
persistence = newInMemoryPersistence(store),
)
.get()
suite "Persistence: write → restart → read-back":
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "outgoing buffer survives restart":
let store = newInMemoryStore()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm1 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm1.ensureChannel(testChannel)).isOk()
let wrapped = await rm1.wrapOutgoingMessage(@[1.byte, 2, 3], "msg-1", testChannel)
check wrapped.isOk()
check store.outgoing[testChannel].len == 1
check "msg-1" in store.outgoing[testChannel]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm1.cleanup()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Simulate restart: fresh manager, same backend.
let rm2 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm2.ensureChannel(testChannel)).isOk()
let buf = await rm2.getOutgoingBuffer(testChannel)
check buf.len == 1
check buf[0].message.messageId == "msg-1"
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm2.cleanup()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "lamport clock survives restart":
let store = newInMemoryStore()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm1 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm1.ensureChannel(testChannel)).isOk()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
check (await rm1.updateLamportTimestamp(42, testChannel)).isOk()
# updateLamportTimestamp is now pure; the mutation is persisted by the
# next op-end save. Drive a wrap to force a trySaveMeta.
discard await rm1.wrapOutgoingMessage(@[byte(1)], "tick", testChannel)
# max(42,0)+1 then max(getTime().toUnix, 43)+1; whatever wrap sets is
# what we'll see. We just assert it stayed monotonic.
check store.lamports[testChannel] >= 43
let savedLamport = store.lamports[testChannel]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm1.cleanup()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm2 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm2.ensureChannel(testChannel)).isOk()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
check rm2.channels[testChannel].lamportTimestamp == savedLamport
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "delivered messages survive restart and rebuild bloom":
let store = newInMemoryStore()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm1 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm1.ensureChannel(testChannel)).isOk()
let msg = SdsMessage.init(
messageId = "delivered-1",
lamportTimestamp = 1,
causalHistory = @[],
channelId = testChannel,
content = @[9.byte, 9],
bloomFilter = @[],
senderId = "alice",
)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
check (await rm1.addToHistory(msg, testChannel)).isOk()
# New design: addToHistory queues; tryUpdateHistory flushes. Tests
# that drive addToHistory directly must follow with an explicit flush
# (in production, the public protocol op issues the flush at op end).
await rm1.tryUpdateHistory(testChannel)
check store.log[testChannel].len == 1
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm1.cleanup()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm2 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm2.ensureChannel(testChannel)).isOk()
let ch = rm2.channels[testChannel]
check ch.messageHistory.len == 1
check "delivered-1" in ch.messageHistory
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Bloom filter rebuilt from log on bootstrap.
check ch.bloomFilter.contains("delivered-1")
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "ack removes outgoing entry from persistence":
let store = newInMemoryStore()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm.ensureChannel(testChannel)).isOk()
discard await rm.wrapOutgoingMessage(@[1.byte], "msg-x", testChannel)
check "msg-x" in store.outgoing[testChannel]
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Synthesize an incoming message that ACKs msg-x via causal history.
let ackMsg = SdsMessage.init(
messageId = "ack-bearer",
lamportTimestamp = 5,
causalHistory = @[HistoryEntry.init("msg-x", @[])],
channelId = testChannel,
content = @[],
bloomFilter = @[],
senderId = "bob",
)
let serialized = serializeMessage(ackMsg).get()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
discard await rm.unwrapReceivedMessage(serialized)
check "msg-x" notin store.outgoing[testChannel]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm.cleanup()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "removeChannel issues exactly one dropChannel call and wipes all state":
# Regression for PR #66 review: removal must be a single transactional
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# drop, not N per-row removes.
let store = newInMemoryStore()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm.ensureChannel(testChannel)).isOk()
discard await rm.wrapOutgoingMessage(@[1.byte], "msg-r", testChannel)
check store.outgoing[testChannel].len == 1
check store.lamports[testChannel] > 0
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm.removeChannel(testChannel)).isOk()
check store.dropChannelCalls.getOrDefault(testChannel) == 1
check testChannel notin store.outgoing
check testChannel notin store.lamports
check testChannel notin store.log
check testChannel notin store.incoming
check testChannel notin store.outgoingRepair
check testChannel notin store.incomingRepair
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm.cleanup()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "noOpPersistence keeps existing manager working":
let rm = newReliabilityManager(participantId = "alice").get()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# default no-op persistence (both legacy and V2)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm.ensureChannel(testChannel)).isOk()
let wrapped = await rm.wrapOutgoingMessage(@[1.byte], "msg-n", testChannel)
check wrapped.isOk()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let buf = await rm.getOutgoingBuffer(testChannel)
check buf.len == 1
await rm.cleanup()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "continue operating after restart: lamport stays monotonic":
let store = newInMemoryStore()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm1 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm1.ensureChannel(testChannel)).isOk()
discard await rm1.wrapOutgoingMessage(@[1.byte], "m1", testChannel)
let lamportAfterSession1 = store.lamports[testChannel]
check lamportAfterSession1 > 0
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm1.cleanup()
# Restart and send another message — lamport must not regress.
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm2 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm2.ensureChannel(testChannel)).isOk()
check rm2.channels[testChannel].lamportTimestamp == lamportAfterSession1
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
discard await rm2.wrapOutgoingMessage(@[2.byte], "m2", testChannel)
check store.lamports[testChannel] > lamportAfterSession1
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let buf = await rm2.getOutgoingBuffer(testChannel)
check buf.len == 2
await rm2.cleanup()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "multiple restart cycles preserve state":
let store = newInMemoryStore()
for i in 1 .. 3:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm.ensureChannel(testChannel)).isOk()
discard await rm.wrapOutgoingMessage(@[byte(i)], "m" & $i, testChannel)
await rm.cleanup()
# Final session: all three messages must be in the buffer.
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rmFinal = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rmFinal.ensureChannel(testChannel)).isOk()
let buf = await rmFinal.getOutgoingBuffer(testChannel)
check buf.len == 3
var ids = newSeq[string]()
for unack in buf:
ids.add(unack.message.messageId.string)
check "m1" in ids
check "m2" in ids
check "m3" in ids
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rmFinal.cleanup()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "incoming dep-waiting buffer survives restart with missingDeps intact":
let store = newInMemoryStore()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm1 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm1.ensureChannel(testChannel)).isOk()
# Receive a message whose causal-history references an unknown predecessor.
let depMsg = SdsMessage.init(
messageId = "msg-with-deps",
lamportTimestamp = 10,
causalHistory = @[HistoryEntry.init("missing-dep", @[])],
channelId = testChannel,
content = @[7.byte],
bloomFilter = @[],
senderId = "carol",
)
let serialized = serializeMessage(depMsg).get()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
discard await rm1.unwrapReceivedMessage(serialized)
check "msg-with-deps" in store.incoming[testChannel]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm1.cleanup()
# Restart — buffered message and its missing-deps set must be back.
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm2 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm2.ensureChannel(testChannel)).isOk()
let inbuf = await rm2.getIncomingBuffer(testChannel)
check "msg-with-deps" in inbuf
check "missing-dep" in inbuf["msg-with-deps"].missingDeps
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm2.cleanup()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "removeChannel + recreate does not inherit stale lamport":
let store = newInMemoryStore()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm1 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm1.ensureChannel(testChannel)).isOk()
discard await rm1.wrapOutgoingMessage(@[1.byte], "m-old", testChannel)
check store.lamports[testChannel] > 0
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm1.removeChannel(testChannel)).isOk()
check testChannel notin store.lamports
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm1.cleanup()
# Recreate the same channelId after a restart — must start fresh.
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm2 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm2.ensureChannel(testChannel)).isOk()
check rm2.channels[testChannel].lamportTimestamp == 0
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let buf = await rm2.getOutgoingBuffer(testChannel)
check buf.len == 0
await rm2.cleanup()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "SDS-R outgoing repair buffer survives restart with absolute t_req_at":
let store = newInMemoryStore()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm1 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm1.ensureChannel(testChannel)).isOk()
let depMsg = SdsMessage.init(
messageId = "msg-needs-repair",
lamportTimestamp = 5,
causalHistory = @[HistoryEntry.init("missing-dep", @[])],
channelId = testChannel,
content = @[1.byte],
bloomFilter = @[],
senderId = "bob",
)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
discard await rm1.unwrapReceivedMessage(serializeMessage(depMsg).get())
check "missing-dep" in store.outgoingRepair[testChannel]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let originalTReqAt =
store.outgoingRepair[testChannel]["missing-dep"].minTimeRepairReq
check originalTReqAt.toUnix > 0
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm1.cleanup()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Restart — repair entry must be back with the SAME absolute time.
# Codec serialises Time as int64 unix milliseconds (PLAN §1.5), so the
# restored Time may differ by sub-millisecond precision from the
# original. Compare at second resolution which is what the protocol
# actually relies on.
let rm2 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm2.ensureChannel(testChannel)).isOk()
let buf = rm2.channels[testChannel].outgoingRepairBuffer
check "missing-dep" in buf
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
check buf["missing-dep"].minTimeRepairReq.toUnix == originalTReqAt.toUnix
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm2.cleanup()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "FIFO eviction state survives restart":
let store = newInMemoryStore()
var smallCfg = defaultConfig()
smallCfg.maxMessageHistory = 3
smallCfg.bloomFilterCapacity = 3
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm1 = newV2Manager(store, smallCfg)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm1.ensureChannel(testChannel)).isOk()
# Add 5 delivered messages — first 2 should be evicted by FIFO.
for i in 1 .. 5:
let m = SdsMessage.init(
messageId = "m" & $i,
lamportTimestamp = int64(i),
causalHistory = @[],
channelId = testChannel,
content = @[byte(i)],
bloomFilter = @[],
senderId = "alice",
)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
check (await rm1.addToHistory(m, testChannel)).isOk()
await rm1.tryUpdateHistory(testChannel)
check store.log[testChannel].len == 3
check "m1" notin store.log[testChannel]
check "m2" notin store.log[testChannel]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm1.cleanup()
# Restart — evicted entries must NOT come back; survivors keep order.
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm2 = newV2Manager(store, smallCfg)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm2.ensureChannel(testChannel)).isOk()
let history = rm2.channels[testChannel].messageHistory
check history.len == 3
check "m1" notin history
check "m2" notin history
check "m3" in history
check "m5" in history
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# FIFO continues correctly after restart: adding m6 evicts m3.
let m6 = SdsMessage.init(
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
messageId = "m6",
lamportTimestamp = 6,
causalHistory = @[],
channelId = testChannel,
content = @[6.byte],
bloomFilter = @[],
senderId = "alice",
)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
check (await rm2.addToHistory(m6, testChannel)).isOk()
await rm2.tryUpdateHistory(testChannel)
check "m3" notin store.log[testChannel]
check "m6" in store.log[testChannel]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm2.cleanup()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
asyncTest "dep-clear cascade resumes correctly across a restart":
let store = newInMemoryStore()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let rm1 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm1.ensureChannel(testChannel)).isOk()
# Receive c (deps on b), then b (deps on a). Both must buffer.
let msgC = SdsMessage.init(
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
messageId = "c",
lamportTimestamp = 30,
causalHistory = @[HistoryEntry.init("b", @[])],
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
channelId = testChannel,
content = @[3.byte],
bloomFilter = @[],
senderId = "carol",
)
let msgB = SdsMessage.init(
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
messageId = "b",
lamportTimestamp = 20,
causalHistory = @[HistoryEntry.init("a", @[])],
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
channelId = testChannel,
content = @[2.byte],
bloomFilter = @[],
senderId = "bob",
)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
discard await rm1.unwrapReceivedMessage(serializeMessage(msgC).get())
discard await rm1.unwrapReceivedMessage(serializeMessage(msgB).get())
check "c" in store.incoming[testChannel]
check "b" in store.incoming[testChannel]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm1.cleanup()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Restart — both still buffered with intact missingDeps.
let rm2 = newV2Manager(store)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
check (await rm2.ensureChannel(testChannel)).isOk()
let inbuf = await rm2.getIncomingBuffer(testChannel)
check "c" in inbuf
check "b" in inbuf
# Now receive a (root) — should cascade-deliver a, b, c.
let msgA = SdsMessage.init(
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
messageId = "a",
lamportTimestamp = 10,
causalHistory = @[],
channelId = testChannel,
content = @[1.byte],
bloomFilter = @[],
senderId = "alice",
)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
discard await rm2.unwrapReceivedMessage(serializeMessage(msgA).get())
let history = rm2.channels[testChannel].messageHistory
check "a" in history
check "b" in history
check "c" in history
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let inbufFinal = await rm2.getIncomingBuffer(testChannel)
check inbufFinal.len == 0
await rm2.cleanup()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
suite "Persistence: failure policy":
asyncTest "loadChannel failure surfaces as rePersistenceError on bootstrap":
# Bootstrap durability is the semantic intent of getOrCreateChannel —
# the caller asked us to materialise a channel and we can't do that
# without knowing prior state. So this op DOES propagate err on load
# failure (PLAN §8).
let store = newInMemoryStore()
store.failingOps.incl("loadChannel")
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
let res = await rm.ensureChannel(testChannel)
check res.isErr()
check res.error == ReliabilityError.rePersistenceError
asyncTest "saveChannelMeta failure during send does NOT surface — non-fatal policy":
# PLAN §8: persistence failures during foreground ops are logged but
# MUST NOT abort the op. The in-memory state is the source of truth;
# the next op's snapshot will re-synchronise on-disk state. This test
# is the inversion of the legacy "write failure surfaces as err" —
# the new policy is deliberate.
let store = newInMemoryStore()
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
store.failingOps.incl("saveChannelMeta")
let res = await rm.wrapOutgoingMessage(@[byte(1)], "m1", testChannel)
# Op succeeds: bytes were produced, protocol state is correct in
# memory, the FFI caller is unaffected.
check res.isOk()
# In-memory state is correct even though disk save was rejected.
let buf = await rm.getOutgoingBuffer(testChannel)
check buf.len == 1
check buf[0].message.messageId == "m1"
# Recovery: clear the failure, drive another op, disk catches up.
store.failingOps.excl("saveChannelMeta")
let res2 = await rm.wrapOutgoingMessage(@[byte(2)], "m2", testChannel)
check res2.isOk()
check "m1" in store.outgoing[testChannel]
check "m2" in store.outgoing[testChannel]
asyncTest "updateHistory failure during send does NOT surface — non-fatal policy":
# Same policy applied to the history-update path.
let store = newInMemoryStore()
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
store.failingOps.incl("updateHistory")
let res = await rm.wrapOutgoingMessage(@[byte(1)], "m1", testChannel)
check res.isOk()
check rm.channels[testChannel].messageHistory.len == 1
asyncTest "updateHistory failure is retried via R2 pending-write queue":
# Fix for PR #72 review comment #1: a failed history write must not
# silently drop the delta. The pending-write queue parks failed
# entries and retries them on the next op end. Once the backend
# recovers, the disk catches up automatically — no caller action
# needed, no err surfaced.
let store = newInMemoryStore()
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
# Failure 1: send m1 while updateHistory is broken.
store.failingOps.incl("updateHistory")
discard await rm.wrapOutgoingMessage(@[byte(1)], "m1", testChannel)
# In-memory state is correct; disk has no log entry for m1 yet.
check rm.channels[testChannel].messageHistory.len == 1
check testChannel notin store.log or "m1" notin store.log[testChannel]
# Pending queue should be holding m1 for retry.
check rm.channels[testChannel].pendingHistoryAppends.len == 1
check "m1" in rm.channels[testChannel].pendingHistoryAppends
# Failure 2: send m2 while still broken. Pending should now hold both.
discard await rm.wrapOutgoingMessage(@[byte(2)], "m2", testChannel)
check rm.channels[testChannel].pendingHistoryAppends.len == 2
check "m1" in rm.channels[testChannel].pendingHistoryAppends
check "m2" in rm.channels[testChannel].pendingHistoryAppends
# Still nothing on disk.
check testChannel notin store.log or store.log[testChannel].len == 0
# Recovery: clear the backend failure, send m3. The op-end flush
# should drain ALL pending entries plus the new one in a single call.
store.failingOps.excl("updateHistory")
discard await rm.wrapOutgoingMessage(@[byte(3)], "m3", testChannel)
check rm.channels[testChannel].pendingHistoryAppends.len == 0
check "m1" in store.log[testChannel]
check "m2" in store.log[testChannel]
check "m3" in store.log[testChannel]
asyncTest "evict-then-re-add merge rule preserves the re-added message on disk":
# Regression: with the original "evict-wins" merge rule, a message
# re-added (e.g. via SDS-R repair) after being evicted during a
# backend outage would have its append silently dropped because the
# id was still in pendingHistoryEvicts. The "latest-wins" rule fixes
# this — the re-add cancels the pending evict.
let store = newInMemoryStore()
var smallCfg = defaultConfig()
smallCfg.maxMessageHistory = 2
smallCfg.bloomFilterCapacity = 2
let rm = newReliabilityManager(
participantId = "alice",
config = smallCfg,
persistence = newInMemoryPersistence(store),
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
proc mkMsg(id: string, ts: int64): SdsMessage =
SdsMessage.init(
messageId = id,
lamportTimestamp = ts,
causalHistory = @[],
channelId = testChannel,
content = @[byte(ts)],
bloomFilter = @[],
senderId = "alice",
)
# Break the backend, then fill the channel past maxMessageHistory so
# m1 gets evicted while we have no successful flush yet.
store.failingOps.incl("updateHistory")
check (await rm.addToHistory(mkMsg("m1", 1), testChannel)).isOk()
await rm.tryUpdateHistory(testChannel) # fails — m1 queued
check (await rm.addToHistory(mkMsg("m2", 2), testChannel)).isOk()
check (await rm.addToHistory(mkMsg("m3", 3), testChannel)).isOk()
# m1 evicted by FIFO; pending should now have m2,m3 as appends and m1 as evict.
check "m1" notin rm.channels[testChannel].messageHistory
check "m1" in rm.channels[testChannel].pendingHistoryEvicts
check "m1" notin rm.channels[testChannel].pendingHistoryAppends
# SDS-R-style re-delivery of m1. With latest-wins, this MUST cancel
# the pending evict and re-queue the append.
check (await rm.addToHistory(mkMsg("m1", 4), testChannel)).isOk()
check "m1" in rm.channels[testChannel].messageHistory
check "m1" notin rm.channels[testChannel].pendingHistoryEvicts
check "m1" in rm.channels[testChannel].pendingHistoryAppends
# Recover and flush. m1 must land on disk.
store.failingOps.excl("updateHistory")
await rm.tryUpdateHistory(testChannel)
check "m1" in store.log[testChannel]
asyncTest "pending queue survives idle ops (flush on next op without history changes)":
# Even if the next op makes no history changes of its own, it must
# still flush the pending queue at op end — otherwise a failed write
# could sit indefinitely if the application only ever does
# mark-deps-met-style ops after a failure.
let store = newInMemoryStore()
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
# Stage a pending entry by failing one send.
store.failingOps.incl("updateHistory")
discard await rm.wrapOutgoingMessage(@[byte(1)], "m1", testChannel)
check rm.channels[testChannel].pendingHistoryAppends.len == 1
# Now clear the failure and drive a markDependenciesMet on a no-op
# input — it has no history changes of its own but its op-end flush
# must still retry the queue.
store.failingOps.excl("updateHistory")
check (await rm.markDependenciesMet(@["nonexistent"], testChannel)).isOk()
check rm.channels[testChannel].pendingHistoryAppends.len == 0
check "m1" in store.log[testChannel]
asyncTest "dropChannel failure during removeChannel surfaces as rePersistenceError":
# Durability is the semantic intent of removeChannel — the caller
# asked us to confirm a disk wipe. We cannot silently lie. So this op
# DOES propagate err on failure (PLAN §8).
let store = newInMemoryStore()
let rm = newReliabilityManager(
participantId = "alice", persistence = newInMemoryPersistence(store)
)
.get()
check (await rm.ensureChannel(testChannel)).isOk()
store.failingOps.incl("dropChannel")
let res = await rm.removeChannel(testChannel)
check res.isErr()
check res.error == ReliabilityError.rePersistenceError