nim-sds/sds.nim

622 lines
25 KiB
Nim
Raw Normal View History

feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
import std/[algorithm, times, tables, sets, options]
import chronos, results, chronicles
import sds/[types, protobuf, sds_utils, rolling_bloom_filter]
export types, protobuf, sds_utils, rolling_bloom_filter
proc newReliabilityManager*(
participantId: SdsParticipantID,
config: ReliabilityConfig = defaultConfig(),
persistence: Persistence = noOpPersistence(),
): Result[ReliabilityManager, ReliabilityError] =
## Creates a new multi-channel ReliabilityManager.
## `participantId` is REQUIRED (see `ReliabilityManager.new`).
## `persistence` defaults to a no-op backend; supply a real one to durably
## store SDS state across restarts.
try:
let rm = ReliabilityManager.new(participantId, config, persistence)
return ok(rm)
except Exception:
error "Failed to create ReliabilityManager", msg = getCurrentExceptionMsg()
return err(ReliabilityError.reOutOfMemory)
proc isAcknowledged*(
msg: UnacknowledgedMessage,
causalHistory: seq[HistoryEntry],
rbf: Option[RollingBloomFilter],
): bool =
if msg.message.messageId in causalHistory.getMessageIds():
return true
if rbf.isSome():
return rbf.get().contains(msg.message.messageId)
return false
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
proc reviewAckStatus(
rm: ReliabilityManager, msg: SdsMessage
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
): Future[Result[void, ReliabilityError]] {.async: (raises: []).} =
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
try:
var rbf: Option[RollingBloomFilter]
if msg.bloomFilter.len > 0:
let bfResult = deserializeBloomFilter(msg.bloomFilter)
if bfResult.isOk():
let bf = bfResult.get()
rbf = some(
RollingBloomFilter.init(
filter = bf,
capacity = bf.capacity,
minCapacity =
(bf.capacity.float * (100 - CapacityFlexPercent).float / 100.0).int,
maxCapacity =
(bf.capacity.float * (100 + CapacityFlexPercent).float / 100.0).int,
)
)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
else:
error "Failed to deserialize bloom filter", error = bfResult.error
rbf = none[RollingBloomFilter]()
else:
rbf = none[RollingBloomFilter]()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
if msg.channelId notin rm.channels:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
return ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let channel = rm.channels[msg.channelId]
var toDelete: seq[(int, SdsMessageID)] = @[]
var i = 0
while i < channel.outgoingBuffer.len:
let outMsg = channel.outgoingBuffer[i]
if outMsg.isAcknowledged(msg.causalHistory, rbf):
if not rm.onMessageSent.isNil():
{.cast(raises: []).}:
rm.onMessageSent(outMsg.message.messageId, outMsg.message.channelId)
toDelete.add((i, outMsg.message.messageId))
inc i
for k in countdown(toDelete.high, 0):
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Phase 2B: in-memory deletion only; the caller's op-end trySaveMeta
# captures the new outgoingBuffer state. The msgId half of the
# tuple is unused now that there is no per-row persistence call.
channel.outgoingBuffer.delete(toDelete[k][0])
ok()
except CatchableError:
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
error "Failed to review ack status", msg = getCurrentExceptionMsg()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
err(ReliabilityError.reInternalError)
proc wrapOutgoingMessage*(
rm: ReliabilityManager,
message: seq[byte],
messageId: SdsMessageID,
channelId: SdsChannelID,
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
): Future[Result[seq[byte], ReliabilityError]] {.async: (raises: []), gcsafe.} =
## Wraps an outgoing message with reliability metadata.
if message.len == 0:
return err(ReliabilityError.reInvalidArgument)
if message.len > MaxMessageSize:
return err(ReliabilityError.reMessageTooLarge)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
try:
await rm.lock.acquire()
try:
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
try:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let channel = (await rm.getOrCreateChannel(channelId)).valueOr:
return err(error)
(await rm.updateLamportTimestamp(getTime().toUnix, channelId)).isOkOr:
return err(error)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let bfResult = serializeBloomFilter(channel.bloomFilter.filter)
if bfResult.isErr():
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
error "Failed to serialize bloom filter", channelId = channelId
return err(ReliabilityError.reSerializationError)
# SDS-R: collect eligible expired repair requests to attach. Per
# spec (sds-r-send-message, RECOMMENDED), prioritise the entries with
# the smallest minTimeRepairReq — they are the most overdue and the
# ones the network most needs us to ask about.
var repairReqs: seq[HistoryEntry] = @[]
let now = getTime()
var expiredKeys: seq[SdsMessageID] = @[]
var eligible: seq[(SdsMessageID, OutgoingRepairEntry)] = @[]
for msgId, repairEntry in channel.outgoingRepairBuffer:
if now >= repairEntry.minTimeRepairReq:
eligible.add((msgId, repairEntry))
eligible.sort do(a, b: (SdsMessageID, OutgoingRepairEntry)) -> int:
cmp(a[1].minTimeRepairReq, b[1].minTimeRepairReq)
let take = min(eligible.len, rm.config.maxRepairRequests)
for i in 0 ..< take:
repairReqs.add(eligible[i][1].outHistEntry)
expiredKeys.add(eligible[i][0])
for key in expiredKeys:
channel.outgoingRepairBuffer.del(key)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Phase 2B: in-memory deletion only; op-end trySaveMeta covers it.
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let causalHistory = (
await rm.getRecentHistoryEntries(rm.config.maxCausalHistory, channelId)
).valueOr:
return err(error)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let msg = SdsMessage.init(
messageId = messageId,
lamportTimestamp = channel.lamportTimestamp,
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
causalHistory = causalHistory,
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
channelId = channelId,
content = message,
bloomFilter = bfResult.get(),
senderId = rm.participantId,
repairRequest = repairReqs,
)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let unackMsg = UnacknowledgedMessage.init(
message = msg, sendTime = getTime(), resendAttempts = 0
)
channel.outgoingBuffer.add(unackMsg)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Phase 2B: in-memory append only; op-end trySaveMeta covers it.
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
channel.bloomFilter.add(msg.messageId)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# addToHistory mutates in-memory state and queues the append/evict
# on the channel's pending-history queue; persistence happens
# ONCE at op end via tryUpdateHistory.
(await rm.addToHistory(msg, channelId)).isOkOr:
return err(error)
# Op end: one meta snapshot + one history flush, paired under the
# lock per the Persistence atomicity contract. tryUpdateHistory
# flushes the channel's pending queue (this op's mutations PLUS
# any leftovers from a prior failed write — R2 retry).
await rm.trySaveMeta(channelId, channel)
await rm.tryUpdateHistory(channelId)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
return serializeMessage(msg)
except CatchableError:
error "Failed to wrap message",
channelId = channelId, msg = getCurrentExceptionMsg()
return err(ReliabilityError.reSerializationError)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
finally:
rm.lock.release()
except CatchableError:
error "Failed to wrap message (lock)",
channelId = channelId, msg = getCurrentExceptionMsg()
return err(ReliabilityError.reSerializationError)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
proc processIncomingBuffer(
rm: ReliabilityManager, channelId: SdsChannelID
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
): Future[Result[void, ReliabilityError]] {.async: (raises: []).} =
## Cascade-deliver any buffered messages whose dependencies are now met.
## Each `addToHistory` call queues its append/evict on the channel's
## pending-history queue; the *caller* (a public protocol op) issues
## ONE `tryUpdateHistory` at op end to flush the whole cascade in a
## single round-trip.
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
try:
await rm.lock.acquire()
try:
if channelId notin rm.channels:
error "Channel does not exist", channelId = channelId
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
return ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let channel = rm.channels[channelId]
if channel.incomingBuffer.len == 0:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
return ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
var processed = initHashSet[SdsMessageID]()
var readyToProcess = newSeq[SdsMessageID]()
for msgId, entry in channel.incomingBuffer:
if entry.missingDeps.len == 0:
readyToProcess.add(msgId)
while readyToProcess.len > 0:
let msgId = readyToProcess.pop()
if msgId in processed:
continue
if msgId in channel.incomingBuffer:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
(await rm.addToHistory(channel.incomingBuffer[msgId].message, channelId)).isOkOr:
return err(error)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
if not rm.onMessageReady.isNil():
{.cast(raises: []).}:
rm.onMessageReady(msgId, channelId)
processed.incl(msgId)
for remainingId, entry in channel.incomingBuffer:
if remainingId notin processed:
if msgId in entry.missingDeps:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Phase 2B: in-memory dep-set shrink only; the parent op
# (unwrap / markDeps) issues a single trySaveMeta at its
# end that captures the final incomingBuffer state.
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
channel.incomingBuffer[remainingId].missingDeps.excl(msgId)
if channel.incomingBuffer[remainingId].missingDeps.len == 0:
readyToProcess.add(remainingId)
for msgId in processed:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Phase 2B: in-memory deletion only; parent op's trySaveMeta covers
# the drained buffer state.
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
channel.incomingBuffer.del(msgId)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
finally:
rm.lock.release()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
except CatchableError:
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
error "Failed to process incoming buffer",
channelId = channelId, msg = getCurrentExceptionMsg()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
err(ReliabilityError.reInternalError)
proc unwrapReceivedMessage*(
rm: ReliabilityManager, message: seq[byte]
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
): Future[
Result[
tuple[message: seq[byte], missingDeps: seq[HistoryEntry], channelId: SdsChannelID],
ReliabilityError,
]
] {.async: (raises: []).} =
## Unwraps a received message and processes its reliability metadata.
try:
let channelId = extractChannelId(message).valueOr:
return err(ReliabilityError.reDeserializationError)
let msg = deserializeMessage(message).valueOr:
return err(ReliabilityError.reDeserializationError)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
let channel = (await rm.getOrCreateChannel(channelId)).valueOr:
return err(error)
# SDS-R: opportunistic repair-buffer cleanup — applies to duplicates too,
# so rebroadcasts cancel redundant responses on peers that already have the message.
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Phase 2B: in-memory deletes only; op-end trySaveMeta covers it.
channel.outgoingRepairBuffer.del(msg.messageId)
channel.incomingRepairBuffer.del(msg.messageId)
if msg.messageId in channel.messageHistory:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Duplicate: no history change. Still flush the meta (repair-buffer
# dels above are mutations) and the history queue (any pending
# entries from a prior failed write get retried here too).
await rm.trySaveMeta(channelId, channel)
await rm.tryUpdateHistory(channelId)
return ok((msg.content, @[], channelId))
channel.bloomFilter.add(msg.messageId)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
(await rm.updateLamportTimestamp(msg.lamportTimestamp, channelId)).isOkOr:
return err(error)
(await rm.reviewAckStatus(msg)).isOkOr:
return err(error)
# SDS-R: process incoming repair requests from this message. We can only
# answer for messages we have actually delivered (i.e. that live in
# messageHistory) — buffered-but-undelivered messages are not in a state
# to confidently rebroadcast.
let now = getTime()
for repairEntry in msg.repairRequest:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Remove from our own outgoing repair buffer (someone else is also requesting).
# Phase 2B: in-memory delete only; op-end trySaveMeta covers it.
channel.outgoingRepairBuffer.del(repairEntry.messageId)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
if repairEntry.messageId in channel.messageHistory and rm.participantId.len > 0 and
repairEntry.senderId.len > 0:
if isInResponseGroup(
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
rm.participantId, repairEntry.senderId, repairEntry.messageId,
rm.config.numResponseGroups,
):
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let serialized =
serializeMessage(channel.messageHistory[repairEntry.messageId])
if serialized.isOk():
let tResp = computeTResp(
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
rm.participantId, repairEntry.senderId, repairEntry.messageId,
rm.config.repairTMax,
)
let inEntry = IncomingRepairEntry(
inHistEntry: repairEntry,
cachedMessage: serialized.get(),
minTimeRepairResp: now + tResp,
)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Phase 2B: in-memory insert only; op-end trySaveMeta covers it.
channel.incomingRepairBuffer[repairEntry.messageId] = inEntry
var missingDeps = rm.checkDependencies(msg.causalHistory, channelId)
if missingDeps.len == 0:
var depsInBuffer = false
for msgId, entry in channel.incomingBuffer.pairs():
if msgId in msg.causalHistory.getMessageIds():
depsInBuffer = true
break
if depsInBuffer:
let entry =
IncomingMessage.init(message = msg, missingDeps = initHashSet[SdsMessageID]())
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Phase 2B: in-memory insert only; op-end trySaveMeta covers it.
channel.incomingBuffer[msg.messageId] = entry
else:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
(await rm.addToHistory(msg, channelId)).isOkOr:
return err(error)
# Unblock any buffered messages that were waiting on this one.
for pendingId, entry in channel.incomingBuffer:
if msg.messageId in entry.missingDeps:
channel.incomingBuffer[pendingId].missingDeps.excl(msg.messageId)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Cascade — addToHistory calls within processIncomingBuffer queue
# their entries on the channel's pending-history queue, flushed
# by the single op-end tryUpdateHistory below.
(await rm.processIncomingBuffer(channelId)).isOkOr:
return err(error)
if not rm.onMessageReady.isNil():
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
{.cast(raises: []).}:
rm.onMessageReady(msg.messageId, channelId)
else:
let entry = IncomingMessage.init(
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
message = msg, missingDeps = missingDeps.getMessageIds().toHashSet()
)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Phase 2B: in-memory insert only; op-end trySaveMeta covers it.
channel.incomingBuffer[msg.messageId] = entry
if not rm.onMissingDependencies.isNil():
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
{.cast(raises: []).}:
rm.onMissingDependencies(msg.messageId, missingDeps, channelId)
# SDS-R: add missing deps to outgoing repair buffer
if rm.participantId.len > 0:
for dep in missingDeps:
if dep.messageId notin channel.outgoingRepairBuffer:
let tReq = computeTReq(
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
rm.participantId, dep.messageId, rm.config.repairTMin,
rm.config.repairTMax,
)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let outEntry =
OutgoingRepairEntry(outHistEntry: dep, minTimeRepairReq: now + tReq)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Phase 2B: in-memory insert only; op-end trySaveMeta covers it.
channel.outgoingRepairBuffer[dep.messageId] = outEntry
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Op end: one meta snapshot + one history flush, paired under the
# lock. The flush is the single point where any cascade-driven
# appends/evicts hit disk (R2 queue absorbs failures).
await rm.trySaveMeta(channelId, channel)
await rm.tryUpdateHistory(channelId)
return ok((msg.content, missingDeps, channelId))
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
except CatchableError:
error "Failed to unwrap message", msg = getCurrentExceptionMsg()
return err(ReliabilityError.reDeserializationError)
proc markDependenciesMet*(
rm: ReliabilityManager, messageIds: seq[SdsMessageID], channelId: SdsChannelID
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
): Future[Result[void, ReliabilityError]] {.async: (raises: []).} =
## Marks the specified message dependencies as met.
try:
if channelId notin rm.channels:
return err(ReliabilityError.reInvalidArgument)
let channel = rm.channels[channelId]
for msgId in messageIds:
if not channel.bloomFilter.contains(msgId):
channel.bloomFilter.add(msgId)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Phase 2B: in-memory dep-set shrink + repair-buffer dels only; the
# op-end trySaveMeta below covers all mutations atomically.
for pendingId, entry in channel.incomingBuffer:
if msgId in entry.missingDeps:
channel.incomingBuffer[pendingId].missingDeps.excl(msgId)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# SDS-R: clear from repair buffers (dependency now met).
channel.outgoingRepairBuffer.del(msgId)
channel.incomingRepairBuffer.del(msgId)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
(await rm.processIncomingBuffer(channelId)).isOkOr:
return err(error)
# Op end: one meta snapshot + one history flush, paired under the lock.
# The flush covers any cascade-driven appends/evicts queued during
# processIncomingBuffer.
if channelId in rm.channels:
await rm.trySaveMeta(channelId, rm.channels[channelId])
await rm.tryUpdateHistory(channelId)
return ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
except CatchableError:
error "Failed to mark dependencies as met",
channelId = channelId, msg = getCurrentExceptionMsg()
return err(ReliabilityError.reInternalError)
proc setCallbacks*(
rm: ReliabilityManager,
onMessageReady: MessageReadyCallback,
onMessageSent: MessageSentCallback,
onMissingDependencies: MissingDependenciesCallback,
onPeriodicSync: PeriodicSyncCallback = nil,
onRetrievalHint: RetrievalHintProvider = nil,
onRepairReady: RepairReadyCallback = nil,
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
) {.async: (raises: []).} =
## Sets the callback functions for various events in the ReliabilityManager.
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
try:
await rm.lock.acquire()
try:
rm.onMessageReady = onMessageReady
rm.onMessageSent = onMessageSent
rm.onMissingDependencies = onMissingDependencies
rm.onPeriodicSync = onPeriodicSync
rm.onRetrievalHint = onRetrievalHint
rm.onRepairReady = onRepairReady
finally:
rm.lock.release()
except CatchableError:
error "Failed to set callbacks", msg = getCurrentExceptionMsg()
proc checkUnacknowledgedMessages(
rm: ReliabilityManager, channelId: SdsChannelID
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
): Future[Result[void, ReliabilityError]] {.async: (raises: []).} =
## Persistence model (PLAN_SNAPSHOT_PERSISTENCE.md phase 2.2): per-entry
## saveOutgoing / removeOutgoing calls are replaced by a single
## `trySaveMeta` at the end of the pass, *only* if the buffer actually
## changed (resend-attempt incremented, or entry expired). Failure is
## logged but does not abort the pass — next tick reissues a fresh
## snapshot.
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
try:
await rm.lock.acquire()
try:
if channelId notin rm.channels:
error "Channel does not exist", channelId = channelId
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
return ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
let channel = rm.channels[channelId]
let now = getTime()
var newOutgoingBuffer: seq[UnacknowledgedMessage] = @[]
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
var dirty = false
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
for unackMsg in channel.outgoingBuffer:
let elapsed = now - unackMsg.sendTime
if elapsed > rm.config.resendInterval:
if unackMsg.resendAttempts < rm.config.maxResendAttempts:
var updatedMsg = unackMsg
updatedMsg.resendAttempts += 1
updatedMsg.sendTime = now
newOutgoingBuffer.add(updatedMsg)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
dirty = true
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
else:
if not rm.onMessageSent.isNil():
{.cast(raises: []).}:
rm.onMessageSent(unackMsg.message.messageId, channelId)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
dirty = true # entry dropped from newOutgoingBuffer
else:
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
newOutgoingBuffer.add(unackMsg)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
channel.outgoingBuffer = newOutgoingBuffer
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
if dirty:
await rm.trySaveMeta(channelId, channel)
ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
finally:
rm.lock.release()
except CatchableError:
error "Failed to check unacknowledged messages",
channelId = channelId, msg = getCurrentExceptionMsg()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
err(ReliabilityError.reInternalError)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
proc periodicBufferSweep(rm: ReliabilityManager) {.async: (raises: [CancelledError]).} =
while true:
try:
for channelId, channel in rm.channels:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Background maintenance has no caller to return to: a persistence
# error is logged (by reliabilityErr) and the sweep continues; the
# next tick retries.
discard await rm.checkUnacknowledgedMessages(channelId)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
await rm.cleanBloomFilter(channelId)
except CatchableError:
error "Error in periodic buffer sweep", msg = getCurrentExceptionMsg()
await sleepAsync(chronos.milliseconds(rm.config.bufferSweepInterval.inMilliseconds))
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
proc periodicSyncMessage(rm: ReliabilityManager) {.async: (raises: [CancelledError]).} =
while true:
try:
if not rm.onPeriodicSync.isNil():
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
{.cast(raises: []).}:
rm.onPeriodicSync()
except CatchableError:
error "Error in periodic sync", msg = getCurrentExceptionMsg()
await sleepAsync(chronos.seconds(rm.config.syncMessageInterval.inSeconds))
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
proc runRepairSweep*(
rm: ReliabilityManager
): Future[Result[void, ReliabilityError]] {.async: (raises: []).} =
## SDS-R: Runs a single pass of the repair sweep.
## - Incoming: fires onRepairReady for expired T_resp entries and removes them
## - Outgoing: drops entries past T_max window
## Exposed so it can be driven directly in tests; also invoked by periodicRepairSweep.
## Acquires rm.lock so the repair buffers cannot be observed mid-mutation by
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
## a concurrent wrapOutgoingMessage / unwrapReceivedMessage on another task.
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
##
## Persistence model (PLAN_SNAPSHOT_PERSISTENCE.md phase 2.1): per-entry
## removeIncomingRepair / removeOutgoingRepair calls are replaced by a
## single `trySaveMeta` per *dirty* channel at the end of that channel's
## sweep. A persistence failure is logged but DOES NOT abort the sweep —
## in-memory state is the source of truth and the next op (or sweep tick)
## will issue a fresh self-contained snapshot.
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
try:
await rm.lock.acquire()
try:
let now = getTime()
for channelId, channel in rm.channels:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
var dirty = false
try:
# Check incoming repair buffer for expired T_resp (time to rebroadcast)
var toRebroadcast: seq[SdsMessageID] = @[]
for msgId, entry in channel.incomingRepairBuffer:
if now >= entry.minTimeRepairResp:
toRebroadcast.add(msgId)
for msgId in toRebroadcast:
let entry = channel.incomingRepairBuffer[msgId]
channel.incomingRepairBuffer.del(msgId)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
dirty = true
if not rm.onRepairReady.isNil():
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
{.cast(raises: []).}:
rm.onRepairReady(entry.cachedMessage, channelId)
# Drop expired outgoing repair entries past T_max
var toRemove: seq[SdsMessageID] = @[]
let tMaxDuration = rm.config.repairTMax
for msgId, entry in channel.outgoingRepairBuffer:
if now - entry.minTimeRepairReq > tMaxDuration:
toRemove.add(msgId)
for msgId in toRemove:
channel.outgoingRepairBuffer.del(msgId)
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
dirty = true
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
except CatchableError:
error "Error in repair sweep for channel",
channelId = channelId, msg = getCurrentExceptionMsg()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Snapshot only if this channel actually mutated. Skipping the call
# when clean honours the dirty-flag guard in ANALYSIS_SNAPSHOT_SAVE_POINTS
# — otherwise an idle node still issues 0.2 saves/s/channel just
# because the periodic sweep ran.
if dirty:
await rm.trySaveMeta(channelId, channel)
ok()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
finally:
rm.lock.release()
except CatchableError:
error "Error in repair sweep", msg = getCurrentExceptionMsg()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
err(ReliabilityError.reInternalError)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
proc periodicRepairSweep(rm: ReliabilityManager) {.async: (raises: [CancelledError]).} =
## SDS-R: Periodically checks repair buffers for expired entries.
while true:
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
try:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
# Background maintenance: log a failed pass and retry next tick.
discard await rm.runRepairSweep()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
except CatchableError:
error "Error in periodic repair sweep", msg = getCurrentExceptionMsg()
await sleepAsync(chronos.milliseconds(rm.config.repairSweepInterval.inMilliseconds))
proc startPeriodicTasks*(rm: ReliabilityManager) =
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
## Starts the periodic background tasks (buffer sweep, sync message,
## SDS-R repair sweep). The futures are kept on the manager so `cleanup`
## can cancel them — without that, the loops would outlive a cleaned-up
## manager and keep firing against cleared state.
rm.periodicTasks.add(FutureBase(rm.periodicBufferSweep()))
rm.periodicTasks.add(FutureBase(rm.periodicSyncMessage()))
rm.periodicTasks.add(FutureBase(rm.periodicRepairSweep()))
proc resetReliabilityManager*(
rm: ReliabilityManager
): Future[Result[void, ReliabilityError]] {.async: (raises: []).} =
## Resets the ReliabilityManager to its initial state.
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
try:
await rm.lock.acquire()
try:
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
try:
for channelId, channel in rm.channels:
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
(await rm.dropChannelFromPersistence(channelId)).isOkOr:
return err(error)
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
channel.lamportTimestamp = 0
channel.messageHistory.clear()
channel.outgoingBuffer.setLen(0)
channel.incomingBuffer.clear()
channel.outgoingRepairBuffer.clear()
channel.incomingRepairBuffer.clear()
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72) * feat: propagate persistence backend errors via Result The Persistence contract previously returned `Future[void]` for writes and `Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no way to report a failure, so a failed write or a failed/partial read was silently swallowed — and on the read path a mid-scan failure could bootstrap a *truncated* channel snapshot, corrupting the rebuilt bloom filter and lamport clock across a restart. Make every contract field Result-returning: * mutating ops -> Future[Result[void, string]] * loadAllForChannel -> Future[Result[ChannelSnapshot, string]] The backend-supplied error string is mapped to a new `ReliabilityError.rePersistenceError` (logged once at the boundary via `reliabilityErr`) and threaded up through every persistence-touching proc to the public API, where the caller decides what to do. Request-driven paths (wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate the error; background maintenance loops (periodicBufferSweep, periodicRepairSweep) log and retry on the next tick, since they have no synchronous caller. Tests: in-memory backend gains a `failingOps` injection hook; new "Persistence: error propagation" suite asserts read/write/drop failures surface as `rePersistenceError`. Full suite passes (90 OK). BREAKING CHANGE: the `Persistence` contract signature changed; custom backends must return `Result` and `ok()` on success. Bumped to 0.3.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add snapshot types and codec (phase 0) Introduce atomic-snapshot persistence types that will replace the current fine-grained 13-proc Persistence interface. This commit is purely additive: no existing call site changes, no behaviour change. New types (sds/types/): - channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob), ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV (flattened map entries for protobuf wire shape). - history_update.nim — HistoryUpdate (combined append/evict payload for the message log). New codec (sds/snapshot_codec.nim): - Protobuf encode/decode for all new types, reusing the existing SdsMessage and HistoryEntry encoders from sds/protobuf.nim. - Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown versions loudly rather than silently truncating. - Time encoded as int64 unix milliseconds. Tests (tests/test_snapshot_codec.nim): - 13 round-trip cases covering empty, single-entry, full-buffer, and repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants; schemaVersion rejection. Planning artefacts: - ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write divergence, chatty call rate, non-fatal-error policy gap). - ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op and projected call rates. - PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit implements phase 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persistence): add PersistenceV2 interface alongside legacy (phase 1) Introduce the 5-proc snapshot-based Persistence interface that will replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so phase 2 can migrate protocol ops one at a time without breaking existing callers. New file: - sds/types/persistence_v2.nim — `PersistenceV2` type with saveChannelMeta / updateHistory / loadChannel / dropChannel / setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture the atomicity pairing (meta save + history update issued back-to-back under the channel lock) and the non-fatal failure policy from PLAN §8. Modified: - sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2` field alongside `persistence`; constructor takes both, both default to no-op. - sds.nim — `newReliabilityManager` plumbs the new optional parameter. - AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 + phase 1 additions; symbol counts updated by `npx gitnexus analyze`. No call site uses the new interface yet — that's phase 2. All existing tests still pass against the legacy interface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1) Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced by a single trySaveMeta per *dirty* channel at the end of that channel's sweep. Failure is logged but does NOT abort the sweep — in-memory state is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8). Helpers added in sds/sds_utils.nim: - snapshotMeta(channel) — capture current ChannelContext as ChannelMeta blob (flattens Table-keyed buffers to seqs for the wire shape). - trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save; logs on failure, never propagates. - tryUpdateHistory(rm, channelId, append, evict) — best-effort history update; skips the call entirely when both lists are empty (HistoryUpdate contract). Call-rate impact for runRepairSweep: - Before: N persistence calls per expired entry per channel. - After: at most 1 saveChannelMeta per dirty channel; 0 on idle channels (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS). All existing tests pass — including the 3 SDS-R Repair Sweep tests that directly exercise this proc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2) Per-entry saveOutgoing / removeOutgoing calls are replaced by one trySaveMeta at the end of the pass, conditional on a dirty flag (resend attempt incremented, or entry expired). Pass succeeds even if the save fails — next tick reissues the snapshot. Call-rate impact: - Before: N persistence calls per affected entry per pass. - After: at most 1 saveChannelMeta per pass; 0 when nothing aged out. All existing tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A) Wires `trySaveMeta` into the three public protocol ops that mutate per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and markDependenciesMet — at the operation's end, under the channel lock. Legacy fine-grained persistence calls REMAIN in place; this commit is additive. Both interfaces persist the same state simultaneously, so all existing tests pass and a real backend wired to either interface continues to work. Phase 2B will strip the legacy calls. Save points match the §"Save Points" table in ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly: - wrapOutgoingMessage: 1 save (always) - unwrapReceivedMessage: 1 save on every path including duplicate (the duplicate path still mutates the repair buffers) - markDependenciesMet: 1 save after the processIncomingBuffer cascade Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues; the protocol op never returns rePersistenceError for snapshot failures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D) End-state of phase 2: the protocol code no longer issues any legacy fine-grained Persistence calls. All state survives via the snapshot-based PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory batched inside addToHistory. The legacy Persistence field on ReliabilityManager remains for backwards compatibility; phase 3 deletes it. Protocol changes (sds.nim, sds/sds_utils.nim): - reviewAckStatus, processIncomingBuffer, updateLamportTimestamp → pure in-memory; no per-mutation persistence. - addToHistory: replaces appendLogEntry+removeLogEntry with a single tryUpdateHistory call carrying (append, evict) atomically. - getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal. - wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet: all per-row saveOutgoing / removeOutgoing / saveIncoming / removeIncoming / saveOutgoingRepair / removeOutgoingRepair / saveIncomingRepair / removeIncomingRepair calls removed (16 call sites in total). State is captured by the op-end trySaveMeta added in phase 2A. - getOrCreateChannel: bootstraps from persistenceV2.loadChannel. - dropChannelFromPersistence: uses persistenceV2.dropChannel. Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8): - Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal — trySaveMeta / tryUpdateHistory log and continue; the protocol op returns ok regardless of disk failure. In-memory state is the source of truth; the next op re-issues a complete snapshot and disk catches up automatically. - Durability-intent ops (removeChannel, resetReliabilityManager via dropChannelFromPersistence; getOrCreateChannel via loadChannel): still propagate rePersistenceError, because the caller asked us to confirm a disk operation and we cannot silently lie. Test infrastructure: - tests/in_memory_persistence_v2.nim: new V2 adapter mock that decomposes the meta blob into the existing InMemoryStore shape so test assertions on store.outgoing / store.incoming / etc. continue to work without change. - tests/test_persistence.nim: 17 tests, all rewritten against V2. - 13 state-survival tests carry over with identical assertions. - "loadChannel failure surfaces as err on bootstrap" — bootstrap keeps durability-intent semantics. - "saveChannelMeta failure during send does NOT surface" — deliberate inversion of the legacy "write failure surfaces as err" test. Asserts the new non-fatal policy: op returns ok, in-memory state correct, disk re-syncs on the next op. - "updateHistory failure during send does NOT surface" — same policy applied to the history path. - "dropChannel failure during removeChannel surfaces as err" — kept. - All 17 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3) End-state of the snapshot-persistence refactor. The legacy 13-proc Persistence interface and its noOpPersistence are gone; the 5-proc snapshot-based interface (formerly PersistenceV2) takes their place under the canonical name. Source: - sds/types/persistence.nim: replaced 13-proc contract with the 5-proc snapshot interface (saveChannelMeta, updateHistory, loadChannel, dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere and an empty ChannelData on load. - sds/types/persistence_v2.nim: removed. - sds/types/reliability_manager.nim: dropped the second persistenceV2 field; constructor takes a single `persistence: Persistence`. - sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments updated. - sds.nim: dropped the persistenceV2 parameter from newReliabilityManager. Tests: - tests/in_memory_persistence_v2.nim: removed; its content moved to... - tests/in_memory_persistence.nim: replaces the old legacy mock with the snapshot adapter under the canonical filename. Same InMemoryStore shape so test assertions stay unchanged. - tests/test_persistence.nim: ctor param renamed, suite name de-prefixed. FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean. All 4 test suites pass: - test_bloom - test_reliability - test_persistence (17 V2 tests) - test_snapshot_codec (13 codec round-trip tests) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Persisting persistence redesign plan for reference * refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix) Addresses all three substantive review findings on PR #72 in one structural change: fold the per-op accumulator and the R2 retry buffer into a single queue on `ChannelContext`, flushed once at op end. Changes: - sds/types/channel_context.nim: add `pendingHistoryAppends` (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts` (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full SdsMessage is looked up from `messageHistory` at flush time. Documented invariant: every id in pendingHistoryAppends is also in messageHistory, upheld by the merge rule. - sds/sds_utils.nim: * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel, msgId)` — "latest-wins" merge: append cancels any pending evict and vice versa. Symmetric, simple, handles the evict-then-re-add sequence correctly (SDS-R repair re-delivering an evicted message while the backend is unreachable). * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the channel's pending queue. Dual role: per-op accumulator (multiple `addToHistory` calls within one op queue together and flush as one round-trip) AND R2 retry buffer (a failed flush leaves the queue populated for the next op to retry). * `addToHistory` queues via the helpers; does not call persistence. * Pending queue cleared on `cleanup` and `removeChannel`. - sds.nim: * `processIncomingBuffer` returns to its single-arg signature — the queue lives on the channel, no parameter threading needed. * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths), `markDependenciesMet` issue exactly one `trySaveMeta` + `tryUpdateHistory` pair at op end, under the lock, with no intervening `await`-of-other-work. Matches the Persistence atomicity contract documented in `sds/types/persistence.nim`. * Pending queue cleared in `resetReliabilityManager`. - tests/test_persistence.nim: * Direct `addToHistory` callers (state-survival setup) now follow with explicit `tryUpdateHistory(channelId)` to flush. Reflects the production op-end flush pattern. * New: `updateHistory failure is retried via R2 pending-write queue` — verifies that two failed sends leave both messages on the queue, and a third successful send drains the whole queue in one call. * New: `pending queue survives idle ops` — verifies that an op with no history changes of its own still flushes a previously-failed batch at op end. * New: `evict-then-re-add merge rule preserves the re-added message on disk` — regression for the "latest-wins" merge rule. The original "evict-wins" rule would silently drop the re-add and leave the message permanently absent from disk; this test would fail under that rule and passes under the corrected one. Resolves PR #72 review comments: - #1 (delta loss on failed updateHistory) — R2 retry queue. - #2 (cascade chattiness — N updateHistory calls per op) — queue collects cascaded entries, flushed as one batch. - #3 (atomicity contract mismatch) — implementation now matches the documented "saveChannelMeta then updateHistory back-to-back" pairing. Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests). FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00
channel.pendingHistoryAppends.clear()
channel.pendingHistoryEvicts.clear()
feat: make Persistence interface async (#69) * feat: make Persistence interface async The 14 Persistence proc fields now return Future[...] with {.async: (raises: []), gcsafe.}, allowing real I/O backends (SQLite, encrypted file, network) to suspend rather than block the Chronos event loop the manager runs on. Propagates through: - ReliabilityManager.lock: system.Lock -> chronos.AsyncLock. Acquired across awaits cleanly; matches the single-threaded Chronos worker the FFI uses. Multi-OS-thread use is now explicitly the caller's responsibility. - sds_utils + sds.nim public API procs (wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet, setCallbacks, resetReliabilityManager, cleanup, ensureChannel, removeChannel, the getter snapshots, etc.) are now async. - FFI request handlers in library/sds_thread/... await the new API. - Tests converted via an asyncTest template that wraps each test body in an async proc; setup/teardown use waitFor for their single async call (ensureChannel / cleanup). Lock scope is preserved exactly: the same call sites that held the kernel Lock today hold AsyncLock now -- no new locking added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor: drop asyncSpawn, add asyncSetup/asyncTeardown Three asyncSpawn usages removed: - sds.nim startPeriodicTasks: stored the periodic-task futures on ReliabilityManager (new field `periodicTasks: seq[FutureBase]`) so cleanup can cancel them on shutdown instead of leaking the loops against a cleared manager. - library/sds_thread/sds_thread.nim: fireSync moved BEFORE processing, then `await SdsThreadRequest.process(...)` instead of asyncSpawn'ing it. Aligns the worker with the SP-channel + lock assumption that there are no concurrent requests; caller throughput is unchanged because the caller only waits for receipt (fireSync), not processing. - tests TestBus repair callback: replaced asyncSpawn(deliverExcept...) with an explicit pending-delivery queue drained by `bus.drain()`. Integration tests no longer rely on `sleepAsync(10ms)` to let spawned deliveries finish — they await drain instead. Tests also pick up an asyncSetup/asyncTeardown pair (tests/async_unittest.nim) so suite fixtures can `await` directly. All `waitFor` in setup/teardown blocks is gone; only the top-level asyncTest wrapper still uses waitFor (once, to drive the async proc to completion). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Correctly propagate error hidden by new async move * Correctly handle future cancellation exceptions, +some housekeeping * Apply suggestion from @Ivansete-status Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com> * Stylistics, async default implication addressed, nph style run * Remove leaking CancelledFuture from public facing + as a consequence it is tuneled into handling CatchableError everywhere --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ivan FB <128452529+Ivansete-status@users.noreply.github.com>
2026-05-25 22:30:15 +02:00
channel.bloomFilter = RollingBloomFilter.init(
rm.config.bloomFilterCapacity, rm.config.bloomFilterErrorRate
)
rm.channels.clear()
return ok()
except CatchableError:
error "Failed to reset ReliabilityManager", msg = getCurrentExceptionMsg()
return err(ReliabilityError.reInternalError)
finally:
rm.lock.release()
except CatchableError:
error "Failed to reset ReliabilityManager (lock)", msg = getCurrentExceptionMsg()
return err(ReliabilityError.reInternalError)