nim-sds/CLAUDE.md
NagyZoltanPeter 4ccdd122fc
refactor(persistence): snapshot-based interface (5 procs, atomic per-op) (#72)
* feat: propagate persistence backend errors via Result

The Persistence contract previously returned `Future[void]` for writes and
`Future[ChannelSnapshot]` for the loader, with `raises: []`. Backends had no
way to report a failure, so a failed write or a failed/partial read was
silently swallowed — and on the read path a mid-scan failure could bootstrap
a *truncated* channel snapshot, corrupting the rebuilt bloom filter and
lamport clock across a restart.

Make every contract field Result-returning:
  * mutating ops  -> Future[Result[void, string]]
  * loadAllForChannel -> Future[Result[ChannelSnapshot, string]]

The backend-supplied error string is mapped to a new
`ReliabilityError.rePersistenceError` (logged once at the boundary via
`reliabilityErr`) and threaded up through every persistence-touching proc to
the public API, where the caller decides what to do. Request-driven paths
(wrap/unwrap/markDependenciesMet/ensureChannel/removeChannel/reset) propagate
the error; background maintenance loops (periodicBufferSweep,
periodicRepairSweep) log and retry on the next tick, since they have no
synchronous caller.

Tests: in-memory backend gains a `failingOps` injection hook; new
"Persistence: error propagation" suite asserts read/write/drop failures
surface as `rePersistenceError`. Full suite passes (90 OK).

BREAKING CHANGE: the `Persistence` contract signature changed; custom
backends must return `Result` and `ok()` on success. Bumped to 0.3.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(persistence): add snapshot types and codec (phase 0)

Introduce atomic-snapshot persistence types that will replace the current
fine-grained 13-proc Persistence interface. This commit is purely additive:
no existing call site changes, no behaviour change.

New types (sds/types/):
- channel_meta.nim — ChannelMeta (atomic per-channel snapshot blob),
  ChannelData (bootstrap payload), OutgoingRepairKV / IncomingRepairKV
  (flattened map entries for protobuf wire shape).
- history_update.nim — HistoryUpdate (combined append/evict payload for
  the message log).

New codec (sds/snapshot_codec.nim):
- Protobuf encode/decode for all new types, reusing the existing
  SdsMessage and HistoryEntry encoders from sds/protobuf.nim.
- Explicit schemaVersion=1 on ChannelMeta; decoder rejects unknown
  versions loudly rather than silently truncating.
- Time encoded as int64 unix milliseconds.

Tests (tests/test_snapshot_codec.nim):
- 13 round-trip cases covering empty, single-entry, full-buffer, and
  repair-heavy snapshots; ChannelData ordering; HistoryUpdate variants;
  schemaVersion rejection.

Planning artefacts:
- ANALYSIS_SDS_PERSISTENCE.md — problem statement (partial-write
  divergence, chatty call rate, non-fatal-error policy gap).
- ANALYSIS_SNAPSHOT_SAVE_POINTS.md — exact save points per protocol op
  and projected call rates.
- PLAN_SNAPSHOT_PERSISTENCE.md — phased refactor plan; this commit
  implements phase 0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(persistence): add PersistenceV2 interface alongside legacy (phase 1)

Introduce the 5-proc snapshot-based Persistence interface that will
replace the legacy 13-proc one. Both coexist on `ReliabilityManager` so
phase 2 can migrate protocol ops one at a time without breaking existing
callers.

New file:
- sds/types/persistence_v2.nim — `PersistenceV2` type with
  saveChannelMeta / updateHistory / loadChannel / dropChannel /
  setRetrievalHint. `noOpPersistenceV2()` default. Doc-comments capture
  the atomicity pairing (meta save + history update issued back-to-back
  under the channel lock) and the non-fatal failure policy from PLAN §8.

Modified:
- sds/types/reliability_manager.nim — adds `persistenceV2: PersistenceV2`
  field alongside `persistence`; constructor takes both, both default to
  no-op.
- sds.nim — `newReliabilityManager` plumbs the new optional parameter.
- AGENTS.md / CLAUDE.md — GitNexus index re-indexed after phase 0 +
  phase 1 additions; symbol counts updated by `npx gitnexus analyze`.

No call site uses the new interface yet — that's phase 2. All existing
tests still pass against the legacy interface.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(persistence): migrate runRepairSweep to PersistenceV2 (phase 2.1)

Per-entry removeIncomingRepair / removeOutgoingRepair calls are replaced
by a single trySaveMeta per *dirty* channel at the end of that channel's
sweep. Failure is logged but does NOT abort the sweep — in-memory state
is the source of truth (PLAN_SNAPSHOT_PERSISTENCE.md §8).

Helpers added in sds/sds_utils.nim:
- snapshotMeta(channel) — capture current ChannelContext as ChannelMeta
  blob (flattens Table-keyed buffers to seqs for the wire shape).
- trySaveMeta(rm, channelId, channel) — best-effort meta snapshot save;
  logs on failure, never propagates.
- tryUpdateHistory(rm, channelId, append, evict) — best-effort history
  update; skips the call entirely when both lists are empty (HistoryUpdate
  contract).

Call-rate impact for runRepairSweep:
- Before: N persistence calls per expired entry per channel.
- After:  at most 1 saveChannelMeta per dirty channel; 0 on idle channels
  (matches the dirty-flag floor in ANALYSIS_SNAPSHOT_SAVE_POINTS).

All existing tests pass — including the 3 SDS-R Repair Sweep tests that
directly exercise this proc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(persistence): migrate checkUnacknowledgedMessages to PersistenceV2 (phase 2.2)

Per-entry saveOutgoing / removeOutgoing calls are replaced by one
trySaveMeta at the end of the pass, conditional on a dirty flag (resend
attempt incremented, or entry expired). Pass succeeds even if the save
fails — next tick reissues the snapshot.

Call-rate impact:
- Before: N persistence calls per affected entry per pass.
- After:  at most 1 saveChannelMeta per pass; 0 when nothing aged out.

All existing tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(persistence): add V2 meta snapshot saves to foreground ops (phase 2A)

Wires `trySaveMeta` into the three public protocol ops that mutate
per-channel state — wrapOutgoingMessage, unwrapReceivedMessage, and
markDependenciesMet — at the operation's end, under the channel lock.

Legacy fine-grained persistence calls REMAIN in place; this commit is
additive. Both interfaces persist the same state simultaneously, so all
existing tests pass and a real backend wired to either interface
continues to work. Phase 2B will strip the legacy calls.

Save points match the §"Save Points" table in
ANALYSIS_SNAPSHOT_SAVE_POINTS.md exactly:
- wrapOutgoingMessage: 1 save (always)
- unwrapReceivedMessage: 1 save on every path including duplicate
  (the duplicate path still mutates the repair buffers)
- markDependenciesMet: 1 save after the processIncomingBuffer cascade

Non-fatal failure policy (PLAN §8): trySaveMeta logs and continues;
the protocol op never returns rePersistenceError for snapshot failures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(persistence): strip legacy interface from protocol path; migrate tests to V2 (phase 2B+2C+2D)

End-state of phase 2: the protocol code no longer issues any legacy
fine-grained Persistence calls. All state survives via the snapshot-based
PersistenceV2 interface — one trySaveMeta per op end, plus tryUpdateHistory
batched inside addToHistory. The legacy Persistence field on
ReliabilityManager remains for backwards compatibility; phase 3 deletes it.

Protocol changes (sds.nim, sds/sds_utils.nim):
- reviewAckStatus, processIncomingBuffer, updateLamportTimestamp →
  pure in-memory; no per-mutation persistence.
- addToHistory: replaces appendLogEntry+removeLogEntry with a single
  tryUpdateHistory call carrying (append, evict) atomically.
- getRecentHistoryEntries: setRetrievalHint switched to V2; non-fatal.
- wrapOutgoingMessage, unwrapReceivedMessage, markDependenciesMet:
  all per-row saveOutgoing / removeOutgoing / saveIncoming /
  removeIncoming / saveOutgoingRepair / removeOutgoingRepair /
  saveIncomingRepair / removeIncomingRepair calls removed (16 call
  sites in total). State is captured by the op-end trySaveMeta added
  in phase 2A.
- getOrCreateChannel: bootstraps from persistenceV2.loadChannel.
- dropChannelFromPersistence: uses persistenceV2.dropChannel.

Failure policy (PLAN_SNAPSHOT_PERSISTENCE.md §8):
- Foreground ops (wrap, unwrap, markDeps, sweeps): non-fatal —
  trySaveMeta / tryUpdateHistory log and continue; the protocol op
  returns ok regardless of disk failure. In-memory state is the source
  of truth; the next op re-issues a complete snapshot and disk catches
  up automatically.
- Durability-intent ops (removeChannel, resetReliabilityManager via
  dropChannelFromPersistence; getOrCreateChannel via loadChannel):
  still propagate rePersistenceError, because the caller asked us to
  confirm a disk operation and we cannot silently lie.

Test infrastructure:
- tests/in_memory_persistence_v2.nim: new V2 adapter mock that
  decomposes the meta blob into the existing InMemoryStore shape so
  test assertions on store.outgoing / store.incoming / etc. continue to
  work without change.
- tests/test_persistence.nim: 17 tests, all rewritten against V2.
  - 13 state-survival tests carry over with identical assertions.
  - "loadChannel failure surfaces as err on bootstrap" — bootstrap
    keeps durability-intent semantics.
  - "saveChannelMeta failure during send does NOT surface" — deliberate
    inversion of the legacy "write failure surfaces as err" test. Asserts
    the new non-fatal policy: op returns ok, in-memory state correct,
    disk re-syncs on the next op.
  - "updateHistory failure during send does NOT surface" — same policy
    applied to the history path.
  - "dropChannel failure during removeChannel surfaces as err" — kept.
- All 17 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(persistence): delete legacy interface; rename PersistenceV2 -> Persistence (phase 3)

End-state of the snapshot-persistence refactor. The legacy 13-proc
Persistence interface and its noOpPersistence are gone; the 5-proc
snapshot-based interface (formerly PersistenceV2) takes their place under
the canonical name.

Source:
- sds/types/persistence.nim: replaced 13-proc contract with the 5-proc
  snapshot interface (saveChannelMeta, updateHistory, loadChannel,
  dropChannel, setRetrievalHint). noOpPersistence returns ok everywhere
  and an empty ChannelData on load.
- sds/types/persistence_v2.nim: removed.
- sds/types/reliability_manager.nim: dropped the second persistenceV2
  field; constructor takes a single `persistence: Persistence`.
- sds/sds_utils.nim: rm.persistenceV2.X -> rm.persistence.X; doc-comments
  updated.
- sds.nim: dropped the persistenceV2 parameter from newReliabilityManager.

Tests:
- tests/in_memory_persistence_v2.nim: removed; its content moved to...
- tests/in_memory_persistence.nim: replaces the old legacy mock with the
  snapshot adapter under the canonical filename. Same InMemoryStore
  shape so test assertions stay unchanged.
- tests/test_persistence.nim: ctor param renamed, suite name de-prefixed.

FFI smoke (`nimble libsdsDynamicMac`, refc/threads:on): builds clean.
All 4 test suites pass:
- test_bloom
- test_reliability
- test_persistence (17 V2 tests)
- test_snapshot_codec (13 codec round-trip tests)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Persisting persistence redesign plan for reference

* refactor(persistence): R2 pending-write queue + per-op accumulator (PR #72 review fix)

Addresses all three substantive review findings on PR #72 in one
structural change: fold the per-op accumulator and the R2 retry buffer
into a single queue on `ChannelContext`, flushed once at op end.

Changes:

- sds/types/channel_context.nim: add `pendingHistoryAppends`
  (`OrderedSet[SdsMessageID]`) and `pendingHistoryEvicts`
  (`HashSet[SdsMessageID]`) fields. Only ids are stored — the full
  SdsMessage is looked up from `messageHistory` at flush time. Documented
  invariant: every id in pendingHistoryAppends is also in messageHistory,
  upheld by the merge rule.

- sds/sds_utils.nim:
  * `queueHistoryAppend(channel, msgId)` / `queueHistoryEvict(channel,
    msgId)` — "latest-wins" merge: append cancels any pending evict
    and vice versa. Symmetric, simple, handles the evict-then-re-add
    sequence correctly (SDS-R repair re-delivering an evicted message
    while the backend is unreachable).
  * `tryUpdateHistory(rm, channelId)` — no more list params; flushes the
    channel's pending queue. Dual role: per-op accumulator (multiple
    `addToHistory` calls within one op queue together and flush as one
    round-trip) AND R2 retry buffer (a failed flush leaves the queue
    populated for the next op to retry).
  * `addToHistory` queues via the helpers; does not call persistence.
  * Pending queue cleared on `cleanup` and `removeChannel`.

- sds.nim:
  * `processIncomingBuffer` returns to its single-arg signature — the
    queue lives on the channel, no parameter threading needed.
  * `wrapOutgoingMessage`, `unwrapReceivedMessage` (all three paths),
    `markDependenciesMet` issue exactly one `trySaveMeta` +
    `tryUpdateHistory` pair at op end, under the lock, with no
    intervening `await`-of-other-work. Matches the Persistence atomicity
    contract documented in `sds/types/persistence.nim`.
  * Pending queue cleared in `resetReliabilityManager`.

- tests/test_persistence.nim:
  * Direct `addToHistory` callers (state-survival setup) now follow with
    explicit `tryUpdateHistory(channelId)` to flush. Reflects the
    production op-end flush pattern.
  * New: `updateHistory failure is retried via R2 pending-write queue` —
    verifies that two failed sends leave both messages on the queue,
    and a third successful send drains the whole queue in one call.
  * New: `pending queue survives idle ops` — verifies that an op with
    no history changes of its own still flushes a previously-failed
    batch at op end.
  * New: `evict-then-re-add merge rule preserves the re-added message
    on disk` — regression for the "latest-wins" merge rule. The original
    "evict-wins" rule would silently drop the re-add and leave the
    message permanently absent from disk; this test would fail under
    that rule and passes under the corrected one.

Resolves PR #72 review comments:
- #1 (delta loss on failed updateHistory) — R2 retry queue.
- #2 (cascade chattiness — N updateHistory calls per op) — queue collects
  cascaded entries, flushed as one batch.
- #3 (atomicity contract mismatch) — implementation now matches the
  documented "saveChannelMeta then updateHistory back-to-back" pairing.

Test summary: 50 tests pass (47 prior + 3 new R2/merge-rule tests).
FFI dylib (`nimble libsdsDynamicMac`, refc + threads:on): clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 12:24:38 +02:00

209 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# nim-sds — Claude Code Guide
## Project Overview
**nim-sds** is a Nim implementation of the **Scalable Data Sync (SDS)** protocol ([spec](https://lip.logos.co/ift-ts/raw/sds.html), IFT LIP-109). SDS achieves end-to-end reliability when consolidating distributed logs in a decentralized manner — participants broadcast messages over a P2P transport, maintain per-channel append-only logs, and use causal ordering to reach consistent state across all nodes.
The library exposes its functionality via a C-compatible FFI so it can be embedded in applications on any platform. Go bindings are maintained in a separate repo: [logos-messaging/sds-go-bindings](https://github.com/logos-messaging/sds-go-bindings).
---
## Protocol Concepts
Understanding these is essential before modifying any core code.
### Message format
Each SDS message carries (`sds/message.nim`, `sds/protobuf.nim`):
| Field | Purpose |
|---|---|
| `message_id` | Globally unique, immutable identifier |
| `sender_id` | Originating participant |
| `channel_id` | Communication channel |
| `lamport_timestamp` | Logical clock for ordering |
| `causal_history` | IDs of the 23 most recent messages the sender has seen (dependencies) |
| `bloom_filter` | Compact summary of all message IDs the sender has received |
| `content` | Application payload |
### Sending a message (`sds/sds_utils.nim` → `wrapOutgoingMessage`)
1. Increment the per-channel Lamport timestamp to `max(current_time_ms, timestamp + 1)`.
2. Attach causal history from the local log tail.
3. Embed the current bloom filter snapshot.
### Receiving a message (`sds/sds_utils.nim` → `unwrapReceivedMessage`)
1. Deduplicate by `message_id`.
2. Check causal dependencies — if any predecessor is missing, buffer the message.
3. When all dependencies are met, deliver: insert into the ordered local log (Lamport timestamp, tie-break by ascending `message_id`).
4. Record `message_id` in the bloom filter.
### Periodic sync
A node periodically broadcasts a message with empty content carrying an updated Lamport timestamp and bloom filter. These sync messages are not persisted and are excluded from causal chains. They help peers detect gaps in their logs.
### SDS-R (Repair extension)
Defined in the spec but **not yet implemented** in this library.
### Bloom filter (`sds/bloom.nim`, `sds/rolling_bloom_filter.nim`)
Used to compactly summarise which messages a node has received, so peers can identify gaps without exchanging full ID lists. The rolling variant automatically resets when capacity is exceeded.
---
## Repository Layout
```
sds/ # Core protocol (pure Nim, no FFI)
message.nim # SdsMessage, HistoryEntry, config constants
sds_utils.nim # ReliabilityManager — send/receive/buffer logic
protobuf.nim # Protobuf encode/decode for SdsMessage
protobufutil.nim # Low-level protobuf helpers
bloom.nim # Bloom filter implementation
rolling_bloom_filter.nim # Adaptive rolling bloom filter
library/ # C FFI wrapper around the core
libsds.nim # Exported C-compatible entry points
libsds.h # C header
ffi_types.nim # C-compatible types and return codes
alloc.nim # Memory allocation helpers
sds_thread/ # Per-context Chronos async worker thread
events/ # JSON serialisation for event callbacks
tests/
test_bloom.nim # Bloom filter unit tests
test_reliability.nim # Protocol-level unit tests
sds.nim # Root module — re-exports public API
sds.nimble # Package manifest + build tasks
flake.nix / Makefile # Reproducible cross-platform build system
```
---
## Key Types
| Type | File | Role |
|---|---|---|
| `ReliabilityManager` | `sds_utils.nim` | Per-channel protocol state: Lamport clock, bloom filter, log, buffers |
| `ReliabilityConfig` | `sds_utils.nim` | Tunable parameters (bloom capacity, history length, resend interval) |
| `SdsMessage` | `message.nim` | Wire message |
| `HistoryEntry` | `message.nim` | `message_id` + optional retrieval hint |
| `UnacknowledgedMessage` | `message.nim` | Outgoing message with resend counter |
| `IncomingMessage` | `message.nim` | Buffered message waiting on missing dependencies |
---
## FFI API (`library/libsds.nim`)
The C API wraps `ReliabilityManager` behind an opaque `SdsContext` handle:
| Export | Maps to |
|---|---|
| `SdsNewReliabilityManager` | Create context |
| `SdsWrapOutgoingMessage` | `wrapOutgoingMessage` |
| `SdsUnwrapReceivedMessage` | `unwrapReceivedMessage` |
| `SdsMarkDependenciesMet` | Notify buffered-message dependencies satisfied |
| `SdsSetEventCallback` | Register event handler (JSON payloads) |
| `SdsSetRetrievalHintProvider` | Register hint-provider callback |
| `SdsStartPeriodicTasks` | Start periodic sync loop |
| `SdsCleanupReliabilityManager` | Free context |
| `SdsResetReliabilityManager` | Reset state without freeing |
Each `SdsContext` runs a dedicated Chronos async loop on a worker thread; application threads communicate with it via SPSC channels.
---
## Running Tests
```bash
nimble test
```
Nix can also provide the environment if a local Nim install is not available:
```bash
nix develop --command nimble test
```
---
## Code Conventions
- **Types**: PascalCase (`ReliabilityManager`, `SdsMessage`)
- **Variables/procs**: camelCase
- **Public exports**: trailing `*`
- **Errors**: `Result[T, ReliabilityError]` — use `.valueOr`, `.isOk()`, `.isErr()`
- **Locks**: `withLock` macro (RAII); all exported procs are `{.gcsafe.}`
- **Constants**: `Default` prefix (e.g., `DefaultBloomFilterCapacity`)
- **Backward compat**: `sds/protobuf.nim` supports old and new causal history formats — do not remove the legacy decode path
---
## Building
**Nimble** is the primary build tool. Desktop library targets:
```bash
nimble libsdsDynamicMac # macOS .dylib
nimble libsdsDynamicLinux # Linux .so
nimble libsdsStaticMac # macOS .a
nimble libsdsStaticLinux # Linux .a
```
**Nix** (`flake.nix`) and **Make** (`Makefile`) are optional conveniences that wrap Nimble for reproducible and cross-platform (including Android/iOS) builds.
## Dependency Management
Nimble dependencies are locked in `nimble.lock`.
```bash
nimble setup -l # local setup
nimble lock # update lock after changing sds.nimble
```
If using Nix, also recalculate the fixed-output hash in `nix/deps.nix` after updating `nimble.lock` (run `nix build`, copy the expected hash from the error, paste into `outputHash`).
<!-- gitnexus:start -->
# GitNexus — Code Intelligence
This project is indexed by GitNexus as **nim-sds** (1100 symbols, 1820 relationships, 66 execution flows). Use the GitNexus MCP tools to understand code, assess impact, and navigate safely.
> If any GitNexus tool warns the index is stale, run `npx gitnexus analyze` in terminal first.
## Always Do
- **MUST run impact analysis before editing any symbol.** Before modifying a function, class, or method, run `gitnexus_impact({target: "symbolName", direction: "upstream"})` and report the blast radius (direct callers, affected processes, risk level) to the user.
- **MUST run `gitnexus_detect_changes()` before committing** to verify your changes only affect expected symbols and execution flows.
- **MUST warn the user** if impact analysis returns HIGH or CRITICAL risk before proceeding with edits.
- When exploring unfamiliar code, use `gitnexus_query({query: "concept"})` to find execution flows instead of grepping. It returns process-grouped results ranked by relevance.
- When you need full context on a specific symbol — callers, callees, which execution flows it participates in — use `gitnexus_context({name: "symbolName"})`.
## Never Do
- NEVER edit a function, class, or method without first running `gitnexus_impact` on it.
- NEVER ignore HIGH or CRITICAL risk warnings from impact analysis.
- NEVER rename symbols with find-and-replace — use `gitnexus_rename` which understands the call graph.
- NEVER commit changes without running `gitnexus_detect_changes()` to check affected scope.
## Resources
| Resource | Use for |
|----------|---------|
| `gitnexus://repo/nim-sds/context` | Codebase overview, check index freshness |
| `gitnexus://repo/nim-sds/clusters` | All functional areas |
| `gitnexus://repo/nim-sds/processes` | All execution flows |
| `gitnexus://repo/nim-sds/process/{name}` | Step-by-step execution trace |
## CLI
| Task | Read this skill file |
|------|---------------------|
| Understand architecture / "How does X work?" | `.claude/skills/gitnexus/gitnexus-exploring/SKILL.md` |
| Blast radius / "What breaks if I change X?" | `.claude/skills/gitnexus/gitnexus-impact-analysis/SKILL.md` |
| Trace bugs / "Why is X failing?" | `.claude/skills/gitnexus/gitnexus-debugging/SKILL.md` |
| Rename / extract / split / refactor | `.claude/skills/gitnexus/gitnexus-refactoring/SKILL.md` |
| Tools, resources, schema reference | `.claude/skills/gitnexus/gitnexus-guide/SKILL.md` |
| Index, status, clean, wiki CLI commands | `.claude/skills/gitnexus/gitnexus-cli/SKILL.md` |
<!-- gitnexus:end -->