logos-blockchain-testing/docs/observation-runtime-plan.md

7.8 KiB

Observation Runtime Plan

Why this work exists

TF is good at deployment plumbing. It is weak at continuous observation.

Today, the same problems are solved repeatedly with custom loops:

  • TF block feed logic in Logos
  • Cucumber manual-cluster polling loops
  • ad hoc catch-up scans for wallet and chain state
  • app-local state polling in expectations

That is the gap this work should close.

The goal is not a generic "distributed systems DSL". The goal is one reusable observation runtime that:

  • continuously collects data from dynamic sources
  • keeps typed materialized state
  • exposes both current snapshot and delta/history views
  • fits naturally in TF scenarios and Cucumber manual-cluster code

Constraints

TF constraints

  • TF abstractions must stay universal and simple.
  • TF must not know app semantics like blocks, wallets, leaders, jobs, or topics.
  • TF must remain useful for simple apps such as openraft_kv, not only Logos.

App constraints

  • Apps must be able to build richer abstractions on top of TF.
  • Logos must be able to support:
    • current block-feed replacement
    • fork-aware chain state
    • public-peer sync targets
    • multi-wallet UTXO tracking
  • Apps must be able to adopt this incrementally.

Migration constraints

  • We do not want a flag-day rewrite.
  • Existing loops can coexist with the new runtime until replacements are proven.

Non-goals

This work should not:

  • put feed back onto the base Application trait
  • build app-specific semantics into TF core
  • replace filesystem blockchain snapshots used for startup/restore
  • force every app to use continuous observation
  • introduce a large public abstraction stack that nobody can explain

Core idea

Introduce one TF-level observation runtime.

That runtime owns:

  • source refresh
  • scheduling
  • polling/ingestion
  • bounded history
  • latest snapshot caching
  • delta publication
  • freshness/error tracking
  • lifecycle hooks for TF and Cucumber

Apps own:

  • source types
  • raw observation logic
  • materialized state
  • snapshot shape
  • delta/event shape
  • higher-level projections such as wallet state

Public TF surface

The TF public surface should stay small.

ObservedSource<S>

A named source instance.

Used for:

  • local node clients
  • public peer endpoints
  • any other app-owned source type

SourceProvider<S>

Returns the current source set.

This must support dynamic source lists because:

  • manual cluster nodes come and go
  • Cucumber worlds may attach public peers
  • node control may restart or replace sources during a run

Observer

App-owned observation logic.

It defines:

  • Source
  • State
  • Snapshot
  • Event

And it implements:

  • init(...)
  • poll(...)
  • snapshot(...)

The important boundary is:

  • TF owns the runtime
  • app code owns materialization

ObservationRuntime

The engine that:

  • starts the loop
  • refreshes sources
  • calls poll(...)
  • stores history
  • publishes deltas
  • updates latest snapshot
  • tracks last error and freshness

ObservationHandle

The read-side interface for workloads, expectations, and Cucumber steps.

It should expose at least:

  • latest snapshot
  • delta subscription
  • bounded history
  • last error

Intended shape

pub struct ObservedSource<S> {
    pub name: String,
    pub source: S,
}

#[async_trait]
pub trait SourceProvider<S>: Send + Sync + 'static {
    async fn sources(&self) -> Vec<ObservedSource<S>>;
}

#[async_trait]
pub trait Observer: Send + Sync + 'static {
    type Source: Clone + Send + Sync + 'static;
    type State: Send + Sync + 'static;
    type Snapshot: Clone + Send + Sync + 'static;
    type Event: Clone + Send + Sync + 'static;

    async fn init(
        &self,
        sources: &[ObservedSource<Self::Source>],
    ) -> Result<Self::State, DynError>;

    async fn poll(
        &self,
        sources: &[ObservedSource<Self::Source>],
        state: &mut Self::State,
    ) -> Result<Vec<Self::Event>, DynError>;

    fn snapshot(&self, state: &Self::State) -> Self::Snapshot;
}

This is enough.

If more helper layers are needed, they should stay internal first.

How current use cases fit

openraft_kv

Use one simple observer.

  • sources: node clients
  • state: latest per-node Raft state
  • snapshot: sorted node-state view
  • events: optional deltas, possibly empty at first

This is the simplest proving case. It validates the runtime without dragging in Logos complexity.

Logos block feed replacement

Use one shared chain observer.

  • sources: local node clients
  • state:
    • node heads
    • block graph
    • heights
    • seen headers
    • recent history
  • snapshot:
    • current head/lib/graph summary
  • events:
    • newly discovered blocks

This covers both existing Logos feed use cases:

  • current snapshot consumers
  • delta/subscription consumers

Cucumber manual-cluster sync

Use the same observer runtime with a different source set.

  • sources:
    • local manual-cluster node clients
    • public peer endpoints
  • state:
    • local consensus views
    • public consensus views
    • derived majority public target
  • snapshot:
    • current local and public sync picture

This removes custom poll/sleep loops from steps.

Multi-wallet fork-aware tracking

This should not be a TF concept.

It should be a Logos projection built on top of the shared chain observer.

  • input: chain observer state
  • output: per-header wallet state cache keyed by block header
  • property: naturally fork-aware because it follows actual ancestry

That replaces repeated backward scans from tip with continuous maintained state.

Logos layering

Logos should not put every concern into one giant impl.

Recommended layering:

  1. Chain source adapter

    • local node reads
    • public peer reads
  2. Shared chain observer

    • catch-up
    • continuous ingestion
    • graph/history materialization
  3. Logos projections

    • head view
    • public sync target
    • fork graph queries
    • wallet state
    • tx inclusion helpers

TF provides the runtime. Logos provides the domain model built on top.

Adoption plan

Phase 1: add TF observation runtime

  • add ObservedSource, SourceProvider, Observer, ObservationRuntime, ObservationHandle
  • keep the public API small
  • no app migrations yet

Phase 2: prove it on openraft_kv

  • add one simple observer over /state
  • migrate one expectation to use the observation handle
  • validate local, compose, and k8s

Phase 3: add Logos shared chain observer

  • implement it alongside current feed/loops
  • do not remove existing consumers yet
  • prove snapshot and delta outputs are useful

Phase 4: migrate one Logos consumer at a time

Suggested order:

  1. fork/head snapshot consumer
  2. tx inclusion consumer
  3. Cucumber sync-to-public-chain logic
  4. wallet/UTXO tracking

Phase 5: delete old loops and feed paths

  • only after the new runtime has replaced real consumers cleanly

Validation gates

Each phase should have clear checks.

Runtime-level

  • crate-level cargo check
  • targeted tests for runtime lifecycle and history retention
  • explicit tests for dynamic source refresh

App-level

  • openraft_kv:
    • local failover
    • compose failover
    • k8s failover
  • Logos:
    • one snapshot consumer migrated
    • one delta consumer migrated
  • Cucumber:
    • one manual-cluster sync path migrated

Open questions

These should stay open until implementation forces a decision:

  • whether ObservationHandle should expose full history directly or only cursor/subscription access
  • how much error/freshness metadata belongs in the generic runtime vs app snapshot types
  • whether multiple observers should share one scheduler/runtime instance or simply run independently first

Design guardrails

When implementing this work:

  • keep TF public abstractions minimal
  • keep app semantics out of TF core
  • do not chase a generic testing DSL
  • build from reusable blocks, not one-off mega impls
  • keep migration incremental
  • prefer simple, explainable runtime behavior over clever abstraction