logos-blockchain-testing/docs/external-network-architecture.md

11 KiB

External Network Integration Architecture (High-Level)

Purpose

Extend the current testing framework without breaking existing scenarios:

  • Keep existing managed deployer flow.
  • Add optional support for attaching to existing clusters.
  • Add optional support for explicit external nodes.
  • Unify all nodes behind one runtime inventory and capability model.

Architecture Diagram

flowchart TD
    A[ScenarioSpec]
    BO[Bootstrap Orchestrator]

    A --> B[Managed Nodes Spec\ncount/config/patches]
    A --> C[Attach Spec\ntyped k8s/compose source]
    A --> D[External Nodes Spec\nstatic endpoints]

    B --> E[Deployer\nlocal/docker/k8s]
    C --> F[AttachProvider\nk8s/compose/...]
    D --> BO

    E --> G[Managed Node Handles\norigin=Managed, ownership=Owned]
    F --> H[Attached Node Handles\norigin=Attached, ownership=Borrowed]
    D --> I[External Node Handles\norigin=External, ownership=Borrowed]

    G --> BO
    H --> BO
    I --> BO

    BO --> J[NodeInventory]
    BO --> BR[Readiness Barrier]
    BR --> J

    J --> K[Scenario Validator\ncapability + ownership checks]
    K --> L[Scenario Execution\nsteps/workloads/assertions]

    J --> M[NodeHandle API]
    M --> N[Query / Tx Submit]
    M --> O[Lifecycle Ops\nstart/stop/restart]
    M --> P[Config Patch]

    Q[Capabilities]
    Q --> N
    Q --> O
    Q --> P

    R[Ownership Policy]
    R --> O
    R --> P
    R --> S[Cleanup Controller\nOwned only]

    T[Observability]
    T --> U[Inventory table at start]
    T --> V[Progress + retry logs]
    T --> W[Per-node diagnostics]

Component Responsibilities

  • ScenarioSpec: declares managed, attached, and external node sources.
  • Deployer: provisions nodes owned by the framework.
  • AttachProvider: discovers pre-existing nodes from an external system.
  • External Nodes Spec: explicit static endpoints for already-running nodes.
  • NodeInventory: single runtime list of all nodes used by scenario steps.
  • NodeHandle: unified node interface with origin, ownership, capabilities, and client.
  • Bootstrap Orchestrator: coordinates provisioning, discovery, peer/bootstrap policy, and readiness.
  • Scenario Validator: rejects unsupported operations before execution.
  • Cleanup Controller: tears down only owned resources.

Bootstrap Control Flow (Coordinator Responsibility)

Bootstrap Orchestrator owns deployment-time coordination:

  1. Resolve ScenarioSpec inputs (managed, attach, external).
  2. Ask Deployer to provision/start managed nodes.
  3. Ask AttachProvider to discover attached nodes.
  4. Normalize all outputs into NodeHandles.
  5. Merge into NodeInventory with stable IDs and dedup.
  6. Apply bootstrap policy (seeds/peers/network join strategy).
  7. Wait on readiness barrier (required nodes or quorum).
  8. Run preflight validation (capability + ownership constraints).
  9. Hand off to scenario execution.

Bootstrap Flow Diagram

sequenceDiagram
    participant SS as ScenarioSpec
    participant BO as Bootstrap Orchestrator
    participant D as Deployer
    participant AP as AttachProvider
    participant NI as NodeInventory
    participant SV as Scenario Validator
    participant SE as Scenario Execution

    SS->>BO: Build request (managed/attach/external)
    BO->>D: Provision/start managed nodes
    D-->>BO: Managed node handles
    BO->>AP: Discover attached cluster nodes
    AP-->>BO: Attached node handles
    BO->>BO: Normalize + dedup + apply bootstrap policy
    BO->>NI: Construct unified inventory
    BO->>BO: Readiness barrier (all/quorum policy)
    BO->>SV: Validate capabilities + ownership
    SV-->>BO: OK / typed error
    BO->>SE: Start scenario runtime

Key Semantics

  • Backward-compatible by default: managed-only scenarios work unchanged.
  • managed_count = 0 is valid for external-only or attach-only scenarios.
  • Lifecycle and config patch operations are gated by capability + ownership.
  • Steps operate on NodeInventory, not on deployer-specific logic.

Ownership and Capability Model

  • Owned nodes: may allow lifecycle and patch operations; included in cleanup.
  • Borrowed nodes: default read-only lifecycle policy (query/submit only unless explicitly enabled).
  • Capability checks happen before action execution and return typed, contextual errors.

Manual Cluster Compatibility

Manual cluster mode maps naturally to the same model:

  • If manual cluster starts processes itself: treat nodes as Managed + Owned.
  • If manual cluster connects to existing nodes: treat nodes as Attached/External + Borrowed.

This keeps scenario logic reusable while preserving explicit safety boundaries.

Critical Design Decisions To Lock Early

  • Identity/dedup rule: define canonical node identity (peer id > endpoint) to prevent duplicate handles.
  • Bootstrap policy: define how peers are selected across mixed sources (managed/attached/external).
  • Readiness semantics: require all nodes, subset, or quorum; and per-step override rules.
  • Safety boundaries: default deny lifecycle/patch operations for borrowed nodes.
  • Compatibility checks: fail fast on incompatible network/genesis/protocol versions.
  • Failure policy: decide when attach/discovery failures are fatal vs degradable.
  • Node identity: use peer_id as canonical key; fallback to (host, port) only when peer id is unavailable.
  • Dedup merge: if same canonical identity appears from multiple sources, keep one handle and record all origins for diagnostics.
  • Bootstrap peers: every managed node gets at least 2 seed peers from distinct origins when possible.
  • Readiness gate: phase 1 default is AllReady (all known nodes must pass readiness). Keep policy extensible for Quorum and future SourceAware readiness.
  • Borrowed node safety: lifecycle and config patch disabled by default for borrowed nodes; explicit opt-in required.
  • Compatibility preflight: enforce matching chain/network id + protocol version before scenario start.
  • Failure handling:
    • managed provisioning failure: fatal
    • attach discovery empty result: fatal if attach requested
    • partial attach discovery: warn + continue only if readiness quorum still satisfiable
  • Cleanup: delete owned artifacts only; never mutate or delete borrowed node resources.

Source Combination Modes

Use a typed source enum so invalid combinations are unrepresentable:

  • Managed { external }: deployer-managed nodes with optional external overlays.
  • Attached { attach, external }: attached cluster with optional external overlays.
  • ExternalOnly { external }: explicit external-only mode.

Validation rules:

  • Managed requires managed deployment to produce nodes (managed_count > 0).
  • Attached requires managed deployment to produce zero nodes (managed + attached is disallowed).
  • ExternalOnly requires non-empty external and zero managed nodes.

Use a layered module structure so responsibilities stay isolated.

Module Map

testing-framework/core/src/
  domain/
    scenario_spec.rs
    node_handle.rs
    node_inventory.rs
  bootstrap/
    orchestrator.rs
    readiness.rs
    validation.rs
  providers/
    deployer/
      mod.rs
      local.rs
      docker.rs
      k8s.rs
    attach/
      mod.rs
      static.rs
      k8s.rs
      compose.rs
  runtime/
    node_ops.rs
    scenario_runtime.rs
  errors/
    bootstrap.rs
    provider.rs
    validation.rs

Layer Responsibilities

  • domain: source-of-truth types and invariants (ScenarioSpec, NodeHandle, NodeInventory).
  • bootstrap: deployment-time coordination flow, readiness barrier, and preflight checks.
  • providers/deployer: create and control owned nodes.
  • providers/attach: discover existing non-owned nodes.
  • runtime: step-facing operations over NodeInventory.
  • errors: typed errors grouped by layer for explicit failure context.

Guardrails To Keep It Clean

  • Steps/workloads must depend on runtime + domain, never on provider internals.
  • Deployer and AttachProvider are adapters only; orchestration logic belongs in bootstrap/orchestrator.
  • Capability and ownership checks run centrally in bootstrap/validation, not ad hoc in step code.
  • Keep env/config parsing in one place; expose typed config downstream.
  • Keep cleanup ownership-aware: only owned artifacts are mutable/deletable.

Non-Breaking Changes To Start Now

These changes help future external-network support while preserving current public API behavior.

  • Introduce internal NodeHandle + NodeInventory and route existing managed-only flow through them.
  • Add AttachProvider trait internally with default no-op wiring (None), without exposing new required API.
  • Add optional config/spec fields (attach, external, readiness_policy) with safe defaults.
  • Centralize readiness and capability checks behind one internal validation entry point.
  • Add internal node metadata (origin, ownership, capabilities) defaulted to managed semantics.
  • Standardize node identity and dedup helpers (peer_id preferred, endpoint fallback).
  • Keep current env vars/flags intact, but parse via a single typed config layer.
  • Add a single source-orchestration match path (ScenarioSources) inside deployers; unsupported source modes fail fast with typed errors until attach/external registration lands.

Open Risks and Required Clarifications

Before full rollout, lock these semantics explicitly:

  • Source enum precedence: typed ScenarioSources variants are the primary control plane. Runtime counts validate, but never redefine, source intent.
  • Ownership conflict resolution: define behavior when a deduped node appears from multiple sources with different ownership (for example, fail-fast by default; optional override if needed).
  • Source-aware readiness: avoid quorum rules that can hide managed deployment failures. Require per-source readiness constraints (for example, minimum managed-ready + global quorum).
  • Readiness rollout: phase 1 uses AllReady; later rollout can add SourceAware constraints once mixed-source behavior is validated.
  • Bootstrap mutation boundary: peer/bootstrap policy mutates managed nodes only unless an attach provider explicitly supports controlled mutation.
  • Compatibility contract expansion: preflight checks should include API/auth/genesis compatibility class, not only network/protocol identifiers.
  • Deterministic membership policy: define strict vs degradable attach behavior so partial discovery does not silently change scenario semantics.
  • Step migration boundary: after NodeInventory handoff, scenario steps must not read deployer-specific state directly.