diff --git a/docs/external-network-architecture.md b/docs/external-network-architecture.md new file mode 100644 index 0000000..b1e9351 --- /dev/null +++ b/docs/external-network-architecture.md @@ -0,0 +1,222 @@ +# External Network Integration Architecture (High-Level) + +## Purpose + +Extend the current testing framework without breaking existing scenarios: + +- Keep existing managed deployer flow. +- Add optional support for attaching to existing clusters. +- Add optional support for explicit external nodes. +- Unify all nodes behind one runtime inventory and capability model. + +## Architecture Diagram + +```mermaid +flowchart TD + A[ScenarioSpec] + BO[Bootstrap Orchestrator] + + A --> B[Managed Nodes Spec\ncount/config/patches] + A --> C[Attach Spec\nprovider + selector] + A --> D[External Nodes Spec\nstatic endpoints] + + B --> E[Deployer\nlocal/docker/k8s] + C --> F[AttachProvider\nstatic/k8s/compose/...] + D --> BO + + E --> G[Managed Node Handles\norigin=Managed, ownership=Owned] + F --> H[Attached Node Handles\norigin=Attached, ownership=Borrowed] + D --> I[External Node Handles\norigin=External, ownership=Borrowed] + + G --> BO + H --> BO + I --> BO + + BO --> J[NodeInventory] + BO --> BR[Readiness Barrier] + BR --> J + + J --> K[Scenario Validator\ncapability + ownership checks] + K --> L[Scenario Execution\nsteps/workloads/assertions] + + J --> M[NodeHandle API] + M --> N[Query / Tx Submit] + M --> O[Lifecycle Ops\nstart/stop/restart] + M --> P[Config Patch] + + Q[Capabilities] + Q --> N + Q --> O + Q --> P + + R[Ownership Policy] + R --> O + R --> P + R --> S[Cleanup Controller\nOwned only] + + T[Observability] + T --> U[Inventory table at start] + T --> V[Progress + retry logs] + T --> W[Per-node diagnostics] +``` + +## Component Responsibilities + +- `ScenarioSpec`: declares managed, attached, and external node sources. +- `Deployer`: provisions nodes owned by the framework. +- `AttachProvider`: discovers pre-existing nodes from an external system. +- `External Nodes Spec`: explicit static endpoints for already-running nodes. +- `NodeInventory`: single runtime list of all nodes used by scenario steps. +- `NodeHandle`: unified node interface with origin, ownership, capabilities, and client. +- `Bootstrap Orchestrator`: coordinates provisioning, discovery, peer/bootstrap policy, and readiness. +- `Scenario Validator`: rejects unsupported operations before execution. +- `Cleanup Controller`: tears down only owned resources. + +## Bootstrap Control Flow (Coordinator Responsibility) + +`Bootstrap Orchestrator` owns deployment-time coordination: + +1. Resolve `ScenarioSpec` inputs (`managed`, `attach`, `external`). +2. Ask `Deployer` to provision/start managed nodes. +3. Ask `AttachProvider` to discover attached nodes. +4. Normalize all outputs into `NodeHandle`s. +5. Merge into `NodeInventory` with stable IDs and dedup. +6. Apply bootstrap policy (seeds/peers/network join strategy). +7. Wait on readiness barrier (required nodes or quorum). +8. Run preflight validation (capability + ownership constraints). +9. Hand off to scenario execution. + +### Bootstrap Flow Diagram + +```mermaid +sequenceDiagram + participant SS as ScenarioSpec + participant BO as Bootstrap Orchestrator + participant D as Deployer + participant AP as AttachProvider + participant NI as NodeInventory + participant SV as Scenario Validator + participant SE as Scenario Execution + + SS->>BO: Build request (managed/attach/external) + BO->>D: Provision/start managed nodes + D-->>BO: Managed node handles + BO->>AP: Discover attached cluster nodes + AP-->>BO: Attached node handles + BO->>BO: Normalize + dedup + apply bootstrap policy + BO->>NI: Construct unified inventory + BO->>BO: Readiness barrier (all/quorum policy) + BO->>SV: Validate capabilities + ownership + SV-->>BO: OK / typed error + BO->>SE: Start scenario runtime +``` + +## Key Semantics + +- Backward-compatible by default: managed-only scenarios work unchanged. +- `managed_count = 0` is valid for external-only or attach-only scenarios. +- Lifecycle and config patch operations are gated by capability + ownership. +- Steps operate on `NodeInventory`, not on deployer-specific logic. + +## Ownership and Capability Model + +- `Owned` nodes: may allow lifecycle and patch operations; included in cleanup. +- `Borrowed` nodes: default read-only lifecycle policy (query/submit only unless explicitly enabled). +- Capability checks happen before action execution and return typed, contextual errors. + +## Manual Cluster Compatibility + +Manual cluster mode maps naturally to the same model: + +- If manual cluster starts processes itself: treat nodes as `Managed` + `Owned`. +- If manual cluster connects to existing nodes: treat nodes as `Attached/External` + `Borrowed`. + +This keeps scenario logic reusable while preserving explicit safety boundaries. + +## Critical Design Decisions To Lock Early + +- **Identity/dedup rule**: define canonical node identity (peer id > endpoint) to prevent duplicate handles. +- **Bootstrap policy**: define how peers are selected across mixed sources (managed/attached/external). +- **Readiness semantics**: require all nodes, subset, or quorum; and per-step override rules. +- **Safety boundaries**: default deny lifecycle/patch operations for borrowed nodes. +- **Compatibility checks**: fail fast on incompatible network/genesis/protocol versions. +- **Failure policy**: decide when attach/discovery failures are fatal vs degradable. + +## Recommended Default Policies + +- **Node identity**: use `peer_id` as canonical key; fallback to `(host, port)` only when peer id is unavailable. +- **Dedup merge**: if same canonical identity appears from multiple sources, keep one handle and record all origins for diagnostics. +- **Bootstrap peers**: every managed node gets at least 2 seed peers from distinct origins when possible. +- **Readiness gate**: default to quorum (`>= 2` or `>= 50%`, whichever is greater); allow strict-all via scenario override. +- **Borrowed node safety**: lifecycle and config patch disabled by default for borrowed nodes; explicit opt-in required. +- **Compatibility preflight**: enforce matching chain/network id + protocol version before scenario start. +- **Failure handling**: + - managed provisioning failure: fatal + - attach discovery empty result: fatal if attach requested + - partial attach discovery: warn + continue only if readiness quorum still satisfiable +- **Cleanup**: delete owned artifacts only; never mutate or delete borrowed node resources. + +## Clean Codebase Layout (Recommended) + +Use a layered module structure so responsibilities stay isolated. + +### Module Map + +```text +testing-framework/core/src/ + domain/ + scenario_spec.rs + node_handle.rs + node_inventory.rs + bootstrap/ + orchestrator.rs + readiness.rs + validation.rs + providers/ + deployer/ + mod.rs + local.rs + docker.rs + k8s.rs + attach/ + mod.rs + static.rs + k8s.rs + compose.rs + runtime/ + node_ops.rs + scenario_runtime.rs + errors/ + bootstrap.rs + provider.rs + validation.rs +``` + +### Layer Responsibilities + +- `domain`: source-of-truth types and invariants (`ScenarioSpec`, `NodeHandle`, `NodeInventory`). +- `bootstrap`: deployment-time coordination flow, readiness barrier, and preflight checks. +- `providers/deployer`: create and control owned nodes. +- `providers/attach`: discover existing non-owned nodes. +- `runtime`: step-facing operations over `NodeInventory`. +- `errors`: typed errors grouped by layer for explicit failure context. + +### Guardrails To Keep It Clean + +- Steps/workloads must depend on `runtime` + `domain`, never on provider internals. +- `Deployer` and `AttachProvider` are adapters only; orchestration logic belongs in `bootstrap/orchestrator`. +- Capability and ownership checks run centrally in bootstrap/validation, not ad hoc in step code. +- Keep env/config parsing in one place; expose typed config downstream. +- Keep cleanup ownership-aware: only owned artifacts are mutable/deletable. + +## Non-Breaking Changes To Start Now + +These changes help future external-network support while preserving current public API behavior. + +- Introduce internal `NodeHandle` + `NodeInventory` and route existing managed-only flow through them. +- Add `AttachProvider` trait internally with default no-op wiring (`None`), without exposing new required API. +- Add optional config/spec fields (`attach`, `external`, `readiness_policy`) with safe defaults. +- Centralize readiness and capability checks behind one internal validation entry point. +- Add internal node metadata (`origin`, `ownership`, `capabilities`) defaulted to managed semantics. +- Standardize node identity and dedup helpers (`peer_id` preferred, endpoint fallback). +- Keep current env vars/flags intact, but parse via a single typed config layer.