Add testnet image build flow and runner docs

This commit is contained in:
andrussal 2025-11-26 12:15:07 +01:00
parent e04af1441a
commit 92e855741a
46 changed files with 3543 additions and 63 deletions

13
book/book.toml Normal file
View File

@ -0,0 +1,13 @@
[book]
authors = ["Nomos Testing"]
language = "en"
multilingual = false
src = "src"
title = "Nomos Testing Book"
[build]
# Keep book output in target/ to avoid polluting the workspace root.
build-dir = "../target/book"
[output.html]
default-theme = "light"

549
book/combined.md Normal file
View File

@ -0,0 +1,549 @@
# Nomos Testing Framework — Combined Reference
## Project Context Primer
This book focuses on the Nomos Testing Framework. It assumes familiarity with
the Nomos architecture, but for completeness, here is a short primer.
- **Nomos** is a modular blockchain protocol composed of validators, executors,
and a data-availability (DA) subsystem.
- **Validators** participate in consensus and produce blocks.
- **Executors** run application logic or off-chain computations referenced by
blocks.
- **Data Availability (DA)** ensures that data referenced in blocks is
published and retrievable, including blobs or channel data used by workloads.
These roles interact tightly, which is why meaningful testing must be performed
in multi-node environments that include real networking, timing, and DA
interaction.
## What You Will Learn
This book gives you a clear mental model for Nomos multi-node testing, shows how
to author scenarios that pair realistic workloads with explicit expectations,
and guides you to run them across local, containerized, and cluster environments
without changing the plan.
## Part I — Foundations
### Introduction
The Nomos Testing Framework is a purpose-built toolkit for exercising Nomos in
realistic, multi-node environments. It solves the gap between small, isolated
tests and full-system validation by letting teams describe a cluster layout,
drive meaningful traffic, and assert the outcomes in one coherent plan.
It is for protocol engineers, infrastructure operators, and QA teams who need
repeatable confidence that validators, executors, and data-availability
components work together under network and timing constraints.
Multi-node integration testing is required because many Nomos behaviors—block
progress, data availability, liveness under churn—only emerge when several
roles interact over real networking and time. This framework makes those checks
declarative, observable, and portable across environments.
### Architecture Overview
The framework follows a clear flow: **Topology → Scenario → Runner → Workloads → Expectations**.
- **Topology** describes the cluster: how many nodes, their roles, and the high-level network and data-availability parameters they should follow.
- **Scenario** combines that topology with the activities to run and the checks to perform, forming a single plan.
- **Deployer/Runner** pair turns the plan into a live environment on the chosen backend (local processes, Docker Compose, or Kubernetes) and brokers readiness.
- **Workloads** generate traffic and conditions that exercise the system.
- **Expectations** observe the run and judge success or failure once activity completes.
Conceptual diagram:
```
Topology → Scenario → Runner → Workloads → Expectations
(shape (plan) (deploy (drive (verify
cluster) & orchestrate) traffic) outcomes)
```
Mermaid view:
```mermaid
flowchart LR
A(Topology<br/>shape cluster) --> B(Scenario<br/>plan)
B --> C(Deployer/Runner<br/>deploy & orchestrate)
C --> D(Workloads<br/>drive traffic)
D --> E(Expectations<br/>verify outcomes)
```
Each layer has a narrow responsibility so that cluster shape, deployment choice, traffic generation, and health checks can evolve independently while fitting together predictably.
### Testing Philosophy
- **Declarative over imperative**: describe the desired cluster shape, traffic, and success criteria; let the framework orchestrate the run.
- **Observable health signals**: prefer liveness and inclusion signals that reflect real user impact instead of internal debug state.
- **Determinism first**: default scenarios aim for repeatable outcomes with fixed topologies and traffic rates; variability is opt-in.
- **Targeted non-determinism**: introduce randomness (e.g., restarts) only when probing resilience or operational robustness.
- **Protocol time, not wall time**: reason in blocks and protocol-driven intervals to reduce dependence on host speed or scheduler noise.
- **Minimum run window**: always allow enough block production to make assertions meaningful; very short runs risk false confidence.
- **Use chaos with intent**: chaos workloads are for recovery and fault-tolerance validation, not for baseline functional checks.
### Scenario Lifecycle (Conceptual)
1. **Build the plan**: Declare a topology, attach workloads and expectations, and set the run window. The plan is the single source of truth for what will happen.
2. **Deploy**: Hand the plan to a runner. It provisions the environment on the chosen backend and waits for nodes to signal readiness.
3. **Drive workloads**: Start traffic and behaviors (transactions, data-availability activity, restarts) for the planned duration.
4. **Observe blocks and signals**: Track block progression and other high-level metrics during or after the run window to ground assertions in protocol time.
5. **Evaluate expectations**: Once activity stops (and optional cooldown completes), check liveness and workload-specific outcomes to decide pass or fail.
6. **Cleanup**: Tear down resources so successive runs start fresh and do not inherit leaked state.
Conceptual lifecycle diagram:
```
Plan → Deploy → Readiness → Drive Workloads → Observe → Evaluate → Cleanup
```
Mermaid view:
```mermaid
flowchart LR
P[Plan<br/>topology + workloads + expectations] --> D[Deploy<br/>runner provisions]
D --> R[Readiness<br/>wait for nodes]
R --> W[Drive Workloads]
W --> O[Observe<br/>blocks/metrics]
O --> E[Evaluate Expectations]
E --> C[Cleanup]
```
### Design Rationale
- **Modular crates** keep configuration, orchestration, workloads, and runners decoupled so each can evolve without breaking the others.
- **Pluggable runners** let the same scenario run on a laptop, a Docker host, or a Kubernetes cluster, making validation portable across environments.
- **Separated workloads and expectations** clarify intent: what traffic to generate versus how to judge success. This simplifies review and reuse.
- **Declarative topology** makes cluster shape explicit and repeatable, reducing surprise when moving between CI and developer machines.
- **Maintainability through predictability**: a clear flow from plan to deployment to verification lowers the cost of extending the framework and interpreting failures.
## Part II — User Guide
### Workspace Layout
The workspace focuses on multi-node integration testing and sits alongside a `nomos-node` checkout. Its crates separate concerns to keep scenarios repeatable and portable:
- **Configs**: prepares high-level node, network, tracing, and wallet settings used across test environments.
- **Core scenario orchestration**: the engine that holds topology descriptions, scenario plans, runtimes, workloads, and expectations.
- **Workflows**: ready-made workloads (transactions, data-availability, chaos) and reusable expectations assembled into a user-facing DSL.
- **Runners**: deployment backends for local processes, Docker Compose, and Kubernetes, all consuming the same scenario plan.
- **Test workflows**: example scenarios and integration checks that show how the pieces fit together.
This split keeps configuration, orchestration, reusable traffic patterns, and deployment adapters loosely coupled while sharing one mental model for tests.
### Annotated Tree
High-level view of the workspace and how pieces relate:
```
nomos-testing/
├─ testing-framework/
│ ├─ configs/ # shared configuration helpers
│ ├─ core/ # scenario model, runtime, topology
│ ├─ workflows/ # workloads, expectations, DSL extensions
│ └─ runners/ # local, compose, k8s deployment backends
├─ tests/ # integration scenarios using the framework
└─ scripts/ # supporting setup utilities (e.g., assets)
```
Each area maps to a responsibility: describe configs, orchestrate scenarios, package common traffic and assertions, adapt to environments, and demonstrate end-to-end usage.
### Authoring Scenarios
Creating a scenario is a declarative exercise:
1. **Shape the topology**: decide how many validators and executors to run, and what high-level network and data-availability characteristics matter for the test.
2. **Attach workloads**: pick traffic generators that align with your goals (transactions, data-availability blobs, or chaos for resilience probes).
3. **Define expectations**: specify the health signals that must hold when the run finishes (e.g., consensus liveness, inclusion of submitted activity; see [Core Content: Workloads & Expectations](workloads.md)).
4. **Set duration**: choose a run window long enough to observe meaningful block progression and the effects of your workloads.
5. **Choose a runner**: target local processes for fast iteration, Docker Compose for reproducible multi-node stacks, or Kubernetes for cluster-grade validation. For environment considerations, see [Operations](operations.md).
Keep scenarios small and explicit: make the intended behavior and the success criteria clear so failures are easy to interpret and act upon.
### Core Content: Workloads & Expectations
Workloads describe the activity a scenario generates; expectations describe the signals that must hold when that activity completes. Both are pluggable so scenarios stay readable and purpose-driven.
#### Workloads
- **Transaction workload**: submits user-level transactions at a configurable rate and can limit how many distinct actors participate.
- **Data-availability workload**: drives blob and channel activity to exercise data-availability paths.
- **Chaos workload**: triggers controlled node restarts to test resilience and recovery behaviors (requires a runner that can control nodes).
#### Expectations
- **Consensus liveness**: verifies the system continues to produce blocks in line with the planned workload and timing window.
- **Workload-specific checks**: each workload can attach its own success criteria (e.g., inclusion of submitted activity) so scenarios remain concise.
Together, workloads and expectations let you express both the pressure applied to the system and the definition of “healthy” for that run.
Workload pipeline (conceptual):
```
Inputs (topology + wallets + rates)
Workload init → Drive traffic → Collect signals
Expectations evaluate
```
Mermaid view:
```mermaid
flowchart TD
I[Inputs<br/>(topology + wallets + rates)] --> Init[Workload init]
Init --> Drive[Drive traffic]
Drive --> Collect[Collect signals]
Collect --> Eval[Expectations evaluate]
```
### Core Content: ScenarioBuilderExt Patterns
Patterns that keep scenarios readable and reusable:
- **Topology-first**: start by shaping the cluster (counts, layout) so later steps inherit a clear foundation.
- **Bundle defaults**: use the DSL helpers to attach common expectations (like liveness) whenever you add a matching workload, reducing forgotten checks.
- **Intentional rates**: express traffic in per-block terms to align with protocol timing rather than wall-clock assumptions.
- **Opt-in chaos**: enable restart patterns only in scenarios meant to probe resilience; keep functional smoke tests deterministic.
- **Wallet clarity**: seed only the number of actors you need; it keeps transaction scenarios deterministic and interpretable.
These patterns make scenario definitions self-explanatory while staying aligned with the frameworks block-oriented timing model.
### Best Practices
- **State your intent**: document the goal of each scenario (throughput, DA validation, resilience) so expectation choices are obvious.
- **Keep runs meaningful**: choose durations that allow multiple blocks and make timing-based assertions trustworthy.
- **Separate concerns**: start with deterministic workloads for functional checks; add chaos in dedicated resilience scenarios to avoid noisy failures.
- **Reuse patterns**: standardize on shared topology and workload presets so results are comparable across environments and teams.
- **Observe first, tune second**: rely on liveness and inclusion signals to interpret outcomes before tweaking rates or topology.
- **Environment fit**: pick runners that match the feedback loop you need—local for speed, compose for reproducible stacks, k8s for cluster-grade fidelity.
- **Minimal surprises**: seed only necessary wallets and keep configuration deltas explicit when moving between CI and developer machines.
### Examples
Concrete scenario shapes that illustrate how to combine topologies, workloads, and expectations. Adjust counts, rates, and durations to fit your environment.
#### Simple 2-validator transaction workload
- **Topology**: two validators.
- **Workload**: transaction submissions at a modest per-block rate with a small set of wallet actors.
- **Expectations**: consensus liveness and inclusion of submitted activity.
- **When to use**: smoke tests for consensus and transaction flow on minimal hardware.
#### DA + transaction workload
- **Topology**: validators plus executors if available.
- **Workloads**: data-availability blobs/channels and transactions running together to stress both paths.
- **Expectations**: consensus liveness and workload-level inclusion/availability checks.
- **When to use**: end-to-end coverage of transaction and DA layers in one run.
#### Chaos + liveness check
- **Topology**: validators (optionally executors) with node control enabled.
- **Workloads**: baseline traffic (transactions or DA) plus chaos restarts on selected roles.
- **Expectations**: consensus liveness to confirm the system keeps progressing despite restarts; workload-specific inclusion if traffic is present.
- **When to use**: resilience validation and operational readiness drills.
### Advanced & Artificial Examples
These illustrative scenarios stretch the framework to show how to build new workloads, expectations, deployers, and topology tricks. They are intentionally “synthetic” to teach capabilities rather than prescribe production tests.
#### Synthetic Delay Workload (Network Latency Simulation)
- **Idea**: inject fake latency between node interactions using internal timers, not OS-level tooling.
- **Demonstrates**: sequencing control inside a workload, verifying protocol progression under induced lag, using timers to pace submissions.
- **Shape**: wrap submissions in delays that mimic slow peers; ensure the expectation checks blocks still progress.
#### Oscillating Load Workload (Traffic Waves)
- **Idea**: traffic rate changes every block or N seconds (e.g., blocks 13 low, 45 high, 67 zero, repeat).
- **Demonstrates**: dynamic, stateful workloads that use `RunMetrics` to time phases; modeling real-world burstiness.
- **Shape**: schedule per-phase rates; confirm inclusion/liveness across peaks and troughs.
#### Byzantine Behavior Mock
- **Idea**: a workload that drops half its planned submissions, sometimes double-submits, and intentionally triggers expectation failures.
- **Demonstrates**: negative testing, resilience checks, and the value of clear expectations when behavior is adversarial by design.
- **Shape**: parameterize drop/double-submit probabilities; pair with an expectation that documents what “bad” looks like.
#### Custom Expectation: Block Finality Drift
- **Idea**: assert the last few blocks differ and block time stays within a tolerated drift budget.
- **Demonstrates**: consuming `BlockFeed` or time-series metrics to validate protocol cadence; crafting post-run assertions around block diversity and timing.
- **Shape**: collect recent blocks, confirm no duplicates, and compare observed intervals to a drift threshold.
#### Custom Deployer: Dry-Run Deployer
- **Idea**: a deployer that never starts nodes; it emits configs, simulates readiness, provides fake blockfeed/metrics.
- **Demonstrates**: full power of the deployer interface for CI dry-runs, config verification, and ultra-fast feedback without Nomos binaries.
- **Shape**: produce logs/artifacts, stub readiness, and feed synthetic blocks so expectations can still run.
#### Stochastic Topology Generator
- **Idea**: topology parameters change at runtime (random validators, DA settings, network shapes).
- **Demonstrates**: randomized property testing and fuzzing approaches to topology building.
- **Shape**: pick roles and network layouts randomly per run; keep expectations tolerant to variability while still asserting core liveness.
#### Multi-Phase Scenario (“Pipelines”)
- **Idea**: scenario runs in phases (e.g., phase 1 transactions, phase 2 DA, phase 3 restarts, phase 4 sync check).
- **Demonstrates**: multi-stage tests, modular scenario assembly, and deliberate lifecycle control.
- **Shape**: drive phase-specific workloads/expectations sequentially; enforce clear boundaries and post-phase checks.
### Running Scenarios
Running a scenario follows the same conceptual flow regardless of environment:
1. Select or author a scenario plan that pairs a topology with workloads, expectations, and a suitable run window.
2. Choose a runner aligned with your environment (local, compose, or k8s) and ensure its prerequisites are available.
3. Deploy the plan through the runner; wait for readiness signals before starting workloads.
4. Let workloads drive activity for the planned duration; keep observability signals visible so you can correlate outcomes.
5. Evaluate expectations and capture results as the primary pass/fail signal.
Use the same plan across different runners to compare behavior between local development and CI or cluster settings. For environment prerequisites and flags, see [Operations](operations.md).
### Runners
Runners turn a scenario plan into a live environment while keeping the plan unchanged. Choose based on feedback speed, reproducibility, and fidelity. For environment and operational considerations, see [Operations](operations.md):
#### Local runner
- Launches node processes directly on the host.
- Fastest feedback loop and minimal orchestration overhead.
- Best for development-time iteration and debugging.
#### Docker Compose runner
- Starts nodes in containers to provide a reproducible multi-node stack on a single machine.
- Discovers service ports and wires observability for convenient inspection.
- Good balance between fidelity and ease of setup.
#### Kubernetes runner
- Deploys nodes onto a cluster for higher-fidelity, longer-running scenarios.
- Suits CI or shared environments where cluster behavior and scheduling matter.
#### Common expectations
- All runners require at least one validator and, for transaction scenarios, access to seeded wallets.
- Readiness probes gate workload start so traffic begins only after nodes are reachable.
- Environment flags can relax timeouts or increase tracing when diagnostics are needed.
Runner abstraction:
```
Scenario Plan
Runner (local | compose | k8s)
│ provisions env + readiness
Runtime + Observability
Workloads / Expectations execute
```
Mermaid view:
```mermaid
flowchart TD
Plan[Scenario Plan] --> RunSel{Runner<br/>(local | compose | k8s)}
RunSel --> Provision[Provision & readiness]
Provision --> Runtime[Runtime + observability]
Runtime --> Exec[Workloads & Expectations execute]
```
### Operations
Operational readiness focuses on prerequisites, environment fit, and clear signals:
- **Prerequisites**: keep a sibling `nomos-node` checkout available; ensure the chosen runners platform needs are met (local binaries for host runs, Docker for compose, cluster access for k8s).
- **Artifacts**: some scenarios depend on prover or circuit assets; fetch them ahead of time with the provided helper scripts when needed.
- **Environment flags**: use slow-environment toggles to relax timeouts, enable tracing when debugging, and adjust observability ports to avoid clashes.
- **Readiness checks**: verify runners report node readiness before starting workloads; this avoids false negatives from starting too early.
- **Failure triage**: map failures to missing prerequisites (wallet seeding, node control availability), runner platform issues, or unmet expectations. Start with liveness signals, then dive into workload-specific assertions.
Treat operational hygiene—assets present, prerequisites satisfied, observability reachable—as the first step to reliable scenario outcomes.
Metrics and observability flow:
```
Runner exposes endpoints/ports
Runtime collects block/health signals
Expectations consume signals to decide pass/fail
Operators inspect logs/metrics when failures arise
```
Mermaid view:
```mermaid
flowchart TD
Expose[Runner exposes endpoints/ports] --> Collect[Runtime collects block/health signals]
Collect --> Consume[Expectations consume signals<br/>decide pass/fail]
Consume --> Inspect[Operators inspect logs/metrics<br/>when failures arise]
```
## Part III — Developer Reference
### Scenario Model (Developer Level)
The scenario model defines clear, composable responsibilities:
- **Topology**: a declarative description of the cluster—how many nodes, their roles, and the broad network and data-availability characteristics. It represents the intended shape of the system under test.
- **Scenario**: a plan combining topology, workloads, expectations, and a run window. Building a scenario validates prerequisites (like seeded wallets) and ensures the run lasts long enough to observe meaningful block progression.
- **Workloads**: asynchronous tasks that generate traffic or conditions. They use shared context to interact with the deployed cluster and may bundle default expectations.
- **Expectations**: post-run assertions. They can capture baselines before workloads start and evaluate success once activity stops.
- **Runtime**: coordinates workloads and expectations for the configured duration, enforces cooldowns when control actions occur, and ensures cleanup so runs do not leak resources.
Developers extending the model should keep these boundaries strict: topology describes, scenarios assemble, runners deploy, workloads drive, and expectations judge outcomes. For guidance on adding new capabilities, see [Extending the Framework](extending.md).
### Extending the Framework
#### Adding a workload
1) Implement the workload contract: provide a name, optional bundled expectations, validate prerequisites up front, and drive asynchronous activity against the deployed cluster.
2) Export it through the workflows layer and consider adding DSL helpers for ergonomic wiring.
#### Adding an expectation
1) Implement the expectation contract: capture baselines if needed and evaluate outcomes after workloads finish; report meaningful errors to aid debugging.
2) Expose reusable expectations from the workflows layer so scenarios can attach them declaratively.
#### Adding a runner
1) Implement the deployer contract for the target backend, producing a runtime context with client access, metrics endpoints, and optional node control.
2) Preserve cleanup guarantees so resources are reclaimed even when runs fail; mirror readiness and observation signals used by existing runners for consistency.
#### Adding topology helpers
Extend the topology description with new layouts or presets while keeping defaults safe and predictable; favor declarative inputs over ad hoc logic so scenarios stay reviewable.
### Internal Crate Reference
High-level roles of the crates that make up the framework:
- **Configs**: prepares reusable configuration primitives for nodes, networking, tracing, data availability, and wallets, shared by all scenarios and runners.
- **Core scenario orchestration**: houses the topology and scenario model, runtime coordination, node clients, and readiness/health probes.
- **Workflows**: packages workloads and expectations into reusable building blocks and offers a fluent DSL to assemble them.
- **Runners**: implements deployment backends (local host, Docker Compose, Kubernetes) that all consume the same scenario plan.
- **Test workflows**: example scenarios and integration checks that exercise the framework end to end and serve as living documentation.
Use this map to locate where to add new capabilities: configuration primitives in configs, orchestration changes in core, reusable traffic/assertions in workflows, environment adapters in runners, and demonstrations in tests.
### Example: New Workload & Expectation (Rust)
A minimal, end-to-end illustration of adding a custom workload and matching expectation. This shows the shape of the traits and where to plug into the framework; expand the logic to fit your real test.
#### Workload: simple reachability probe
Key ideas:
- **name**: identifies the workload in logs.
- **expectations**: workloads can bundle defaults so callers dont forget checks.
- **init**: derive inputs from the generated topology (e.g., pick a target node).
- **start**: drive async activity using the shared `RunContext`.
```rust
use std::sync::Arc;
use async_trait::async_trait;
use testing_framework_core::scenario::{
DynError, Expectation, RunContext, RunMetrics, Workload,
};
use testing_framework_core::topology::GeneratedTopology;
pub struct ReachabilityWorkload {
target_idx: usize,
bundled: Vec<Box<dyn Expectation>>,
}
impl ReachabilityWorkload {
pub fn new(target_idx: usize) -> Self {
Self {
target_idx,
bundled: vec![Box::new(ReachabilityExpectation::new(target_idx))],
}
}
}
#[async_trait]
impl Workload for ReachabilityWorkload {
fn name(&self) -> &'static str {
"reachability_workload"
}
fn expectations(&self) -> Vec<Box<dyn Expectation>> {
self.bundled.clone()
}
fn init(
&mut self,
topology: &GeneratedTopology,
_metrics: &RunMetrics,
) -> Result<(), DynError> {
if topology.validators().get(self.target_idx).is_none() {
return Err("no validator at requested index".into());
}
Ok(())
}
async fn start(&self, ctx: &RunContext) -> Result<(), DynError> {
let client = ctx
.clients()
.validators()
.get(self.target_idx)
.ok_or("missing target client")?;
// Pseudo-action: issue a lightweight RPC to prove reachability.
client.health_check().await.map_err(|e| e.into())
}
}
```
#### Expectation: confirm the target stayed reachable
Key ideas:
- **start_capture**: snapshot baseline if needed (not used here).
- **evaluate**: assert the condition after workloads finish.
```rust
use async_trait::async_trait;
use testing_framework_core::scenario::{DynError, Expectation, RunContext};
pub struct ReachabilityExpectation {
target_idx: usize,
}
impl ReachabilityExpectation {
pub fn new(target_idx: usize) -> Self {
Self { target_idx }
}
}
#[async_trait]
impl Expectation for ReachabilityExpectation {
fn name(&self) -> &str {
"target_reachable"
}
async fn evaluate(&mut self, ctx: &RunContext) -> Result<(), DynError> {
let client = ctx
.clients()
.validators()
.get(self.target_idx)
.ok_or("missing target client")?;
client.health_check().await.map_err(|e| {
format!("target became unreachable during run: {e}").into()
})
}
}
```
#### How to wire it
- Build your scenario as usual and call `.with_workload(ReachabilityWorkload::new(0))`.
- The bundled expectation is attached automatically; you can add more with `.with_expectation(...)` if needed.
- Keep the logic minimal and fast for smoke tests; grow it into richer probes for deeper scenarios.
## Part IV — Appendix
### DSL Cheat Sheet
The framework offers a fluent builder style to keep scenarios readable. Common knobs:
- **Topology shaping**: set validator and executor counts, pick a network layout style, and adjust high-level data-availability traits.
- **Wallet seeding**: define how many users participate and the total funds available for transaction workloads.
- **Workload tuning**: configure transaction rates, data-availability channel and blob rates, and whether chaos restarts should include validators, executors, or both.
- **Expectations**: attach liveness and workload-specific checks so success is explicit.
- **Run window**: set a minimum duration long enough for multiple blocks to be observed and verified.
Use these knobs to express intent clearly, keeping scenario definitions concise and consistent across teams.
### Troubleshooting Scenarios
Common symptoms and likely causes:
- **No or slow block progression**: runner started workloads before readiness, insufficient run window, or environment too slow—extend duration or enable slow-environment tuning.
- **Transactions not included**: missing or insufficient wallet seeding, misaligned transaction rate with block cadence, or network instability—reduce rate and verify wallet setup.
- **Chaos stalls the run**: node control not available for the chosen runner or restart cadence too aggressive—enable control capability and widen restart intervals.
- **Observability gaps**: metrics or logs unreachable because ports clash or services are not exposed—adjust observability ports and confirm runner wiring.
- **Flaky behavior across runs**: mixing chaos with functional smoke tests or inconsistent topology between environments—separate deterministic and chaos scenarios and standardize topology presets.
### FAQ
**Why block-oriented timing?**
Using block cadence reduces dependence on host speed and keeps assertions aligned with protocol behavior.
**Can I reuse the same scenario across runners?**
Yes. The plan stays the same; swap runners (local, compose, k8s) to target different environments.
**When should I enable chaos workloads?**
Only when testing resilience or operational recovery; keep functional smoke tests deterministic.
**How long should runs be?**
Long enough for multiple blocks so liveness and inclusion checks are meaningful; very short runs risk false confidence.
**Do I always need seeded wallets?**
Only for transaction scenarios. Data-availability or pure chaos scenarios may not require them, but liveness checks still need validators producing blocks.
**What if expectations fail but workloads “look fine”?**
Trust expectations first—they capture the intended success criteria. Use the observability signals and runner logs to pinpoint why the system missed the target.
### Glossary
- **Validator**: node role responsible for participating in consensus and block production.
- **Executor**: node role that processes transactions or workloads delegated by validators.
- **DA (Data Availability)**: subsystem ensuring blobs or channel data are published and retrievable for validation.
- **Workload**: traffic or behavior generator that exercises the system during a scenario run.
- **Expectation**: post-run assertion that judges whether the system met the intended success criteria.
- **Topology**: declarative description of the cluster shape, roles, and high-level parameters for a scenario.
- **Blockfeed**: stream of block observations used for liveness or inclusion signals during a run.
- **Control capability**: the ability for a runner to start, stop, or restart nodes, used by chaos workloads.

File diff suppressed because it is too large Load Diff

31
book/src/SUMMARY.md Normal file
View File

@ -0,0 +1,31 @@
# Summary
- [Project Context Primer](project-context-primer.md)
- [What You Will Learn](what-you-will-learn.md)
- [Part I — Foundations](part-i.md)
- [Introduction](introduction.md)
- [Architecture Overview](architecture-overview.md)
- [Testing Philosophy](testing-philosophy.md)
- [Scenario Lifecycle (Conceptual)](scenario-lifecycle.md)
- [Design Rationale](design-rationale.md)
- [Part II — User Guide](part-ii.md)
- [Workspace Layout](workspace-layout.md)
- [Annotated Tree](annotated-tree.md)
- [Authoring Scenarios](authoring-scenarios.md)
- [Core Content: Workloads & Expectations](workloads.md)
- [Core Content: ScenarioBuilderExt Patterns](scenario-builder-ext-patterns.md)
- [Best Practices](best-practices.md)
- [Examples](examples.md)
- [Advanced & Artificial Examples](examples-advanced.md)
- [Running Scenarios](running-scenarios.md)
- [Runners](runners.md)
- [Operations](operations.md)
- [Part III — Developer Reference](part-iii.md)
- [Scenario Model (Developer Level)](scenario-model.md)
- [Extending the Framework](extending.md)
- [Example: New Workload & Expectation (Rust)](custom-workload-example.md)
- [Internal Crate Reference](internal-crate-reference.md)
- [Part IV — Appendix](part-iv.md)
- [DSL Cheat Sheet](dsl-cheat-sheet.md)
- [Troubleshooting Scenarios](troubleshooting.md)
- [FAQ](faq.md)
- [Glossary](glossary.md)

View File

@ -0,0 +1,17 @@
# Annotated Tree
High-level view of the workspace and how pieces relate:
```
nomos-testing/
├─ testing-framework/
│ ├─ configs/ # shared configuration helpers
│ ├─ core/ # scenario model, runtime, topology
│ ├─ workflows/ # workloads, expectations, DSL extensions
│ └─ runners/ # local, compose, k8s deployment backends
├─ tests/ # integration scenarios using the framework
└─ scripts/ # supporting setup utilities (e.g., assets)
```
Each area maps to a responsibility: describe configs, orchestrate scenarios,
package common traffic and assertions, adapt to environments, and demonstrate
end-to-end usage.

View File

@ -0,0 +1,29 @@
# Architecture Overview
The framework follows a clear flow: **Topology → Scenario → Runner → Workloads → Expectations**.
- **Topology** describes the cluster: how many nodes, their roles, and the high-level network and data-availability parameters they should follow.
- **Scenario** combines that topology with the activities to run and the checks to perform, forming a single plan.
- **Deployer/Runner** pair turns the plan into a live environment on the chosen backend (local processes, Docker Compose, or Kubernetes) and brokers readiness.
- **Workloads** generate traffic and conditions that exercise the system.
- **Expectations** observe the run and judge success or failure once activity completes.
Conceptual diagram:
```
Topology → Scenario → Runner → Workloads → Expectations
(shape (plan) (deploy (drive (verify
cluster) & orchestrate) traffic) outcomes)
```
Mermaid view:
```mermaid
flowchart LR
A(Topology<br/>shape cluster) --> B(Scenario<br/>plan)
B --> C(Deployer/Runner<br/>deploy & orchestrate)
C --> D(Workloads<br/>drive traffic)
D --> E(Expectations<br/>verify outcomes)
```
Each layer has a narrow responsibility so that cluster shape, deployment choice,
traffic generation, and health checks can evolve independently while fitting
together predictably.

View File

@ -0,0 +1,20 @@
# Authoring Scenarios
Creating a scenario is a declarative exercise:
1. **Shape the topology**: decide how many validators and executors to run, and
what high-level network and data-availability characteristics matter for the
test.
2. **Attach workloads**: pick traffic generators that align with your goals
(transactions, data-availability blobs, or chaos for resilience probes).
3. **Define expectations**: specify the health signals that must hold when the
run finishes (e.g., consensus liveness, inclusion of submitted activity; see
[Core Content: Workloads & Expectations](workloads.md)).
4. **Set duration**: choose a run window long enough to observe meaningful
block progression and the effects of your workloads.
5. **Choose a runner**: target local processes for fast iteration, Docker
Compose for reproducible multi-node stacks, or Kubernetes for cluster-grade
validation. For environment considerations, see [Operations](operations.md).
Keep scenarios small and explicit: make the intended behavior and the success
criteria clear so failures are easy to interpret and act upon.

View File

@ -0,0 +1,16 @@
# Best Practices
- **State your intent**: document the goal of each scenario (throughput, DA
validation, resilience) so expectation choices are obvious.
- **Keep runs meaningful**: choose durations that allow multiple blocks and make
timing-based assertions trustworthy.
- **Separate concerns**: start with deterministic workloads for functional
checks; add chaos in dedicated resilience scenarios to avoid noisy failures.
- **Reuse patterns**: standardize on shared topology and workload presets so
results are comparable across environments and teams.
- **Observe first, tune second**: rely on liveness and inclusion signals to
interpret outcomes before tweaking rates or topology.
- **Environment fit**: pick runners that match the feedback loop you need—local
for speed, compose for reproducible stacks, k8s for cluster-grade fidelity.
- **Minimal surprises**: seed only necessary wallets and keep configuration
deltas explicit when moving between CI and developer machines.

View File

@ -0,0 +1,116 @@
# Example: New Workload & Expectation (Rust)
A minimal, end-to-end illustration of adding a custom workload and matching
expectation. This shows the shape of the traits and where to plug into the
framework; expand the logic to fit your real test.
## Workload: simple reachability probe
Key ideas:
- **name**: identifies the workload in logs.
- **expectations**: workloads can bundle defaults so callers dont forget checks.
- **init**: derive inputs from the generated topology (e.g., pick a target node).
- **start**: drive async activity using the shared `RunContext`.
```rust
use std::sync::Arc;
use async_trait::async_trait;
use testing_framework_core::scenario::{
DynError, Expectation, RunContext, RunMetrics, Workload,
};
use testing_framework_core::topology::GeneratedTopology;
pub struct ReachabilityWorkload {
target_idx: usize,
bundled: Vec<Box<dyn Expectation>>,
}
impl ReachabilityWorkload {
pub fn new(target_idx: usize) -> Self {
Self {
target_idx,
bundled: vec![Box::new(ReachabilityExpectation::new(target_idx))],
}
}
}
#[async_trait]
impl Workload for ReachabilityWorkload {
fn name(&self) -> &'static str {
"reachability_workload"
}
fn expectations(&self) -> Vec<Box<dyn Expectation>> {
self.bundled.clone()
}
fn init(
&mut self,
topology: &GeneratedTopology,
_metrics: &RunMetrics,
) -> Result<(), DynError> {
if topology.validators().get(self.target_idx).is_none() {
return Err("no validator at requested index".into());
}
Ok(())
}
async fn start(&self, ctx: &RunContext) -> Result<(), DynError> {
let client = ctx
.clients()
.validators()
.get(self.target_idx)
.ok_or("missing target client")?;
// Pseudo-action: issue a lightweight RPC to prove reachability.
client.health_check().await.map_err(|e| e.into())
}
}
```
## Expectation: confirm the target stayed reachable
Key ideas:
- **start_capture**: snapshot baseline if needed (not used here).
- **evaluate**: assert the condition after workloads finish.
```rust
use async_trait::async_trait;
use testing_framework_core::scenario::{DynError, Expectation, RunContext};
pub struct ReachabilityExpectation {
target_idx: usize,
}
impl ReachabilityExpectation {
pub fn new(target_idx: usize) -> Self {
Self { target_idx }
}
}
#[async_trait]
impl Expectation for ReachabilityExpectation {
fn name(&self) -> &str {
"target_reachable"
}
async fn evaluate(&mut self, ctx: &RunContext) -> Result<(), DynError> {
let client = ctx
.clients()
.validators()
.get(self.target_idx)
.ok_or("missing target client")?;
client.health_check().await.map_err(|e| {
format!("target became unreachable during run: {e}").into()
})
}
}
```
## How to wire it
- Build your scenario as usual and call `.with_workload(ReachabilityWorkload::new(0))`.
- The bundled expectation is attached automatically; you can add more with
`.with_expectation(...)` if needed.
- Keep the logic minimal and fast for smoke tests; grow it into richer probes
for deeper scenarios.

View File

@ -0,0 +1,7 @@
# Design Rationale
- **Modular crates** keep configuration, orchestration, workloads, and runners decoupled so each can evolve without breaking the others.
- **Pluggable runners** let the same scenario run on a laptop, a Docker host, or a Kubernetes cluster, making validation portable across environments.
- **Separated workloads and expectations** clarify intent: what traffic to generate versus how to judge success. This simplifies review and reuse.
- **Declarative topology** makes cluster shape explicit and repeatable, reducing surprise when moving between CI and developer machines.
- **Maintainability through predictability**: a clear flow from plan to deployment to verification lowers the cost of extending the framework and interpreting failures.

View File

@ -0,0 +1,19 @@
# Core Content: DSL Cheat Sheet
The framework offers a fluent builder style to keep scenarios readable. Common
knobs:
- **Topology shaping**: set validator and executor counts, pick a network layout
style, and adjust high-level data-availability traits.
- **Wallet seeding**: define how many users participate and the total funds
available for transaction workloads.
- **Workload tuning**: configure transaction rates, data-availability channel
and blob rates, and whether chaos restarts should include validators,
executors, or both.
- **Expectations**: attach liveness and workload-specific checks so success is
explicit.
- **Run window**: set a minimum duration long enough for multiple blocks to be
observed and verified.
Use these knobs to express intent clearly, keeping scenario definitions concise
and consistent across teams.

View File

@ -0,0 +1,62 @@
# Advanced & Artificial Examples
These illustrative scenarios stretch the framework to show how to build new
workloads, expectations, deployers, and topology tricks. They are intentionally
“synthetic” to teach capabilities rather than prescribe production tests.
## Synthetic Delay Workload (Network Latency Simulation)
- **Idea**: inject fake latency between node interactions using internal timers,
not OS-level tooling.
- **Demonstrates**: sequencing control inside a workload, verifying protocol
progression under induced lag, using timers to pace submissions.
- **Shape**: wrap submissions in delays that mimic slow peers; ensure the
expectation checks blocks still progress.
## Oscillating Load Workload (Traffic Waves)
- **Idea**: traffic rate changes every block or N seconds (e.g., blocks 13 low,
45 high, 67 zero, repeat).
- **Demonstrates**: dynamic, stateful workloads that use `RunMetrics` to time
phases; modeling real-world burstiness.
- **Shape**: schedule per-phase rates; confirm inclusion/liveness across peaks
and troughs.
## Byzantine Behavior Mock
- **Idea**: a workload that drops half its planned submissions, sometimes
double-submits, and intentionally triggers expectation failures.
- **Demonstrates**: negative testing, resilience checks, and the value of clear
expectations when behavior is adversarial by design.
- **Shape**: parameterize drop/double-submit probabilities; pair with an
expectation that documents what “bad” looks like.
## Custom Expectation: Block Finality Drift
- **Idea**: assert the last few blocks differ and block time stays within a
tolerated drift budget.
- **Demonstrates**: consuming `BlockFeed` or time-series metrics to validate
protocol cadence; crafting post-run assertions around block diversity and
timing.
- **Shape**: collect recent blocks, confirm no duplicates, and compare observed
intervals to a drift threshold.
## Custom Deployer: Dry-Run Deployer
- **Idea**: a deployer that never starts nodes; it emits configs, simulates
readiness, provides fake blockfeed/metrics.
- **Demonstrates**: full power of the deployer interface for CI dry-runs,
config verification, and ultra-fast feedback without Nomos binaries.
- **Shape**: produce logs/artifacts, stub readiness, and feed synthetic blocks
so expectations can still run.
## Stochastic Topology Generator
- **Idea**: topology parameters change at runtime (random validators, DA
settings, network shapes).
- **Demonstrates**: randomized property testing and fuzzing approaches to
topology building.
- **Shape**: pick roles and network layouts randomly per run; keep expectations
tolerant to variability while still asserting core liveness.
## Multi-Phase Scenario (“Pipelines”)
- **Idea**: scenario runs in phases (e.g., phase 1 transactions, phase 2 DA,
phase 3 restarts, phase 4 sync check).
- **Demonstrates**: multi-stage tests, modular scenario assembly, and deliberate
lifecycle control.
- **Shape**: drive phase-specific workloads/expectations sequentially; enforce
clear boundaries and post-phase checks.

28
book/src/examples.md Normal file
View File

@ -0,0 +1,28 @@
# Examples
Concrete scenario shapes that illustrate how to combine topologies, workloads,
and expectations. Adjust counts, rates, and durations to fit your environment.
## Simple 2-validator transaction workload
- **Topology**: two validators.
- **Workload**: transaction submissions at a modest per-block rate with a small
set of wallet actors.
- **Expectations**: consensus liveness and inclusion of submitted activity.
- **When to use**: smoke tests for consensus and transaction flow on minimal
hardware.
## DA + transaction workload
- **Topology**: validators plus executors if available.
- **Workloads**: data-availability blobs/channels and transactions running
together to stress both paths.
- **Expectations**: consensus liveness and workload-level inclusion/availability
checks.
- **When to use**: end-to-end coverage of transaction and DA layers in one run.
## Chaos + liveness check
- **Topology**: validators (optionally executors) with node control enabled.
- **Workloads**: baseline traffic (transactions or DA) plus chaos restarts on
selected roles.
- **Expectations**: consensus liveness to confirm the system keeps progressing
despite restarts; workload-specific inclusion if traffic is present.
- **When to use**: resilience validation and operational readiness drills.

31
book/src/extending.md Normal file
View File

@ -0,0 +1,31 @@
# Extending the Framework
## Adding a workload
1) Implement `testing_framework_core::scenario::Workload`:
- Provide a name and any bundled expectations.
- In `init`, derive inputs from `GeneratedTopology` and `RunMetrics`; fail
fast if prerequisites are missing (e.g., wallet data, node addresses).
- In `start`, drive async traffic using the `RunContext` clients.
2) Expose the workload from a module under `testing-framework/workflows` and
consider adding a DSL helper for ergonomic wiring.
## Adding an expectation
1) Implement `testing_framework_core::scenario::Expectation`:
- Use `start_capture` to snapshot baseline metrics.
- Use `evaluate` to assert outcomes after workloads finish; return all errors
so the runner can aggregate them.
2) Export it from `testing-framework/workflows` if it is reusable.
## Adding a runner
1) Implement `testing_framework_core::scenario::Deployer` for your backend.
- Produce a `RunContext` with `NodeClients`, metrics endpoints, and optional
`NodeControlHandle`.
- Guard cleanup with `CleanupGuard` to reclaim resources even on failures.
2) Mirror the readiness and block-feed probes used by the existing runners so
workloads can rely on consistent signals.
## Adding topology helpers
- Extend `testing_framework_core::topology::TopologyBuilder` with new layouts or
configuration presets (e.g., specialized DA parameters). Keep defaults safe:
ensure at least one participant and clamp dispersal factors as the current
helpers do.

26
book/src/faq.md Normal file
View File

@ -0,0 +1,26 @@
# FAQ
**Why block-oriented timing?**
Using block cadence reduces dependence on host speed and keeps assertions aligned
with protocol behavior.
**Can I reuse the same scenario across runners?**
Yes. The plan stays the same; swap runners (local, compose, k8s) to target
different environments.
**When should I enable chaos workloads?**
Only when testing resilience or operational recovery; keep functional smoke
tests deterministic.
**How long should runs be?**
Long enough for multiple blocks so liveness and inclusion checks are
meaningful; very short runs risk false confidence.
**Do I always need seeded wallets?**
Only for transaction scenarios. Data-availability or pure chaos scenarios may
not require them, but liveness checks still need validators producing blocks.
**What if expectations fail but workloads “look fine”?**
Trust expectations first—they capture the intended success criteria. Use the
observability signals and runner logs to pinpoint why the system missed the
target.

18
book/src/glossary.md Normal file
View File

@ -0,0 +1,18 @@
# Glossary
- **Validator**: node role responsible for participating in consensus and block
production.
- **Executor**: node role that processes transactions or workloads delegated by
validators.
- **DA (Data Availability)**: subsystem ensuring blobs or channel data are
published and retrievable for validation.
- **Workload**: traffic or behavior generator that exercises the system during a
scenario run.
- **Expectation**: post-run assertion that judges whether the system met the
intended success criteria.
- **Topology**: declarative description of the cluster shape, roles, and
high-level parameters for a scenario.
- **Blockfeed**: stream of block observations used for liveness or inclusion
signals during a run.
- **Control capability**: the ability for a runner to start, stop, or restart
nodes, used by chaos workloads.

View File

@ -0,0 +1,18 @@
# Internal Crate Reference
High-level roles of the crates that make up the framework:
- **Configs**: prepares reusable configuration primitives for nodes, networking,
tracing, data availability, and wallets, shared by all scenarios and runners.
- **Core scenario orchestration**: houses the topology and scenario model,
runtime coordination, node clients, and readiness/health probes.
- **Workflows**: packages workloads and expectations into reusable building
blocks and offers a fluent DSL to assemble them.
- **Runners**: implements deployment backends (local host, Docker Compose,
Kubernetes) that all consume the same scenario plan.
- **Test workflows**: example scenarios and integration checks that exercise the
framework end to end and serve as living documentation.
Use this map to locate where to add new capabilities: configuration primitives
in configs, orchestration changes in core, reusable traffic/assertions in
workflows, environment adapters in runners, and demonstrations in tests.

15
book/src/introduction.md Normal file
View File

@ -0,0 +1,15 @@
# Introduction
The Nomos Testing Framework is a purpose-built toolkit for exercising Nomos in
realistic, multi-node environments. It solves the gap between small, isolated
tests and full-system validation by letting teams describe a cluster layout,
drive meaningful traffic, and assert the outcomes in one coherent plan.
It is for protocol engineers, infrastructure operators, and QA teams who need
repeatable confidence that validators, executors, and data-availability
components work together under network and timing constraints.
Multi-node integration testing is required because many Nomos behaviors—block
progress, data availability, liveness under churn—only emerge when several
roles interact over real networking and time. This framework makes those checks
declarative, observable, and portable across environments.

42
book/src/operations.md Normal file
View File

@ -0,0 +1,42 @@
# Operations
Operational readiness focuses on prerequisites, environment fit, and clear
signals:
- **Prerequisites**: keep a sibling `nomos-node` checkout available; ensure the
chosen runners platform needs are met (local binaries for host runs, Docker
for compose, cluster access for k8s).
- **Artifacts**: some scenarios depend on prover or circuit assets; fetch them
ahead of time with the provided helper scripts when needed.
- **Environment flags**: use slow-environment toggles to relax timeouts, enable
tracing when debugging, and adjust observability ports to avoid clashes.
- **Readiness checks**: verify runners report node readiness before starting
workloads; this avoids false negatives from starting too early.
- **Failure triage**: map failures to missing prerequisites (wallet seeding,
node control availability), runner platform issues, or unmet expectations.
Start with liveness signals, then dive into workload-specific assertions.
Treat operational hygiene—assets present, prerequisites satisfied, observability
reachable—as the first step to reliable scenario outcomes.
Metrics and observability flow:
```
Runner exposes endpoints/ports
Runtime collects block/health signals
Expectations consume signals to decide pass/fail
Operators inspect logs/metrics when failures arise
```
Mermaid view:
```mermaid
flowchart TD
Expose[Runner exposes endpoints/ports] --> Collect[Runtime collects block/health signals]
Collect --> Consume[Expectations consume signals<br/>decide pass/fail]
Consume --> Inspect[Operators inspect logs/metrics<br/>when failures arise]
```

4
book/src/part-i.md Normal file
View File

@ -0,0 +1,4 @@
# Part I — Foundations
Conceptual chapters that establish the mental model for the framework and how
it approaches multi-node testing.

4
book/src/part-ii.md Normal file
View File

@ -0,0 +1,4 @@
# Part II — User Guide
Practical guidance for shaping scenarios, combining workloads and expectations,
and running them across different environments.

4
book/src/part-iii.md Normal file
View File

@ -0,0 +1,4 @@
# Part III — Developer Reference
Deep dives for contributors who extend the framework, evolve its abstractions,
or maintain the crate set.

4
book/src/part-iv.md Normal file
View File

@ -0,0 +1,4 @@
# Part IV — Appendix
Quick-reference material and supporting guidance to keep scenarios discoverable,
debuggable, and consistent.

View File

@ -0,0 +1,16 @@
# Project Context Primer
This book focuses on the Nomos Testing Framework. It assumes familiarity with
the Nomos architecture, but for completeness, here is a short primer.
- **Nomos** is a modular blockchain protocol composed of validators, executors,
and a data-availability (DA) subsystem.
- **Validators** participate in consensus and produce blocks.
- **Executors** run application logic or off-chain computations referenced by
blocks.
- **Data Availability (DA)** ensures that data referenced in blocks is
published and retrievable, including blobs or channel data used by workloads.
These roles interact tightly, which is why meaningful testing must be performed
in multi-node environments that include real networking, timing, and DA
interaction.

51
book/src/runners.md Normal file
View File

@ -0,0 +1,51 @@
# Runners
Runners turn a scenario plan into a live environment while keeping the plan
unchanged. Choose based on feedback speed, reproducibility, and fidelity. For
environment and operational considerations, see [Operations](operations.md):
## Local runner
- Launches node processes directly on the host.
- Fastest feedback loop and minimal orchestration overhead.
- Best for development-time iteration and debugging.
## Docker Compose runner
- Starts nodes in containers to provide a reproducible multi-node stack on a
single machine.
- Discovers service ports and wires observability for convenient inspection.
- Good balance between fidelity and ease of setup.
## Kubernetes runner
- Deploys nodes onto a cluster for higher-fidelity, longer-running scenarios.
- Suits CI or shared environments where cluster behavior and scheduling matter.
### Common expectations
- All runners require at least one validator and, for transaction scenarios,
access to seeded wallets.
- Readiness probes gate workload start so traffic begins only after nodes are
reachable.
- Environment flags can relax timeouts or increase tracing when diagnostics are
needed.
Runner abstraction:
```
Scenario Plan
Runner (local | compose | k8s)
│ provisions env + readiness
Runtime + Observability
Workloads / Expectations execute
```
Mermaid view:
```mermaid
flowchart TD
Plan[Scenario Plan] --> RunSel{Runner<br/>(local | compose | k8s)}
RunSel --> Provision[Provision & readiness]
Provision --> Runtime[Runtime + observability]
Runtime --> Exec[Workloads & Expectations execute]
```

View File

@ -0,0 +1,17 @@
# Running Scenarios
Running a scenario follows the same conceptual flow regardless of environment:
1. Select or author a scenario plan that pairs a topology with workloads,
expectations, and a suitable run window.
2. Choose a runner aligned with your environment (local, compose, or k8s) and
ensure its prerequisites are available.
3. Deploy the plan through the runner; wait for readiness signals before
starting workloads.
4. Let workloads drive activity for the planned duration; keep observability
signals visible so you can correlate outcomes.
5. Evaluate expectations and capture results as the primary pass/fail signal.
Use the same plan across different runners to compare behavior between local
development and CI or cluster settings. For environment prerequisites and
flags, see [Operations](operations.md).

View File

@ -0,0 +1,17 @@
# Core Content: ScenarioBuilderExt Patterns
Patterns that keep scenarios readable and reusable:
- **Topology-first**: start by shaping the cluster (counts, layout) so later
steps inherit a clear foundation.
- **Bundle defaults**: use the DSL helpers to attach common expectations (like
liveness) whenever you add a matching workload, reducing forgotten checks.
- **Intentional rates**: express traffic in per-block terms to align with
protocol timing rather than wall-clock assumptions.
- **Opt-in chaos**: enable restart patterns only in scenarios meant to probe
resilience; keep functional smoke tests deterministic.
- **Wallet clarity**: seed only the number of actors you need; it keeps
transaction scenarios deterministic and interpretable.
These patterns make scenario definitions self-explanatory while staying aligned
with the frameworks block-oriented timing model.

View File

@ -0,0 +1,24 @@
# Scenario Lifecycle (Conceptual)
1. **Build the plan**: Declare a topology, attach workloads and expectations, and set the run window. The plan is the single source of truth for what will happen.
2. **Deploy**: Hand the plan to a runner. It provisions the environment on the chosen backend and waits for nodes to signal readiness.
3. **Drive workloads**: Start traffic and behaviors (transactions, data-availability activity, restarts) for the planned duration.
4. **Observe blocks and signals**: Track block progression and other high-level metrics during or after the run window to ground assertions in protocol time.
5. **Evaluate expectations**: Once activity stops (and optional cooldown completes), check liveness and workload-specific outcomes to decide pass or fail.
6. **Cleanup**: Tear down resources so successive runs start fresh and do not inherit leaked state.
Conceptual lifecycle diagram:
```
Plan → Deploy → Readiness → Drive Workloads → Observe → Evaluate → Cleanup
```
Mermaid view:
```mermaid
flowchart LR
P[Plan<br/>topology + workloads + expectations] --> D[Deploy<br/>runner provisions]
D --> R[Readiness<br/>wait for nodes]
R --> W[Drive Workloads]
W --> O[Observe<br/>blocks/metrics]
O --> E[Evaluate Expectations]
E --> C[Cleanup]
```

View File

@ -0,0 +1,23 @@
# Scenario Model (Developer Level)
The scenario model defines clear, composable responsibilities:
- **Topology**: a declarative description of the cluster—how many nodes, their
roles, and the broad network and data-availability characteristics. It
represents the intended shape of the system under test.
- **Scenario**: a plan combining topology, workloads, expectations, and a run
window. Building a scenario validates prerequisites (like seeded wallets) and
ensures the run lasts long enough to observe meaningful block progression.
- **Workloads**: asynchronous tasks that generate traffic or conditions. They
use shared context to interact with the deployed cluster and may bundle
default expectations.
- **Expectations**: post-run assertions. They can capture baselines before
workloads start and evaluate success once activity stops.
- **Runtime**: coordinates workloads and expectations for the configured
duration, enforces cooldowns when control actions occur, and ensures cleanup
so runs do not leak resources.
Developers extending the model should keep these boundaries strict: topology
describes, scenarios assemble, runners deploy, workloads drive, and expectations
judge outcomes. For guidance on adding new capabilities, see
[Extending the Framework](extending.md).

View File

@ -0,0 +1,9 @@
# Testing Philosophy
- **Declarative over imperative**: describe the desired cluster shape, traffic, and success criteria; let the framework orchestrate the run.
- **Observable health signals**: prefer liveness and inclusion signals that reflect real user impact instead of internal debug state.
- **Determinism first**: default scenarios aim for repeatable outcomes with fixed topologies and traffic rates; variability is opt-in.
- **Targeted non-determinism**: introduce randomness (e.g., restarts) only when probing resilience or operational robustness.
- **Protocol time, not wall time**: reason in blocks and protocol-driven intervals to reduce dependence on host speed or scheduler noise.
- **Minimum run window**: always allow enough block production to make assertions meaningful; very short runs risk false confidence.
- **Use chaos with intent**: chaos workloads are for recovery and fault-tolerance validation, not for baseline functional checks.

View File

@ -0,0 +1,9 @@
# Troubleshooting Scenarios
Common symptoms and likely causes:
- **No or slow block progression**: runner started workloads before readiness, insufficient run window, or environment too slow—extend duration or enable slow-environment tuning.
- **Transactions not included**: missing or insufficient wallet seeding, misaligned transaction rate with block cadence, or network instability—reduce rate and verify wallet setup.
- **Chaos stalls the run**: node control not available for the chosen runner or restart cadence too aggressive—enable control capability and widen restart intervals.
- **Observability gaps**: metrics or logs unreachable because ports clash or services are not exposed—adjust observability ports and confirm runner wiring.
- **Flaky behavior across runs**: mixing chaos with functional smoke tests or inconsistent topology between environments—separate deterministic and chaos scenarios and standardize topology presets.

View File

@ -0,0 +1,7 @@
# Usage Patterns
- **Shape a topology, pick a runner**: choose local for quick iteration, compose for reproducible multi-node stacks with observability, or k8s for cluster-grade validation.
- **Compose workloads deliberately**: pair transactions and data-availability traffic for end-to-end coverage; add chaos only when assessing recovery and resilience.
- **Align expectations with goals**: use liveness-style checks to confirm the system keeps up with planned activity, and add workload-specific assertions for inclusion or availability.
- **Reuse plans across environments**: keep the scenario constant while swapping runners to compare behavior between developer machines and CI clusters.
- **Iterate with clear signals**: treat expectation outcomes as the primary pass/fail indicator, and adjust topology or workloads based on what those signals reveal.

View File

@ -0,0 +1,6 @@
# What You Will Learn
This book gives you a clear mental model for Nomos multi-node testing, shows how
to author scenarios that pair realistic workloads with explicit expectations,
and guides you to run them across local, containerized, and cluster environments
without changing the plan.

42
book/src/workloads.md Normal file
View File

@ -0,0 +1,42 @@
# Core Content: Workloads & Expectations
Workloads describe the activity a scenario generates; expectations describe the
signals that must hold when that activity completes. Both are pluggable so
scenarios stay readable and purpose-driven.
## Workloads
- **Transaction workload**: submits user-level transactions at a configurable
rate and can limit how many distinct actors participate.
- **Data-availability workload**: drives blob and channel activity to exercise
data-availability paths.
- **Chaos workload**: triggers controlled node restarts to test resilience and
recovery behaviors (requires a runner that can control nodes).
## Expectations
- **Consensus liveness**: verifies the system continues to produce blocks in
line with the planned workload and timing window.
- **Workload-specific checks**: each workload can attach its own success
criteria (e.g., inclusion of submitted activity) so scenarios remain concise.
Together, workloads and expectations let you express both the pressure applied
to the system and the definition of “healthy” for that run.
Workload pipeline (conceptual):
```
Inputs (topology + wallets + rates)
Workload init → Drive traffic → Collect signals
Expectations evaluate
```
Mermaid view:
```mermaid
flowchart TD
I[Inputs<br/>(topology + wallets + rates)] --> Init[Workload init]
Init --> Drive[Drive traffic]
Drive --> Collect[Collect signals]
Collect --> Eval[Expectations evaluate]
```

View File

@ -0,0 +1,19 @@
# Workspace Layout
The workspace focuses on multi-node integration testing and sits alongside a
`nomos-node` checkout. Its crates separate concerns to keep scenarios
repeatable and portable:
- **Configs**: prepares high-level node, network, tracing, and wallet settings
used across test environments.
- **Core scenario orchestration**: the engine that holds topology descriptions,
scenario plans, runtimes, workloads, and expectations.
- **Workflows**: ready-made workloads (transactions, data-availability, chaos)
and reusable expectations assembled into a user-facing DSL.
- **Runners**: deployment backends for local processes, Docker Compose, and
Kubernetes, all consuming the same scenario plan.
- **Test workflows**: example scenarios and integration checks that show how
the pieces fit together.
This split keeps configuration, orchestration, reusable traffic patterns, and
deployment adapters loosely coupled while sharing one mental model for tests.

View File

@ -21,6 +21,20 @@ if [ ! -d "$CIRCUITS_DIR" ]; then
exit 1
fi
system_gmp_package() {
local multiarch
multiarch="$(gcc -print-multiarch 2>/dev/null || echo aarch64-linux-gnu)"
local lib_path="/usr/lib/${multiarch}/libgmp.a"
if [ ! -f "$lib_path" ]; then
echo "system libgmp.a not found at $lib_path" >&2
return 1
fi
mkdir -p depends/gmp/package_aarch64/lib depends/gmp/package_aarch64/include
cp "$lib_path" depends/gmp/package_aarch64/lib/
# Headers are small; copy the public ones the build expects.
cp /usr/include/gmp*.h depends/gmp/package_aarch64/include/ || true
}
case "$TARGET_ARCH" in
arm64 | aarch64)
;;
@ -41,12 +55,23 @@ git submodule update --init --recursive >&2
if [ "${RAPIDSNARK_BUILD_GMP:-1}" = "1" ]; then
GMP_TARGET="${RAPIDSNARK_GMP_TARGET:-aarch64}"
./build_gmp.sh "$GMP_TARGET" >&2
else
echo "Using system libgmp to satisfy rapidsnark dependencies" >&2
system_gmp_package
fi
MAKE_TARGET="${RAPIDSNARK_MAKE_TARGET:-host_arm64}"
PACKAGE_DIR="${RAPIDSNARK_PACKAGE_DIR:-package_arm64}"
make "$MAKE_TARGET" -j"$(nproc)" >&2
rm -rf build_prover_arm64
mkdir build_prover_arm64
cd build_prover_arm64
cmake .. \
-DTARGET_PLATFORM=aarch64 \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX="../${PACKAGE_DIR}" \
-DBUILD_SHARED_LIBS=OFF >&2
cmake --build . --target prover verifier -- -j"$(nproc)" >&2
install -m 0755 "${PACKAGE_DIR}/bin/prover" "$CIRCUITS_DIR/prover"
install -m 0755 "src/prover" "$CIRCUITS_DIR/prover"
install -m 0755 "src/verifier" "$CIRCUITS_DIR/verifier"
echo "rapidsnark prover installed to $CIRCUITS_DIR/prover" >&2

View File

@ -121,7 +121,7 @@ download_release() {
print_error "Please check that version ${VERSION} exists for platform ${platform}"
print_error "Available releases: https://github.com/${REPO}/releases"
rm -rf "$temp_dir"
exit 1
return 1
fi
print_success "Download complete"
@ -132,7 +132,7 @@ download_release() {
if ! tar -xzf "${temp_dir}/${artifact}" -C "$INSTALL_DIR" --strip-components=1; then
print_error "Failed to extract archive"
rm -rf "$temp_dir"
exit 1
return 1
fi
rm -rf "$temp_dir"
@ -171,8 +171,18 @@ main() {
# Check existing installation
check_existing_installation
# Download and extract
download_release "$platform"
# Download and extract (retry with x86_64 bundle on aarch64 if needed)
if ! download_release "$platform"; then
if [[ "$platform" == linux-aarch64 ]]; then
print_warning "Falling back to linux-x86_64 circuits bundle; will rebuild prover for aarch64."
rm -rf "$INSTALL_DIR"
if ! download_release "linux-x86_64"; then
exit 1
fi
else
exit 1
fi
fi
# Handle macOS quarantine if needed
if [[ "$platform" == macos-* ]]; then

View File

@ -82,7 +82,7 @@ pub fn create_executor_config(config: GeneralConfig) -> ExecutorConfig {
// non-string keys and keep services alive.
recovery_file: PathBuf::new(),
bootstrap: chain_service::BootstrapConfig {
prolonged_bootstrap_period: Duration::from_secs(3),
prolonged_bootstrap_period: config.bootstrapping_config.prolonged_bootstrap_period,
force_bootstrap: false,
offline_grace_period: chain_service::OfflineGracePeriodConfig {
grace_period: Duration::from_secs(20 * 60),

View File

@ -204,7 +204,8 @@ fn build_values(topology: &GeneratedTopology) -> HelmValues {
let validators = topology
.validators()
.iter()
.map(|validator| {
.enumerate()
.map(|(index, validator)| {
let mut env = BTreeMap::new();
env.insert(
"CFG_NETWORK_PORT".into(),
@ -225,6 +226,8 @@ fn build_values(topology: &GeneratedTopology) -> HelmValues {
.port()
.to_string(),
);
env.insert("CFG_HOST_KIND".into(), "validator".into());
env.insert("CFG_HOST_IDENTIFIER".into(), format!("validator-{index}"));
NodeValues {
api_port: validator.general.api_config.address.port(),
@ -237,7 +240,8 @@ fn build_values(topology: &GeneratedTopology) -> HelmValues {
let executors = topology
.executors()
.iter()
.map(|executor| {
.enumerate()
.map(|(index, executor)| {
let mut env = BTreeMap::new();
env.insert(
"CFG_NETWORK_PORT".into(),
@ -258,6 +262,8 @@ fn build_values(topology: &GeneratedTopology) -> HelmValues {
.port()
.to_string(),
);
env.insert("CFG_HOST_KIND".into(), "executor".into());
env.insert("CFG_HOST_IDENTIFIER".into(), format!("executor-{index}"));
NodeValues {
api_port: executor.general.api_config.address.port(),

View File

@ -22,7 +22,7 @@ use crate::{
helm::{HelmError, install_release},
host::node_host,
logs::dump_namespace_logs,
wait::{ClusterPorts, ClusterWaitError, NodeConfigPorts, wait_for_cluster_ready},
wait::{ClusterPorts, ClusterReady, ClusterWaitError, NodeConfigPorts, wait_for_cluster_ready},
};
pub struct K8sRunner {
@ -66,6 +66,7 @@ struct ClusterEnvironment {
executor_api_ports: Vec<u16>,
executor_testing_ports: Vec<u16>,
prometheus_port: u16,
port_forwards: Vec<std::process::Child>,
}
impl ClusterEnvironment {
@ -75,6 +76,7 @@ impl ClusterEnvironment {
release: String,
cleanup: RunnerCleanup,
ports: &ClusterPorts,
port_forwards: Vec<std::process::Child>,
) -> Self {
Self {
client,
@ -86,6 +88,7 @@ impl ClusterEnvironment {
executor_api_ports: ports.executors.iter().map(|ports| ports.api).collect(),
executor_testing_ports: ports.executors.iter().map(|ports| ports.testing).collect(),
prometheus_port: ports.prometheus,
port_forwards,
}
}
@ -97,15 +100,17 @@ impl ClusterEnvironment {
"k8s stack failure; collecting diagnostics"
);
dump_namespace_logs(&self.client, &self.namespace).await;
kill_port_forwards(&mut self.port_forwards);
if let Some(guard) = self.cleanup.take() {
Box::new(guard).cleanup();
}
}
fn into_cleanup(mut self) -> RunnerCleanup {
self.cleanup
.take()
.expect("cleanup guard should be available")
fn into_cleanup(self) -> (RunnerCleanup, Vec<std::process::Child>) {
(
self.cleanup.expect("cleanup guard should be available"),
self.port_forwards,
)
}
}
@ -264,12 +269,15 @@ impl Deployer for K8sRunner {
return Err(err);
}
};
let cleanup = cluster
let (cleanup, port_forwards) = cluster
.take()
.expect("cluster should still be available")
.into_cleanup();
let cleanup_guard: Box<dyn CleanupGuard> =
Box::new(K8sCleanupGuard::new(cleanup, block_feed_guard));
let cleanup_guard: Box<dyn CleanupGuard> = Box::new(K8sCleanupGuard::new(
cleanup,
block_feed_guard,
port_forwards,
));
let context = RunContext::new(
descriptors,
None,
@ -301,6 +309,14 @@ fn ensure_supported_topology(descriptors: &GeneratedTopology) -> Result<(), K8sR
Ok(())
}
fn kill_port_forwards(handles: &mut Vec<std::process::Child>) {
for handle in handles.iter_mut() {
let _ = handle.kill();
let _ = handle.wait();
}
handles.clear();
}
fn collect_port_specs(descriptors: &GeneratedTopology) -> PortSpecs {
let validators = descriptors
.validators()
@ -386,11 +402,11 @@ async fn setup_cluster(
let mut cleanup_guard =
Some(install_stack(client, &assets, &namespace, &release, validators, executors).await?);
let cluster_ports =
let cluster_ready =
wait_for_ports_or_cleanup(client, &namespace, &release, specs, &mut cleanup_guard).await?;
info!(
prometheus_port = cluster_ports.prometheus,
prometheus_port = cluster_ready.ports.prometheus,
"discovered prometheus endpoint"
);
@ -401,7 +417,8 @@ async fn setup_cluster(
cleanup_guard
.take()
.expect("cleanup guard must exist after successful cluster startup"),
&cluster_ports,
&cluster_ready.ports,
cluster_ready.port_forwards,
);
if readiness_checks {
@ -448,7 +465,7 @@ async fn wait_for_ports_or_cleanup(
release: &str,
specs: &PortSpecs,
cleanup_guard: &mut Option<RunnerCleanup>,
) -> Result<ClusterPorts, K8sRunnerError> {
) -> Result<ClusterReady, K8sRunnerError> {
match wait_for_cluster_ready(
client,
namespace,
@ -498,13 +515,19 @@ async fn ensure_cluster_readiness(
struct K8sCleanupGuard {
cleanup: RunnerCleanup,
block_feed: Option<BlockFeedTask>,
port_forwards: Vec<std::process::Child>,
}
impl K8sCleanupGuard {
const fn new(cleanup: RunnerCleanup, block_feed: BlockFeedTask) -> Self {
const fn new(
cleanup: RunnerCleanup,
block_feed: BlockFeedTask,
port_forwards: Vec<std::process::Child>,
) -> Self {
Self {
cleanup,
block_feed: Some(block_feed),
port_forwards,
}
}
}
@ -514,6 +537,7 @@ impl CleanupGuard for K8sCleanupGuard {
if let Some(block_feed) = self.block_feed.take() {
CleanupGuard::cleanup(Box::new(block_feed));
}
kill_port_forwards(&mut self.port_forwards);
CleanupGuard::cleanup(Box::new(self.cleanup));
}
}

View File

@ -1,4 +1,9 @@
use std::time::Duration;
use std::{
net::{Ipv4Addr, TcpListener, TcpStream},
process::{Command as StdCommand, Stdio},
thread,
time::Duration,
};
use k8s_openapi::api::{apps::v1::Deployment, core::v1::Service};
use kube::{Api, Client, Error as KubeError};
@ -9,7 +14,12 @@ use tokio::time::sleep;
use crate::host::node_host;
const DEPLOYMENT_TIMEOUT: Duration = Duration::from_secs(180);
const NODE_HTTP_TIMEOUT: Duration = Duration::from_secs(240);
const NODE_HTTP_PROBE_TIMEOUT: Duration = Duration::from_secs(30);
const HTTP_POLL_INTERVAL: Duration = Duration::from_secs(1);
const PROMETHEUS_HTTP_PORT: u16 = 9090;
const PROMETHEUS_HTTP_TIMEOUT: Duration = Duration::from_secs(240);
const PROMETHEUS_HTTP_PROBE_TIMEOUT: Duration = Duration::from_secs(30);
const PROMETHEUS_SERVICE_NAME: &str = "prometheus";
#[derive(Clone, Copy)]
@ -30,6 +40,11 @@ pub struct ClusterPorts {
pub prometheus: u16,
}
pub struct ClusterReady {
pub ports: ClusterPorts,
pub port_forwards: Vec<std::process::Child>,
}
#[derive(Debug, Error)]
pub enum ClusterWaitError {
#[error("deployment {name} in namespace {namespace} did not become ready within {timeout:?}")]
@ -62,6 +77,13 @@ pub enum ClusterWaitError {
},
#[error("timeout waiting for prometheus readiness on NodePort {port}")]
PrometheusTimeout { port: u16 },
#[error("failed to start port-forward for service {service} port {port}: {source}")]
PortForward {
service: String,
port: u16,
#[source]
source: anyhow::Error,
},
}
pub async fn wait_for_deployment_ready(
@ -159,7 +181,7 @@ pub async fn wait_for_cluster_ready(
release: &str,
validator_ports: &[NodeConfigPorts],
executor_ports: &[NodeConfigPorts],
) -> Result<ClusterPorts, ClusterWaitError> {
) -> Result<ClusterReady, ClusterWaitError> {
if validator_ports.is_empty() {
return Err(ClusterWaitError::MissingValidator);
}
@ -177,11 +199,40 @@ pub async fn wait_for_cluster_ready(
});
}
let mut port_forwards = Vec::new();
let validator_api_ports: Vec<u16> = validator_allocations
.iter()
.map(|ports| ports.api)
.collect();
wait_for_node_http(&validator_api_ports, NodeRole::Validator).await?;
if wait_for_node_http_nodeport(
&validator_api_ports,
NodeRole::Validator,
NODE_HTTP_PROBE_TIMEOUT,
)
.await
.is_err()
{
// Fall back to port-forwarding when NodePorts are unreachable from the host.
validator_allocations.clear();
port_forwards = port_forward_group(
namespace,
release,
"validator",
validator_ports,
&mut validator_allocations,
)?;
let validator_api_ports: Vec<u16> = validator_allocations
.iter()
.map(|ports| ports.api)
.collect();
if let Err(err) =
wait_for_node_http_port_forward(&validator_api_ports, NodeRole::Validator).await
{
kill_port_forwards(&mut port_forwards);
return Err(err);
}
}
let mut executor_allocations = Vec::with_capacity(executor_ports.len());
for (index, ports) in executor_ports.iter().enumerate() {
@ -195,39 +246,102 @@ pub async fn wait_for_cluster_ready(
});
}
if !executor_allocations.is_empty() {
let executor_api_ports: Vec<u16> = executor_allocations.iter().map(|ports| ports.api).collect();
if !executor_allocations.is_empty()
&& wait_for_node_http_nodeport(
&executor_api_ports,
NodeRole::Executor,
NODE_HTTP_PROBE_TIMEOUT,
)
.await
.is_err()
{
executor_allocations.clear();
match port_forward_group(
namespace,
release,
"executor",
executor_ports,
&mut executor_allocations,
) {
Ok(forwards) => port_forwards.extend(forwards),
Err(err) => {
kill_port_forwards(&mut port_forwards);
return Err(err);
}
}
let executor_api_ports: Vec<u16> =
executor_allocations.iter().map(|ports| ports.api).collect();
wait_for_node_http(&executor_api_ports, NodeRole::Executor).await?;
if let Err(err) =
wait_for_node_http_port_forward(&executor_api_ports, NodeRole::Executor).await
{
kill_port_forwards(&mut port_forwards);
return Err(err);
}
}
let prometheus_port = find_node_port(
let mut prometheus_port = find_node_port(
client,
namespace,
PROMETHEUS_SERVICE_NAME,
PROMETHEUS_HTTP_PORT,
)
.await?;
wait_for_prometheus_http(prometheus_port).await?;
if wait_for_prometheus_http_nodeport(prometheus_port, PROMETHEUS_HTTP_PROBE_TIMEOUT)
.await
.is_err()
{
let (local_port, forward) =
port_forward_service(namespace, PROMETHEUS_SERVICE_NAME, PROMETHEUS_HTTP_PORT)
.map_err(|err| {
kill_port_forwards(&mut port_forwards);
err
})?;
prometheus_port = local_port;
port_forwards.push(forward);
if let Err(err) =
wait_for_prometheus_http_port_forward(prometheus_port, PROMETHEUS_HTTP_TIMEOUT).await
{
kill_port_forwards(&mut port_forwards);
return Err(err);
}
}
Ok(ClusterPorts {
validators: validator_allocations,
executors: executor_allocations,
prometheus: prometheus_port,
Ok(ClusterReady {
ports: ClusterPorts {
validators: validator_allocations,
executors: executor_allocations,
prometheus: prometheus_port,
},
port_forwards,
})
}
async fn wait_for_node_http(ports: &[u16], role: NodeRole) -> Result<(), ClusterWaitError> {
async fn wait_for_node_http_nodeport(
ports: &[u16],
role: NodeRole,
timeout: Duration,
) -> Result<(), ClusterWaitError> {
let host = node_host();
http_probe::wait_for_http_ports_with_host(
ports,
role,
&host,
Duration::from_secs(240),
Duration::from_secs(1),
)
.await
.map_err(map_http_error)
wait_for_node_http_on_host(ports, role, &host, timeout).await
}
async fn wait_for_node_http_port_forward(
ports: &[u16],
role: NodeRole,
) -> Result<(), ClusterWaitError> {
wait_for_node_http_on_host(ports, role, "127.0.0.1", NODE_HTTP_TIMEOUT).await
}
async fn wait_for_node_http_on_host(
ports: &[u16],
role: NodeRole,
host: &str,
timeout: Duration,
) -> Result<(), ClusterWaitError> {
http_probe::wait_for_http_ports_with_host(ports, role, host, timeout, HTTP_POLL_INTERVAL)
.await
.map_err(map_http_error)
}
const fn map_http_error(error: HttpReadinessError) -> ClusterWaitError {
@ -238,11 +352,30 @@ const fn map_http_error(error: HttpReadinessError) -> ClusterWaitError {
}
}
pub async fn wait_for_prometheus_http(port: u16) -> Result<(), ClusterWaitError> {
let client = reqwest::Client::new();
let url = format!("http://{}:{port}/-/ready", node_host());
pub async fn wait_for_prometheus_http_nodeport(
port: u16,
timeout: Duration,
) -> Result<(), ClusterWaitError> {
let host = node_host();
wait_for_prometheus_http(&host, port, timeout).await
}
for _ in 0..240 {
pub async fn wait_for_prometheus_http_port_forward(
port: u16,
timeout: Duration,
) -> Result<(), ClusterWaitError> {
wait_for_prometheus_http("127.0.0.1", port, timeout).await
}
pub async fn wait_for_prometheus_http(
host: &str,
port: u16,
timeout: Duration,
) -> Result<(), ClusterWaitError> {
let client = reqwest::Client::new();
let url = format!("http://{host}:{port}/-/ready");
for _ in 0..timeout.as_secs() {
if let Ok(resp) = client.get(&url).send().await
&& resp.status().is_success()
{
@ -253,3 +386,101 @@ pub async fn wait_for_prometheus_http(port: u16) -> Result<(), ClusterWaitError>
Err(ClusterWaitError::PrometheusTimeout { port })
}
fn port_forward_group(
namespace: &str,
release: &str,
kind: &str,
ports: &[NodeConfigPorts],
allocations: &mut Vec<NodePortAllocation>,
) -> Result<Vec<std::process::Child>, ClusterWaitError> {
let mut forwards = Vec::new();
for (index, ports) in ports.iter().enumerate() {
let service = format!("{release}-{kind}-{index}");
let (api_port, api_forward) = match port_forward_service(namespace, &service, ports.api) {
Ok(forward) => forward,
Err(err) => {
kill_port_forwards(&mut forwards);
return Err(err);
}
};
let (testing_port, testing_forward) =
match port_forward_service(namespace, &service, ports.testing) {
Ok(forward) => forward,
Err(err) => {
kill_port_forwards(&mut forwards);
return Err(err);
}
};
allocations.push(NodePortAllocation {
api: api_port,
testing: testing_port,
});
forwards.push(api_forward);
forwards.push(testing_forward);
}
Ok(forwards)
}
fn port_forward_service(
namespace: &str,
service: &str,
remote_port: u16,
) -> Result<(u16, std::process::Child), ClusterWaitError> {
let local_port = allocate_local_port().map_err(|source| ClusterWaitError::PortForward {
service: service.to_owned(),
port: remote_port,
source,
})?;
let mut child = StdCommand::new("kubectl")
.arg("port-forward")
.arg("-n")
.arg(namespace)
.arg(format!("svc/{service}"))
.arg(format!("{local_port}:{remote_port}"))
.stdout(Stdio::null())
.stderr(Stdio::null())
.spawn()
.map_err(|source| ClusterWaitError::PortForward {
service: service.to_owned(),
port: remote_port,
source: source.into(),
})?;
for _ in 0..20 {
if let Ok(Some(status)) = child.try_wait() {
return Err(ClusterWaitError::PortForward {
service: service.to_owned(),
port: remote_port,
source: anyhow::anyhow!("kubectl exited with {status}"),
});
}
if TcpStream::connect((Ipv4Addr::LOCALHOST, local_port)).is_ok() {
return Ok((local_port, child));
}
thread::sleep(Duration::from_millis(250));
}
let _ = child.kill();
Err(ClusterWaitError::PortForward {
service: service.to_owned(),
port: remote_port,
source: anyhow::anyhow!("port-forward did not become ready"),
})
}
fn allocate_local_port() -> anyhow::Result<u16> {
let listener = TcpListener::bind((Ipv4Addr::LOCALHOST, 0))?;
let port = listener.local_addr()?.port();
drop(listener);
Ok(port)
}
fn kill_port_forwards(handles: &mut Vec<std::process::Child>) {
for handle in handles.iter_mut() {
let _ = handle.kill();
let _ = handle.wait();
}
handles.clear();
}

View File

@ -2,7 +2,8 @@
# check=skip=SecretsUsedInArgOrEnv
# Ignore warnings about sensitive information as this is test data.
ARG VERSION=v0.2.0
ARG VERSION=v0.3.1
ARG CIRCUITS_OVERRIDE
# ===========================
# BUILD IMAGE
@ -11,24 +12,61 @@ ARG VERSION=v0.2.0
FROM rust:1.91.0-slim-bookworm AS builder
ARG VERSION
ARG CIRCUITS_OVERRIDE
LABEL maintainer="augustinas@status.im" \
source="https://github.com/logos-co/nomos-node" \
description="Nomos testnet build image"
WORKDIR /nomos
WORKDIR /workspace
COPY . .
# Install dependencies needed for building RocksDB.
RUN apt-get update && apt-get install -yq \
git gcc g++ clang libssl-dev pkg-config ca-certificates curl
git gcc g++ clang make cmake m4 xz-utils libgmp-dev libssl-dev pkg-config ca-certificates curl wget
RUN chmod +x scripts/setup-nomos-circuits.sh && \
scripts/setup-nomos-circuits.sh "$VERSION" "/opt/circuits"
RUN mkdir -p /opt/circuits && \
select_circuits_source() { \
# Prefer an explicit override when it exists (file or directory). \
if [ -n "$CIRCUITS_OVERRIDE" ] && [ -e "/workspace/${CIRCUITS_OVERRIDE}" ]; then \
echo "/workspace/${CIRCUITS_OVERRIDE}"; \
return 0; \
fi; \
# Fall back to the workspace bundle shipped with the repo. \
if [ -e "/workspace/tests/kzgrs/kzgrs_test_params" ]; then \
echo "/workspace/tests/kzgrs/kzgrs_test_params"; \
return 0; \
fi; \
return 1; \
}; \
if CIRCUITS_PATH="$(select_circuits_source)"; then \
echo "Using prebuilt circuits bundle from ${CIRCUITS_PATH#/workspace/}"; \
if [ -d "$CIRCUITS_PATH" ]; then \
cp -R "${CIRCUITS_PATH}/." /opt/circuits; \
else \
cp "${CIRCUITS_PATH}" /opt/circuits/; \
fi; \
fi; \
if [ ! -f "/opt/circuits/pol/verification_key.json" ]; then \
echo "Local circuits missing pol artifacts; downloading ${VERSION} bundle and rebuilding"; \
chmod +x scripts/setup-nomos-circuits.sh && \
NOMOS_CIRCUITS_REBUILD_RAPIDSNARK=1 \
RAPIDSNARK_BUILD_GMP=1 \
scripts/setup-nomos-circuits.sh "$VERSION" "/opt/circuits"; \
fi
ENV NOMOS_CIRCUITS=/opt/circuits
ENV CARGO_TARGET_DIR=/workspace/target
RUN cargo build --release --all-features
# Fetch the nomos-node sources pinned in Cargo.lock and build the runtime binaries.
RUN git clone https://github.com/logos-co/nomos-node.git /workspace/nomos-node && \
cd /workspace/nomos-node && \
git fetch --depth 1 origin 2f60a0372c228968c3526c341ebc7e58bbd178dd && \
git checkout 2f60a0372c228968c3526c341ebc7e58bbd178dd && \
cargo build --release --all-features --bins
# Build cfgsync binaries from this workspace.
RUN cargo build --release --locked --manifest-path /workspace/testnet/cfgsync/Cargo.toml --bins
# ===========================
# NODE IMAGE
@ -50,11 +88,11 @@ RUN apt-get update && apt-get install -yq \
COPY --from=builder /opt/circuits /opt/circuits
COPY --from=builder /nomos/target/release/nomos-node /usr/bin/nomos-node
COPY --from=builder /nomos/target/release/nomos-executor /usr/bin/nomos-executor
COPY --from=builder /nomos/target/release/nomos-cli /usr/bin/nomos-cli
COPY --from=builder /nomos/target/release/cfgsync-server /usr/bin/cfgsync-server
COPY --from=builder /nomos/target/release/cfgsync-client /usr/bin/cfgsync-client
COPY --from=builder /workspace/target/release/nomos-node /usr/bin/nomos-node
COPY --from=builder /workspace/target/release/nomos-executor /usr/bin/nomos-executor
COPY --from=builder /workspace/target/release/nomos-cli /usr/bin/nomos-cli
COPY --from=builder /workspace/target/release/cfgsync-server /usr/bin/cfgsync-server
COPY --from=builder /workspace/target/release/cfgsync-client /usr/bin/cfgsync-client
ENV NOMOS_CIRCUITS=/opt/circuits

View File

@ -0,0 +1,38 @@
#!/bin/bash
set -euo pipefail
# Builds the testnet image with circuits. Prefers a local circuits bundle
# (tests/kzgrs/kzgrs_test_params) or a custom override; otherwise downloads
# from logos-co/nomos-circuits.
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
IMAGE_TAG="${IMAGE_TAG:-nomos-testnet:local}"
VERSION="${VERSION:-v0.3.1}"
CIRCUITS_OVERRIDE="${CIRCUITS_OVERRIDE:-tests/kzgrs/kzgrs_test_params}"
echo "Workspace root: ${ROOT_DIR}"
echo "Image tag: ${IMAGE_TAG}"
echo "Circuits override: ${CIRCUITS_OVERRIDE:-<none>}"
echo "Circuits version (fallback download): ${VERSION}"
build_args=(
-f "${ROOT_DIR}/testnet/Dockerfile"
-t "${IMAGE_TAG}"
"${ROOT_DIR}"
)
# Pass override/version args to the Docker build.
if [ -n "${CIRCUITS_OVERRIDE}" ]; then
build_args+=(--build-arg "CIRCUITS_OVERRIDE=${CIRCUITS_OVERRIDE}")
fi
build_args+=(--build-arg "VERSION=${VERSION}")
echo "Running: docker build ${build_args[*]}"
docker build "${build_args[@]}"
cat <<EOF
Build complete.
- Use this image in k8s/compose by exporting NOMOS_TESTNET_IMAGE=${IMAGE_TAG}
- Circuits source: ${CIRCUITS_OVERRIDE:-download ${VERSION}}
EOF

View File

@ -14,5 +14,9 @@ export CFG_FILE_PATH="/config.yaml" \
# persist state.
mkdir -p /recovery
/usr/bin/cfgsync-client && \
exec /usr/bin/nomos-executor /config.yaml
/usr/bin/cfgsync-client
# Align bootstrap timing with validators to keep configs consistent.
sed -i "s/prolonged_bootstrap_period: .*/prolonged_bootstrap_period: '3.000000000'/" /config.yaml
exec /usr/bin/nomos-executor /config.yaml

View File

@ -14,5 +14,9 @@ export CFG_FILE_PATH="/config.yaml" \
# persist state.
mkdir -p /recovery
/usr/bin/cfgsync-client && \
exec /usr/bin/nomos-node /config.yaml
/usr/bin/cfgsync-client
# Align bootstrap timing with executors to keep configs consistent.
sed -i "s/prolonged_bootstrap_period: .*/prolonged_bootstrap_period: '3.000000000'/" /config.yaml
exec /usr/bin/nomos-node /config.yaml

View File

@ -0,0 +1,76 @@
#!/bin/bash
#
# Setup script for nomos-circuits
#
# Usage: ./setup-nomos-circuits.sh [VERSION] [INSTALL_DIR]
# VERSION - Optional. Version to install (default: v0.3.1)
# INSTALL_DIR - Optional. Installation directory (default: $HOME/.nomos-circuits)
#
# Examples:
# ./setup-nomos-circuits.sh # Install default version to default location
# ./setup-nomos-circuits.sh v0.2.0 # Install specific version to default location
# ./setup-nomos-circuits.sh v0.2.0 /opt/circuits # Install to custom location
set -euo pipefail
VERSION="${1:-v0.3.1}"
DEFAULT_INSTALL_DIR="$HOME/.nomos-circuits"
INSTALL_DIR="${2:-$DEFAULT_INSTALL_DIR}"
REPO="logos-co/nomos-circuits"
detect_platform() {
local os=""
local arch=""
case "$(uname -s)" in
Linux*) os="linux" ;;
Darwin*) os="macos" ;;
MINGW*|MSYS*|CYGWIN*) os="windows" ;;
*) echo "Unsupported operating system: $(uname -s)" >&2; exit 1 ;;
esac
case "$(uname -m)" in
x86_64) arch="x86_64" ;;
aarch64|arm64) arch="aarch64" ;;
*) echo "Unsupported architecture: $(uname -m)" >&2; exit 1 ;;
esac
echo "${os}-${arch}"
}
download_release() {
local platform="$1"
local artifact="nomos-circuits-${VERSION}-${platform}.tar.gz"
local url="https://github.com/${REPO}/releases/download/${VERSION}/${artifact}"
local temp_dir
temp_dir=$(mktemp -d)
echo "Downloading nomos-circuits ${VERSION} for ${platform}..."
if [ -n "${GITHUB_TOKEN:-}" ]; then
auth_header="Authorization: Bearer ${GITHUB_TOKEN}"
else
auth_header=""
fi
if ! curl -L ${auth_header:+-H "$auth_header"} -o "${temp_dir}/${artifact}" "${url}"; then
echo "Failed to download release artifact from ${url}" >&2
rm -rf "${temp_dir}"
exit 1
fi
echo "Extracting to ${INSTALL_DIR}..."
rm -rf "${INSTALL_DIR}"
mkdir -p "${INSTALL_DIR}"
if ! tar -xzf "${temp_dir}/${artifact}" -C "${INSTALL_DIR}" --strip-components=1; then
echo "Failed to extract ${artifact}" >&2
rm -rf "${temp_dir}"
exit 1
fi
rm -rf "${temp_dir}"
}
platform=$(detect_platform)
echo "Setting up nomos-circuits ${VERSION} for ${platform}"
echo "Installing to ${INSTALL_DIR}"
download_release "${platform}"
echo "Installation complete. Circuits installed at: ${INSTALL_DIR}"
echo "If using a custom directory, set NOMOS_CIRCUITS=${INSTALL_DIR}"