diff --git a/book/src/troubleshooting.md b/book/src/troubleshooting.md index f07a466..cd4480c 100644 --- a/book/src/troubleshooting.md +++ b/book/src/troubleshooting.md @@ -202,48 +202,67 @@ Run a minimal baseline test (e.g., 2 validators, consensus liveness only). If it ## Common Error Messages ### "Consensus liveness expectation failed" -- **Cause**: Not enough blocks produced during run window, missing `POL_PROOF_DEV_MODE=true` (causes slow proof generation), or missing KZG assets for DA workloads -- **Fix**: - 1. Verify `POL_PROOF_DEV_MODE=true` is set (REQUIRED for all runners) - 2. Verify KZG assets exist at `testing-framework/assets/stack/kzgrs_test_params/` (for DA workloads) - 3. Extend `with_run_duration()` to allow more blocks - 4. Check node logs for proof generation or DA errors - 5. Reduce transaction/DA rate if nodes are overwhelmed + +- **Cause**: Not enough blocks produced during the run window, missing + `POL_PROOF_DEV_MODE=true` (causes slow proof generation), or missing KZG + assets for DA workloads. +- **Fix**: + 1. Verify `POL_PROOF_DEV_MODE=true` is set (REQUIRED for all runners). + 2. Verify KZG assets exist at + `testing-framework/assets/stack/kzgrs_test_params/` (for DA workloads). + 3. Extend `with_run_duration()` to allow more blocks. + 4. Check node logs for proof generation or DA errors. + 5. Reduce transaction/DA rate if nodes are overwhelmed. ### "Wallet seeding failed" -- **Cause**: Topology doesn't have enough funded wallets for the workload -- **Fix**: Increase `.wallets(N)` count or reduce `.users(M)` in transaction workload (ensure N ≥ M) + +- **Cause**: Topology doesn't have enough funded wallets for the workload. +- **Fix**: Increase `.wallets(N)` count or reduce `.users(M)` in the transaction + workload (ensure N ≥ M). ### "Node control not available" -- **Cause**: Runner doesn't support node control (only ComposeDeployer does), or `enable_node_control()` wasn't called -- **Fix**: - 1. Use ComposeDeployer for chaos tests (LocalDeployer and K8sDeployer don't support node control) - 2. Ensure `.enable_node_control()` is called in scenario before `.chaos()` + +- **Cause**: Runner doesn't support node control (only ComposeDeployer does), or + `enable_node_control()` wasn't called. +- **Fix**: + 1. Use ComposeDeployer for chaos tests (LocalDeployer and K8sDeployer don't + support node control). + 2. Ensure `.enable_node_control()` is called in the scenario before `.chaos()`. ### "Readiness timeout" -- **Cause**: Nodes didn't become responsive within expected time (often due to missing prerequisites) -- **Fix**: - 1. **Verify `POL_PROOF_DEV_MODE=true` is set** (REQUIRED for all runners—without it, proof generation is too slow) - 2. Check node logs for startup errors (port conflicts, missing assets) - 3. Verify network connectivity between nodes - 4. For DA workloads, ensure KZG circuit assets are present + +- **Cause**: Nodes didn't become responsive within expected time (often due to + missing prerequisites). +- **Fix**: + 1. **Verify `POL_PROOF_DEV_MODE=true` is set** (REQUIRED for all runners—without + it, proof generation is too slow). + 2. Check node logs for startup errors (port conflicts, missing assets). + 3. Verify network connectivity between nodes. + 4. For DA workloads, ensure KZG circuit assets are present. ### "Port already in use" -- **Cause**: Previous test didn't clean up, or another process holds the port -- **Fix**: Kill orphaned processes (`pkill nomos-node`), wait for Docker cleanup (`docker compose down`), or restart Docker + +- **Cause**: Previous test didn't clean up, or another process holds the port. +- **Fix**: Kill orphaned processes (`pkill nomos-node`), wait for Docker cleanup + (`docker compose down`), or restart Docker. ### "Image not found: nomos-testnet:local" -- **Cause**: Docker image not built for Compose/K8s runners, or KZG assets not baked into image -- **Fix**: - 1. Fetch KZG assets: `scripts/setup-nomos-circuits.sh v0.3.1 /tmp/nomos-circuits` - 2. Copy to assets: `cp -r /tmp/nomos-circuits/* testing-framework/assets/stack/kzgrs_test_params/` - 3. Build image: `testing-framework/assets/stack/scripts/build_test_image.sh` + +- **Cause**: Docker image not built for Compose/K8s runners, or KZG assets not + baked into the image. +- **Fix**: + 1. Fetch KZG assets: `scripts/setup-nomos-circuits.sh v0.3.1 /tmp/nomos-circuits`. + 2. Copy to assets: + `cp -r /tmp/nomos-circuits/* testing-framework/assets/stack/kzgrs_test_params/`. + 3. Build image: `testing-framework/assets/stack/scripts/build_test_image.sh`. ### "Failed to load KZG parameters" or "Circuit file not found" -- **Cause**: DA workload requires KZG circuit assets that aren't present -- **Fix**: - 1. Fetch assets: `scripts/setup-nomos-circuits.sh v0.3.1 /tmp/nomos-circuits` - 2. Copy to expected path: `cp -r /tmp/nomos-circuits/* testing-framework/assets/stack/kzgrs_test_params/` - 3. For Compose/K8s: rebuild image with assets baked in + +- **Cause**: DA workload requires KZG circuit assets that aren't present. +- **Fix**: + 1. Fetch assets: `scripts/setup-nomos-circuits.sh v0.3.1 /tmp/nomos-circuits`. + 2. Copy to expected path: + `cp -r /tmp/nomos-circuits/* testing-framework/assets/stack/kzgrs_test_params/`. + 3. For Compose/K8s: rebuild image with assets baked in. For detailed logging configuration and observability setup, see [Operations](operations.md). diff --git a/book/src/usage-patterns.md b/book/src/usage-patterns.md index d95ba77..76dddba 100644 --- a/book/src/usage-patterns.md +++ b/book/src/usage-patterns.md @@ -1,7 +1,16 @@ # Usage Patterns -- **Shape a topology, pick a runner**: choose local for quick iteration, compose for reproducible multi-node stacks with observability, or k8s for cluster-grade validation. -- **Compose workloads deliberately**: pair transactions and data-availability traffic for end-to-end coverage; add chaos only when assessing recovery and resilience. -- **Align expectations with goals**: use liveness-style checks to confirm the system keeps up with planned activity, and add workload-specific assertions for inclusion or availability. -- **Reuse plans across environments**: keep the scenario constant while swapping runners to compare behavior between developer machines and CI clusters. -- **Iterate with clear signals**: treat expectation outcomes as the primary pass/fail indicator, and adjust topology or workloads based on what those signals reveal. +- **Shape a topology, pick a runner**: choose local for quick iteration, compose + for reproducible multi-node stacks with observability, or k8s for cluster-grade + validation. +- **Compose workloads deliberately**: pair transactions and data-availability + traffic for end-to-end coverage; add chaos only when assessing recovery and + resilience. +- **Align expectations with goals**: use liveness-style checks to confirm the + system keeps up with planned activity, and add workload-specific assertions for + inclusion or availability. +- **Reuse plans across environments**: keep the scenario constant while swapping + runners to compare behavior between developer machines and CI clusters. +- **Iterate with clear signals**: treat expectation outcomes as the primary + pass/fail indicator, and adjust topology or workloads based on what those + signals reveal. diff --git a/book/src/workloads.md b/book/src/workloads.md index 145df58..99f1414 100644 --- a/book/src/workloads.md +++ b/book/src/workloads.md @@ -5,6 +5,7 @@ signals that must hold when that activity completes. Both are pluggable so scenarios stay readable and purpose-driven. ## Workloads + - **Transaction workload**: submits user-level transactions at a configurable rate and can limit how many distinct actors participate. - **Data-availability workload**: drives blob and channel activity to exercise @@ -13,6 +14,7 @@ scenarios stay readable and purpose-driven. recovery behaviors (requires a runner that can control nodes). ## Expectations + - **Consensus liveness**: verifies the system continues to produce blocks in line with the planned workload and timing window. - **Workload-specific checks**: each workload can attach its own success