Topology & Chaos Patterns

This page focuses on cluster manipulation: node control, chaos patterns, and what the tooling supports today.

Node control availability

  • Supported: restart/peer control via NodeControlHandle (compose runner).
  • Not supported: local runner does not expose node control; k8s runner does not support it yet.

Chaos patterns to consider

  • Restarts: random restarts with minimum delay/cooldown to test recovery.
  • Partitions: block/unblock peers to simulate partial isolation, then assert height convergence after healing.
  • Validator churn: stop one validator and start another (new key) mid-run to test membership changes; expect convergence.
  • Load SLOs: push tx/DA rates and assert inclusion/availability budgets instead of only liveness.
  • API probes: poll HTTP/RPC endpoints during chaos to ensure external contracts stay healthy (shape + latency).

Expectations to pair

  • Liveness/height convergence after chaos windows.
  • SLO checks: inclusion latency, DA responsiveness, API latency/shape.
  • Recovery checks: ensure nodes that were isolated or restarted catch up to cluster height within a timeout.

Guidance

  • Keep chaos realistic: avoid flapping or patterns you wouldn't operate in prod.
  • Scope chaos: choose validators vs executors intentionally; don't restart all nodes at once unless you're testing full outages.
  • Combine chaos with observability: capture block feed/metrics and API health so failures are diagnosable.