Topology & Chaos Patterns

This page focuses on cluster manipulation: node control, chaos patterns, and what the tooling supports today.

Node control availability

Supported: restart/peer control via NodeControlHandle (compose runner).
Not supported: local runner does not expose node control; k8s runner does not support it yet.

Restarts: random restarts with minimum delay/cooldown to test recovery.
Partitions: block/unblock peers to simulate partial isolation, then assert height convergence after healing.
Validator churn: stop one validator and start another (new key) mid-run to test membership changes; expect convergence.
Load SLOs: push tx/DA rates and assert inclusion/availability budgets instead of only liveness.
API probes: poll HTTP/RPC endpoints during chaos to ensure external contracts stay healthy (shape + latency).

Liveness/height convergence after chaos windows.
SLO checks: inclusion latency, DA responsiveness, API latency/shape.
Recovery checks: ensure nodes that were isolated or restarted catch up to cluster height within a timeout.

Keep chaos realistic: avoid flapping or patterns you wouldn't operate in prod.
Scope chaos: choose validators vs executors intentionally; don't restart all nodes at once unless you're testing full outages.
Combine chaos with observability: capture block feed/metrics and API health so failures are diagnosable.