5.3 KiB
FFI benchmarks
This directory holds Nim micro/stress benchmarks. Neither is part of nimble test.
bench_codec.nim—cborvsc(cwire) wire-format codec microbenchmark (documented below). Pure measurement, not a gate.bench_ffi_submit.nim— concurrent-submit stress test + throughput benchmark forsendRequestToFFIThread(documented next). Carries a scaling gate that fails CI until the per-request submit lock is replaced.
sendRequestToFFIThread concurrent-submit stress / throughput
bench_ffi_submit.nim motivates issue #90: every foreign-thread call serialises the whole trySend + reqSignal.fireSync + reqReceivedSignal.waitSync cycle under a single ctx.lock. The lock is load-bearing because reqChannel is single-slot and the accept handshake waits on a shared reqReceivedSignal, so producers cannot overlap.
The bench fans K producer threads (1 → 8) at one context, each firing the same per-thread volume of no-op requests. It times the submit phase only — from the start gate until every producer returns from its last sendRequestToFFIThread — because that is the path the fix parallelises; completion is bounded by the single FFI thread and deliberately excluded. Each thread count runs FFI_SUBMIT_ITERS times (default 5) and the median submit/sec is reported, so run-to-run noise can't move the verdict.
It is also a correctness stress test: the aggregate callback count must match the submit count exactly (no drops or double-fires), with zero submit errors and (under asan/lsan/tsan) zero leaks or races.
nimble bench_ffi_submit
# smaller / faster (handy under sanitizers — they distort timing, so disable the gate):
FFI_SUBMIT_PER_THREAD=2000 FFI_SUBMIT_ITERS=1 FFI_SCALING_GATE=0 nimble bench_ffi_submit
# under a sanitizer (proves no leaks/races; gate off — see below):
NIM_FFI_SAN=tsan FFI_SUBMIT_PER_THREAD=2000 FFI_SCALING_GATE=0 nimble bench_ffi_submit
Env knobs: FFI_SUBMIT_PER_THREAD (volume per producer, default 20000), FFI_SUBMIT_ITERS (median sample count, default 5), FFI_SCALING_GATE (default 1; set 0 to report numbers without failing).
Scaling gate — red until the lock is replaced
By default the bench fails (non-zero exit) unless submit throughput at 8 threads is at least 1.5x the 1-thread rate. This is a forcing function: it cannot pass while sendRequestToFFIThread holds ctx.lock across the synchronous reqReceivedSignal accept, because that serialises every submit no matter how many producers run.
Baseline measured 2026-06-24 (16-core Linux, orc, -d:danger, median of 5): submit scaling held at 0.98–1.16x across threads — flat, as the lock dictates. 1.5x sits above that noise ceiling (so the lock-bound code fails reliably) and well below the >=2x that parallel lock-free MPSC ingress yields on any multicore host (so the fix clears it with margin). Once it lands and this turns green, keep the gate as a regression guard.
The gate runs in the non-sanitized Submit Scaling Gate CI job (.github/workflows/ci.yml); the sanitized jobs run the same bench with FFI_SCALING_GATE=0 for leak/race coverage only, since sanitizer instrumentation makes throughput scaling meaningless.
FFI wire-format codec benchmark
bench_codec.nim is a single-process Nim microbenchmark comparing the two FFI
wire-format codecs head-to-head on identical payloads:
- cbor —
cborEncode/cborDecode, self-describing bytes overseq[byte]. The codec thecborABI uses on every boundary crossing. - c (cwire) —
cwirePack/cwireUnpack/cwireFree, flat C-struct shared-memory packing. The codec thecABI uses, emitted for every{.ffi: "abi = c".}type as its<T>_CWirecompanion.
Both paths run in the same process on the same values, so the numbers isolate codec cost only — no thread hop, no callback dispatch, no chronos work. The full FFI round-trip (thread channel + callback) is identical for both ABIs, so the codec is where the ABI difference actually lives.
Running
nimble bench_codec
# or directly, with the size sweep extended to 1 MiB:
nim c -r --mm:orc -d:danger tests/bench/bench_codec.nim --include-1mib
Build with -d:danger (the nimble task does) so the figures reflect optimized
codegen rather than a debug build.
Payload shapes covered
| Type | Shape |
|---|---|
EchoRequest |
1 string + 1 int (small struct) |
EchoResponse |
2 strings |
ComplexRequest |
seq[EchoRequest], seq[string], Option[...] |
BytesPayload |
seq[byte], swept 100 B → 150 KiB (--include-1mib adds 1 MiB) |
Interpreting
Small structs are dominated by CBOR's per-field tag/length framing, so cwire
wins by a large factor. As payloads grow into big seq[byte] blobs, both codecs
become memcpy-bound and the ratio converges toward ~1×. The byte-blob sweep
also reports throughput (MiB/s) for each codec.
Note: this benchmark exercises the
cABI codec (the cwire companions on the Nim side). Wiring thecABI through the full proc-dispatch path and the foreign (C++/Rust) generators is tracked separately; onlycborcurrently generates working end-to-end bindings.