* try parallel reduction in batch add, but alas it's slower than custom chunking. Except maybe on arch with performance/efficiency cores
* initial impl of parallel MSM - scaling to debug, threads not woken fast enough
* improve comment [skip ci]
* skip top window when c divides the number of bits
* for some reason parallel-for loops scale on 5+ threads while spawn only on 2x threads. Thread wakeup issue?
* Add counters and timers to audit threadpool bottlenecks
* metrics and profiling fixes, (slower) latency hiding, activate tests
* fix thief thread trying to wake another before canceling its own sleep
* easier to sort metrics and parallel endomorphism application
* selective endomorphism acceleration
* some tuning
* spawn can handle compile-time literals, static and type parameters. Also introduce spawnAwaitable to await void procs
* improve MSM overview [skip ci]
* bench cleanup