* Pasta bench
* cleanup env variables
* [MSM]: generate benchmark coef-points pairs in parallel
* try to fix windows Ci
* add diagnostic info
* fix old test for new codecs/io primitives
* Ensure the projective point at infinity is not all zeros, but (0, 1, 0)
* move tests
* move threadpool to root path
* fix hints and warnings, print nim versions for tests for debugging the new strange issue in CI
* print nim version
* mixup on branches
* mixup on branches reloaded
* rework assembler register/mem and constraint declarations
* Introduce constraint UnmutatedPointerToWriteMem
* Create invidual memory cell operands
* [Assembly] fully support indirect memory addressing
* fix calling convention for exported procs
* Prepare for switch to intel syntax to avoid clang constant propagation asm symbol name interfering OR pointer+offset addressing
* use modifiers to prevent bad string mixin fo assembler to linker of propagated consts
* Assembly: switch to intel syntax
* with working memory operand - now works with LTO on both GCC and clang and constant folding
* use memory operand in more places
* remove some inline now that we have lto
* cleanup compiler config and benches
* tracer shouldn't force dependencies when unused
* fix cc on linux
* nimble fixes
* update README [skip CI]
* update MacOS CI with Homebrew Clang
* oops nimble bindings disappeared
* more nimble fixes
* fix sha256 exported symbol
* improve constraints on modular addition
* Add extra constraint to force reloading of pointer in reg inputs
* Fix LLVM gold linker running out of registers
* workaround MinGW64 GCC 12.2 bad codegen in t_pairing_cyclotomic_subgroup with LTO
* [testsuite] Rework parallel test runner to buffer beyond 65536 chars and properly wait for process exit
* [testsuite] improve error reporting
* rework openArray[byte/char] for BLS signature C API
* Prepare for optimized library and bindings
* properly link to constantine
* Compiler fixes, global sanitizers, GCC bug with --opt:size
* workaround/fix #229: don't inline field reduction in Fp2
* fix clang running out of registers with LTO
* [C API] missed length parameters for ctt_eth_bls_fast_aggregate_verify
* double-precision asm is too large for inlining, try to fix Linux and MacOS woes at https://github.com/mratsim/constantine/pull/228#issuecomment-1512773460
* Use FORTIFY_SOURCE for testing
* Fix#230 - gcc miscompiles Fp6 mul with LTO
* disable LTO for now, PR is too long
* try parallel reduction in batch add, but alas it's slower than custom chunking. Except maybe on arch with performance/efficiency cores
* initial impl of parallel MSM - scaling to debug, threads not woken fast enough
* improve comment [skip ci]
* skip top window when c divides the number of bits
* for some reason parallel-for loops scale on 5+ threads while spawn only on 2x threads. Thread wakeup issue?
* Add counters and timers to audit threadpool bottlenecks
* metrics and profiling fixes, (slower) latency hiding, activate tests
* fix thief thread trying to wake another before canceling its own sleep
* easier to sort metrics and parallel endomorphism application
* selective endomorphism acceleration
* some tuning
* spawn can handle compile-time literals, static and type parameters. Also introduce spawnAwaitable to await void procs
* improve MSM overview [skip ci]
* bench cleanup
* Threadpool: eventcount didn't put threads to actual sleep :/
* rework task awaiter sleep to prevent use-after-free race condition after task completion
* Need memory fence for StoreLoad synchronization ordering
* update design doc
* set memory order in sleep of eventcount
* cleanup debug logs
* comment cleanup [skip ci]
* remove reserve threads
* recover last perf diff: 1. don't import primitives, cpu features detection globals are noticeable, 2. noinit + conditional zeroMem are unnecessary when sync is inline 3. inline 'newSpawn' and don't init the loop part
* avoid syscalls if possible if thred is awake but idle
* renaming eventLoop
* remove unused code: steal-half
* renaming
* no need for 0-init sync, T can be large in cryptography
* introduce reserve threads to minimize latency and maximize throughput when awaiting a future
* introduce a ceilDiv proc
* threadpool: implement parallel-for loops
* 10x perf improvement by not waking reserveBackoff on syncAll
* bench overhead: new reserve system might introduce too much wakeup latency, 2x slower, for fine-grained parallelism
* add parallelForStrided
* Threadpool: Implement parallel reductions
* refactor parallel loop codegen: introduce descriptor, parsing and codegen stages
* parallel strided, test transpose bench
* tight loop is faster when backoff is not inline
* no POSIX stuff on windows, larger types for histogram bench
* fix tests
* max RSS overflow?
* missed an undefined var
* exit histogram on 32-bit
* forgot to return early dor 32-bit
* unoptimized msm
* MSM: reorder loops
* add a signed windowed recoding technique
* improve wNAF table access
* use batchAffine
* revamp EC tests
* MSM signed digit support
* refactor MSM: recode signed ahead of time
* missing test vector
* refactor allocs and Alloca sideeffect
* add an endomorphism threshold
* Add Jacobian extended coordinates
* refactor recodings, prepare for parallelizable on-the-fly signed recoding
* recoding changes, introduce proper NAF for pairings
* more pairings refactoring, introduce miller accumulator for EVM
* some optim to the addchain miller loop
* start optimizing multi-pairing
* finish multi-miller loop refactoring
* minor tuning
* MSM: signed encoding suitable for parallelism (no precompute)
* cleanup signed window encoding
* add prefetching
* add metering
* properly init result to infinity
* comment on prefetching
* introduce vartime inversion for batch additions
* fix JacExt infinity conversion
* add batchAffine for MSM, though slower than JacExtended at the moment
* add a batch affine scheduler for MSM
* Add Multi-Scalar-Multiplication endomorphism acceleration
* some tuning
* signed integer fixes + 32-bit + tuning
* Some more tuning
* common msm bench + don't use affine for c < 9
* nit
* create a codecs.nim file for hex/base64 and other encoding conversions
* improve maintenance/readability of hex conversion
* add skeleton of constant-time base64 decoding
* use raw casts
* use raw casts only for same size types
* [Threadpool] Fix syncAll releasing while a thread was attempting to steal + force no exception in tasks
* fix unguarded access on MacOS barriers
* parallel batchadd
* moved import
* Implement a threadpool
* int and SomeUnsignedInt ...
* Type conversion for windows SynchronizationBarrier
* Use the latest MacOS 11, Big Sur API (jan 2021) for MacOS futexes, Github action offers MacOS 12 and can test them
* bench need posix timer not available on windows and darwin futex
* Windows: nimble exec empty line is an error, Mac: use defined(osx) instead of defined(macos)
* file rename
* okay, that's the last one hopefully
* deactivate stealHalf for now
* sha256: separate message scheduling and state updates to help implement specific use-cases like #205; also implement SSSE3 acceleration (2006, Intel Core 2 Duo)
* sha256: simplify update flow, store less metadata in context
* sha256: Fix reworked update function
* Implement x86 hardware SHA acceleration
* typo