* try parallel reduction in batch add, but alas it's slower than custom chunking. Except maybe on arch with performance/efficiency cores
* initial impl of parallel MSM - scaling to debug, threads not woken fast enough
* improve comment [skip ci]
* skip top window when c divides the number of bits
* for some reason parallel-for loops scale on 5+ threads while spawn only on 2x threads. Thread wakeup issue?
* Add counters and timers to audit threadpool bottlenecks
* metrics and profiling fixes, (slower) latency hiding, activate tests
* fix thief thread trying to wake another before canceling its own sleep
* easier to sort metrics and parallel endomorphism application
* selective endomorphism acceleration
* some tuning
* spawn can handle compile-time literals, static and type parameters. Also introduce spawnAwaitable to await void procs
* improve MSM overview [skip ci]
* bench cleanup
* introduce reserve threads to minimize latency and maximize throughput when awaiting a future
* introduce a ceilDiv proc
* threadpool: implement parallel-for loops
* 10x perf improvement by not waking reserveBackoff on syncAll
* bench overhead: new reserve system might introduce too much wakeup latency, 2x slower, for fine-grained parallelism
* add parallelForStrided
* Threadpool: Implement parallel reductions
* refactor parallel loop codegen: introduce descriptor, parsing and codegen stages
* parallel strided, test transpose bench
* tight loop is faster when backoff is not inline
* no POSIX stuff on windows, larger types for histogram bench
* fix tests
* max RSS overflow?
* missed an undefined var
* exit histogram on 32-bit
* forgot to return early dor 32-bit
* unoptimized msm
* MSM: reorder loops
* add a signed windowed recoding technique
* improve wNAF table access
* use batchAffine
* revamp EC tests
* MSM signed digit support
* refactor MSM: recode signed ahead of time
* missing test vector
* refactor allocs and Alloca sideeffect
* add an endomorphism threshold
* Add Jacobian extended coordinates
* refactor recodings, prepare for parallelizable on-the-fly signed recoding
* recoding changes, introduce proper NAF for pairings
* more pairings refactoring, introduce miller accumulator for EVM
* some optim to the addchain miller loop
* start optimizing multi-pairing
* finish multi-miller loop refactoring
* minor tuning
* MSM: signed encoding suitable for parallelism (no precompute)
* cleanup signed window encoding
* add prefetching
* add metering
* properly init result to infinity
* comment on prefetching
* introduce vartime inversion for batch additions
* fix JacExt infinity conversion
* add batchAffine for MSM, though slower than JacExtended at the moment
* add a batch affine scheduler for MSM
* Add Multi-Scalar-Multiplication endomorphism acceleration
* some tuning
* signed integer fixes + 32-bit + tuning
* Some more tuning
* common msm bench + don't use affine for c < 9
* nit
* create a codecs.nim file for hex/base64 and other encoding conversions
* improve maintenance/readability of hex conversion
* add skeleton of constant-time base64 decoding
* use raw casts
* use raw casts only for same size types
* [Threadpool] Fix syncAll releasing while a thread was attempting to steal + force no exception in tasks
* fix unguarded access on MacOS barriers
* parallel batchadd
* moved import
* sha256: separate message scheduling and state updates to help implement specific use-cases like #205; also implement SSSE3 acceleration (2006, Intel Core 2 Duo)
* sha256: simplify update flow, store less metadata in context
* sha256: Fix reworked update function
* Implement x86 hardware SHA acceleration
* typo
* Example+Test C API vs GMP
* Create build directory for bindings test
* --nimMainPrefix is 1.6 only
* Add libdl for dynamic loading
* absolute paths
* add static link test
* Fix man main, rename Nimmain to init_NimMain
* Deal with MacOS annoying linker w.r.t. static libraries
* use .exe extension to satisfy windows (?)
* annoying GCC which doesn't create paths
* Try skipping DLL test on windows
* windows extensions ...
* no lib prefix on windows
* Try to compile with GMP on windows and 32-bit linux
* remove leftover msys shell
* Don't use GMP Mersenne Twister, bad randomness and untested Nim wrapper
* properly cache nim
* fix path after cache
* run pacman in msys2 env
* rework msys2 ... again
* shell compat for file clearing
* shell compat try-again for file clearing
* force bash for clearing parallel builds on windows
* Use nimscript directly (why didn't it work last time?)
* Avoid IO redirection to support any shell
* Avoid IO redirection v2 to support any shell
* add debug data
* add debug again
* Introduce pararun, a parallel test runner to remove need of GNU parallel
* pararun: style
* First draft at bindings generation
* finite field bindings PoC
* support openarray, export NimMain
* PoC extension fields and elliptic curve bindings
* Pasta
* expose more bindings, remove nimZeroMem, remove tracer when unused, codegen name_mangling`gensym issue
* workaround bad C gensym codegen with {.inline.} pragma in non-dirty template nested in generic proc instantiated by template
* Skeleton of hash to curve for BLS12-381 G1
* Remove isodegree parameter
* Fix polynomial evaluation of hashToG1
* Optimize hash_to_curve and add bench for hash to G1
* slight optim of jacobian isomap + v7 test vectors
* try 1.6 CI
* Try CI with 1.6 and windows.
* Bend the knee
* have fun debugging CI
* have fun debugging CI
* more CI spam
* branch -> nim_version
* fight or flight
* properly detect windows
* Fix galore
* 🐍🐍 snake:
* meh give up on parallelizing windows and dealing with windows PATH issues
* ¯\_ (ツ)_/¯
* Add specific fromMont conversion routine. Rename montyResidue to getMont
* missed test file
* Add x86_64 ASM for fromMont
* Add x86_64 MULX/ADCX/ADOX for fromMont
* rework Montgomery Multiplication with prefetch/latency hiding techniques
* Fix ADX autodetection, closes#174. Rollback faster mul_mont attempt, no improvement and debug pain.
* finalSub in fromMont & adx_bmi -> adx
* Some {.noInit.} to avoid Nim zeroMem (which should be optimized away but who knows)
* Uniformize name 'op+domain': mulmod - mulmont
* Fix asm codegen bug "0x0000555555565930 <+896>: sbb 0x20(%r8),%r8" with Clang in final substraction
* Prepare for skipping final substraction
* Don't forget to copy the result when we skip the final substraction
* Seems like we need to stash the idea of skipping the final substraction for now, needs bounds analysis https://eprint.iacr.org/2017/1057.pdf
* fix condition for ASM 32-bit
* optim modular addition when sparebit is available
* split modular inversion in its own file
* Stash fast GCD inversion https://eprint.iacr.org/2020/972.pdf
* Stash Pornin's bingcd -> issue with inner modular reduction
* Implement Bernstein-Yang inversion
* Avoid Nim checks on signed integers (32-bit runtime issue)
* cleanup: remove old inversion impls
* cleanup: static moduli, move div2
* small comments (skip ci)
* comment cleanup (skip ci)
* fix total iterations on 32-bit
* Add batch conversion to affine coordinates using simultaneous inversion trick
* fix conditional setZero and batchAffine conversion
* cleanup unneeded branches following affine conversion unification
* Fix batchAffine with zero inputs and add fuzz failure to test suite
* Move cofactor clearing to dedicated per-curve subgroups file
* Add BLS12-381 fast subgroup checks
* Implement fast cofactor clearing for BN254_snarks
* Add fast subgroup check to BN254Snarks
* add BLS12_377 optimized cofactor and subgroup functions
* Add BN254_Nogami
* Add GT-subgroup tests
* Use the new subgroup checks for Eth1 EVM precompiles
* add more Fp tests for Twisted Edwards curves
* add fused sqrt+division bench
* Significant fused sqrt+division improvement for any prime field over algorithm described in "High-Speed High-Security Signature", Bernstein et al, p15 "Fast decompression", https://ed25519.cr.yp.to/ed25519-20110705.pdf
* Activate secp256k1 field benches + spring renaming of field multiplication
* addition chains for inversion and sqrt of Curve25519
* Make isSquare use addition chains
* add double-prec mul/square bench for <256-bit prime fields.
* Point decoding: optimized sqrt for p ≡ 5 (mod 8) (Curve25519)
* Implement fused sqrt(u/v) for twisted edwards point deserialization
* Introduce twisted edwards affine
* Allow declaration of curve field elements (and fight against recursive dependencies
* Twisted edwards group law + tests
* Add support for jubjub and bandersnatch #162
* test twisted edwards scalar mul
* Hash to Curve: impl expand_message_xmd
* Try to precompute part of hash to curve at compile-time
* sha256 bench - use the new hashes module
* [WIP] smoke test hash to field
* Implement hash_to_field with expected output
* unoptimized hash-to-curve G2 for BLS12-381
* Don't run sanitizer on hash to field as it uses GC-ed strings