* Pasta bench
* cleanup env variables
* [MSM]: generate benchmark coef-points pairs in parallel
* try to fix windows Ci
* add diagnostic info
* fix old test for new codecs/io primitives
* Ensure the projective point at infinity is not all zeros, but (0, 1, 0)
* move tests
* move threadpool to root path
* fix hints and warnings, print nim versions for tests for debugging the new strange issue in CI
* print nim version
* mixup on branches
* mixup on branches reloaded
* rework assembler register/mem and constraint declarations
* Introduce constraint UnmutatedPointerToWriteMem
* Create invidual memory cell operands
* [Assembly] fully support indirect memory addressing
* fix calling convention for exported procs
* Prepare for switch to intel syntax to avoid clang constant propagation asm symbol name interfering OR pointer+offset addressing
* use modifiers to prevent bad string mixin fo assembler to linker of propagated consts
* Assembly: switch to intel syntax
* with working memory operand - now works with LTO on both GCC and clang and constant folding
* use memory operand in more places
* remove some inline now that we have lto
* cleanup compiler config and benches
* tracer shouldn't force dependencies when unused
* fix cc on linux
* nimble fixes
* update README [skip CI]
* update MacOS CI with Homebrew Clang
* oops nimble bindings disappeared
* more nimble fixes
* fix sha256 exported symbol
* improve constraints on modular addition
* Add extra constraint to force reloading of pointer in reg inputs
* Fix LLVM gold linker running out of registers
* workaround MinGW64 GCC 12.2 bad codegen in t_pairing_cyclotomic_subgroup with LTO
* [testsuite] Rework parallel test runner to buffer beyond 65536 chars and properly wait for process exit
* [testsuite] improve error reporting
* rework openArray[byte/char] for BLS signature C API
* Prepare for optimized library and bindings
* properly link to constantine
* Compiler fixes, global sanitizers, GCC bug with --opt:size
* workaround/fix #229: don't inline field reduction in Fp2
* fix clang running out of registers with LTO
* [C API] missed length parameters for ctt_eth_bls_fast_aggregate_verify
* double-precision asm is too large for inlining, try to fix Linux and MacOS woes at https://github.com/mratsim/constantine/pull/228#issuecomment-1512773460
* Use FORTIFY_SOURCE for testing
* Fix#230 - gcc miscompiles Fp6 mul with LTO
* disable LTO for now, PR is too long
* try parallel reduction in batch add, but alas it's slower than custom chunking. Except maybe on arch with performance/efficiency cores
* initial impl of parallel MSM - scaling to debug, threads not woken fast enough
* improve comment [skip ci]
* skip top window when c divides the number of bits
* for some reason parallel-for loops scale on 5+ threads while spawn only on 2x threads. Thread wakeup issue?
* Add counters and timers to audit threadpool bottlenecks
* metrics and profiling fixes, (slower) latency hiding, activate tests
* fix thief thread trying to wake another before canceling its own sleep
* easier to sort metrics and parallel endomorphism application
* selective endomorphism acceleration
* some tuning
* spawn can handle compile-time literals, static and type parameters. Also introduce spawnAwaitable to await void procs
* improve MSM overview [skip ci]
* bench cleanup
* remove reserve threads
* recover last perf diff: 1. don't import primitives, cpu features detection globals are noticeable, 2. noinit + conditional zeroMem are unnecessary when sync is inline 3. inline 'newSpawn' and don't init the loop part
* avoid syscalls if possible if thred is awake but idle
* renaming eventLoop
* remove unused code: steal-half
* renaming
* no need for 0-init sync, T can be large in cryptography
* introduce reserve threads to minimize latency and maximize throughput when awaiting a future
* introduce a ceilDiv proc
* threadpool: implement parallel-for loops
* 10x perf improvement by not waking reserveBackoff on syncAll
* bench overhead: new reserve system might introduce too much wakeup latency, 2x slower, for fine-grained parallelism
* add parallelForStrided
* Threadpool: Implement parallel reductions
* refactor parallel loop codegen: introduce descriptor, parsing and codegen stages
* parallel strided, test transpose bench
* tight loop is faster when backoff is not inline
* no POSIX stuff on windows, larger types for histogram bench
* fix tests
* max RSS overflow?
* missed an undefined var
* exit histogram on 32-bit
* forgot to return early dor 32-bit
* unoptimized msm
* MSM: reorder loops
* add a signed windowed recoding technique
* improve wNAF table access
* use batchAffine
* revamp EC tests
* MSM signed digit support
* refactor MSM: recode signed ahead of time
* missing test vector
* refactor allocs and Alloca sideeffect
* add an endomorphism threshold
* Add Jacobian extended coordinates
* refactor recodings, prepare for parallelizable on-the-fly signed recoding
* recoding changes, introduce proper NAF for pairings
* more pairings refactoring, introduce miller accumulator for EVM
* some optim to the addchain miller loop
* start optimizing multi-pairing
* finish multi-miller loop refactoring
* minor tuning
* MSM: signed encoding suitable for parallelism (no precompute)
* cleanup signed window encoding
* add prefetching
* add metering
* properly init result to infinity
* comment on prefetching
* introduce vartime inversion for batch additions
* fix JacExt infinity conversion
* add batchAffine for MSM, though slower than JacExtended at the moment
* add a batch affine scheduler for MSM
* Add Multi-Scalar-Multiplication endomorphism acceleration
* some tuning
* signed integer fixes + 32-bit + tuning
* Some more tuning
* common msm bench + don't use affine for c < 9
* nit
* [Threadpool] Fix syncAll releasing while a thread was attempting to steal + force no exception in tasks
* fix unguarded access on MacOS barriers
* parallel batchadd
* moved import
* Implement a threadpool
* int and SomeUnsignedInt ...
* Type conversion for windows SynchronizationBarrier
* Use the latest MacOS 11, Big Sur API (jan 2021) for MacOS futexes, Github action offers MacOS 12 and can test them
* bench need posix timer not available on windows and darwin futex
* Windows: nimble exec empty line is an error, Mac: use defined(osx) instead of defined(macos)
* file rename
* okay, that's the last one hopefully
* deactivate stealHalf for now
* Example+Test C API vs GMP
* Create build directory for bindings test
* --nimMainPrefix is 1.6 only
* Add libdl for dynamic loading
* absolute paths
* add static link test
* Fix man main, rename Nimmain to init_NimMain
* Deal with MacOS annoying linker w.r.t. static libraries
* use .exe extension to satisfy windows (?)
* annoying GCC which doesn't create paths
* Try skipping DLL test on windows
* windows extensions ...
* no lib prefix on windows
* Try to compile with GMP on windows and 32-bit linux
* remove leftover msys shell
* Don't use GMP Mersenne Twister, bad randomness and untested Nim wrapper
* properly cache nim
* fix path after cache
* run pacman in msys2 env
* rework msys2 ... again
* shell compat for file clearing
* shell compat try-again for file clearing
* force bash for clearing parallel builds on windows
* Use nimscript directly (why didn't it work last time?)
* Avoid IO redirection to support any shell
* Avoid IO redirection v2 to support any shell
* add debug data
* add debug again
* Introduce pararun, a parallel test runner to remove need of GNU parallel
* pararun: style
* First draft at bindings generation
* finite field bindings PoC
* support openarray, export NimMain
* PoC extension fields and elliptic curve bindings
* Pasta
* expose more bindings, remove nimZeroMem, remove tracer when unused, codegen name_mangling`gensym issue
* workaround bad C gensym codegen with {.inline.} pragma in non-dirty template nested in generic proc instantiated by template
* try 1.6 CI
* Try CI with 1.6 and windows.
* Bend the knee
* have fun debugging CI
* have fun debugging CI
* more CI spam
* branch -> nim_version
* fight or flight
* properly detect windows
* Fix galore
* 🐍🐍 snake:
* meh give up on parallelizing windows and dealing with windows PATH issues
* ¯\_ (ツ)_/¯
* split modular inversion in its own file
* Stash fast GCD inversion https://eprint.iacr.org/2020/972.pdf
* Stash Pornin's bingcd -> issue with inner modular reduction
* Implement Bernstein-Yang inversion
* Avoid Nim checks on signed integers (32-bit runtime issue)
* cleanup: remove old inversion impls
* cleanup: static moduli, move div2
* small comments (skip ci)
* comment cleanup (skip ci)
* fix total iterations on 32-bit
* Add batch conversion to affine coordinates using simultaneous inversion trick
* fix conditional setZero and batchAffine conversion
* cleanup unneeded branches following affine conversion unification
* Fix batchAffine with zero inputs and add fuzz failure to test suite
* Move cofactor clearing to dedicated per-curve subgroups file
* Add BLS12-381 fast subgroup checks
* Implement fast cofactor clearing for BN254_snarks
* Add fast subgroup check to BN254Snarks
* add BLS12_377 optimized cofactor and subgroup functions
* Add BN254_Nogami
* Add GT-subgroup tests
* Use the new subgroup checks for Eth1 EVM precompiles
* Point decoding: optimized sqrt for p ≡ 5 (mod 8) (Curve25519)
* Implement fused sqrt(u/v) for twisted edwards point deserialization
* Introduce twisted edwards affine
* Allow declaration of curve field elements (and fight against recursive dependencies
* Twisted edwards group law + tests
* Add support for jubjub and bandersnatch #162
* test twisted edwards scalar mul
* Hash to Curve: impl expand_message_xmd
* Try to precompute part of hash to curve at compile-time
* sha256 bench - use the new hashes module
* [WIP] smoke test hash to field
* Implement hash_to_field with expected output
* unoptimized hash-to-curve G2 for BLS12-381
* Don't run sanitizer on hash to field as it uses GC-ed strings
* Pairing with affine: align API to BLST and Gurvy and common use-case.
* Implement multi-pairing / aggregate verif for BLS12-381 (+2% pairing perf)
* Generalize the optimized miller loop for single pairing
* Immplement the miller loop addchain for BLS12-377
* Miller addition chain for BN254-Nogami
* no Miller adchain for BN254-Snarks
* Update the line test with new tower https://github.com/mratsim/constantine/pull/153
* Somewhat sparse for Fp2 M-Twist
* Implement line by line multiplication for Fp12 D-Twist
* Somewhat sparse Mul for Fp12 D-Twist
* Finish the sparse and somewhat sparse multiplications