* introduce reserve threads to minimize latency and maximize throughput when awaiting a future
* introduce a ceilDiv proc
* threadpool: implement parallel-for loops
* 10x perf improvement by not waking reserveBackoff on syncAll
* bench overhead: new reserve system might introduce too much wakeup latency, 2x slower, for fine-grained parallelism
* add parallelForStrided
* Threadpool: Implement parallel reductions
* refactor parallel loop codegen: introduce descriptor, parsing and codegen stages
* parallel strided, test transpose bench
* tight loop is faster when backoff is not inline
* no POSIX stuff on windows, larger types for histogram bench
* fix tests
* max RSS overflow?
* missed an undefined var
* exit histogram on 32-bit
* forgot to return early dor 32-bit
* unoptimized msm
* MSM: reorder loops
* add a signed windowed recoding technique
* improve wNAF table access
* use batchAffine
* revamp EC tests
* MSM signed digit support
* refactor MSM: recode signed ahead of time
* missing test vector
* refactor allocs and Alloca sideeffect
* add an endomorphism threshold
* Add Jacobian extended coordinates
* refactor recodings, prepare for parallelizable on-the-fly signed recoding
* recoding changes, introduce proper NAF for pairings
* more pairings refactoring, introduce miller accumulator for EVM
* some optim to the addchain miller loop
* start optimizing multi-pairing
* finish multi-miller loop refactoring
* minor tuning
* MSM: signed encoding suitable for parallelism (no precompute)
* cleanup signed window encoding
* add prefetching
* add metering
* properly init result to infinity
* comment on prefetching
* introduce vartime inversion for batch additions
* fix JacExt infinity conversion
* add batchAffine for MSM, though slower than JacExtended at the moment
* add a batch affine scheduler for MSM
* Add Multi-Scalar-Multiplication endomorphism acceleration
* some tuning
* signed integer fixes + 32-bit + tuning
* Some more tuning
* common msm bench + don't use affine for c < 9
* nit
* Implement a threadpool
* int and SomeUnsignedInt ...
* Type conversion for windows SynchronizationBarrier
* Use the latest MacOS 11, Big Sur API (jan 2021) for MacOS futexes, Github action offers MacOS 12 and can test them
* bench need posix timer not available on windows and darwin futex
* Windows: nimble exec empty line is an error, Mac: use defined(osx) instead of defined(macos)
* file rename
* okay, that's the last one hopefully
* deactivate stealHalf for now
* Try to compile with GMP on windows and 32-bit linux
* remove leftover msys shell
* Don't use GMP Mersenne Twister, bad randomness and untested Nim wrapper
* properly cache nim
* fix path after cache
* run pacman in msys2 env
* rework msys2 ... again
* shell compat for file clearing
* shell compat try-again for file clearing
* force bash for clearing parallel builds on windows
* Use nimscript directly (why didn't it work last time?)
* Avoid IO redirection to support any shell
* Avoid IO redirection v2 to support any shell
* add debug data
* add debug again
* Introduce pararun, a parallel test runner to remove need of GNU parallel
* pararun: style
* Add specific fromMont conversion routine. Rename montyResidue to getMont
* missed test file
* Add x86_64 ASM for fromMont
* Add x86_64 MULX/ADCX/ADOX for fromMont
* rework Montgomery Multiplication with prefetch/latency hiding techniques
* Fix ADX autodetection, closes#174. Rollback faster mul_mont attempt, no improvement and debug pain.
* finalSub in fromMont & adx_bmi -> adx
* Some {.noInit.} to avoid Nim zeroMem (which should be optimized away but who knows)
* Uniformize name 'op+domain': mulmod - mulmont
* Fix asm codegen bug "0x0000555555565930 <+896>: sbb 0x20(%r8),%r8" with Clang in final substraction
* Prepare for skipping final substraction
* Don't forget to copy the result when we skip the final substraction
* Seems like we need to stash the idea of skipping the final substraction for now, needs bounds analysis https://eprint.iacr.org/2017/1057.pdf
* fix condition for ASM 32-bit
* optim modular addition when sparebit is available
* Point decoding: optimized sqrt for p ≡ 5 (mod 8) (Curve25519)
* Implement fused sqrt(u/v) for twisted edwards point deserialization
* Introduce twisted edwards affine
* Allow declaration of curve field elements (and fight against recursive dependencies
* Twisted edwards group law + tests
* Add support for jubjub and bandersnatch #162
* test twisted edwards scalar mul
* consistent naming for dbl-width
* Isolate double-width Fp2 mul
* Implement double-width complex multiplication
* Lay out Fp4 double-width mul
* Off by p in square Fp4 as well :/
* less copies and stack space in addition chains
* Address https://github.com/mratsim/constantine/issues/154 partly
* Fix#154, faster Fp4 square: less non-residue, no Mul, only square (bit more ops total)
* Fix typo
* better assembly scheduling for add/sub
* Double-width -> Double-precision
* Unred -> Unr
* double-precision modular addition
* Replace canUseNoCarryMontyMul and canUseNoCarryMontySquare by getSpareBits
* Complete the double-precision implementation
* Use double-precision path for Fp4 squaring and mul
* remove mixin annotations
* Lazy reduction in Fp4 prod
* Fix assembly for sum2xMod
* Assembly for double-precision negation
* reduce white spaces in pairing benchmarks
* ADX implies BMI2
* Fix affine instantiation
* drop concept from the codebase
* Remove alignment requirement, this cases problem in sequences on 32-bit for t_fp12_anti_regression
* slight sparse optim
* naive removal of out-of-place mul by non residue
* Use {.inline.} in a consistent manner across the codebase
* Handle aliasing for quadratic multiplication
* reorg optimization
* Handle aliasing for quadratic squaring
* handle aliasing in mul_sparse_complex_by_0y
* Rework multiplication by nonresidue, assume tower and twist use same non-residue
* continue rework
* continue on non-residues
* Remove "NonResidue *" calls
* handle aliasing in Chung-Hasan SQR2
* Handla aliasing in Chung-Hasan SQR3
* Use one less temporary in Chung Hasan sqr2
* handle aliasing in cubic extensions
* merge extension tower in the same file to reduce duplicate proc and allow better inlining
* handle aliasing in cubic inversion
* drop out-of-place proc from BigInt and finite fields as well
* less copies in line_projective
* remove a copy in fp12 by lines
* Add Fp, Fp2, Fp6 support for BW6-761
* Add G1 for BW6-761
* Prepare to support G2 twists on the same field as G1
* Remove a useless dependent type for lines
* Implement G2 for BW6-761
* Fix Line leftover
* Pairing - initial commit
- line functions
- sparse Fp12 functions
* Small fixes:
- Line parametrized by twist for generic algorithm
- Add a conjugate operator for quadratic extensions
- Have frobenius use it
- Create an Affine coordinate type for elliptic curve
* Implement (failing) pairing test
* Stash pairing debug session, temp switch Fp12 over Fp4
* Proper naive pairing on BLS12-381
* Frobenius map
* Implement naive pairing for BN curves
* Add pairing tests to CI + reduce time spent on lower-level tests
* Test without assembler in Github Actions + less base layers test iterations
* Add sage script for BN254
* Implement (failing) scalar multiplication tests
* Add a first test against sagemath
* Finish the tests against SAGE for BN254
* Add significant test coverage of scalar multiplication with reference checks for BN254_Snarks and BLS12_381