* unoptimized msm
* MSM: reorder loops
* add a signed windowed recoding technique
* improve wNAF table access
* use batchAffine
* revamp EC tests
* MSM signed digit support
* refactor MSM: recode signed ahead of time
* missing test vector
* refactor allocs and Alloca sideeffect
* add an endomorphism threshold
* Add Jacobian extended coordinates
* refactor recodings, prepare for parallelizable on-the-fly signed recoding
* recoding changes, introduce proper NAF for pairings
* more pairings refactoring, introduce miller accumulator for EVM
* some optim to the addchain miller loop
* start optimizing multi-pairing
* finish multi-miller loop refactoring
* minor tuning
* MSM: signed encoding suitable for parallelism (no precompute)
* cleanup signed window encoding
* add prefetching
* add metering
* properly init result to infinity
* comment on prefetching
* introduce vartime inversion for batch additions
* fix JacExt infinity conversion
* add batchAffine for MSM, though slower than JacExtended at the moment
* add a batch affine scheduler for MSM
* Add Multi-Scalar-Multiplication endomorphism acceleration
* some tuning
* signed integer fixes + 32-bit + tuning
* Some more tuning
* common msm bench + don't use affine for c < 9
* nit
* Add specific fromMont conversion routine. Rename montyResidue to getMont
* missed test file
* Add x86_64 ASM for fromMont
* Add x86_64 MULX/ADCX/ADOX for fromMont
* rework Montgomery Multiplication with prefetch/latency hiding techniques
* Fix ADX autodetection, closes#174. Rollback faster mul_mont attempt, no improvement and debug pain.
* finalSub in fromMont & adx_bmi -> adx
* Some {.noInit.} to avoid Nim zeroMem (which should be optimized away but who knows)
* Uniformize name 'op+domain': mulmod - mulmont
* Fix asm codegen bug "0x0000555555565930 <+896>: sbb 0x20(%r8),%r8" with Clang in final substraction
* Prepare for skipping final substraction
* Don't forget to copy the result when we skip the final substraction
* Seems like we need to stash the idea of skipping the final substraction for now, needs bounds analysis https://eprint.iacr.org/2017/1057.pdf
* fix condition for ASM 32-bit
* optim modular addition when sparebit is available
* split modular inversion in its own file
* Stash fast GCD inversion https://eprint.iacr.org/2020/972.pdf
* Stash Pornin's bingcd -> issue with inner modular reduction
* Implement Bernstein-Yang inversion
* Avoid Nim checks on signed integers (32-bit runtime issue)
* cleanup: remove old inversion impls
* cleanup: static moduli, move div2
* small comments (skip ci)
* comment cleanup (skip ci)
* fix total iterations on 32-bit
* Add batch conversion to affine coordinates using simultaneous inversion trick
* fix conditional setZero and batchAffine conversion
* cleanup unneeded branches following affine conversion unification
* Fix batchAffine with zero inputs and add fuzz failure to test suite
* add more Fp tests for Twisted Edwards curves
* add fused sqrt+division bench
* Significant fused sqrt+division improvement for any prime field over algorithm described in "High-Speed High-Security Signature", Bernstein et al, p15 "Fast decompression", https://ed25519.cr.yp.to/ed25519-20110705.pdf
* Activate secp256k1 field benches + spring renaming of field multiplication
* addition chains for inversion and sqrt of Curve25519
* Make isSquare use addition chains
* add double-prec mul/square bench for <256-bit prime fields.
* Clear cofactor in BN254 G2 testgen and frobenius
* Implement G2 endomorphism acceleration in Sage
* Somewhat working accelerated scalar mul G2 (2.2x) faster
- OK for BN254_Snarks
- Some test failing for BLS12-381
* Fix negative miniscalars by adding an extra bit of encoding
* Cleanup accel params
* Small recoding optimizations
* Implement double-width field multiplication for double-width towering
* Fp2 mul acceleration via double-width lazy reduction (pure Nim)
* Inline assembly for basic add and sub
* Use 2 registers instead of 12+ for ASM conditional copy
* Prepare assembly for extended multiprecision multiplication support
* Add assembly for mul
* initial implementation of assembly reduction
* stash current progress of assembly reduction
* Fix clobbering issue, only P256 comparison remain buggy
* Fix asm montgomery reduction for NIST P256 as well
* MULX/ADCX/ADOX multi-precision multiplication
* MULX/ADCX/ADOX reduction v1
* Add (deactivated) assembly for double-width substraction + rework benches
* Add bench to nimble and deactivate double-width for now. slower than classic
* Fix x86-32 running out of registers for mul
* Clang needs to be at v9 to support flag output constraints (Xcode 11.4.2 / OSX Catalina)
* 32-bit doesn't have enough registers for ASM mul
* Fix again Travis Clang 9 issues
* LLVM 9 is not whitelisted in travis
* deactivated assembler with travis clang
* syntax error
* another
* ...
* missing space, yeah ...
* Proof-of-Concept Assembly code generator
* Tag inline per procedure so we can easily track the tradeoff on tower fields
* Implement Assembly for modular addition (but very curious off-by-one)
* Fix off-by one for moduli with non msb set
* Stash (super fast) alternative but still off by carry
* Fix GCC optimizing ASM away
* Save 1 register to allow compiling for BLS12-381 (in the GMP test)
* The compiler cannot find enough registers if the ASM file is not compiled with -O3
* Add modsub
* Add field negation
* Implement no-carry Assembly optimized field multiplication
* Expose UseX86ASM to the EC benchmark
* omit frame pointer to save registers instead of hardcoding -O3. Also ensure early clobber constraints for Clang
* Prepare for assembly fallback
* Implement fallback for CPU that don't support ADX and BMI2
* Add CPU runtime detection
* Update README closes#66
* Remove commented out code
* Mention that the inverse of 0 is 0 (TODO tests)
* Introduce "Higher-Kinded tower extensions"
* rename isCOmplexExtension -> fromComplexExtension
* update benchmarks with the new tower scheme
* Try to recover some speed on mul/squaring for an optimal tower (but this was not it)