* rework assembler register/mem and constraint declarations
* Introduce constraint UnmutatedPointerToWriteMem
* Create invidual memory cell operands
* [Assembly] fully support indirect memory addressing
* fix calling convention for exported procs
* Prepare for switch to intel syntax to avoid clang constant propagation asm symbol name interfering OR pointer+offset addressing
* use modifiers to prevent bad string mixin fo assembler to linker of propagated consts
* Assembly: switch to intel syntax
* with working memory operand - now works with LTO on both GCC and clang and constant folding
* use memory operand in more places
* remove some inline now that we have lto
* cleanup compiler config and benches
* tracer shouldn't force dependencies when unused
* fix cc on linux
* nimble fixes
* update README [skip CI]
* update MacOS CI with Homebrew Clang
* oops nimble bindings disappeared
* more nimble fixes
* fix sha256 exported symbol
* improve constraints on modular addition
* Add extra constraint to force reloading of pointer in reg inputs
* Fix LLVM gold linker running out of registers
* workaround MinGW64 GCC 12.2 bad codegen in t_pairing_cyclotomic_subgroup with LTO
* [Threadpool] Fix syncAll releasing while a thread was attempting to steal + force no exception in tasks
* fix unguarded access on MacOS barriers
* parallel batchadd
* moved import
* add more Fp tests for Twisted Edwards curves
* add fused sqrt+division bench
* Significant fused sqrt+division improvement for any prime field over algorithm described in "High-Speed High-Security Signature", Bernstein et al, p15 "Fast decompression", https://ed25519.cr.yp.to/ed25519-20110705.pdf
* Activate secp256k1 field benches + spring renaming of field multiplication
* addition chains for inversion and sqrt of Curve25519
* Make isSquare use addition chains
* add double-prec mul/square bench for <256-bit prime fields.
* consistent naming for dbl-width
* Isolate double-width Fp2 mul
* Implement double-width complex multiplication
* Lay out Fp4 double-width mul
* Off by p in square Fp4 as well :/
* less copies and stack space in addition chains
* Address https://github.com/mratsim/constantine/issues/154 partly
* Fix#154, faster Fp4 square: less non-residue, no Mul, only square (bit more ops total)
* Fix typo
* better assembly scheduling for add/sub
* Double-width -> Double-precision
* Unred -> Unr
* double-precision modular addition
* Replace canUseNoCarryMontyMul and canUseNoCarryMontySquare by getSpareBits
* Complete the double-precision implementation
* Use double-precision path for Fp4 squaring and mul
* remove mixin annotations
* Lazy reduction in Fp4 prod
* Fix assembly for sum2xMod
* Assembly for double-precision negation
* reduce white spaces in pairing benchmarks
* ADX implies BMI2