Assembly backend (#69)

* Proof-of-Concept Assembly code generator * Tag inline per procedure so we can easily track the tradeoff on tower fields * Implement Assembly for modular addition (but very curious off-by-one) * Fix off-by one for moduli with non msb set * Stash (super fast) alternative but still off by carry * Fix GCC optimizing ASM away * Save 1 register to allow compiling for BLS12-381 (in the GMP test) * The compiler cannot find enough registers if the ASM file is not compiled with -O3 * Add modsub * Add field negation * Implement no-carry Assembly optimized field multiplication * Expose UseX86ASM to the EC benchmark * omit frame pointer to save registers instead of hardcoding -O3. Also ensure early clobber constraints for Clang * Prepare for assembly fallback * Implement fallback for CPU that don't support ADX and BMI2 * Add CPU runtime detection * Update README closes #66 * Remove commented out code
2020-07-24 22:02:30 +02:00 · 2020-07-24 22:02:30 +02:00 · d97bc9b61c
parent 504e2a9c25
commit d97bc9b61c
28 changed files with 2601 additions and 367 deletions
--- a/.travis.yml
+++ b/.travis.yml
@ -99,6 +99,9 @@ script:
    - nimble refresh
    - nimble install gmp stew
    - nimble test_parallel
    - if [[ "$ARCH" != "arm64" ]]; then
        nimble test_parallel_no_assembler;
      fi
 branches:
  except:
    - gh-pages
--- a/README.md
+++ b/README.md
@ -19,12 +19,18 @@ You can install the developement version of the library through nimble with the
 nimble install https://github.com/mratsim/constantine@#master
 ```
-For speed it is recommended to prefer Clang, MSVC or ICC over GCC.
+For speed it is recommended to prefer Clang, MSVC or ICC over GCC (see [Compiler-caveats](#Compiler-caveats)).
 GCC does not properly optimize add-with-carry and sub-with-borrow loops (see [Compiler-caveats](#Compiler-caveats)).
 Further if using GCC, GCC 7 at minimum is required, previous versions
 generated incorrect add-with-carry code.
 On x86-64, inline assembly is used to workaround compilers having issues optimizing large integer arithmetic,
 and also ensure constant-time code.
 This can be deactivated with `"-d:ConstantineASM=false"`:
 - at a significant performance cost with GCC (~50% slower than Clang).
 - at misssed opportunity on recent CPUs that support MULX/ADCX/ADOX instructions (~60% faster than Clang).
 - There is a 2.4x perf ratio between using plain GCC vs GCC with inline assembly.
 ## Target audience
 The library aims to be a portable, compact and hardened library for elliptic curve cryptography needs, in particular for blockchain protocols and zero-knowledge proofs system.
@ -39,10 +45,13 @@ in this order
 ## Curves supported
 At the moment the following curves are supported, adding a new curve only requires adding the prime modulus
-and its bitsize in [constantine/config/curves.nim](constantine/config/curves.nim).
+and its bitsize in [constantine/config/curves.nim](constantine/config/curves_declaration.nim).
 The following curves are configured:
 > Note: At the moment, finite field arithmetic is fully supported
 >       but elliptic curve arithmetic is work-in-progress.
 ### ECDH / ECDSA curves
 - NIST P-224
@ -58,7 +67,8 @@ Families:
 - FKM: Fotiadis-Konstantinou-Martindale
 Curves:
- BN254 (Zero-Knowledge Proofs, Snarks, Starks, Zcash, Ethereum 1)
+- BN254_Nogami
 - BN254_Snarks (Zero-Knowledge Proofs, Snarks, Starks, Zcash, Ethereum 1)
 - BLS12-377 (Zexe)
 - BLS12-381 (Algorand, Chia Networks, Dfinity, Ethereum 2, Filecoin, Zcash Sapling)
 - BN446
@ -137,8 +147,13 @@ To measure the performance of Constantine
 ```bash
 git clone https://github.com/mratsim/constantine
-nimble bench_fp_clang
+nimble bench_fp       # Using Assembly (+ GCC)
-nimble bench_fp2_clang
+nimble bench_fp_clang # Using Clang only
 nimble bench_fp_gcc   # Using Clang only (very slow)
 nimble bench_fp2
 # ...
 nimble bench_ec_g1
 nimble bench_ec_g2
 ```
 As mentioned in the [Compiler caveats](#compiler-caveats) section, GCC is up to 2x slower than Clang due to mishandling of carries and register usage.
@ -146,33 +161,51 @@ As mentioned in the [Compiler caveats](#compiler-caveats) section, GCC is up to
 On my machine, for selected benchmarks on the prime field for popular pairing-friendly curves.
 ```
-⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
+Compiled with GCC
-==========================================================================================================
+Optimization level =>
  no optimization: false
  release: true
  danger: true
  inline assembly: true
 Using Constantine with 64-bit limbs
 Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz
-All benchmarks are using constant-time implementations to protect against side-channel attacks.
+⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
 i.e. a 20% overclock will be about 20% off (assuming no dynamic frequency scaling)
-Compiled with Clang
+=================================================================================================================
 Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz (overclocked all-core Turbo @4.1GHz)
 --------------------------------------------------------------------------------
 Addition        Fp[BN254]               0 ns         0 cycles
 Substraction    Fp[BN254]               0 ns         0 cycles
 Negation        Fp[BN254]               0 ns         0 cycles
 Multiplication  Fp[BN254]              21 ns        65 cycles
 Squaring        Fp[BN254]              18 ns        55 cycles
 Inversion       Fp[BN254]            6266 ns     18799 cycles
 --------------------------------------------------------------------------------
 Addition        Fp[BLS12_381]           0 ns         0 cycles
 Substraction    Fp[BLS12_381]           0 ns         0 cycles
 Negation        Fp[BLS12_381]           0 ns         0 cycles
 Multiplication  Fp[BLS12_381]          45 ns       136 cycles
 Squaring        Fp[BLS12_381]          39 ns       118 cycles
 Inversion       Fp[BLS12_381]       15683 ns     47050 cycles
 --------------------------------------------------------------------------------
 -------------------------------------------------------------------------------------------------------------------------------------------------
 Addition                                           Fp[BN254_Snarks]     333333333.333 ops/s             3 ns/op             9 CPU cycles (approx)
 Substraction                                       Fp[BN254_Snarks]     500000000.000 ops/s             2 ns/op             8 CPU cycles (approx)
 Negation                                           Fp[BN254_Snarks]    1000000000.000 ops/s             1 ns/op             3 CPU cycles (approx)
 Multiplication                                     Fp[BN254_Snarks]      71428571.429 ops/s            14 ns/op            44 CPU cycles (approx)
 Squaring                                           Fp[BN254_Snarks]      71428571.429 ops/s            14 ns/op            44 CPU cycles (approx)
 Inversion (constant-time Euclid)                   Fp[BN254_Snarks]        122579.063 ops/s          8158 ns/op         24474 CPU cycles (approx)
 Inversion via exponentiation p-2 (Little Fermat)   Fp[BN254_Snarks]        153822.489 ops/s          6501 ns/op         19504 CPU cycles (approx)
 Square Root + square check (constant-time)         Fp[BN254_Snarks]        153491.942 ops/s          6515 ns/op         19545 CPU cycles (approx)
 Exp curve order (constant-time) - 254-bit          Fp[BN254_Snarks]        104580.632 ops/s          9562 ns/op         28687 CPU cycles (approx)
 Exp curve order (Leak exponent bits) - 254-bit     Fp[BN254_Snarks]        153798.831 ops/s          6502 ns/op         19506 CPU cycles (approx)
 -------------------------------------------------------------------------------------------------------------------------------------------------
 Addition                                           Fp[BLS12_381]        250000000.000 ops/s             4 ns/op            14 CPU cycles (approx)
 Substraction                                       Fp[BLS12_381]        250000000.000 ops/s             4 ns/op            13 CPU cycles (approx)
 Negation                                           Fp[BLS12_381]       1000000000.000 ops/s             1 ns/op             4 CPU cycles (approx)
 Multiplication                                     Fp[BLS12_381]         35714285.714 ops/s            28 ns/op            84 CPU cycles (approx)
 Squaring                                           Fp[BLS12_381]         35714285.714 ops/s            28 ns/op            85 CPU cycles (approx)
 Inversion (constant-time Euclid)                   Fp[BLS12_381]            43763.676 ops/s         22850 ns/op         68552 CPU cycles (approx)
 Inversion via exponentiation p-2 (Little Fermat)   Fp[BLS12_381]            63983.620 ops/s         15629 ns/op         46889 CPU cycles (approx)
 Square Root + square check (constant-time)         Fp[BLS12_381]            63856.960 ops/s         15660 ns/op         46982 CPU cycles (approx)
 Exp curve order (constant-time) - 255-bit          Fp[BLS12_381]            68535.399 ops/s         14591 ns/op         43775 CPU cycles (approx)
 Exp curve order (Leak exponent bits) - 255-bit     Fp[BLS12_381]            93222.709 ops/s         10727 ns/op         32181 CPU cycles (approx)
 -------------------------------------------------------------------------------------------------------------------------------------------------
 Notes:
-  GCC is significantly slower than Clang on multiprecision arithmetic.
+  - Compilers:
-  The simplest operations might be optimized away by the compiler.
+    Compilers are severely limited on multiprecision arithmetic.
    Inline Assembly is used by default (nimble bench_fp).
    Bench without assembly can use "nimble bench_fp_gcc" or "nimble bench_fp_clang".
    GCC is significantly slower than Clang on multiprecision arithmetic due to catastrophic handling of carries.
  - The simplest operations might be optimized away by the compiler.
  - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)
 ```
 ### Compiler caveats
@ -234,25 +267,15 @@ add256:
        retq
 ```
 As a workaround key procedures use inline assembly.
 ### Inline assembly
-Constantine uses inline assembly for a very restricted use-case: "conditional mov",
+While using intrinsics significantly improve code readability, portability, auditability and maintainability,
-and a temporary use-case "hardware 128-bit division" that will be replaced ASAP (as hardware division is not constant-time).
+Constantine use inline assembly on x86-64 to ensure performance portability despite poor optimization (for GCC)
 and also to use dedicated large integer instructions MULX, ADCX, ADOX that compilers cannot generate.
-Using intrinsics otherwise significantly improve code readability, portability, auditability and maintainability.
+The speed improvement on finite field arithmetic is up 60% with MULX, ADCX, ADOX on BLS12-381 (6 limbs).
 #### Future optimizations
 In the future more inline assembly primitives might be added provided the performance benefit outvalues the significant complexity.
 In particular, multiprecision multiplication and squaring on x86 can use the instructions MULX, ADCX and ADOX
 to multiply-accumulate on 2 carry chains in parallel (with instruction-level parallelism)
 and improve performance by 15~20% over an uint128-based implementation.
 As no compiler is able to generate such code even when using the `_mulx_u64` and `_addcarryx_u64` intrinsics,
 either the assembly for each supported bigint size must be hardcoded
 or a "compiler" must be implemented in macros that will generate the required inline assembly at compile-time.
 Such a compiler can also be used to overcome GCC codegen deficiencies, here is an example for add-with-carry:
 https://github.com/mratsim/finite-fields/blob/d7f6d8bb/macro_add_carry.nim
 ## Sizes: code size, stack usage
@ -286,3 +309,7 @@ or
 * Apache License, Version 2.0, ([LICENSE-APACHEv2](LICENSE-APACHEv2) or http://www.apache.org/licenses/LICENSE-2.0)
 at your option. This file may not be copied, modified, or distributed except according to those terms.
 This library has **no external dependencies**.
 In particular GMP is used only for testing and differential fuzzing
 and is not linked in the library.
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@ -186,12 +186,19 @@ steps:
      echo "PATH=${PATH}"
      export ucpu=${UCPU}
      nimble test_parallel
-    displayName: 'Testing the package (including GMP)'
+    displayName: 'Testing Constantine with Assembler and with GMP'
    condition: ne(variables['Agent.OS'], 'Windows_NT')
  - bash: |
      echo "PATH=${PATH}"
      export ucpu=${UCPU}
      nimble test_parallel_no_assembler
    displayName: 'Testing Constantine without Assembler and with GMP'
    condition: ne(variables['Agent.OS'], 'Windows_NT')
  - bash: |
      echo "PATH=${PATH}"
      export ucpu=${UCPU}
      nimble test_no_gmp
-    displayName: 'Testing the package (without GMP)'
+    displayName: 'Testing the package (without Assembler or GMP)'
    condition: eq(variables['Agent.OS'], 'Windows_NT')
--- a/benchmarks/bench_ec_g1.nim
+++ b/benchmarks/bench_ec_g1.nim
@ -64,8 +64,4 @@ proc main() =
  separator()
 main()
-
+notes()
 echo "\nNotes:"
 echo "  - GCC is significantly slower than Clang on multiprecision arithmetic."
 echo "  - The simplest operations might be optimized away by the compiler."
 echo "  - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
--- a/benchmarks/bench_ec_g2.nim
+++ b/benchmarks/bench_ec_g2.nim
@ -65,8 +65,4 @@ proc main() =
  separator()
 main()
-
+notes()
 echo "\nNotes:"
 echo "  - GCC is significantly slower than Clang on multiprecision arithmetic."
 echo "  - The simplest operations might be optimized away by the compiler."
 echo "  - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
--- a/benchmarks/bench_elliptic_template.nim
+++ b/benchmarks/bench_elliptic_template.nim
@ -14,7 +14,7 @@
 import
  # Internals
-  ../constantine/config/curves,
+  ../constantine/config/[curves, common],
  ../constantine/arithmetic,
  ../constantine/io/io_bigints,
  ../constantine/elliptic/[ec_weierstrass_projective, ec_scalar_mul, ec_endomorphism_accel],
@ -57,7 +57,11 @@ elif defined(icc):
 else:
  echo "\nCompiled with an unknown compiler"
-echo "Optimization level => no optimization: ", not defined(release), " | release: ", defined(release), " | danger: ", defined(danger)
+echo "Optimization level => "
 echo "  no optimization: ", not defined(release)
 echo "  release: ", defined(release)
 echo "  danger: ", defined(danger)
 echo "  inline assembly: ", UseX86ASM
 when (sizeof(int) == 4) or defined(Constantine32):
  echo "⚠️ Warning: using Constantine with 32-bit limbs"
@ -84,6 +88,16 @@ proc report(op, elliptic: string, start, stop: MonoTime, startClk, stopClk: int6
  else:
    echo &"{op:<60} {elliptic:<40} {throughput:>15.3f} ops/s     {ns:>9} ns/op"
 proc notes*() =
  echo "Notes:"
  echo "  - Compilers:"
  echo "    Compilers are severely limited on multiprecision arithmetic."
  echo "    Inline Assembly is used by default (nimble bench_fp)."
  echo "    Bench without assembly can use \"nimble bench_fp_gcc\" or \"nimble bench_fp_clang\"."
  echo "    GCC is significantly slower than Clang on multiprecision arithmetic due to catastrophic handling of carries."
  echo "  - The simplest operations might be optimized away by the compiler."
  echo "  - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
 macro fixEllipticDisplay(T: typedesc): untyped =
  # At compile-time, enums are integers and their display is buggy
  # we get the Curve ID instead of the curve name.
--- a/benchmarks/bench_fields_template.nim
+++ b/benchmarks/bench_fields_template.nim
@ -14,7 +14,7 @@
 import
  # Internals
-  ../constantine/config/curves,
+  ../constantine/config/[curves, common],
  ../constantine/arithmetic,
  ../constantine/towers,
  # Helpers
@ -54,7 +54,11 @@ elif defined(icc):
 else:
  echo "\nCompiled with an unknown compiler"
-echo "Optimization level => no optimization: ", not defined(release), " | release: ", defined(release), " | danger: ", defined(danger)
+echo "Optimization level => "
 echo "  no optimization: ", not defined(release)
 echo "  release: ", defined(release)
 echo "  danger: ", defined(danger)
 echo "  inline assembly: ", UseX86ASM
 when (sizeof(int) == 4) or defined(Constantine32):
  echo "⚠️ Warning: using Constantine with 32-bit limbs"
@ -81,6 +85,16 @@ proc report(op, field: string, start, stop: MonoTime, startClk, stopClk: int64,
  else:
    echo &"{op:<50} {field:<18} {throughput:>15.3f} ops/s     {ns:>9} ns/op"
 proc notes*() =
  echo "Notes:"
  echo "  - Compilers:"
  echo "    Compilers are severely limited on multiprecision arithmetic."
  echo "    Inline Assembly is used by default (nimble bench_fp)."
  echo "    Bench without assembly can use \"nimble bench_fp_gcc\" or \"nimble bench_fp_clang\"."
  echo "    GCC is significantly slower than Clang on multiprecision arithmetic due to catastrophic handling of carries."
  echo "  - The simplest operations might be optimized away by the compiler."
  echo "  - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
 macro fixFieldDisplay(T: typedesc): untyped =
  # At compile-time, enums are integers and their display is buggy
  # we get the Curve ID instead of the curve name.
--- a/benchmarks/bench_fp.nim
+++ b/benchmarks/bench_fp.nim
@ -59,8 +59,4 @@ proc main() =
    separator()
 main()
-
+notes()
 echo "Notes:"
 echo "  - GCC is significantly slower than Clang on multiprecision arithmetic."
 echo "  - The simplest operations might be optimized away by the compiler."
 echo "  - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
--- a/benchmarks/bench_fp12.nim
+++ b/benchmarks/bench_fp12.nim
@ -50,8 +50,4 @@ proc main() =
    separator()
 main()
-
+notes()
 echo "Notes:"
 echo "  - GCC is significantly slower than Clang on multiprecision arithmetic."
 echo "  - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
 echo "  - The tower of extension fields chosen can lead to a large difference of performance between primes of similar bitwidth."
--- a/benchmarks/bench_fp2.nim
+++ b/benchmarks/bench_fp2.nim
@ -51,8 +51,4 @@ proc main() =
    separator()
 main()
-
+notes()
 echo "Notes:"
 echo "  - GCC is significantly slower than Clang on multiprecision arithmetic."
 echo "  - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
 echo "  - The tower of extension fields chosen can lead to a large difference of performance between primes of similar bitwidth."
--- a/benchmarks/bench_fp6.nim
+++ b/benchmarks/bench_fp6.nim
@ -50,8 +50,4 @@ proc main() =
    separator()
 main()
-
+notes()
 echo "Notes:"
 echo "  - GCC is significantly slower than Clang on multiprecision arithmetic."
 echo "  - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
 echo "  - The tower of extension fields chosen can lead to a large difference of performance between primes of similar bitwidth."
--- a/constantine.nimble
+++ b/constantine.nimble
@ -106,7 +106,7 @@ proc runBench(benchName: string, compiler = "") =
  var cc = ""
  if compiler != "":
-    cc = "--cc:" & compiler
+    cc = "--cc:" & compiler & " -d:ConstantineASM=false"
  exec "nim c " & cc &
       " -d:danger --verbosity:0 -o:build/" & benchName & "_" & compiler &
       " -r --hints:off --warnings:off benchmarks/" & benchName & ".nim"
@ -209,6 +209,45 @@ task test_parallel, "Run all tests in parallel (via GNU parallel)":
    runBench("bench_ec_g1")
    runBench("bench_ec_g2")
 task test_parallel_no_assembler, "Run all tests (without macro assembler) in parallel (via GNU parallel)":
  # -d:testingCurves is configured in a *.nim.cfg for convenience
  let cmdFile = true # open(buildParallel, mode = fmWrite) # Nimscript doesn't support IO :/
  exec "> " & buildParallel
  for td in testDesc:
    if td.path in useDebug:
      test "-d:debugConstantine -d:ConstantineASM=false", td.path, cmdFile
    else:
      test " -d:ConstantineASM=false", td.path, cmdFile
  # cmdFile.close()
  # Execute everything in parallel with GNU parallel
  exec "parallel --keep-order --group < " & buildParallel
  exec "> " & buildParallel
  if sizeof(int) == 8: # 32-bit tests on 64-bit arch
    for td in testDesc:
      if td.path in useDebug:
        test "-d:Constantine32 -d:debugConstantine -d:ConstantineASM=false", td.path, cmdFile
      else:
        test "-d:Constantine32 -d:ConstantineASM=false", td.path, cmdFile
    # cmdFile.close()
    # Execute everything in parallel with GNU parallel
    exec "parallel --keep-order --group < " & buildParallel
  # Now run the benchmarks
  #
  # Benchmarks compile and run
  # ignore Windows 32-bit for the moment
  # Ensure benchmarks stay relevant. Ignore Windows 32-bit at the moment
  if not defined(windows) or not (existsEnv"UCPU" or getEnv"UCPU" == "i686"):
    runBench("bench_fp")
    runBench("bench_fp2")
    runBench("bench_fp6")
    runBench("bench_fp12")
    runBench("bench_ec_g1")
    runBench("bench_ec_g2")
 task test_parallel_no_gmp, "Run all tests in parallel (via GNU parallel)":
  # -d:testingCurves is configured in a *.nim.cfg for convenience
  let cmdFile = true # open(buildParallel, mode = fmWrite) # Nimscript doesn't support IO :/
--- a/constantine/arithmetic/finite_fields.nim
+++ b/constantine/arithmetic/finite_fields.nim
@ -29,11 +29,13 @@ import
  ../config/[common, type_fp, curves],
  ./bigints, ./limbs_montgomery
 when UseX86ASM:
  import ./finite_fields_asm_x86
 export Fp
 # No exceptions allowed
 {.push raises: [].}
 {.push inline.}
 # ############################################################
 #
@ -41,15 +43,15 @@ export Fp
 #
 # ############################################################
-func fromBig*[C: static Curve](T: type Fp[C], src: BigInt): Fp[C] {.noInit.} =
+func fromBig*[C: static Curve](T: type Fp[C], src: BigInt): Fp[C] {.noInit, inline.} =
  ## Convert a BigInt to its Montgomery form
  result.mres.montyResidue(src, C.Mod, C.getR2modP(), C.getNegInvModWord(), C.canUseNoCarryMontyMul())
-func fromBig*[C: static Curve](dst: var Fp[C], src: BigInt) =
+func fromBig*[C: static Curve](dst: var Fp[C], src: BigInt) {.inline.}=
  ## Convert a BigInt to its Montgomery form
  dst.mres.montyResidue(src, C.Mod, C.getR2modP(), C.getNegInvModWord(), C.canUseNoCarryMontyMul())
-func toBig*(src: Fp): auto {.noInit.} =
+func toBig*(src: Fp): auto {.noInit, inline.} =
  ## Convert a finite-field element to a BigInt in natural representation
  var r {.noInit.}: typeof(src.mres)
  r.redc(src.mres, Fp.C.Mod, Fp.C.getNegInvModWord(), Fp.C.canUseNoCarryMontyMul())
@ -58,14 +60,17 @@ func toBig*(src: Fp): auto {.noInit.} =
 # Copy
 # ------------------------------------------------------------
-func ccopy*(a: var Fp, b: Fp, ctl: SecretBool) =
+func ccopy*(a: var Fp, b: Fp, ctl: SecretBool) {.inline.} =
  ## Constant-time conditional copy
  ## If ctl is true: b is copied into a
  ## if ctl is false: b is not copied and a is unmodified
  ## Time and memory accesses are the same whether a copy occurs or not
  when UseX86ASM:
    ccopy_asm(a.mres.limbs, b.mres.limbs, ctl)
  else:
    ccopy(a.mres, b.mres, ctl)
-func cswap*(a, b: var Fp, ctl: CTBool) =
+func cswap*(a, b: var Fp, ctl: CTBool) {.inline.} =
  ## Swap ``a`` and ``b`` if ``ctl`` is true
  ##
  ## Constant-time:
@ -93,80 +98,108 @@ func cswap*(a, b: var Fp, ctl: CTBool) =
 # In practice I'm not aware of such prime being using in elliptic curves.
 # 2^127 - 1 and 2^521 - 1 are used but 127 and 521 are not multiple of 32/64
-func `==`*(a, b: Fp): SecretBool =
+func `==`*(a, b: Fp): SecretBool {.inline.} =
  ## Constant-time equality check
  a.mres == b.mres
-func isZero*(a: Fp): SecretBool =
+func isZero*(a: Fp): SecretBool {.inline.} =
  ## Constant-time check if zero
  a.mres.isZero()
-func isOne*(a: Fp): SecretBool =
+func isOne*(a: Fp): SecretBool {.inline.} =
  ## Constant-time check if one
  a.mres == Fp.C.getMontyOne()
-func setZero*(a: var Fp) =
+func setZero*(a: var Fp) {.inline.} =
  ## Set ``a`` to zero
  a.mres.setZero()
-func setOne*(a: var Fp) =
+func setOne*(a: var Fp) {.inline.} =
  ## Set ``a`` to one
  # Note: we need 1 in Montgomery residue form
  # TODO: Nim codegen is not optimal it uses a temporary
  #       Check if the compiler optimizes it away
  a.mres = Fp.C.getMontyOne()
-func `+=`*(a: var Fp, b: Fp) =
+func `+=`*(a: var Fp, b: Fp) {.inline.} =
  ## In-place addition modulo p
  when UseX86ASM and a.mres.limbs.len <= 6: # TODO: handle spilling
    addmod_asm(a.mres.limbs, b.mres.limbs, Fp.C.Mod.limbs)
  else:
    var overflowed = add(a.mres, b.mres)
    overflowed = overflowed or not(a.mres < Fp.C.Mod)
    discard csub(a.mres, Fp.C.Mod, overflowed)
-func `-=`*(a: var Fp, b: Fp) =
+func `-=`*(a: var Fp, b: Fp) {.inline.} =
  ## In-place substraction modulo p
  when UseX86ASM and a.mres.limbs.len <= 6: # TODO: handle spilling
    submod_asm(a.mres.limbs, b.mres.limbs, Fp.C.Mod.limbs)
  else:
    let underflowed = sub(a.mres, b.mres)
    discard cadd(a.mres, Fp.C.Mod, underflowed)
-func double*(a: var Fp) =
+func double*(a: var Fp) {.inline.} =
  ## Double ``a`` modulo p
  when UseX86ASM and a.mres.limbs.len <= 6: # TODO: handle spilling
    addmod_asm(a.mres.limbs, a.mres.limbs, Fp.C.Mod.limbs)
  else:
    var overflowed = double(a.mres)
    overflowed = overflowed or not(a.mres < Fp.C.Mod)
    discard csub(a.mres, Fp.C.Mod, overflowed)
-func sum*(r: var Fp, a, b: Fp) =
+func sum*(r: var Fp, a, b: Fp) {.inline.} =
  ## Sum ``a`` and ``b`` into ``r`` module p
  ## r is initialized/overwritten
  when UseX86ASM and a.mres.limbs.len <= 6: # TODO: handle spilling
    r = a
    addmod_asm(r.mres.limbs, b.mres.limbs, Fp.C.Mod.limbs)
  else:
    var overflowed = r.mres.sum(a.mres, b.mres)
    overflowed = overflowed or not(r.mres < Fp.C.Mod)
    discard csub(r.mres, Fp.C.Mod, overflowed)
-func diff*(r: var Fp, a, b: Fp) =
+func diff*(r: var Fp, a, b: Fp) {.inline.} =
  ## Substract `b` from `a` and store the result into `r`.
  ## `r` is initialized/overwritten
  when UseX86ASM and a.mres.limbs.len <= 6: # TODO: handle spilling
    var t = a # Handle aliasing r == b
    submod_asm(t.mres.limbs, b.mres.limbs, Fp.C.Mod.limbs)
    r = t
  else:
    var underflowed = r.mres.diff(a.mres, b.mres)
    discard cadd(r.mres, Fp.C.Mod, underflowed)
-func double*(r: var Fp, a: Fp) =
+func double*(r: var Fp, a: Fp) {.inline.} =
  ## Double ``a`` into ``r``
  ## `r` is initialized/overwritten
  when UseX86ASM and a.mres.limbs.len <= 6: # TODO: handle spilling
    r = a
    addmod_asm(r.mres.limbs, a.mres.limbs, Fp.C.Mod.limbs)
  else:
    var overflowed = r.mres.double(a.mres)
    overflowed = overflowed or not(r.mres < Fp.C.Mod)
    discard csub(r.mres, Fp.C.Mod, overflowed)
-func prod*(r: var Fp, a, b: Fp) =
+func prod*(r: var Fp, a, b: Fp) {.inline.} =
  ## Store the product of ``a`` by ``b`` modulo p into ``r``
  ## ``r`` is initialized / overwritten
  r.mres.montyMul(a.mres, b.mres, Fp.C.Mod, Fp.C.getNegInvModWord(), Fp.C.canUseNoCarryMontyMul())
-func square*(r: var Fp, a: Fp) =
+func square*(r: var Fp, a: Fp) {.inline.} =
  ## Squaring modulo p
  r.mres.montySquare(a.mres, Fp.C.Mod, Fp.C.getNegInvModWord(), Fp.C.canUseNoCarryMontySquare())
-func neg*(r: var Fp, a: Fp) =
+func neg*(r: var Fp, a: Fp) {.inline.} =
  ## Negate modulo p
  when UseX86ASM and defined(gcc):
    # Clang and every compiler besides GCC
    # can cleanly optimized this
    # especially on Fp2
    negmod_asm(r.mres.limbs, a.mres.limbs, Fp.C.Mod.limbs)
  else:
    discard r.mres.diff(Fp.C.Mod, a.mres)
-func div2*(a: var Fp) =
+func div2*(a: var Fp) {.inline.} =
  ## Modular division by 2
  a.mres.div2_modular(Fp.C.getPrimePlus1div2())
@ -178,7 +211,7 @@ func div2*(a: var Fp) =
 #
 # Internally those procedures will allocate extra scratchspace on the stack
-func pow*(a: var Fp, exponent: BigInt) =
+func pow*(a: var Fp, exponent: BigInt) {.inline.} =
  ## Exponentiation modulo p
  ## ``a``: a field element to be exponentiated
  ## ``exponent``: a big integer
@ -191,7 +224,7 @@ func pow*(a: var Fp, exponent: BigInt) =
    Fp.C.canUseNoCarryMontySquare()
  )
-func pow*(a: var Fp, exponent: openarray[byte]) =
+func pow*(a: var Fp, exponent: openarray[byte]) {.inline.} =
  ## Exponentiation modulo p
  ## ``a``: a field element to be exponentiated
  ## ``exponent``: a big integer in canonical big endian representation
@ -204,7 +237,7 @@ func pow*(a: var Fp, exponent: openarray[byte]) =
    Fp.C.canUseNoCarryMontySquare()
  )
-func powUnsafeExponent*(a: var Fp, exponent: BigInt) =
+func powUnsafeExponent*(a: var Fp, exponent: BigInt) {.inline.} =
  ## Exponentiation modulo p
  ## ``a``: a field element to be exponentiated
  ## ``exponent``: a big integer
@ -224,7 +257,7 @@ func powUnsafeExponent*(a: var Fp, exponent: BigInt) =
    Fp.C.canUseNoCarryMontySquare()
  )
-func powUnsafeExponent*(a: var Fp, exponent: openarray[byte]) =
+func powUnsafeExponent*(a: var Fp, exponent: openarray[byte]) {.inline.} =
  ## Exponentiation modulo p
  ## ``a``: a field element to be exponentiated
  ## ``exponent``: a big integer a big integer in canonical big endian representation
@ -250,7 +283,7 @@ func powUnsafeExponent*(a: var Fp, exponent: openarray[byte]) =
 #
 # ############################################################
-func isSquare*[C](a: Fp[C]): SecretBool =
+func isSquare*[C](a: Fp[C]): SecretBool {.inline.} =
  ## Returns true if ``a`` is a square (quadratic residue) in 𝔽p
  ##
  ## Assumes that the prime modulus ``p`` is public.
@ -272,7 +305,7 @@ func isSquare*[C](a: Fp[C]): SecretBool =
      xi.mres == C.getMontyPrimeMinus1()
    )
-func sqrt_p3mod4[C](a: var Fp[C]) =
+func sqrt_p3mod4[C](a: var Fp[C]) {.inline.} =
  ## Compute the square root of ``a``
  ##
  ## This requires ``a`` to be a square
@ -286,7 +319,7 @@ func sqrt_p3mod4[C](a: var Fp[C]) =
  static: doAssert BaseType(C.Mod.limbs[0]) mod 4 == 3
  a.powUnsafeExponent(C.getPrimePlus1div4_BE())
-func sqrt_invsqrt_p3mod4[C](sqrt, invsqrt: var Fp[C], a: Fp[C]) =
+func sqrt_invsqrt_p3mod4[C](sqrt, invsqrt: var Fp[C], a: Fp[C]) {.inline.} =
  ## If ``a`` is a square, compute the square root of ``a`` in sqrt
  ## and the inverse square root of a in invsqrt
  ##
@ -307,7 +340,7 @@ func sqrt_invsqrt_p3mod4[C](sqrt, invsqrt: var Fp[C], a: Fp[C]) =
  # √a ≡ a * 1/√a ≡ a^((p+1)/4) (mod p)
  sqrt.prod(invsqrt, a)
-func sqrt_invsqrt_if_square_p3mod4[C](sqrt, invsqrt: var Fp[C], a: Fp[C]): SecretBool =
+func sqrt_invsqrt_if_square_p3mod4[C](sqrt, invsqrt: var Fp[C], a: Fp[C]): SecretBool {.inline.} =
  ## If ``a`` is a square, compute the square root of ``a`` in sqrt
  ## and the inverse square root of a in invsqrt
  ##
@ -319,7 +352,7 @@ func sqrt_invsqrt_if_square_p3mod4[C](sqrt, invsqrt: var Fp[C], a: Fp[C]): Secre
  euler.prod(sqrt, invsqrt)
  result = not(euler.mres == C.getMontyPrimeMinus1())
-func sqrt_if_square_p3mod4[C](a: var Fp[C]): SecretBool =
+func sqrt_if_square_p3mod4[C](a: var Fp[C]): SecretBool {.inline.} =
  ## If ``a`` is a square, compute the square root of ``a``
  ## if not, ``a`` is unmodified.
  ##
@ -334,7 +367,7 @@ func sqrt_if_square_p3mod4[C](a: var Fp[C]): SecretBool =
  result = sqrt_invsqrt_if_square_p3mod4(sqrt, invsqrt, a)
  a.ccopy(sqrt, result)
-func sqrt*[C](a: var Fp[C]) =
+func sqrt*[C](a: var Fp[C]) {.inline.} =
  ## Compute the square root of ``a``
  ##
  ## This requires ``a`` to be a square
@ -349,7 +382,7 @@ func sqrt*[C](a: var Fp[C]) =
  else:
    {.error: "Square root is only implemented for p ≡ 3 (mod 4)".}
-func sqrt_if_square*[C](a: var Fp[C]): SecretBool =
+func sqrt_if_square*[C](a: var Fp[C]): SecretBool {.inline.} =
  ## If ``a`` is a square, compute the square root of ``a``
  ## if not, ``a`` is unmodified.
  ##
@ -361,7 +394,7 @@ func sqrt_if_square*[C](a: var Fp[C]): SecretBool =
  else:
    {.error: "Square root is only implemented for p ≡ 3 (mod 4)".}
-func sqrt_invsqrt*[C](sqrt, invsqrt: var Fp[C], a: Fp[C]) =
+func sqrt_invsqrt*[C](sqrt, invsqrt: var Fp[C], a: Fp[C]) {.inline.} =
  ## Compute the square root and inverse square root of ``a``
  ##
  ## This requires ``a`` to be a square
@ -376,7 +409,7 @@ func sqrt_invsqrt*[C](sqrt, invsqrt: var Fp[C], a: Fp[C]) =
  else:
    {.error: "Square root is only implemented for p ≡ 3 (mod 4)".}
-func sqrt_invsqrt_if_square*[C](sqrt, invsqrt: var Fp[C], a: Fp[C]): SecretBool =
+func sqrt_invsqrt_if_square*[C](sqrt, invsqrt: var Fp[C], a: Fp[C]): SecretBool  {.inline.} =
  ## Compute the square root and ivnerse square root of ``a``
  ##
  ## This returns true if ``a`` is square and sqrt/invsqrt contains the square root/inverse square root
@ -403,15 +436,15 @@ func sqrt_invsqrt_if_square*[C](sqrt, invsqrt: var Fp[C], a: Fp[C]): SecretBool
 # - Those that return a field element
 # - Those that internally allocate a temporary field element
-func `+`*(a, b: Fp): Fp {.noInit.} =
+func `+`*(a, b: Fp): Fp {.noInit, inline.} =
  ## Addition modulo p
  result.sum(a, b)
-func `-`*(a, b: Fp): Fp {.noInit.} =
+func `-`*(a, b: Fp): Fp {.noInit, inline.} =
  ## Substraction modulo p
  result.diff(a, b)
-func `*`*(a, b: Fp): Fp {.noInit.} =
+func `*`*(a, b: Fp): Fp {.noInit, inline.} =
  ## Multiplication modulo p
  ##
  ## It is recommended to assign with {.noInit.}
@ -419,11 +452,11 @@ func `*`*(a, b: Fp): Fp {.noInit.} =
  ## routine will zero init internally the result.
  result.prod(a, b)
-func `*=`*(a: var Fp, b: Fp) =
+func `*=`*(a: var Fp, b: Fp) {.inline.} =
  ## Multiplication modulo p
  a.prod(a, b)
-func square*(a: var Fp) =
+func square*(a: var Fp) {.inline.}=
  ## Squaring modulo p
  a.mres.montySquare(a.mres, Fp.C.Mod, Fp.C.getNegInvModWord(), Fp.C.canUseNoCarryMontySquare())
--- a/constantine/arithmetic/finite_fields_asm_mul_x86.nim
+++ b/constantine/arithmetic/finite_fields_asm_mul_x86.nim
@ -0,0 +1,223 @@
 # Constantine
 # Copyright (c) 2018-2019    Status Research & Development GmbH
 # Copyright (c) 2020-Present Mamy André-Ratsimbazafy
 # Licensed and distributed under either of
 #   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
 #   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
 # at your option. This file may not be copied, modified, or distributed except according to those terms.
 import
  # Standard library
  std/macros,
  # Internal
  ../config/common,
  ../primitives,
  ./limbs
 # ############################################################
 #
 #        Assembly implementation of finite fields
 #
 # ############################################################
 # Note: We can refer to at most 30 registers in inline assembly
 #       and "InputOutput" registers count double
 #       They are nice to let the compiler deals with mov
 #       but too constraining so we move things ourselves.
 static: doAssert UseX86ASM
 # Necessary for the compiler to find enough registers (enabled at -O1)
 {.localPassC:"-fomit-frame-pointer".}
 # Montgomery multiplication
 # ------------------------------------------------------------
 # Fallback when no ADX and BMI2 support (MULX, ADCX, ADOX)
 proc finalSub*(
       ctx: var Assembler_x86,
       r: Operand or OperandArray,
       t, M, scratch: OperandArray
     ) =
  ## Reduce `t` into `r` modulo `M`
  let N = M.len
  ctx.comment "Final substraction"
  for i in 0 ..< N:
    ctx.mov scratch[i], t[i]
    if i == 0:
      ctx.sub scratch[i], M[i]
    else:
      ctx.sbb scratch[i], M[i]
  # If we borrowed it means that we were smaller than
  # the modulus and we don't need "scratch"
  for i in 0 ..< N:
    ctx.cmovnc t[i], scratch[i]
    ctx.mov r[i], t[i]
 macro montMul_CIOS_nocarry_gen[N: static int](r_MM: var Limbs[N], a_MM, b_MM, M_MM: Limbs[N], m0ninv_MM: BaseType): untyped =
  ## Generate an optimized Montgomery Multiplication kernel
  ## using the CIOS method
  ##
  ## The multiplication and reduction are further merged in the same loop
  ##
  ## This requires the most significant word of the Modulus
  ##   M[^1] < high(SecretWord) shr 2 (i.e. less than 0b00111...1111)
  ## https://hackmd.io/@zkteam/modular_multiplication
  result = newStmtList()
  var ctx = init(Assembler_x86, BaseType)
  let
    scratchSlots = max(N, 6)
    # We could force M as immediate by specializing per moduli
    M = init(OperandArray, nimSymbol = M_MM, N, PointerInReg, Input)
    # If N is too big, we need to spill registers. TODO.
    t = init(OperandArray, nimSymbol = ident"t", N, ElemsInReg, Output_EarlyClobber)
    # MultiPurpose Register slots
    scratch = init(OperandArray, nimSymbol = ident"scratch", scratchSlots, ElemsInReg, InputOutput_EnsureClobber)
    # MUL requires RAX and RDX
    rRAX = Operand(
      desc: OperandDesc(
        asmId: "[rax]",
        nimSymbol: ident"rax",
        rm: RAX,
        constraint: Output_EarlyClobber,
        cEmit: "rax"
      )
    )
    rRDX = Operand(
      desc: OperandDesc(
        asmId: "[rdx]",
        nimSymbol: ident"rdx",
        rm: RDX,
        constraint: Output_EarlyClobber,
        cEmit: "rdx"
      )
    )
    m0ninv = Operand(
               desc: OperandDesc(
                 asmId: "[m0ninv]",
                 nimSymbol: m0ninv_MM,
                 rm: MemOffsettable,
                 constraint: Input,
                 cEmit: "&" & $m0ninv_MM
               )
             )
    # We're really constrained by register and somehow setting as memory doesn't help
    # So we store the result `r` in the scratch space and then reload it in RDX
    # before the scratchspace is used in final substraction
    a = scratch[0].asArrayAddr(len = N) # Store the `a` operand
    b = scratch[1].asArrayAddr(len = N) # Store the `b` operand
    A = scratch[2]                      # High part of extended precision multiplication
    C = scratch[3]
    m = scratch[4]                      # Stores (t[0] * m0ninv) mod 2^w
    r = scratch[5]                      # Stores the `r` operand
  # Registers used:
  # - 1 for `M`
  # - 6 for `t`     (at most)
  # - 6 for `scratch`
  # - 2 for RAX and RDX
  # Total 15 out of 16
  # We can save 1 by hardcoding M as immediate (and m0ninv)
  # but this prevent reusing the same code for multiple curves like BLS12-377 and BLS12-381
  # We might be able to save registers by having `r` and `M` be memory operand as well
  let tsym = t.nimSymbol
  let scratchSym = scratch.nimSymbol
  let eax = rRAX.desc.nimSymbol
  let edx = rRDX.desc.nimSymbol
  result.add quote do:
    static: doAssert: sizeof(SecretWord) == sizeof(ByteAddress)
    var `tsym`: typeof(`r_MM`) # zero init
    # Assumes 64-bit limbs on 64-bit arch (or you can't store an address)
    var `scratchSym` {.noInit.}: Limbs[`scratchSlots`]
    var `eax`{.noInit.}, `edx`{.noInit.}: BaseType
    `scratchSym`[0] = cast[SecretWord](`a_MM`[0].unsafeAddr)
    `scratchSym`[1] = cast[SecretWord](`b_MM`[0].unsafeAddr)
    `scratchSym`[5] = cast[SecretWord](`r_MM`[0].unsafeAddr)
  # Algorithm
  # -----------------------------------------
  # for i=0 to N-1
  #   (A, t[0]) <- a[0] * b[i] + t[0]
  #    m        <- (t[0] * m0ninv) mod 2^w
  #   (C, _)    <- m * M[0] + t[0]
  #   for j=1 to N-1
  #     (A, t[j])   <- a[j] * b[i] + A + t[j]
  #     (C, t[j-1]) <- m * M[j] + C + t[j]
  #
  #   t[N-1] = C + A
  # No register spilling handling
  doAssert N <= 6, "The Assembly-optimized montgomery multiplication requires at most 6 limbs."
  for i in 0 ..< N:
    # (A, t[0]) <- a[0] * b[i] + t[0]
    ctx.mov rRAX, a[0]
    ctx.mul rdx, rax, b[i], rax
    if i == 0: # overwrite t[0]
      ctx.mov t[0], rRAX
    else:      # Accumulate in t[0]
      ctx.add t[0], rRAX
      ctx.adc rRDX, 0
    ctx.mov A, rRDX
    # m        <- (t[0] * m0ninv) mod 2^w
    ctx.mov m, m0ninv
    ctx.imul m, t[0]
    # (C, _)    <- m * M[0] + t[0]
    ctx.`xor` C, C
    ctx.mov rRAX, M[0]
    ctx.mul rdx, rax, m, rax
    ctx.add rRAX, t[0]
    ctx.adc C, rRDX
    for j in 1 ..< N:
      # (A, t[j])   <- a[j] * b[i] + A + t[j]
      ctx.mov rRAX, a[j]
      ctx.mul rdx, rax, b[i], rax
      if i == 0:
        ctx.mov t[j], A
      else:
        ctx.add t[j], A
        ctx.adc rRDX, 0
      ctx.`xor` A, A
      ctx.add t[j], rRAX
      ctx.adc A, rRDX
      # (C, t[j-1]) <- m * M[j] + C + t[j]
      ctx.mov rRAX, M[j]
      ctx.mul rdx, rax, m, rax
      ctx.add C, t[j]
      ctx.adc rRDX, 0
      ctx.add C, rRAX
      ctx.adc rRDX, 0
      ctx.mov t[j-1], C
      ctx.mov C, rRDX
    ctx.add A, C
    ctx.mov t[N-1], A
  ctx.mov rRDX, r
  let r2 = rRDX.asArrayAddr(len = N)
  ctx.finalSub(
    r2, t, M,
    scratch
  )
  result.add ctx.generate
 func montMul_CIOS_nocarry_asm*(r: var Limbs, a, b, M: Limbs, m0ninv: BaseType) =
  ## Constant-time modular multiplication
  montMul_CIOS_nocarry_gen(r, a, b, M, m0ninv)
--- a/constantine/arithmetic/finite_fields_asm_mul_x86_adx_bmi2.nim
+++ b/constantine/arithmetic/finite_fields_asm_mul_x86_adx_bmi2.nim
@ -0,0 +1,282 @@
 # Constantine
 # Copyright (c) 2018-2019    Status Research & Development GmbH
 # Copyright (c) 2020-Present Mamy André-Ratsimbazafy
 # Licensed and distributed under either of
 #   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
 #   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
 # at your option. This file may not be copied, modified, or distributed except according to those terms.
 import
  # Standard library
  std/macros,
  # Internal
  ../config/common,
  ../primitives,
  ./limbs,
  ./finite_fields_asm_mul_x86
 # ############################################################
 #
 #        Assembly implementation of finite fields
 #
 # ############################################################
 # Note: We can refer to at most 30 registers in inline assembly
 #       and "InputOutput" registers count double
 #       They are nice to let the compiler deals with mov
 #       but too constraining so we move things ourselves.
 static: doAssert UseX86ASM
 # MULX/ADCX/ADOX
 {.localPassC:"-madx -mbmi2".}
 # Necessary for the compiler to find enough registers (enabled at -O1)
 {.localPassC:"-fomit-frame-pointer".}
 # Montgomery Multiplication
 # ------------------------------------------------------------
 proc mulx_by_word(
       ctx: var Assembler_x86,
       C: Operand,
       t: OperandArray,
       a: Operand, # Pointer in scratchspace
       word: Operand,
       S, rRDX: Operand
     ) =
  ## Multiply the `a[0..<N]` by `word` and store in `t[0..<N]`
  ## and carry register `C` (t[N])
  ## `t` and `C` overwritten
  ## `S` is a scratchspace carry register
  ## `rRDX` is the RDX register descriptor
  let N = t.len
  doAssert N >= 2, "The Assembly-optimized montgomery multiplication requires at least 2 limbs."
  ctx.comment "  Outer loop i = 0"
  ctx.`xor` rRDX, rRDX # Clear flags - TODO: necessary?
  ctx.mov rRDX, word
  # for j=0 to N-1
  #  (C,t[j])  := t[j] + a[j]*b[i] + C
  # First limb
  ctx.mulx t[1], t[0], a[0], rdx
  # Steady state
  for j in 1 ..< N-1:
    ctx.mulx t[j+1], S, a[j], rdx
    ctx.adox t[j], S   # TODO, we probably can use ADC here
  # Last limb
  ctx.mulx C, S, a[N-1], rdx
  ctx.adox t[N-1], S
  # Final carries
  ctx.comment "  Mul carries i = 0"
  ctx.mov  rRDX, 0 # Set to 0 without clearing flags
  ctx.adcx C, rRDX
  ctx.adox C, rRDX
 proc mulaccx_by_word(
       ctx: var Assembler_x86,
       C: Operand,
       t: OperandArray,
       a: Operand, # Pointer in scratchspace
       i: int,
       word: Operand,
       S, rRDX: Operand
     ) =
  ## Multiply the `a[0..<N]` by `word`
  ## and accumulate in `t[0..<N]`
  ## and carry register `C` (t[N])
  ## `t` and `C` are multiply-accumulated
  ## `S` is a scratchspace register
  ## `rRDX` is the RDX register descriptor
  let N = t.len
  doAssert N >= 2, "The Assembly-optimized montgomery multiplication requires at least 2 limbs."
  doAssert i != 0
  ctx.comment "  Outer loop i = " & $i
  ctx.`xor` rRDX, rRDX # Clear flags - TODO: necessary?
  ctx.mov rRDX, word
  # for j=0 to N-1
  #  (C,t[j])  := t[j] + a[j]*b[i] + C
  # Steady state
  for j in 0 ..< N-1:
    ctx.mulx C, S, a[j], rdx
    ctx.adox t[j], S
    ctx.adcx t[j+1], C
  # Last limb
  ctx.mulx C, S, a[N-1], rdx
  ctx.adox t[N-1], S
  # Final carries
  ctx.comment "  Mul carries i = " & $i
  ctx.mov  rRDX, 0 # Set to 0 without clearing flags
  ctx.adcx C, rRDX
  ctx.adox C, rRDX
 proc partialRedx(
       ctx: var Assembler_x86,
       C: Operand,
       t: OperandArray,
       M: OperandArray,
       m0ninv: Operand,
       lo, S, rRDX: Operand
     ) =
    ## Partial Montgomery reduction
    ## For CIOS method
    ## `C` the update carry flag (represents t[N])
    ## `t[0..<N]` the array to reduce
    ## `M[0..<N] the prime modulus
    ## `m0ninv` The montgomery magic number -1/M[0]
    ## `lo` and `S` are scratchspace registers
    ## `rRDX` is the RDX register descriptor
    let N = M.len
    # m = t[0] * m0ninv mod 2^w
    ctx.comment "  Reduction"
    ctx.comment "  m = t[0] * m0ninv mod 2^w"
    ctx.mov  rRDX, t[0]
    ctx.mulx S, rRDX, m0ninv, rdx # (S, RDX) <- m0ninv * RDX
    # Clear carry flags - TODO: necessary?
    ctx.`xor` S, S
    # S,_ := t[0] + m*M[0]
    ctx.comment "  S,_ := t[0] + m*M[0]"
    ctx.mulx S, lo, M[0], rdx
    ctx.adcx lo, t[0] # set the carry flag for the future ADCX
    ctx.mov  t[0], S
    # for j=1 to N-1
    #   (S,t[j-1]) := t[j] + m*M[j] + S
    ctx.comment "  for j=1 to N-1"
    ctx.comment "    (S,t[j-1]) := t[j] + m*M[j] + S"
    for j in 1 ..< N:
      ctx.adcx t[j-1], t[j]
      ctx.mulx t[j], S, M[j], rdx
      ctx.adox t[j-1], S
    # Last carries
    # t[N-1} = S + C
    ctx.comment "  Reduction carry "
    ctx.mov S, 0
    ctx.adcx t[N-1], S
    ctx.adox t[N-1], C
 macro montMul_CIOS_nocarry_adx_bmi2_gen[N: static int](r_MM: var Limbs[N], a_MM, b_MM, M_MM: Limbs[N], m0ninv_MM: BaseType): untyped =
  ## Generate an optimized Montgomery Multiplication kernel
  ## using the CIOS method
  ## This requires the most significant word of the Modulus
  ##   M[^1] < high(SecretWord) shr 2 (i.e. less than 0b00111...1111)
  ## https://hackmd.io/@zkteam/modular_multiplication
  result = newStmtList()
  var ctx = init(Assembler_x86, BaseType)
  let
    scratchSlots = max(N, 6)
    r = init(OperandArray, nimSymbol = r_MM, N, PointerInReg, InputOutput)
    # We could force M as immediate by specializing per moduli
    M = init(OperandArray, nimSymbol = M_MM, N, PointerInReg, Input)
    # If N is too big, we need to spill registers. TODO.
    t = init(OperandArray, nimSymbol = ident"t", N, ElemsInReg, Output_EarlyClobber)
    # MultiPurpose Register slots
    scratch = init(OperandArray, nimSymbol = ident"scratch", scratchSlots, ElemsInReg, InputOutput)
    # MULX requires RDX
    rRDX = Operand(
      desc: OperandDesc(
        asmId: "[rdx]",
        nimSymbol: ident"rdx",
        rm: RDX,
        constraint: Output_EarlyClobber,
        cEmit: "rdx"
      )
    )
    a = scratch[0].asArrayAddr(len = N) # Store the `a` operand
    b = scratch[1].asArrayAddr(len = N) # Store the `b` operand
    A = scratch[2]                      # High part of extended precision multiplication
    C = scratch[3]
    m0ninv = scratch[4]                 # Modular inverse of M[0]
    lo = scratch[5]                     # Discard "lo" part of partial Montgomery Reduction
  # Registers used:
  # - 1 for `r`
  # - 1 for `M`
  # - 6 for `t`     (at most)
  # - 6 for `scratch`
  # - 1 for RDX
  # Total 15 out of 16
  # We can save 1 by hardcoding M as immediate (and m0ninv)
  # but this prevent reusing the same code for multiple curves like BLS12-377 and BLS12-381
  # We might be able to save registers by having `r` and `M` be memory operand as well
  let tsym = t.nimSymbol
  let scratchSym = scratch.nimSymbol
  let edx = rRDX.desc.nimSymbol
  result.add quote do:
    static: doAssert: sizeof(SecretWord) == sizeof(ByteAddress)
    var `tsym`: typeof(`r_MM`) # zero init
    # Assumes 64-bit limbs on 64-bit arch (or you can't store an address)
    var `scratchSym` {.noInit.}: Limbs[`scratchSlots`]
    var `edx`{.noInit.}: BaseType
    `scratchSym`[0] = cast[SecretWord](`a_MM`[0].unsafeAddr)
    `scratchSym`[1] = cast[SecretWord](`b_MM`[0].unsafeAddr)
    `scratchSym`[4] = SecretWord `m0ninv_MM`
  # Algorithm
  # -----------------------------------------
  # for i=0 to N-1
  #   for j=0 to N-1
  # 		(A,t[j])  := t[j] + a[j]*b[i] + A
  #   m := t[0]*m0ninv mod W
  # 	C,_ := t[0] + m*M[0]
  # 	for j=1 to N-1
  # 		(C,t[j-1]) := t[j] + m*M[j] + C
  #   t[N-1] = C + A
  # No register spilling handling
  doAssert N <= 6, "The Assembly-optimized montgomery multiplication requires at most 6 limbs."
  for i in 0 ..< N:
    if i == 0:
      ctx.mulx_by_word(
        A, t,
        a,
        b[0],
        C, rRDX
      )
    else:
      ctx.mulaccx_by_word(
        A, t,
        a, i,
        b[i],
        C, rRDX
      )
    ctx.partialRedx(
      A, t,
      M, m0ninv,
      lo, C, rRDX
    )
  ctx.finalSub(
    r, t, M,
    scratch
  )
  result.add ctx.generate
 func montMul_CIOS_nocarry_asm_adx_bmi2*(r: var Limbs, a, b, M: Limbs, m0ninv: BaseType) =
  ## Constant-time modular multiplication
  montMul_CIOS_nocarry_adx_bmi2_gen(r, a, b, M, m0ninv)
--- a/constantine/arithmetic/finite_fields_asm_x86.nim
+++ b/constantine/arithmetic/finite_fields_asm_x86.nim
@ -0,0 +1,340 @@
 # Constantine
 # Copyright (c) 2018-2019    Status Research & Development GmbH
 # Copyright (c) 2020-Present Mamy André-Ratsimbazafy
 # Licensed and distributed under either of
 #   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
 #   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
 # at your option. This file may not be copied, modified, or distributed except according to those terms.
 import
  # Standard library
  std/macros,
  # Internal
  ../config/common,
  ../primitives,
  ./limbs
 # ############################################################
 #
 #        Assembly implementation of finite fields
 #
 # ############################################################
 # Note: We can refer to at most 30 registers in inline assembly
 #       and "InputOutput" registers count double
 #       They are nice to let the compiler deals with mov
 #       but too constraining so we move things ourselves.
 static: doAssert UseX86ASM
 {.localPassC:"-fomit-frame-pointer".} # Needed so that the compiler finds enough registers
 # Copy
 # ------------------------------------------------------------
 macro ccopy_gen[N: static int](a: var Limbs[N], b: Limbs[N], ctl: SecretBool): untyped =
  ## Generate an optimized conditional copy kernel
  result = newStmtList()
  var ctx = init(Assembler_x86, BaseType)
  let
    arrA = init(OperandArray, nimSymbol = a, N, PointerInReg, InputOutput)
    arrB = init(OperandArray, nimSymbol = b, N, PointerInReg, Input)
    # If N is too big, we need to spill registers. TODO.
    arrT = init(OperandArray, nimSymbol = ident"t", N, ElemsInReg, Output_EarlyClobber)
    control = Operand(
      desc: OperandDesc(
        asmId: "[ctl]",
        nimSymbol: ctl,
        rm: Reg,
        constraint: Input,
        cEmit: "ctl"
      )
    )
  ctx.test control, control
  for i in 0 ..< N:
    ctx.mov arrT[i], arrA[i]
    ctx.cmovnz arrT[i], arrB[i]
    ctx.mov arrA[i], arrT[i]
  let t = arrT.nimSymbol
  let c = control.desc.nimSymbol
  result.add quote do:
    var `t` {.noInit.}: typeof(`a`)
  result.add ctx.generate()
 func ccopy_asm*(a: var Limbs, b: Limbs, ctl: SecretBool) {.inline.}=
  ## Constant-time conditional copy
  ## If ctl is true: b is copied into a
  ## if ctl is false: b is not copied and a is untouched
  ## Time and memory accesses are the same whether a copy occurs or not
  ccopy_gen(a, b, ctl)
 # Field addition
 # ------------------------------------------------------------
 macro addmod_gen[N: static int](a: var Limbs[N], b, M: Limbs[N]): untyped =
  ## Generate an optimized modular addition kernel
  # Register pressure note:
  #   We could generate a kernel per modulus M by hardocing it as immediate
  #   however this requires
  #     - duplicating the kernel and also
  #     - 64-bit immediate encoding is quite large
  result = newStmtList()
  var ctx = init(Assembler_x86, BaseType)
  let
    arrA = init(OperandArray, nimSymbol = a, N, PointerInReg, InputOutput)
    # We reuse the reg used for B for overflow detection
    arrB = init(OperandArray, nimSymbol = b, N, PointerInReg, InputOutput)
    # We could force M as immediate by specializing per moduli
    arrM = init(OperandArray, nimSymbol = M, N, PointerInReg, Input)
    # If N is too big, we need to spill registers. TODO.
    arrT = init(OperandArray, nimSymbol = ident"t", N, ElemsInReg, Output_EarlyClobber)
    arrTsub = init(OperandArray, nimSymbol = ident"tsub", N, ElemsInReg, Output_EarlyClobber)
  # Addition
  for i in 0 ..< N:
    ctx.mov arrT[i], arrA[i]
    if i == 0:
      ctx.add arrT[0], arrB[0]
    else:
      ctx.adc arrT[i], arrB[i]
    # Interleaved copy in a second buffer as well
    ctx.mov arrTsub[i], arrT[i]
  # Mask: overflowed contains 0xFFFF or 0x0000
  let overflowed = arrB.reuseRegister()
  ctx.sbb overflowed, overflowed
  # Now substract the modulus
  for i in 0 ..< N:
    if i == 0:
      ctx.sub arrTsub[0], arrM[0]
    else:
      ctx.sbb arrTsub[i], arrM[i]
  # If it overflows here, it means that it was
  # smaller than the modulus and we don't need arrTsub
  ctx.sbb overflowed, 0
  # Conditional Mov and
  # and store result
  for i in 0 ..< N:
    ctx.cmovnc arrT[i],  arrTsub[i]
    ctx.mov arrA[i], arrT[i]
  let t = arrT.nimSymbol
  let tsub = arrTsub.nimSymbol
  result.add quote do:
    var `t`{.noinit.}, `tsub` {.noInit.}: typeof(`a`)
  result.add ctx.generate
 func addmod_asm*(a: var Limbs, b, M: Limbs) =
  ## Constant-time modular addition
  addmod_gen(a, b, M)
 # Field substraction
 # ------------------------------------------------------------
 macro submod_gen[N: static int](a: var Limbs[N], b, M: Limbs[N]): untyped =
  ## Generate an optimized modular addition kernel
  # Register pressure note:
  #   We could generate a kernel per modulus M by hardocing it as immediate
  #   however this requires
  #     - duplicating the kernel and also
  #     - 64-bit immediate encoding is quite large
  result = newStmtList()
  var ctx = init(Assembler_x86, BaseType)
  let
    arrA = init(OperandArray, nimSymbol = a, N, PointerInReg, InputOutput)
    # We reuse the reg used for B for overflow detection
    arrB = init(OperandArray, nimSymbol = b, N, PointerInReg, InputOutput)
    # We could force M as immediate by specializing per moduli
    arrM = init(OperandArray, nimSymbol = M, N, PointerInReg, Input)
    # If N is too big, we need to spill registers. TODO.
    arrT = init(OperandArray, nimSymbol = ident"t", N, ElemsInReg, Output_EarlyClobber)
    arrTadd = init(OperandArray, nimSymbol = ident"tadd", N, ElemsInReg, Output_EarlyClobber)
  # Addition
  for i in 0 ..< N:
    ctx.mov arrT[i], arrA[i]
    if i == 0:
      ctx.sub arrT[0], arrB[0]
    else:
      ctx.sbb arrT[i], arrB[i]
    # Interleaved copy the modulus to hide SBB latencies
    ctx.mov arrTadd[i], arrM[i]
  # Mask: undeflowed contains 0xFFFF or 0x0000
  let underflowed = arrB.reuseRegister()
  ctx.sbb underflowed, underflowed
  # Now mask the adder, with 0 or the modulus limbs
  for i in 0 ..< N:
    ctx.`and` arrTadd[i], underflowed
  # Add the masked modulus
  for i in 0 ..< N:
    if i == 0:
      ctx.add arrT[0], arrTadd[0]
    else:
      ctx.adc arrT[i], arrTadd[i]
    ctx.mov arrA[i], arrT[i]
  let t = arrT.nimSymbol
  let tadd = arrTadd.nimSymbol
  result.add quote do:
    var `t`{.noinit.}, `tadd` {.noInit.}: typeof(`a`)
  result.add ctx.generate
 func submod_asm*(a: var Limbs, b, M: Limbs) =
  ## Constant-time modular substraction
  ## Warning, does not handle aliasing of a and b
  submod_gen(a, b, M)
 # Field negation
 # ------------------------------------------------------------
 macro negmod_gen[N: static int](r: var Limbs[N], a, M: Limbs[N]): untyped =
  ## Generate an optimized modular negation kernel
  result = newStmtList()
  var ctx = init(Assembler_x86, BaseType)
  let
    arrA = init(OperandArray, nimSymbol = a, N, PointerInReg, Input)
    arrR = init(OperandArray, nimSymbol = r, N, ElemsInReg, InputOutput)
    # We could force M as immediate by specializing per moduli
    arrM = init(OperandArray, nimSymbol = M, N, PointerInReg, Input)
  # Addition
  for i in 0 ..< N:
    ctx.mov arrR[i], arrM[i]
    if i == 0:
      ctx.sub arrR[0], arrA[0]
    else:
      ctx.sbb arrR[i], arrA[i]
  result.add ctx.generate
 func negmod_asm*(r: var Limbs, a, M: Limbs) {.inline.} =
  ## Constant-time modular negation
  negmod_gen(r, a, M)
 # Sanity checks
 # ----------------------------------------------------------
 when isMainModule:
  import ../config/type_bigint, algorithm, strutils
  proc mainAdd() =
    var a = [SecretWord 0xE3DF60E8F6D0AF9A'u64, SecretWord 0x7B2665C2258A7625'u64, SecretWord 0x68FC9A1D0977C8E0'u64, SecretWord 0xF3DC61ED7DE76883'u64]
    var b = [SecretWord 0x78E9C2EF58BB6B78'u64, SecretWord 0x547F65BD19014254'u64, SecretWord 0x556A115819EAD4B5'u64, SecretWord 0x8CA844A546935DC3'u64]
    var M = [SecretWord 0xFFFFFFFF00000001'u64, SecretWord 0x0000000000000000'u64, SecretWord 0x00000000FFFFFFFF'u64, SecretWord 0xFFFFFFFFFFFFFFFF'u64]
    var s = "0x5cc923d94f8c1b11cfa5cb7f3e8bb879be66ab7423629d968084a692c47ac647"
    a.reverse()
    b.reverse()
    M.reverse()
    debugecho "--------------------------------"
    debugecho "before:"
    debugecho "  a: ", a.toHex()
    debugecho "  b: ", b.toHex()
    debugecho "  m: ", M.toHex()
    addmod_asm(a, b, M)
    debugecho "after:"
    debugecho "  a: ", a.toHex().tolower
    debugecho "  s: ", s
    debugecho " ok: ", a.toHex().tolower == s
    a = [SecretWord 0x00935a991ca215a6'u64, SecretWord 0x5fbdac6294679337'u64, SecretWord 0x1e41793877b80f12'u64, SecretWord 0x5724cd93cb32932d'u64]
    b = [SecretWord 0x19dd4ecfda64ef80'u64, SecretWord 0x92deeb1532169c3d'u64, SecretWord 0x69ce4ee28421cd30'u64, SecretWord 0x4d90ab5a40295321'u64]
    M = [SecretWord 0x2523648240000001'u64, SecretWord 0xba344d8000000008'u64, SecretWord 0x6121000000000013'u64, SecretWord 0xa700000000000013'u64]
    s = "0x1a70a968f7070526f29c9777c67e2f74880fc81afbd9dc42a4b578ee0b5be64e"
    a.reverse()
    b.reverse()
    M.reverse()
    debugecho "--------------------------------"
    debugecho "before:"
    debugecho "  a: ", a.toHex()
    debugecho "  b: ", b.toHex()
    debugecho "  m: ", M.toHex()
    addmod_asm(a, b, M)
    debugecho "after:"
    debugecho "  a: ", a.toHex().tolower
    debugecho "  s: ", s
    debugecho " ok: ", a.toHex().tolower == s
    a = [SecretWord 0x1c7d810f37fc6e0b'u64, SecretWord 0xb91aba4ce339cea3'u64, SecretWord 0xd9f5571ccc4dfd1a'u64, SecretWord 0xf5906ee9df91f554'u64]
    b = [SecretWord 0x18394ffe94874c9f'u64, SecretWord 0x6e8a8ad032fc5f15'u64, SecretWord 0x7533a2b46b7e9530'u64, SecretWord 0x2849996b4bb61b48'u64]
    M = [SecretWord 0x2523648240000001'u64, SecretWord 0xba344d8000000008'u64, SecretWord 0x6121000000000013'u64, SecretWord 0xa700000000000013'u64]
    s = "0x0f936c8b8c83baa96d70f79d16362db0ee07f9d137cc923776da08552b481089"
    a.reverse()
    b.reverse()
    M.reverse()
    debugecho "--------------------------"
    debugecho "before:"
    debugecho "  a: ", a.toHex()
    debugecho "  b: ", b.toHex()
    debugecho "  m: ", M.toHex()
    addmod_asm(a, b, M)
    debugecho "after:"
    debugecho "  a: ", a.toHex().tolower
    debugecho "  s: ", s
    debugecho " ok: ", a.toHex().tolower == s
    a = [SecretWord 0xe9d55643'u64, SecretWord 0x580ec4cc3f91cef3'u64, SecretWord 0x11ecbb7d35b36449'u64, SecretWord 0x35535ca31c5dc2ba'u64]
    b = [SecretWord 0x97f7ed94'u64, SecretWord 0xbad96eb98204a622'u64, SecretWord 0xbba94400f9a061d6'u64, SecretWord 0x60d3521a0d3dd9eb'u64]
    M = [SecretWord 0xffffffff'u64, SecretWord 0xffffffffffffffff'u64, SecretWord 0xffffffff00000000'u64, SecretWord 0x0000000000000001'u64]
    s = "0x0000000081cd43d812e83385c1967515cd95ff7f2f53c61f9626aebd299b9ca4"
    a.reverse()
    b.reverse()
    M.reverse()
    debugecho "--------------------------"
    debugecho "before:"
    debugecho "  a: ", a.toHex()
    debugecho "  b: ", b.toHex()
    debugecho "  m: ", M.toHex()
    addmod_asm(a, b, M)
    debugecho "after:"
    debugecho "  a: ", a.toHex().tolower
    debugecho "  s: ", s
    debugecho " ok: ", a.toHex().tolower == s
  mainAdd()
  proc mainSub() =
    var a = [SecretWord 0xf9c32e89b80b17bd'u64, SecretWord 0xdbd3069d4ca0e1c3'u64, SecretWord 0x980d4c70d39d5e17'u64, SecretWord 0xd9f0252845f18c3a'u64]
    var b = [SecretWord 0x215075604bfd64de'u64, SecretWord 0x36dc488149fc5d3e'u64, SecretWord 0x91fff665385d20fd'u64, SecretWord 0xe980a5a203b43179'u64]
    var M = [SecretWord 0xFFFFFFFFFFFFFFFF'u64, SecretWord 0xFFFFFFFFFFFFFFFF'u64, SecretWord 0xFFFFFFFFFFFFFFFF'u64, SecretWord 0xFFFFFFFEFFFFFC2F'u64]
    var s = "0xd872b9296c0db2dfa4f6be1c02a48485060d560b9b403d19f06f7f86423d5ac1"
    a.reverse()
    b.reverse()
    M.reverse()
    debugecho "--------------------------------"
    debugecho "before:"
    debugecho "  a: ", a.toHex()
    debugecho "  b: ", b.toHex()
    debugecho "  m: ", M.toHex()
    submod_asm(a, b, M)
    debugecho "after:"
    debugecho "  a: ", a.toHex().tolower
    debugecho "  s: ", s
    debugecho " ok: ", a.toHex().tolower == s
  mainSub()
--- a/constantine/arithmetic/limbs.nim
+++ b/constantine/arithmetic/limbs.nim
@ -104,9 +104,6 @@ func ccopy*(a: var Limbs, b: Limbs, ctl: SecretBool) =
  ## If ctl is true: b is copied into a
  ## if ctl is false: b is not copied and a is untouched
  ## Time and memory accesses are the same whether a copy occurs or not
  # TODO: on x86, we use inline assembly for CMOV
  #       the codegen is a bit inefficient as the condition `ctl`
  #       is tested for each limb.
  for i in 0 ..< a.len:
    ctl.ccopy(a[i], b[i])
--- a/constantine/arithmetic/limbs_montgomery.nim
+++ b/constantine/arithmetic/limbs_montgomery.nim
@ -7,13 +7,18 @@
 # at your option. This file may not be copied, modified, or distributed except according to those terms.
 import
-  # Stadard library
+  # Standard library
  std/macros,
  # Internal
  ../config/common,
  ../primitives,
  ./limbs
 when UseX86ASM:
  import
    ./finite_fields_asm_mul_x86,
    ./finite_fields_asm_mul_x86_adx_bmi2
 # ############################################################
 #
 #         Multiprecision Montgomery Arithmetic
@ -343,34 +348,43 @@ func montyMul*(
  # - specialize/duplicate code for m0ninv == 1 (especially if only 1 curve is needed)
  # - keep it generic and optimize code size
  when canUseNoCarryMontyMul:
    when UseX86ASM and a.len in {2 .. 6}: # TODO: handle spilling
      if ({.noSideEffect.}: hasBmi2()) and ({.noSideEffect.}: hasAdx()):
        montMul_CIOS_nocarry_asm_adx_bmi2(r, a, b, M, m0ninv)
      else:
        montMul_CIOS_nocarry_asm(r, a, b, M, m0ninv)
    else:
      montyMul_CIOS_nocarry(r, a, b, M, m0ninv)
  else:
    montyMul_FIPS(r, a, b, M, m0ninv)
 func montySquare*(r: var Limbs, a, M: Limbs,
-                  m0ninv: static BaseType, canUseNoCarryMontySquare: static bool) =
+                  m0ninv: static BaseType, canUseNoCarryMontySquare: static bool) {.inline.} =
  ## Compute r <- a^2 (mod M) in the Montgomery domain
  ## `m0ninv` = -1/M (mod SecretWord). Our words are 2^31 or 2^63
-  when canUseNoCarryMontySquare:
+  # TODO: needs optimization similar to multiplication
-    # TODO: Deactivated
+  montyMul(r, a, a, M, m0ninv, canUseNoCarryMontySquare)
    # Off-by one on 32-bit on the least significant bit
    # for Fp[BLS12-381] with inputs
    # - -0x091F02EFA1C9B99C004329E94CD3C6B308164CBE02037333D78B6C10415286F7C51B5CD7F917F77B25667AB083314B1B
    # - -0x0B7C8AFE5D43E9A973AF8649AD8C733B97D06A78CFACD214CBE9946663C3F682362E0605BC8318714305B249B505AFD9
-    # montySquare_CIOS_nocarry(r, a, M, m0ninv)
+  # when canUseNoCarryMontySquare:
-    montyMul_CIOS_nocarry(r, a, a, M, m0ninv)
+  #   # TODO: Deactivated
-  else:
+  #   # Off-by one on 32-bit on the least significant bit
-    # TODO: Deactivated
+  #   # for Fp[BLS12-381] with inputs
-    # Off-by one on 32-bit for Fp[2^127 - 1] with inputs
+  #   # - -0x091F02EFA1C9B99C004329E94CD3C6B308164CBE02037333D78B6C10415286F7C51B5CD7F917F77B25667AB083314B1B
-    # - -0x75bfffefbfffffff7fd9dfd800000000
+  #   # - -0x0B7C8AFE5D43E9A973AF8649AD8C733B97D06A78CFACD214CBE9946663C3F682362E0605BC8318714305B249B505AFD9
-    # - -0x7ff7ffffffffffff1dfb7fafc0000000
+  #
-    # Squaring the number and its opposite
+  #   # montySquare_CIOS_nocarry(r, a, M, m0ninv)
-    # should give the same result, but those are off-by-one
+  #   montyMul_CIOS_nocarry(r, a, a, M, m0ninv)
-
+  # else:
-    # montySquare_CIOS(r, a, M, m0ninv) # TODO <--- Fix this
+  #   # TODO: Deactivated
-    montyMul_FIPS(r, a, a, M, m0ninv)
+  #   # Off-by one on 32-bit for Fp[2^127 - 1] with inputs
  #   # - -0x75bfffefbfffffff7fd9dfd800000000
  #   # - -0x7ff7ffffffffffff1dfb7fafc0000000
  #   # Squaring the number and its opposite
  #   # should give the same result, but those are off-by-one
  #
  #   # montySquare_CIOS(r, a, M, m0ninv) # TODO <--- Fix this
  #   montyMul_FIPS(r, a, a, M, m0ninv)
 func redc*(r: var Limbs, a, one, M: Limbs,
           m0ninv: static BaseType, canUseNoCarryMontyMul: static bool) =
--- a/constantine/config/common.nim
+++ b/constantine/config/common.nim
@ -41,6 +41,11 @@ const
  One* = SecretWord(1)
  MaxWord* = SecretWord(high(BaseType))
 # TODO, we restrict assembly to 64-bit words
 # We need to support register spills for large limbs
 const ConstantineASM {.booldefine.} = true
 const UseX86ASM* = WordBitWidth == 64 and ConstantineASM and X86 and GCC_Compatible
 # ############################################################
 #
 #                  Instrumentation
--- a/constantine/primitives.nim
+++ b/constantine/primitives.nim
@ -21,3 +21,7 @@ export
  addcarry_subborrow,
  extended_precision,
  bithacks
 when X86 and GCC_Compatible:
  import primitives/[cpuinfo_x86, macro_assembler_x86]
  export cpuinfo_x86, macro_assembler_x86
--- a/constantine/primitives/README.md
+++ b/constantine/primitives/README.md
@ -5,8 +5,10 @@ This folder holds:
 - the constant-time primitives, implemented as distinct types
  to have the compiler enforce proper usage
 - extended precision multiplication and division primitives
- assembly primitives
+- assembly or builtin int128 primitives
 - intrinsics
 - an assembler
 - Runtime CPU features detection
 ## Security
@ -30,13 +32,48 @@ on random or user inputs to constrain them to the prime field
 of the elliptic curves.
 Constantine internals are built to avoid costly constant-time divisions.
-## Performance and code size
+## Assembler
-It is recommended to prefer Clang, MSVC or ICC over GCC if possible.
+For both security and performance purposes, Constantine uses inline assembly for field arithmetic.
 GCC code is significantly slower and bigger for multiprecision arithmetic
 even when using dedicated intrinsics.
-See https://gcc.godbolt.org/z/2h768y
+### Assembly Security
 General purposes compiler can and do rewrite code as long as any observable effect is maintained. Unfortunately timing is not considered an observable effect and as general purpose compiler gets smarter and branch prediction on processor gets also smarter, compilers recognize and rewrite increasingly more initial branchless code to code with branches, potentially exposing secret data.
 A typical example is conditional mov which is required to be constant-time any time secrets are involved (https://tools.ietf.org/html/draft-irtf-cfrg-hash-to-curve-08#section-4)
 The paper `What you get is what you C: Controlling side effects in mainstream C compilers` (https://www.cl.cam.ac.uk/~rja14/Papers/whatyouc.pdf) exposes how compiler "improvements" are detrimental to cryptography
 ![image](https://user-images.githubusercontent.com/22738317/83965485-60cf4f00-a8b4-11ea-866f-4cc8e742f7a8.png)
 Another example is secure erasing secret data, which is often elided as an optimization.
 Those are not theoretical exploits as explained in the `When constant-time doesn't save you` article (https://research.kudelskisecurity.com/2017/01/16/when-constant-time-source-may-not-save-you/) which explains an attack against Curve25519 which was designed to be easily implemented in a constant-time manner.
 This attacks is due to an "optimization" in MSVC compiler
 > **every code compiled in 32-bit with MSVC on 64-bit architectures will call llmul every time a 64-bit multiplication is executed.**
 - [When Constant-Time Source Yields Variable-Time Binary: Exploiting Curve25519-donna Built with MSVC 2015.](https://infoscience.epfl.ch/record/223794/files/32_1.pdf)
 #### Verification of Assembly
 The assembly code generated needs special tooling for formal verification that is different from the C code in https://github.com/mratsim/constantine/issues/6.
 Recently Microsoft Research introduced Vale:
 - Vale: Verifying High-Performance  Cryptographic Assembly Code\
  Barry Bond and Chris Hawblitzel, Microsoft Research; Manos Kapritsos,  University of Michigan; K. Rustan M. Leino and Jacob R. Lorch, Microsoft Research;  Bryan Parno, Carnegie Mellon University; Ashay Rane, The University of Texas at Austin;Srinath Setty, Microsoft Research; Laure Thompson, Cornell University\
  https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-bond.pdf
  https://github.com/project-everest/vale
 Vale can be used to verify assembly crypto code against the architecture and also detect timing attacks.
 ### Assembly Performance
 Beyond security, compilers do not expose several primitives that are necessary for necessary for multiprecision arithmetic.
 #### Add with carry, sub with borrow
 The most egregious example is add with carry which led to the GMP team to implement everything in Assembly even though this is a most basic need and almost all processor have an ADC instruction, some like the 6502 from 30 years ago only have ADC and no ADD.
 See:
 - https://gmplib.org/manual/Assembly-Carry-Propagation.html
 -
 ![image](https://user-images.githubusercontent.com/22738317/83965806-8f4e2980-a8b6-11ea-9fbb-719e42d119dc.png)
 Some specific platforms might expose add with carry, for example x86 but even then the code generation might be extremely poor: https://gcc.godbolt.org/z/2h768y
 ```C
 #include <stdint.h>
 #include <x86intrin.h>
@ -47,7 +84,6 @@ void add256(uint64_t a[4], uint64_t b[4]){
    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
 }
 ```
 GCC
 ```asm
 add256:
@ -70,7 +106,6 @@ add256:
        adcq    %rax, 24(%rdi)
        ret
 ```
 Clang
 ```asm
 add256:
@ -84,8 +119,9 @@ add256:
        adcq    %rax, 24(%rdi)
        retq
 ```
 (Reported fixed but it is not? https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67317)
-### Inline assembly
+And no way to use ADC for ARM architectures with GCC.
 Clang does offer `__builtin_addcll` which might work now or [not](https://stackoverflow.com/questions/33690791/producing-good-add-with-carry-code-from-clang) as fixing the add with carry for x86 took years. Alternatively Clang does offer new arbitrary width integer since a month ago, called ExtInt http://blog.llvm.org/2020/04/the-new-clang-extint-feature-provides.html it is unknown however if code is guaranted to be constant-time.
-Using inline assembly will sacrifice code readability, portability, auditability and maintainability.
+See also: https://stackoverflow.com/questions/29029572/multi-word-addition-using-the-carry-flag/29212615
 That said the performance might be worth it.
--- a/constantine/primitives/cpuinfo_x86.nim
+++ b/constantine/primitives/cpuinfo_x86.nim
@ -0,0 +1,779 @@
 # From awr1: https://github.com/nim-lang/Nim/pull/11816/files
 proc cpuidX86(eaxi, ecxi: int32): tuple[eax, ebx, ecx, edx: int32] {.used.}=
  when defined(vcc):
    # limited inline asm support in vcc, so intrinsics, here we go:
    proc cpuidVcc(cpuInfo: ptr int32; functionID, subFunctionID: int32)
      {.cdecl, importc: "__cpuidex", header: "intrin.h".}
    cpuidVcc(addr result.eax, eaxi, ecxi)
  else:
    var (eaxr, ebxr, ecxr, edxr) = (0'i32, 0'i32, 0'i32, 0'i32)
    asm """
      cpuid
      :"=a"(`eaxr`), "=b"(`ebxr`), "=c"(`ecxr`), "=d"(`edxr`)
      :"a"(`eaxi`), "c"(`ecxi`)"""
    (eaxr, ebxr, ecxr, edxr)
 proc cpuNameX86(): string {.used.}=
  var leaves {.global.} = cast[array[48, char]]([
    cpuidX86(eaxi = 0x80000002'i32, ecxi = 0),
    cpuidX86(eaxi = 0x80000003'i32, ecxi = 0),
    cpuidX86(eaxi = 0x80000004'i32, ecxi = 0)])
  result = $cast[cstring](addr leaves[0])
 type
  X86Feature {.pure.} = enum
    HypervisorPresence, Hyperthreading, NoSMT, IntelVtx, Amdv, X87fpu, Mmx,
    MmxExt, F3DNow, F3DNowEnhanced, Prefetch, Sse, Sse2, Sse3, Ssse3, Sse4a,
    Sse41, Sse42, Avx, Avx2, Avx512f, Avx512dq, Avx512ifma, Avx512pf,
    Avx512er, Avx512cd, Avx512bw, Avx512vl, Avx512vbmi, Avx512vbmi2,
    Avx512vpopcntdq, Avx512vnni, Avx512vnniw4, Avx512fmaps4, Avx512bitalg,
    Avx512bfloat16, Avx512vp2intersect, Rdrand, Rdseed, MovBigEndian, Popcnt,
    Fma3, Fma4, Xop, Cas8B, Cas16B, Abm, Bmi1, Bmi2, TsxHle, TsxRtm, Adx, Sgx,
    Gfni, Aes, Vaes, Vpclmulqdq, Pclmulqdq, NxBit, Float16c, Sha, Clflush,
    ClflushOpt, Clwb, PrefetchWT1, Mpx
 let
  leaf1 = cpuidX86(eaxi = 1, ecxi = 0)
  leaf7 = cpuidX86(eaxi = 7, ecxi = 0)
  leaf8 = cpuidX86(eaxi = 0x80000001'i32, ecxi = 0)
 # The reason why we don't just evaluate these directly in the `let` variable
 # list is so that we can internally organize features by their input (leaf)
 # and output registers.
 proc testX86Feature(feature: X86Feature): bool =
  proc test(input, bit: int): bool =
    ((1 shl bit) and input) != 0
  # see: https://en.wikipedia.org/wiki/CPUID#Calling_CPUID
  # see: Intel® Architecture Instruction Set Extensions and Future Features
  #      Programming Reference
  result = case feature
    # leaf 1, edx
    of X87fpu:
      leaf1.edx.test(0)
    of Clflush:
      leaf1.edx.test(19)
    of Mmx:
      leaf1.edx.test(23)
    of Sse:
      leaf1.edx.test(25)
    of Sse2:
      leaf1.edx.test(26)
    of Hyperthreading:
      leaf1.edx.test(28)
    # leaf 1, ecx
    of Sse3:
      leaf1.ecx.test(0)
    of Pclmulqdq:
      leaf1.ecx.test(1)
    of IntelVtx:
      leaf1.ecx.test(5)
    of Ssse3:
      leaf1.ecx.test(9)
    of Fma3:
      leaf1.ecx.test(12)
    of Cas16B:
      leaf1.ecx.test(13)
    of Sse41:
      leaf1.ecx.test(19)
    of Sse42:
      leaf1.ecx.test(20)
    of MovBigEndian:
      leaf1.ecx.test(22)
    of Popcnt:
      leaf1.ecx.test(23)
    of Aes:
      leaf1.ecx.test(25)
    of Avx:
      leaf1.ecx.test(28)
    of Float16c:
      leaf1.ecx.test(29)
    of Rdrand:
      leaf1.ecx.test(30)
    of HypervisorPresence:
      leaf1.ecx.test(31)
    # leaf 7, ecx
    of PrefetchWT1:
      leaf7.ecx.test(0)
    of Avx512vbmi:
      leaf7.ecx.test(1)
    of Avx512vbmi2:
      leaf7.ecx.test(6)
    of Gfni:
      leaf7.ecx.test(8)
    of Vaes:
      leaf7.ecx.test(9)
    of Vpclmulqdq:
      leaf7.ecx.test(10)
    of Avx512vnni:
      leaf7.ecx.test(11)
    of Avx512bitalg:
      leaf7.ecx.test(12)
    of Avx512vpopcntdq:
      leaf7.ecx.test(14)
    # lead 7, eax
    of Avx512bfloat16:
      leaf7.eax.test(5)
    # leaf 7, ebx
    of Sgx:
      leaf7.ebx.test(2)
    of Bmi1:
      leaf7.ebx.test(3)
    of TsxHle:
      leaf7.ebx.test(4)
    of Avx2:
      leaf7.ebx.test(5)
    of Bmi2:
      leaf7.ebx.test(8)
    of TsxRtm:
      leaf7.ebx.test(11)
    of Mpx:
      leaf7.ebx.test(14)
    of Avx512f:
      leaf7.ebx.test(16)
    of Avx512dq:
      leaf7.ebx.test(17)
    of Rdseed:
      leaf7.ebx.test(18)
    of Adx:
      leaf7.ebx.test(19)
    of Avx512ifma:
      leaf7.ebx.test(21)
    of ClflushOpt:
      leaf7.ebx.test(23)
    of Clwb:
      leaf7.ebx.test(24)
    of Avx512pf:
      leaf7.ebx.test(26)
    of Avx512er:
      leaf7.ebx.test(27)
    of Avx512cd:
      leaf7.ebx.test(28)
    of Sha:
      leaf7.ebx.test(29)
    of Avx512bw:
      leaf7.ebx.test(30)
    of Avx512vl:
      leaf7.ebx.test(31)
    # leaf 7, edx
    of Avx512vnniw4:
      leaf7.edx.test(2)
    of Avx512fmaps4:
      leaf7.edx.test(3)
    of Avx512vp2intersect:
      leaf7.edx.test(8)
    # leaf 8, edx
    of NoSMT:
      leaf8.edx.test(1)
    of Cas8B:
      leaf8.edx.test(8)
    of NxBit:
      leaf8.edx.test(20)
    of MmxExt:
      leaf8.edx.test(22)
    of F3DNowEnhanced:
      leaf8.edx.test(30)
    of F3DNow:
      leaf8.edx.test(31)
    # leaf 8, ecx
    of Amdv:
      leaf8.ecx.test(2)
    of Abm:
      leaf8.ecx.test(5)
    of Sse4a:
      leaf8.ecx.test(6)
    of Prefetch:
      leaf8.ecx.test(8)
    of Xop:
      leaf8.ecx.test(11)
    of Fma4:
      leaf8.ecx.test(16)
 let
  isHypervisorPresentImpl = testX86Feature(HypervisorPresence)
  hasSimultaneousMultithreadingImpl =
    testX86Feature(Hyperthreading) or not testX86Feature(NoSMT)
  hasIntelVtxImpl = testX86Feature(IntelVtx)
  hasAmdvImpl = testX86Feature(Amdv)
  hasX87fpuImpl = testX86Feature(X87fpu)
  hasMmxImpl = testX86Feature(Mmx)
  hasMmxExtImpl = testX86Feature(MmxExt)
  has3DNowImpl = testX86Feature(F3DNow)
  has3DNowEnhancedImpl = testX86Feature(F3DNowEnhanced)
  hasPrefetchImpl = testX86Feature(Prefetch) or testX86Feature(F3DNow)
  hasSseImpl = testX86Feature(Sse)
  hasSse2Impl = testX86Feature(Sse2)
  hasSse3Impl = testX86Feature(Sse3)
  hasSsse3Impl = testX86Feature(Ssse3)
  hasSse4aImpl = testX86Feature(Sse4a)
  hasSse41Impl = testX86Feature(Sse41)
  hasSse42Impl = testX86Feature(Sse42)
  hasAvxImpl = testX86Feature(Avx)
  hasAvx2Impl = testX86Feature(Avx2)
  hasAvx512fImpl = testX86Feature(Avx512f)
  hasAvx512dqImpl = testX86Feature(Avx512dq)
  hasAvx512ifmaImpl = testX86Feature(Avx512ifma)
  hasAvx512pfImpl = testX86Feature(Avx512pf)
  hasAvx512erImpl = testX86Feature(Avx512er)
  hasAvx512cdImpl = testX86Feature(Avx512dq)
  hasAvx512bwImpl = testX86Feature(Avx512bw)
  hasAvx512vlImpl = testX86Feature(Avx512vl)
  hasAvx512vbmiImpl = testX86Feature(Avx512vbmi)
  hasAvx512vbmi2Impl = testX86Feature(Avx512vbmi2)
  hasAvx512vpopcntdqImpl = testX86Feature(Avx512vpopcntdq)
  hasAvx512vnniImpl = testX86Feature(Avx512vnni)
  hasAvx512vnniw4Impl = testX86Feature(Avx512vnniw4)
  hasAvx512fmaps4Impl = testX86Feature(Avx512fmaps4)
  hasAvx512bitalgImpl = testX86Feature(Avx512bitalg)
  hasAvx512bfloat16Impl = testX86Feature(Avx512bfloat16)
  hasAvx512vp2intersectImpl = testX86Feature(Avx512vp2intersect)
  hasRdrandImpl = testX86Feature(Rdrand)
  hasRdseedImpl = testX86Feature(Rdseed)
  hasMovBigEndianImpl = testX86Feature(MovBigEndian)
  hasPopcntImpl = testX86Feature(Popcnt)
  hasFma3Impl = testX86Feature(Fma3)
  hasFma4Impl = testX86Feature(Fma4)
  hasXopImpl = testX86Feature(Xop)
  hasCas8BImpl = testX86Feature(Cas8B)
  hasCas16BImpl = testX86Feature(Cas16B)
  hasAbmImpl = testX86Feature(Abm)
  hasBmi1Impl = testX86Feature(Bmi1)
  hasBmi2Impl = testX86Feature(Bmi2)
  hasTsxHleImpl = testX86Feature(TsxHle)
  hasTsxRtmImpl = testX86Feature(TsxRtm)
  hasAdxImpl = testX86Feature(TsxHle)
  hasSgxImpl = testX86Feature(Sgx)
  hasGfniImpl = testX86Feature(Gfni)
  hasAesImpl = testX86Feature(Aes)
  hasVaesImpl = testX86Feature(Vaes)
  hasVpclmulqdqImpl = testX86Feature(Vpclmulqdq)
  hasPclmulqdqImpl = testX86Feature(Pclmulqdq)
  hasNxBitImpl = testX86Feature(NxBit)
  hasFloat16cImpl = testX86Feature(Float16c)
  hasShaImpl = testX86Feature(Sha)
  hasClflushImpl = testX86Feature(Clflush)
  hasClflushOptImpl = testX86Feature(ClflushOpt)
  hasClwbImpl = testX86Feature(Clwb)
  hasPrefetchWT1Impl = testX86Feature(PrefetchWT1)
  hasMpxImpl = testX86Feature(Mpx)
 # NOTE: We use procedures here (layered over the variables) to keep the API
 # consistent and usable against possible future heterogenous systems with ISA
 # differences between cores (a possibility that has historical precedents, for
 # instance, the PPU/SPU relationship found on the IBM Cell). If future systems
 # do end up having disparate ISA features across multiple cores, expect there to
 # be a "cpuCore" argument added to the feature procs.
 proc isHypervisorPresent*(): bool {.inline.} =
  return isHypervisorPresentImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if this application is running inside of a virtual machine
  ## (this is by no means foolproof).
 proc hasSimultaneousMultithreading*(): bool {.inline.} =
  return hasSimultaneousMultithreadingImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware is utilizing simultaneous multithreading
  ## (branded as *"hyperthreads"* on Intel processors).
 proc hasIntelVtx*(): bool {.inline.} =
  return hasIntelVtxImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the Intel virtualization extensions (VT-x) are available.
 proc hasAmdv*(): bool {.inline.} =
  return hasAmdvImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the AMD virtualization extensions (AMD-V) are available.
 proc hasX87fpu*(): bool {.inline.} =
  return hasX87fpuImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use x87 floating-point instructions
  ## (includes support for single, double, and 80-bit percision floats as per
  ## IEEE 754-1985).
  ##
  ## By virtue of SSE2 enforced compliance on AMD64 CPUs, this should always be
  ## `true` on 64-bit x86 processors. It should be noted that support of these
  ## instructions is deprecated on 64-bit versions of Windows - see MSDN_.
  ##
  ## .. _MSDN: https://docs.microsoft.com/en-us/windows/win32/dxtecharts/sixty-four-bit-programming-for-game-developers#porting-applications-to-64-bit-platforms
 proc hasMmx*(): bool {.inline.} =
  return hasMmxImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use MMX SIMD instructions.
  ##
  ## By virtue of SSE2 enforced compliance on AMD64 CPUs, this should always be
  ## `true` on 64-bit x86 processors. It should be noted that support of these
  ## instructions is deprecated on 64-bit versions of Windows (see MSDN_ for
  ## more info).
  ##
  ## .. _MSDN: https://docs.microsoft.com/en-us/windows/win32/dxtecharts/sixty-four-bit-programming-for-game-developers#porting-applications-to-64-bit-platforms
 proc hasMmxExt*(): bool {.inline.} =
  return hasMmxExtImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use "Extended MMX" SIMD instructions.
  ##
  ## It should be noted that support of these instructions is deprecated on
  ## 64-bit versions of Windows (see MSDN_ for more info).
  ##
  ## .. _MSDN: https://docs.microsoft.com/en-us/windows/win32/dxtecharts/sixty-four-bit-programming-for-game-developers#porting-applications-to-64-bit-platforms
 proc has3DNow*(): bool {.inline.} =
  return has3DNowImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use 3DNow! SIMD instructions.
  ##
  ## It should be noted that support of these instructions is deprecated on
  ## 64-bit versions of Windows (see MSDN_ for more info), and that the 3DNow!
  ## instructions (with an exception made for the prefetch instructions, see the
  ## `hasPrefetch` procedure) have been phased out of AMD processors since 2010
  ## (see `AMD Developer Central`_ for more info).
  ##
  ## .. _MSDN: https://docs.microsoft.com/en-us/windows/win32/dxtecharts/sixty-four-bit-programming-for-game-developers#porting-applications-to-64-bit-platforms
  ## .. _`AMD Developer Central`: https://web.archive.org/web/20131109151245/http://developer.amd.com/community/blog/2010/08/18/3dnow-deprecated/
 proc has3DNowEnhanced*(): bool {.inline.} =
  return has3DNowEnhancedImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use "Enhanced 3DNow!" SIMD instructions.
  ##
  ## It should be noted that support of these instructions is deprecated on
  ## 64-bit versions of Windows (see MSDN_ for more info), and that the 3DNow!
  ## instructions (with an exception made for the prefetch instructions, see the
  ## `hasPrefetch` procedure) have been phased out of AMD processors since 2010
  ## (see `AMD Developer Central`_ for more info).
  ##
  ## .. _MSDN: https://docs.microsoft.com/en-us/windows/win32/dxtecharts/sixty-four-bit-programming-for-game-developers#porting-applications-to-64-bit-platforms
  ## .. _`AMD Developer Central`: https://web.archive.org/web/20131109151245/http://developer.amd.com/community/blog/2010/08/18/3dnow-deprecated/
 proc hasPrefetch*(): bool {.inline.} =
  return hasPrefetchImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use the `PREFETCH` and `PREFETCHW`
  ## instructions. These instructions originally included as part of 3DNow!, but
  ## potentially indepdendent from the rest of it due to changes in contemporary
  ## AMD processors (see above).
 proc hasSse*(): bool {.inline.} =
  return hasSseImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use the SSE (Streaming SIMD Extensions)
  ## 1.0 instructions, which introduced 128-bit SIMD on x86 machines.
  ##
  ## By virtue of SSE2 enforced compliance on AMD64 CPUs, this should always be
  ## `true` on 64-bit x86 processors.
 proc hasSse2*(): bool {.inline.} =
  return hasSse2Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use the SSE (Streaming SIMD Extensions)
  ## 2.0 instructions.
  ##
  ## By virtue of SSE2 enforced compliance on AMD64 CPUs, this should always be
  ## `true` on 64-bit x86 processors.
 proc hasSse3*(): bool {.inline.} =
  return hasSse3Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use SSE (Streaming SIMD Extensions) 3.0
  ## instructions.
 proc hasSsse3*(): bool {.inline.} =
  return hasSsse3Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use Supplemental SSE (Streaming SIMD
  ## Extensions) 3.0 instructions.
 proc hasSse4a*(): bool {.inline.} =
  return hasSse4aImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use Supplemental SSE (Streaming SIMD
  ## Extensions) 4a instructions.
 proc hasSse41*(): bool {.inline.} =
  return hasSse41Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use Supplemental SSE (Streaming SIMD
  ## Extensions) 4.1 instructions.
 proc hasSse42*(): bool {.inline.} =
  return hasSse42Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use Supplemental SSE (Streaming SIMD
  ## Extensions) 4.2 instructions.
 proc hasAvx*(): bool {.inline.} =
  return hasAvxImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 1.0 instructions, which introduced 256-bit SIMD on x86 machines along with
  ## addded reencoded versions of prior 128-bit SSE instructions into the more
  ## code-dense and non-backward compatible VEX (Vector Extensions) format.
 proc hasAvx2*(): bool {.inline.} =
  return hasAvx2Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions) 2.0
  ## instructions.
 proc hasAvx512f*(): bool {.inline.} =
  return hasAvx512fImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit F (Foundation) instructions.
 proc hasAvx512dq*(): bool {.inline.} =
  return hasAvx512dqImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit DQ (Doubleword + Quadword) instructions.
 proc hasAvx512ifma*(): bool {.inline.} =
  return hasAvx512ifmaImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit IFMA (Integer Fused Multiply Accumulation) instructions.
 proc hasAvx512pf*(): bool {.inline.} =
  return hasAvx512pfImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit PF (Prefetch) instructions.
 proc hasAvx512er*(): bool {.inline.} =
  return hasAvx512erImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit ER (Exponential and Reciprocal) instructions.
 proc hasAvx512cd*(): bool {.inline.} =
  return hasAvx512cdImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit CD (Conflict Detection) instructions.
 proc hasAvx512bw*(): bool {.inline.} =
  return hasAvx512bwImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit BW (Byte and Word) instructions.
 proc hasAvx512vl*(): bool {.inline.} =
  return hasAvx512vlImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit VL (Vector Length) instructions.
 proc hasAvx512vbmi*(): bool {.inline.} =
  return hasAvx512vbmiImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit VBMI (Vector Byte Manipulation) 1.0 instructions.
 proc hasAvx512vbmi2*(): bool {.inline.} =
  return hasAvx512vbmi2Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit VBMI (Vector Byte Manipulation) 2.0 instructions.
 proc hasAvx512vpopcntdq*(): bool {.inline.} =
  return hasAvx512vpopcntdqImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use the AVX (Advanced Vector Extensions)
  ## 512-bit `VPOPCNTDQ` (population count, i.e. determine number of flipped
  ## bits) instruction.
 proc hasAvx512vnni*(): bool {.inline.} =
  return hasAvx512vnniImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit VNNI (Vector Neural Network) instructions.
 proc hasAvx512vnniw4*(): bool {.inline.} =
  return hasAvx512vnniw4Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit 4VNNIW (Vector Neural Network Word Variable Percision)
  ## instructions.
 proc hasAvx512fmaps4*(): bool {.inline.} =
  return hasAvx512fmaps4Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit 4FMAPS (Fused-Multiply-Accumulation Single-percision) instructions.
 proc hasAvx512bitalg*(): bool {.inline.} =
  return hasAvx512bitalgImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit BITALG (Bit Algorithms) instructions.
 proc hasAvx512bfloat16*(): bool {.inline.} =
  return hasAvx512bfloat16Impl
  ## **(x86 Only)**
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit BFLOAT16 (8-bit exponent, 7-bit mantissa) instructions used by
  ## Intel DL (Deep Learning) Boost.
 proc hasAvx512vp2intersect*(): bool {.inline.} =
  return hasAvx512vp2intersectImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware can use AVX (Advanced Vector Extensions)
  ## 512-bit VP2INTERSECT (Compute Intersections between Dualwords + Quadwords)
  ## instructions.
 proc hasRdrand*(): bool {.inline.} =
  return hasRdrandImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the `RDRAND` instruction,
  ## i.e. Intel on-CPU hardware random number generation.
 proc hasRdseed*(): bool {.inline.} =
  return hasRdseedImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the `RDSEED` instruction,
  ## i.e. Intel on-CPU hardware random number generation (used for seeding other
  ## PRNGs).
 proc hasMovBigEndian*(): bool {.inline.} =
  return hasMovBigEndianImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the `MOVBE` instruction for
  ## endianness/byte-order switching.
 proc hasPopcnt*(): bool {.inline.} =
  return hasPopcntImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the `POPCNT` (population
  ## count, i.e. determine number of flipped bits) instruction.
 proc hasFma3*(): bool {.inline.} =
  return hasFma3Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the FMA3 (Fused Multiply
  ## Accumulation 3-operand) SIMD instructions.
 proc hasFma4*(): bool {.inline.} =
  return hasFma4Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the FMA4 (Fused Multiply
  ## Accumulation 4-operand) SIMD instructions.
 proc hasXop*(): bool {.inline.} =
  return hasXopImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the XOP (eXtended
  ## Operations) SIMD instructions. These instructions are exclusive to the
  ## Bulldozer AMD microarchitecture family (i.e. Bulldozer, Piledriver,
  ## Steamroller, and Excavator) and were phased out with the release of the Zen
  ## design.
 proc hasCas8B*(): bool {.inline.} =
  return hasCas8BImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the (`LOCK`-able)
  ## `CMPXCHG8B` 64-bit compare-and-swap instruction.
 proc hasCas16B*(): bool {.inline.} =
  return hasCas16BImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the (`LOCK`-able)
  ## `CMPXCHG16B` 128-bit compare-and-swap instruction.
 proc hasAbm*(): bool {.inline.} =
  return hasAbmImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for ABM (Advanced Bit
  ## Manipulation) insturctions (i.e. `POPCNT` and `LZCNT` for counting leading
  ## zeroes).
 proc hasBmi1*(): bool {.inline.} =
  return hasBmi1Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for BMI (Bit Manipulation) 1.0
  ## instructions.
 proc hasBmi2*(): bool {.inline.} =
  return hasBmi2Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for BMI (Bit Manipulation) 2.0
  ## instructions.
 proc hasTsxHle*(): bool {.inline.} =
  return hasTsxHleImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for HLE (Hardware Lock Elision)
  ## as part of Intel's TSX (Transactional Synchronization Extensions).
 proc hasTsxRtm*(): bool {.inline.} =
  return hasTsxRtmImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for RTM (Restricted
  ## Transactional Memory) as part of Intel's TSX (Transactional Synchronization
  ## Extensions).
 proc hasAdx*(): bool {.inline.} =
  return hasAdxImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for ADX (Multi-percision
  ## Add-Carry Extensions) insructions.
 proc hasSgx*(): bool {.inline.} =
  return hasSgxImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for SGX (Software Guard
  ## eXtensions) memory encryption technology.
 proc hasGfni*(): bool {.inline.} =
  return hasGfniImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for GFNI (Galois Field Affine
  ## Transformation) instructions.
 proc hasAes*(): bool {.inline.} =
  return hasAesImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for AESNI (Advanced Encryption
  ## Standard) instructions.
 proc hasVaes*(): bool {.inline.} =
  return hasVaesImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for VAES (Vectorized Advanced
  ## Encryption Standard) instructions.
 proc hasVpclmulqdq*(): bool {.inline.} =
  return hasVpclmulqdqImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for `VCLMULQDQ` (512 and 256-bit
  ## Carryless Multiplication) instructions.
 proc hasPclmulqdq*(): bool {.inline.} =
  return hasPclmulqdqImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for `PCLMULQDQ` (128-bit
  ## Carryless Multiplication) instructions.
 proc hasNxBit*(): bool {.inline.} =
  return hasNxBitImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for NX-bit (No-eXecute)
  ## technology for marking pages of memory as non-executable.
 proc hasFloat16c*(): bool {.inline.} =
  return hasFloat16cImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for F16C instructions, used for
  ## converting 16-bit "half-percision" floating-point values to and from
  ## single-percision floating-point values.
 proc hasSha*(): bool {.inline.} =
  return hasShaImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for SHA (Secure Hash Algorithm)
  ## instructions.
 proc hasClflush*(): bool {.inline.} =
  return hasClflushImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the `CLFLUSH` (Cache-line
  ## Flush) instruction.
 proc hasClflushOpt*(): bool {.inline.} =
  return hasClflushOptImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the `CLFLUSHOPT` (Cache-line
  ## Flush Optimized) instruction.
 proc hasClwb*(): bool {.inline.} =
  return hasClwbImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the `CLWB` (Cache-line Write
  ## Back) instruction.
 proc hasPrefetchWT1*(): bool {.inline.} =
  return hasPrefetchWT1Impl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for the `PREFECTHWT1`
  ## instruction.
 proc hasMpx*(): bool {.inline.} =
  return hasMpxImpl
  ## **(x86 Only)**
  ##
  ## Reports `true` if the hardware has support for MPX (Memory Protection
  ## eXtensions).
--- a/constantine/primitives/macro_assembler_x86.nim
+++ b/constantine/primitives/macro_assembler_x86.nim
@ -0,0 +1,620 @@
 # Constantine
 # Copyright (c) 2018-2019    Status Research & Development GmbH
 # Copyright (c) 2020-Present Mamy André-Ratsimbazafy
 # Licensed and distributed under either of
 #   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
 #   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
 # at your option. This file may not be copied, modified, or distributed except according to those terms.
 import std/[macros, strutils, sets, hashes]
 # A compile-time inline assembler
 type
  RM* = enum
    ## Register or Memory operand
    # https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html
    Reg            = "r"
    Mem            = "m"
    AnyRegOrMem    = "rm" # use "r, m" instead?
    Imm            = "i"
    MemOffsettable = "o"
    AnyRegMemImm   = "g"
    AnyMemOffImm   = "oi"
    AnyRegImm      = "ri"
    PointerInReg   = "r" # Store an array pointer
    ElemsInReg     = "r" # Store each individual array element in reg
    # Specific registers
    RCX            = "c"
    RDX            = "d"
    R8             = "r8"
    RAX            = "a"
  Register* = enum
    rbx, rdx, r8, rax
  Constraint* = enum
    ## GCC extended assembly modifier
    Input               = ""
    Input_Commutative   = "%"
    Input_EarlyClobber  = "&"
    Output_Overwrite    = "="
    Output_EarlyClobber = "=&"
    InputOutput         = "+"
    InputOutput_EnsureClobber = "+&" # For register InputOutput, clang needs "+&" bug?
  OpKind = enum
    kRegister
    kFromArray
    kArrayAddr
  Operand* = object
    desc*: OperandDesc
    case kind: OpKind
    of kRegister:
      discard
    of kFromArray:
      offset: int
    of kArrayAddr:
      buf: seq[Operand]
  OperandDesc* = ref object
    asmId*: string          # [a] - ASM id
    nimSymbol*: NimNode     # a   - Nim nimSymbol
    rm*: RM
    constraint*: Constraint
    cEmit*: string          # C emit for example a->limbs
  OperandArray* = object
    nimSymbol*: NimNode
    buf: seq[Operand]
  OperandReuse* = object
    # Allow reusing a register
    asmId*: string
  Assembler_x86* = object
    code: string
    operands: HashSet[OperandDesc]
    wordBitWidth*: int
    wordSize: int
    areFlagsClobbered: bool
    isStackClobbered: bool
  Stack* = object
 const SpecificRegisters = {RCX, RDX, R8, RAX}
 const OutputReg = {Output_EarlyClobber, InputOutput, InputOutput_EnsureClobber, Output_Overwrite}
 func hash(od: OperandDesc): Hash =
  {.noSideEffect.}:
    hash($od.nimSymbol)
 # TODO: remove the need of OperandArray
 func len*(opArray: OperandArray): int =
  opArray.buf.len
 proc `[]`*(opArray: OperandArray, index: int): Operand =
  opArray.buf[index]
 func `[]`*(opArray: var OperandArray, index: int): var Operand =
  opArray.buf[index]
 func `[]`*(arrayAddr: Operand, index: int): Operand =
  arrayAddr.buf[index]
 func `[]`*(arrayAddr: var Operand, index: int): var Operand =
  arrayAddr.buf[index]
 func init*(T: type Assembler_x86, Word: typedesc[SomeUnsignedInt]): Assembler_x86 =
  result.wordSize = sizeof(Word)
  result.wordBitWidth = result.wordSize * 8
 func init*(T: type OperandArray, nimSymbol: NimNode, len: int, rm: RM, constraint: Constraint): OperandArray =
  doAssert rm in {
    MemOffsettable,
    AnyMemOffImm,
    PointerInReg,
    ElemsInReg
  } or rm in SpecificRegisters
  result.buf.setLen(len)
  # We need to dereference the hidden pointer of var param
  let isHiddenDeref = nimSymbol.kind == nnkHiddenDeref
  let nimSymbol = if isHiddenDeref: nimSymbol[0]
                  else: nimSymbol
  {.noSideEffect.}:
    let symStr = $nimSymbol
  result.nimSymbol = nimSymbol
  if rm in {PointerInReg, MemOffsettable, AnyMemOffImm} or
     rm in SpecificRegisters:
    let desc = OperandDesc(
                  asmId: "[" & symStr & "]",
                  nimSymbol: nimSymbol,
                  rm: rm,
                  constraint: constraint,
                  cEmit: symStr
                )
    for i in 0 ..< len:
      result.buf[i] = Operand(
        desc: desc,
        kind: kFromArray,
        offset: i
      )
  else:
    # We can't store an array in register so we create assign individual register
    # per array elements instead
    for i in 0 ..< len:
      result.buf[i] = Operand(
        desc: OperandDesc(
                  asmId: "[" & symStr & $i & "]",
                  nimSymbol: ident(symStr & $i),
                  rm: rm,
                  constraint: constraint,
                  cEmit: symStr & "[" & $i & "]"
              ),
        kind: kRegister
      )
 func asArrayAddr*(op: Operand, len: int): Operand =
  ## Use the value stored in an operand as an array address
  doAssert op.desc.rm in {Reg, PointerInReg, ElemsInReg}+SpecificRegisters
  result = Operand(
    kind: kArrayAddr,
    desc: nil,
    buf: newSeq[Operand](len)
  )
  for i in 0 ..< len:
    result.buf[i] = Operand(
      desc: op.desc,
      kind: kFromArray,
      offset: i
    )
 # Code generation
 # ------------------------------------------------------------------------------------------------------------
 func generate*(a: Assembler_x86): NimNode =
  ## Generate the inline assembly code from
  ## the desired instruction
  var
    outOperands: seq[string]
    inOperands: seq[string]
    memClobbered = false
  for odesc in a.operands.items():
    var decl: string
    if odesc.rm in SpecificRegisters:
      # [a] "rbx" (`a`)
      decl = odesc.asmId & "\"" & $odesc.constraint & $odesc.rm & "\"" &
             " (`" & odesc.cEmit & "`)"
    elif odesc.rm in {Mem, AnyRegOrMem, MemOffsettable, AnyRegMemImm, AnyMemOffImm}:
      # [a] "+r" (`*a`)
      # We need to deref the pointer to memory
      decl = odesc.asmId & " \"" & $odesc.constraint & $odesc.rm & "\"" &
             " (`*" & odesc.cEmit & "`)"
    else:
      # [a] "+r" (`a[0]`)
      decl = odesc.asmId & " \"" & $odesc.constraint & $odesc.rm & "\"" &
             " (`" & odesc.cEmit & "`)"
    if odesc.constraint in {Input, Input_Commutative}:
      inOperands.add decl
    else:
      outOperands.add decl
    if odesc.rm == PointerInReg and odesc.constraint in {Output_Overwrite, Output_EarlyClobber, InputOutput, InputOutput_EnsureClobber}:
      memClobbered = true
  var params: string
  params.add ": " & outOperands.join(", ") & '\n'
  params.add ": " & inOperands.join(", ") & '\n'
  let clobbers = [(a.isStackClobbered, "sp"),
                  (a.areFlagsClobbered, "cc"),
                  (memClobbered, "memory")]
  var clobberList = ": "
  for (clobbered, str) in clobbers:
    if clobbered:
      if clobberList.len == 2:
        clobberList.add "\"" & str & '\"'
      else:
        clobberList.add ", \"" & str & '\"'
  params.add clobberList
  # GCC will optimize ASM away if there are no
  # memory operand or volatile + memory clobber
  # https://stackoverflow.com/questions/34244185/looping-over-arrays-with-inline-assembly
  # result = nnkAsmStmt.newTree(
  #   newEmptyNode(),
  #   newLit(asmStmt & params)
  # )
  var asmStmt = "\"" & a.code.replace("\n", "\\n\"\n\"")
  asmStmt.setLen(asmStmt.len - 1) # drop the last quote
  result = nnkPragma.newTree(
    nnkExprColonExpr.newTree(
      ident"emit",
      newLit(
        "asm volatile(\n" & asmStmt & params & ");"
      )
    )
  )
 func getStrOffset(a: Assembler_x86, op: Operand): string =
  if op.kind != kFromArray:
    return "%" & op.desc.asmId
  # Beware GCC / Clang differences with array offsets
  # https://lists.llvm.org/pipermail/llvm-dev/2017-August/116202.html
  if op.desc.rm in {Mem, AnyRegOrMem, MemOffsettable, AnyMemOffImm, AnyRegMemImm}:
    # Directly accessing memory
    if op.offset == 0:
      return "%" & op.desc.asmId
    if defined(gcc):
      return $(op.offset * a.wordSize) & "+%" & op.desc.asmId
    elif defined(clang):
      return $(op.offset * a.wordSize) & "%" & op.desc.asmId
    else:
      error "Unconfigured compiler"
  elif op.desc.rm == PointerInReg or
       op.desc.rm in SpecificRegisters or
       (op.desc.rm == ElemsInReg and op.kind == kFromArray):
    if op.offset == 0:
      return "(%" & $op.desc.asmId & ')'
    if defined(gcc):
      return $(op.offset * a.wordSize) & "+(%" & $op.desc.asmId & ')'
    elif defined(clang):
      return $(op.offset * a.wordSize) & "(%" & $op.desc.asmId & ')'
    else:
      error "Unconfigured compiler"
  else:
    error "Unsupported: " & $op.desc.rm.ord
 func codeFragment(a: var Assembler_x86, instr: string, op: Operand) =
  # Generate a code fragment
  let off = a.getStrOffset(op)
  if a.wordBitWidth == 64:
    a.code &= instr & "q " & off & '\n'
  elif a.wordBitWidth == 32:
    a.code &= instr & "l " & off & '\n'
  else:
    error "Unsupported bitwidth: " & $a.wordBitWidth
  a.operands.incl op.desc
 func codeFragment(a: var Assembler_x86, instr: string, op0, op1: Operand) =
  # Generate a code fragment
  # ⚠️ Warning:
  # The caller should deal with destination/source operand
  # so that it fits GNU Assembly
  let off0 = a.getStrOffset(op0)
  let off1 = a.getStrOffset(op1)
  if a.wordBitWidth == 64:
    a.code &= instr & "q " & off0 & ", " & off1 & '\n'
  elif a.wordBitWidth == 32:
    a.code &= instr & "l " & off0 & ", " & off1 & '\n'
  else:
    error "Unsupported bitwidth: " & $a.wordBitWidth
  a.operands.incl op0.desc
  a.operands.incl op1.desc
 func codeFragment(a: var Assembler_x86, instr: string, imm: int, op: Operand) =
  # Generate a code fragment
  # ⚠️ Warning:
  # The caller should deal with destination/source operand
  # so that it fits GNU Assembly
  let off = a.getStrOffset(op)
  if a.wordBitWidth == 64:
    a.code &= instr & "q $" & $imm & ", " & off & '\n'
  else:
    a.code &= instr & "l $" & $imm & ", " & off & '\n'
  a.operands.incl op.desc
 func codeFragment(a: var Assembler_x86, instr: string, imm: int, reg: Register) =
  # Generate a code fragment
  # ⚠️ Warning:
  # The caller should deal with destination/source operand
  # so that it fits GNU Assembly
  if a.wordBitWidth == 64:
    a.code &= instr & "q $" & $imm & ", %%" & $reg & '\n'
  else:
    a.code &= instr & "l $" & $imm & ", %%" & $reg & '\n'
 func codeFragment(a: var Assembler_x86, instr: string, reg0, reg1: Register) =
  # Generate a code fragment
  # ⚠️ Warning:
  # The caller should deal with destination/source operand
  # so that it fits GNU Assembly
  if a.wordBitWidth == 64:
    a.code &= instr & "q %%" & $reg0 & ", %%" & $reg1 & '\n'
  else:
    a.code &= instr & "l %%" & $reg0 & ", %%" & $reg1 & '\n'
 func codeFragment(a: var Assembler_x86, instr: string, imm: int, reg: OperandReuse) =
  # Generate a code fragment
  # ⚠️ Warning:
  # The caller should deal with destination/source operand
  # so that it fits GNU Assembly
  if a.wordBitWidth == 64:
    a.code &= instr & "q $" & $imm & ", %" & $reg.asmId & '\n'
  else:
    a.code &= instr & "l $" & $imm & ", %" & $reg.asmId & '\n'
 func codeFragment(a: var Assembler_x86, instr: string, reg0, reg1: OperandReuse) =
  # Generate a code fragment
  # ⚠️ Warning:
  # The caller should deal with destination/source operand
  # so that it fits GNU Assembly
  if a.wordBitWidth == 64:
    a.code &= instr & "q %" & $reg0.asmId & ", %" & $reg1.asmId & '\n'
  else:
    a.code &= instr & "l %" & $reg0.asmId & ", %" & $reg1.asmId & '\n'
 func codeFragment(a: var Assembler_x86, instr: string, reg0: OperandReuse, reg1: Operand) =
  # Generate a code fragment
  # ⚠️ Warning:
  # The caller should deal with destination/source operand
  # so that it fits GNU Assembly
  if a.wordBitWidth == 64:
    a.code &= instr & "q %" & $reg0.asmId & ", %" & $reg1.desc.asmId & '\n'
  else:
    a.code &= instr & "l %" & $reg0.asmId & ", %" & $reg1.desc.asmId & '\n'
  a.operands.incl reg1.desc
 func reuseRegister*(reg: OperandArray): OperandReuse =
  # TODO: disable the reg input
  doAssert reg.buf[0].desc.constraint == InputOutput
  result.asmId = reg.buf[0].desc.asmId
 func comment*(a: var Assembler_x86, comment: string) =
  # Add a comment
  a.code &= "# " & comment & '\n'
 func repackRegisters*(regArr: OperandArray, regs: varargs[Operand]): OperandArray =
  ## Extend an array of registers with extra registers
  result.buf = regArr.buf
  result.buf.add regs
  result.nimSymbol = nil
 # Instructions
 # ------------------------------------------------------------------------------------------------------------
 func add*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- dst + src
  doAssert dst.desc.constraint in OutputReg
  a.codeFragment("add", src, dst)
  a.areFlagsClobbered = true
 func adc*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- dst + src + carry
  doAssert dst.desc.constraint in OutputReg
  a.codeFragment("adc", src, dst)
  a.areFlagsClobbered = true
  if dst.desc.rm != Reg:
    {.warning: "Using addcarry with a memory destination, this incurs significant performance penalties.".}
 func adc*(a: var Assembler_x86, dst: Operand, imm: int) =
  ## Does: dst <- dst + imm + borrow
  doAssert dst.desc.constraint in OutputReg
  a.codeFragment("adc", imm, dst)
  a.areFlagsClobbered = true
  if dst.desc.rm != Reg:
    {.warning: "Using addcarry with a memory destination, this incurs significant performance penalties.".}
 func sub*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- dst - src
  doAssert dst.desc.constraint in OutputReg
  a.codeFragment("sub", src, dst)
  a.areFlagsClobbered = true
 func sbb*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- dst - src - borrow
  doAssert dst.desc.constraint in OutputReg
  a.codeFragment("sbb", src, dst)
  a.areFlagsClobbered = true
  if dst.desc.rm != Reg:
    {.warning: "Using subborrow with a memory destination, this incurs significant performance penalties.".}
 func sbb*(a: var Assembler_x86, dst: Operand, imm: int) =
  ## Does: dst <- dst - imm - borrow
  doAssert dst.desc.constraint in OutputReg
  a.codeFragment("sbb", imm, dst)
  a.areFlagsClobbered = true
  if dst.desc.rm != Reg:
    {.warning: "Using subborrow with a memory destination, this incurs significant performance penalties.".}
 func sbb*(a: var Assembler_x86, dst: Register, imm: int) =
  ## Does: dst <- dst - imm - borrow
  a.codeFragment("sbb", imm, dst)
  a.areFlagsClobbered = true
 func sbb*(a: var Assembler_x86, dst, src: Register) =
  ## Does: dst <- dst - imm - borrow
  a.codeFragment("sbb", src, dst)
  a.areFlagsClobbered = true
 func sbb*(a: var Assembler_x86, dst: OperandReuse, imm: int) =
  ## Does: dst <- dst - imm - borrow
  a.codeFragment("sbb", imm, dst)
  a.areFlagsClobbered = true
 func sbb*(a: var Assembler_x86, dst, src: OperandReuse) =
  ## Does: dst <- dst - imm - borrow
  a.codeFragment("sbb", src, dst)
  a.areFlagsClobbered = true
 func sar*(a: var Assembler_x86, dst: Operand, imm: int) =
  ## Does Arithmetic Right Shift (i.e. with sign extension)
  doAssert dst.desc.constraint in OutputReg
  a.codeFragment("sar", imm, dst)
  a.areFlagsClobbered = true
 func `and`*(a: var Assembler_x86, dst: OperandReuse, imm: int) =
  ## Compute the bitwise AND of x and y and
  ## set the Sign, Zero and Parity flags
  a.codeFragment("and", imm, dst)
  a.areFlagsClobbered = true
 func `and`*(a: var Assembler_x86, dst, src: Operand) =
  ## Compute the bitwise AND of x and y and
  ## set the Sign, Zero and Parity flags
  a.codeFragment("and", src, dst)
  a.areFlagsClobbered = true
 func `and`*(a: var Assembler_x86, dst: Operand, src: OperandReuse) =
  ## Compute the bitwise AND of x and y and
  ## set the Sign, Zero and Parity flags
  a.codeFragment("and", src, dst)
  a.areFlagsClobbered = true
 func test*(a: var Assembler_x86, x, y: Operand) =
  ## Compute the bitwise AND of x and y and
  ## set the Sign, Zero and Parity flags
  a.codeFragment("test", x, y)
  a.areFlagsClobbered = true
 func `xor`*(a: var Assembler_x86, x, y: Operand) =
  ## Compute the bitwise xor of x and y and
  ## reset all flags
  a.codeFragment("xor", x, y)
  a.areFlagsClobbered = true
 func mov*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- src
  doAssert dst.desc.constraint in OutputReg, $dst.repr
  a.codeFragment("mov", src, dst)
  # No clobber
 func mov*(a: var Assembler_x86, dst: Operand, imm: int) =
  ## Does: dst <- imm
  doAssert dst.desc.constraint in OutputReg, $dst.repr
  a.codeFragment("mov", imm, dst)
  # No clobber
 func cmovc*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- src if the carry flag is set
  doAssert dst.desc.rm in {Reg, ElemsInReg}, "The destination operand must be a register: " & $dst.repr
  doAssert dst.desc.constraint in OutputReg, $dst.repr
  a.codeFragment("cmovc", src, dst)
  # No clobber
 func cmovnc*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- src if the carry flag is not set
  doAssert dst.desc.rm in {Reg, ElemsInReg}, "The destination operand must be a register: " & $dst.repr
  doAssert dst.desc.constraint in {Output_EarlyClobber, InputOutput, Output_Overwrite}, $dst.repr
  a.codeFragment("cmovnc", src, dst)
  # No clobber
 func cmovz*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- src if the zero flag is not set
  doAssert dst.desc.rm in {Reg, ElemsInReg}, "The destination operand must be a register: " & $dst.repr
  doAssert dst.desc.constraint in OutputReg, $dst.repr
  a.codeFragment("cmovz", src, dst)
  # No clobber
 func cmovnz*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- src if the zero flag is not set
  doAssert dst.desc.rm in {Reg, ElemsInReg}, "The destination operand must be a register: " & $dst.repr
  doAssert dst.desc.constraint in OutputReg, $dst.repr
  a.codeFragment("cmovnz", src, dst)
  # No clobber
 func cmovs*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- src if the sign flag
  doAssert dst.desc.rm in {Reg, ElemsInReg}, "The destination operand must be a register: " & $dst.repr
  doAssert dst.desc.constraint in OutputReg, $dst.repr
  a.codeFragment("cmovs", src, dst)
  # No clobber
 func mul*(a: var Assembler_x86, dHi, dLo: Register, src0: Operand, src1: Register) =
  ## Does (dHi, dLo) <- src0 * src1
  doAssert src1 == rax, "MUL requires the RAX register"
  doAssert dHi == rdx,  "MUL requires the RDX register"
  doAssert dLo == rax,   "MUL requires the RAX register"
  a.codeFragment("mul", src0)
 func imul*(a: var Assembler_x86, dst, src: Operand) =
  ## Does dst <- dst * src, keeping only the low half
  doAssert dst.desc.rm in {Reg, ElemsInReg}, "The destination operand must be a register: " & $dst.repr
  doAssert dst.desc.constraint in OutputReg, $dst.repr
  a.codeFragment("imul", src, dst)
 func mulx*(a: var Assembler_x86, dHi, dLo, src0: Operand, src1: Register) =
  ## Does (dHi, dLo) <- src0 * src1
  doAssert src1 == rdx, "MULX requires the RDX register"
  doAssert dHi.desc.rm in {Reg, ElemsInReg} or dHi.desc.rm in SpecificRegisters,
    "The destination operand must be a register " & $dHi.repr
  doAssert dLo.desc.rm in {Reg, ElemsInReg} or dLo.desc.rm in SpecificRegisters,
    "The destination operand must be a register " & $dLo.repr
  doAssert dHi.desc.constraint in OutputReg
  doAssert dLo.desc.constraint in OutputReg
  let off0 = a.getStrOffset(src0)
  # Annoying AT&T syntax
  if a.wordBitWidth == 64:
    a.code &= "mulxq " & off0 & ", %" & $dLo.desc.asmId & ", %" & $dHi.desc.asmId & '\n'
  else:
    a.code &= "mulxl " & off0 & ", %" & $dLo.desc.asmId & ", %" & $dHi.desc.asmId & '\n'
  a.operands.incl src0.desc
 func adcx*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- dst + src + carry
  ## and only sets the carry flag
  doAssert dst.desc.constraint in OutputReg, $dst.repr
  doAssert dst.desc.rm in {Reg, ElemsInReg}, "The destination operand must be a register: " & $dst.repr
  a.codeFragment("adcx", src, dst)
  a.areFlagsClobbered = true
 func adox*(a: var Assembler_x86, dst, src: Operand) =
  ## Does: dst <- dst + src + overflow
  ## and only sets the overflow flag
  doAssert dst.desc.constraint in OutputReg, $dst.repr
  doAssert dst.desc.rm in {Reg, ElemsInReg}, "The destination operand must be a register: " & $dst.repr
  a.codeFragment("adox", src, dst)
  a.areFlagsClobbered = true
 func push*(a: var Assembler_x86, _: type Stack, reg: Operand) =
  ## Push the content of register on the stack
  doAssert reg.desc.rm in {Reg, PointerInReg, ElemsInReg}+SpecificRegisters, "The destination operand must be a register: " & $reg.repr
  a.codeFragment("push", reg)
  a.isStackClobbered = true
 func pop*(a: var Assembler_x86, _: type Stack, reg: Operand) =
  ## Pop the content of register on the stack
  doAssert reg.desc.rm in {Reg, PointerInReg, ElemsInReg}+SpecificRegisters, "The destination operand must be a register: " & $reg.repr
  a.codeFragment("pop", reg)
  a.isStackClobbered = true
--- a/constantine/primitives/research/README.md
+++ b/constantine/primitives/research/README.md
@ -1,45 +0,0 @@
 # Compiler for generic inline assembly code-generation
 This folder holds alternative implementations of primitives
 that uses inline assembly.
 This avoids the pitfalls of traditional compiler bad code generation
 for multiprecision arithmetic (see GCC https://gcc.godbolt.org/z/2h768y)
 or unsupported features like handling 2 carry chains for
 multiplication using MULX/ADOX/ADCX.
 To be generic over multiple curves,
 for example BN254 requires 4 words and BLS12-381 requires 6 words of size 64 bits,
 the compilers is implemented as a set of macros that generate inline assembly.
 ⚠⚠⚠ Warning! Warning! Warning!
 This is a significant sacrifice of code readability, portability, auditability and maintainability in favor of performance.
 This combines 2 of the most notorious ways to obfuscate your code:
 * metaprogramming and macros
 * inline assembly
 Adventurers beware: not for the faint of heart.
 This is unfinished, untested, unused, unfuzzed and just a proof-of-concept at the moment.*
 _* I take no responsibility if this smashes your stack, eats your cat, hides a skeleton in your closet, warps a pink elephant in the room, summons untold eldritch horrors or causes the heat death of the universe. You have been warned._
 _The road to debugging hell is paved with metaprogrammed assembly optimizations._
 _For my defence, OpenSSL assembly is generated by a Perl script and neither Perl nor the generated Assembly are type-checked by a dependently-typed compiler._
 ## References
 Multiprecision (Montgomery) Multiplication & Squaring in Assembly
 - Intel MULX/ADCX/ADOX Table 2 p13: https://www.intel.cn/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
 - Squaring: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/large-integer-squaring-ia-paper.pdf
 - https://eprint.iacr.org/eprint-bin/getfile.pl?entry=2017/558&version=20170608:200345&file=558.pdf
 - https://github.com/intel/ipp-crypto
 - https://github.com/herumi/mcl
 Experimentations in Nim
 - https://github.com/mratsim/finite-fields
--- a/constantine/primitives/research/addcarry_subborrow_compiler.nim
+++ b/constantine/primitives/research/addcarry_subborrow_compiler.nim
@ -1,133 +0,0 @@
 # Constantine
 # Copyright (c) 2018-2019    Status Research & Development GmbH
 # Copyright (c) 2020-Present Mamy André-Ratsimbazafy
 # Licensed and distributed under either of
 #   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
 #   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
 # at your option. This file may not be copied, modified, or distributed except according to those terms.
 # ############################################################
 #
 #            Add-with-carry and Sub-with-borrow
 #
 # ############################################################
 #
 # This is a proof-of-concept optimal add-with-carry
 # compiler implemented as Nim macros.
 #
 # This overcome the bad GCC codegen aven with addcary_u64 intrinsic.
 import std/macros
 func wordsRequired(bits: int): int {.compileTime.} =
  ## Compute the number of limbs required
  ## from the announced bit length
  (bits + 64 - 1) div 64
 type
  BigInt[bits: static int] {.byref.} = object
    ## BigInt
    ## Enforce-passing by reference otherwise uint128 are passed by stack
    ## which causes issue with the inline assembly
    limbs: array[bits.wordsRequired, uint64]
 macro addCarryGen_u64(a, b: untyped, bits: static int): untyped =
  var asmStmt = (block:
    "      movq %[b], %[tmp]\n" &
    "      addq %[tmp], %[a]\n"
  )
  let maxByteOffset = bits div 8
  const wsize = sizeof(uint64)
  when defined(gcc):
    for byteOffset in countup(wsize, maxByteOffset-1, wsize):
      asmStmt.add (block:
        "\n" &
        # movq 8+%[b], %[tmp]
        "      movq " & $byteOffset & "+%[b], %[tmp]\n" &
        # adcq %[tmp], 8+%[a]
        "      adcq %[tmp], " & $byteOffset & "+%[a]\n"
      )
  elif defined(clang):
    # https://lists.llvm.org/pipermail/llvm-dev/2017-August/116202.html
    for byteOffset in countup(wsize, maxByteOffset-1, wsize):
      asmStmt.add (block:
        "\n" &
        # movq 8+%[b], %[tmp]
        "      movq " & $byteOffset & "%[b], %[tmp]\n" &
        # adcq %[tmp], 8+%[a]
        "      adcq %[tmp], " & $byteOffset & "%[a]\n"
      )
  let tmp = ident("tmp")
  asmStmt.add (block:
    ": [tmp] \"+r\" (`" & $tmp & "`), [a] \"+m\" (`" & $a & "->limbs[0]`)\n" &
    ": [b] \"m\"(`" & $b & "->limbs[0]`)\n" &
    ": \"cc\""
  )
  result = newStmtList()
  result.add quote do:
    var `tmp`{.noinit.}: uint64
  result.add nnkAsmStmt.newTree(
    newEmptyNode(),
    newLit asmStmt
  )
  echo result.toStrLit
 func `+=`(a: var BigInt, b: BigInt) {.noinline.}=
  # Depending on inline or noinline
  # the generated ASM addressing must be tweaked for Clang
  # https://lists.llvm.org/pipermail/llvm-dev/2017-August/116202.html
  addCarryGen_u64(a, b, BigInt.bits)
 # #############################################
 when isMainModule:
  import std/random
  proc rand(T: typedesc[BigInt]): T =
    for i in 0 ..< result.limbs.len:
      result.limbs[i] = uint64(rand(high(int)))
  proc main() =
    block:
      let a = BigInt[128](limbs: [high(uint64), 0])
      let b = BigInt[128](limbs: [1'u64, 0])
      echo "a:        ", a
      echo "b:        ", b
      echo "------------------------------------------------------"
      var a1 = a
      a1 += b
      echo a1
      echo "======================================================"
    block:
      let a = rand(BigInt[256])
      let b = rand(BigInt[256])
      echo "a:        ", a
      echo "b:        ", b
      echo "------------------------------------------------------"
      var a1 = a
      a1 += b
      echo a1
      echo "======================================================"
    block:
      let a = rand(BigInt[384])
      let b = rand(BigInt[384])
      echo "a:        ", a
      echo "b:        ", b
      echo "------------------------------------------------------"
      var a1 = a
      a1 += b
      echo a1
  main()
--- a/tests/t_ec_sage_bls12_381
+++ b/tests/t_ec_sage_bls12_381
--- a/tests/t_finite_fields_vs_gmp.nim
+++ b/tests/t_finite_fields_vs_gmp.nim
@ -20,22 +20,13 @@ import
 echo "\n------------------------------------------------------\n"
 var RNG {.compileTime.} = initRand(1234)
 const CurveParams = [
  P224,
  BN254_Nogami,
  BN254_Snarks,
  Curve25519,
  P256,
  Secp256k1,
  BLS12_377,
  BLS12_381,
  BN446,
  FKM12_447,
  BLS12_461,
  BN462
 ]
-const AvailableCurves = [P224, BN254_Nogami, BN254_Snarks, P256, Secp256k1, BLS12_381]
+const AvailableCurves = [
  P224,
  BN254_Nogami, BN254_Snarks,
  P256, Secp256k1,
  BLS12_381
 ]
 const # https://gmplib.org/manual/Integer-Import-and-Export.html
  GMP_WordLittleEndian = -1'i32
--- a/tests/t_io_fields.nim
+++ b/tests/t_io_fields.nim
@ -140,6 +140,14 @@ proc main() =
        check: p == hex
    test "Round trip on prime field of BN254 Snarks curve":
      block: # 2^126
        const p = "0x0000000000000000000000000000000040000000000000000000000000000000"
        let x = Fp[BN254_Snarks].fromBig BigInt[254].fromHex(p)
        let hex = x.toHex(bigEndian)
        check: p == hex
    test "Round trip on prime field of BLS12_381 curve":
      block: # 2^126
        const p = "0x000000000000000000000000000000000000000000000000000000000000000040000000000000000000000000000000"