Internals refactor + renewed focus on perf (#17)

* Lay out the refactoring objectives and tradeoffs * Refactor the 32 and 64-bit primitives [skip ci] * BigInts and Modular BigInts compile * Make the bigints test compile * Fix modular reduction * Fix reduction tests vs GMP * Implement montegomery mul, pow, inverse, WIP finite field compilation * Make FiniteField compile * Fix exponentiation compilation * Fix Montgomery magic constant computation for 2^64 words * Fix typo in non-optimized CIOS - passing finite fields IO tests * Add limbs comparisons [skip ci] * Fix on precomputation of the Montgomery magic constant * Passing all tests including 𝔽p2 * modular addition, the test for mersenne prime was wrong * update benches * Fix "nimble test" + typo on out-of-place field addition * bigint division, normalization is needed: https://travis-ci.com/github/mratsim/constantine/jobs/298359743 * missing conversion in subborrow non-x86 fallback - https://travis-ci.com/github/mratsim/constantine/jobs/298359744 * Fix little-endian serialization * Constantine32 flag to run 32-bit constantine on 64-bit machines * IO Field test, ensure that BaseType is used instead of uint64 when the prime can field in uint32 * Implement proper addcarry and subborrow fallback for the compile-time VM * Fix export issue when the logical wordbitwidth == physical wordbitwidth - passes all tests (32-bit and 64-bit) * Fix uint128 on ARM * Fix C++ conditional copy and ARM addcarry/subborrow * Add investigation for SIGFPE in Travis * Fix debug display for unsafeDiv2n1n * multiplexer typo * moveMem bug in glibc of Ubuntu 16.04? * Was probably missing an early clobbered register annotation on conditional mov * Note on Montgomery-friendly moduli * Strongly suspect a GCC before GCC 7 codegen bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87139) * hex conversion was (for debugging) not taking requested order into account + inlining comment * Use 32-bit limbs on ARM64, uint128 builtin __udivti4 bug? * Revert "Use 32-bit limbs on ARM64, uint128 builtin __udivti4 bug?" This reverts commit 087f9aa7fb40bbd058d05cbd8eec7fc082911f49. * Fix subborrow fallback for non-x86 (need to maks the borrow)
2025-03-04 04:10:40 +00:00 · 2020-03-16 16:33:51 +01:00 · 2020-03-16 16:33:51 +01:00 · 4ff0e3d90b
commit 4ff0e3d90b
parent 191bb7710c
44 changed files with 2413 additions and 1385 deletions
--- a/.travis.yml
+++ b/.travis.yml
@ -11,21 +11,23 @@ matrix:
    # Constantine only works with Nim devel
    # Build and test using both gcc and clang
    # Build and test on both x86-64 and ARM64
-    - os: linux
+    # Ubuntu Bionic (18.04) is needed, it includes
+    # GCC 7 codegen fixes to addcarry_u64.
+    - dist: bionic
      arch: amd64
      env:
        - ARCH=amd64
        - CHANNEL=devel
      compiler: gcc

-    - os: linux
+    - dist: bionic
      arch: arm64
      env:
        - ARCH=arm64
        - CHANNEL=devel
      compiler: gcc

-    - os: linux
+    - dist: bionic
      arch: amd64
      env:
        - ARCH=amd64
--- a/README.md
+++ b/README.md
@ -18,14 +18,20 @@ You can install the developement version of the library through nimble with the
 nimble install https://github.com/mratsim/constantine@#master
 ```

+For speed it is recommended to prefer Clang, MSVC or ICC over GCC.
+GCC does not properly optimize add-with-carry and sub-with-borrow loops (see [Compiler-caveats](#Compiler-caveats)).
+
+Further if using GCC, GCC 7 at minimum is required, previous versions
+generated incorrect add-with-carry code.
+
 ## Target audience

 The library aims to be a portable, compact and hardened library for elliptic curve cryptography needs, in particular for blockchain protocols and zero-knowledge proofs system.

 The library focuses on following properties:
 - constant-time (not leaking secret data via side-channels)
- generated code size, datatype size and stack usage
 - performance
+- generated code size, datatype size and stack usage

 in this order

@ -54,6 +60,128 @@ actively hinder you by:
 A growing number of attack vectors is being collected for your viewing pleasure
 at https://github.com/mratsim/constantine/wiki/Constant-time-arithmetics

+## Performance
+
+High-performance is a sought out property.
+Note that security and side-channel resistance takes priority over performance.
+
+New applications of elliptic curve cryptography like zero-knowledge proofs or
+proof-of-stake based blockchain protocols are bottlenecked by cryptography.
+
+### In blockchain
+
+Ethereum 2 clients spent or use to spend anywhere between 30% to 99% of their processing time verifying the signatures of block validators on R&D testnets
+Assuming we want nodes to handle a thousand peers, if a cryptographic pairing takes 1ms, that represents 1s of cryptography per block to sign with a target
+block frequency of 1 every 6 seconds.
+
+### In zero-knowledge proofs
+
+According to https://medium.com/loopring-protocol/zksnark-prover-optimizations-3e9a3e5578c0
+a 16-core CPU can prove 20 transfers/second or 10 transactions/second.
+The previous implementation was 15x slower and one of the key optimizations
+was changing the elliptic curve cryptography backend.
+It had a direct implication on hardware cost and/or cloud computing resources required.
+
+### Compiler caveats
+
+Unfortunately compilers and in particular GCC are not very good at optimizing big integers and/or cryptographic code even when using intrinsics like `addcarry_u64`.
+
+Compilers with proper support of `addcarry_u64` like Clang, MSVC and ICC
+may generate code up to 20~25% faster than GCC.
+
+This is explained by the GMP team: https://gmplib.org/manual/Assembly-Carry-Propagation.html
+and can be reproduced with the following C code.
+
+See https://gcc.godbolt.org/z/2h768y
+```C
+#include <stdint.h>
+#include <x86intrin.h>
+
+void add256(uint64_t a[4], uint64_t b[4]){
+  uint8_t carry = 0;
+  for (int i = 0; i < 4; ++i)
+    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
+}
+```
+
+GCC
+```asm
+add256:
+        movq    (%rsi), %rax
+        addq    (%rdi), %rax
+        setc    %dl
+        movq    %rax, (%rdi)
+        movq    8(%rdi), %rax
+        addb    $-1, %dl
+        adcq    8(%rsi), %rax
+        setc    %dl
+        movq    %rax, 8(%rdi)
+        movq    16(%rdi), %rax
+        addb    $-1, %dl
+        adcq    16(%rsi), %rax
+        setc    %dl
+        movq    %rax, 16(%rdi)
+        movq    24(%rsi), %rax
+        addb    $-1, %dl
+        adcq    %rax, 24(%rdi)
+        ret
+```
+
+Clang
+```asm
+add256:
+        movq    (%rsi), %rax
+        addq    %rax, (%rdi)
+        movq    8(%rsi), %rax
+        adcq    %rax, 8(%rdi)
+        movq    16(%rsi), %rax
+        adcq    %rax, 16(%rdi)
+        movq    24(%rsi), %rax
+        adcq    %rax, 24(%rdi)
+        retq
+```
+
+### Inline assembly
+
+Constantine uses inline assembly for a very restricted use-case: "conditional mov",
+and a temporary use-case "hardware 128-bit division" that will be replaced ASAP (as hardware division is not constant-time).
+
+Using intrinsics otherwise significantly improve code readability, portability, auditability and maintainability.
+
+#### Future optimizations
+
+In the future more inline assembly primitives might be added provided the performance benefit outvalues the significant complexity.
+In particular, multiprecision multiplication and squaring on x86 can use the instructions MULX, ADCX and ADOX
+to multiply-accumulate on 2 carry chains in parallel (with instruction-level parallelism)
+and improve performance by 15~20% over an uint128-based implementation.
+As no compiler is able to generate such code even when using the `_mulx_u64` and `_addcarryx_u64` intrinsics,
+either the assembly for each supported bigint size must be hardcoded
+or a "compiler" must be implemented in macros that will generate the required inline assembly at compile-time.
+
+Such a compiler can also be used to overcome GCC codegen deficiencies, here is an example for add-with-carry:
+https://github.com/mratsim/finite-fields/blob/d7f6d8bb/macro_add_carry.nim
+
+## Sizes: code size, stack usage
+
+Thanks to 10x smaller key sizes for the same security level as RSA, elliptic curve cryptography
+is widely used on resource-constrained devices.
+
+Constantine is actively optimize for code-size and stack usage.
+Constantine does not use heap allocation.
+
+At the moment Constantine is optimized for 32-bit and 64-bit CPUs.
+
+When performance and code size conflicts, a careful and informed default is chosen.
+In the future, a compile-time flag that goes beyond the compiler `-Os` might be provided.
+
+### Example tradeoff
+
+Unrolling Montgomery Multiplication brings about 15% performance improvement
+which translate to ~15% on all operations in Constantine as field multiplication bottlenecks
+all cryptographic primitives.
+This is considered a worthwhile tradeoff on all but the most constrained CPUs
+with those CPUs probably being 8-bit or 16-bit.
+
 ## License

 Licensed and distributed under either of
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@ -19,12 +19,12 @@ strategy:
      CHANNEL: devel
      TEST_LANG: cpp
    Linux_devel_64bit:
-      VM: 'ubuntu-16.04'
+      VM: 'ubuntu-18.04'
      UCPU: amd64
      CHANNEL: devel
      TEST_LANG: c
    Linux_cpp_devel_64bit:
-      VM: 'ubuntu-16.04'
+      VM: 'ubuntu-18.04'
      UCPU: amd64
      CHANNEL: devel
      WEAVE_TEST_LANG: cpp
--- a/BIN
+++ b/BIN
--- a/BIN
+++ b/BIN
--- a/benchmarks/bls12_381_fp.nim
+++ b/benchmarks/bls12_381_fp.nim
@ -23,7 +23,7 @@

 import
  ../constantine/config/[common, curves],
-  ../constantine/arithmetic/[bigints_checked, finite_fields],
+  ../constantine/arithmetic/[bigints, finite_fields],
  ../constantine/io/[io_bigints, io_fields],
  random, std/monotimes, times, strformat,
  ./timers
--- a/benchmarks/bn254_fp.nim
+++ b/benchmarks/bn254_fp.nim
@ -23,7 +23,7 @@

 import
  ../constantine/config/[common, curves],
-  ../constantine/arithmetic/[bigints_checked, finite_fields],
+  ../constantine/arithmetic/[bigints, finite_fields],
  ../constantine/io/[io_bigints, io_fields],
  random, std/monotimes, times, strformat,
  ./timers
--- a/benchmarks/secp256k1_fp.nim
+++ b/benchmarks/secp256k1_fp.nim
@ -23,7 +23,7 @@

 import
  ../constantine/config/[common, curves],
-  ../constantine/arithmetic/[bigints_checked, finite_fields],
+  ../constantine/arithmetic/[bigints, finite_fields],
  ../constantine/io/[io_bigints, io_fields],
  random, std/monotimes, times, strformat,
  ./timers
--- a/constantine.nimble
+++ b/constantine.nimble
@ -9,7 +9,7 @@ srcDir        = "src"
 requires "nim >= 1.1.0"

 ### Helper functions
-proc test(path: string) =
+proc test(flags, path: string) =
  if not dirExists "build":
    mkDir "build"
  # Compilation language is controlled by WEAVE_TEST_LANG
@ -17,35 +17,64 @@ proc test(path: string) =
  if existsEnv"TEST_LANG":
    lang = getEnv"TEST_LANG"

+  var cc = ""
+  if existsEnv"CC":
+    cc = " --cc:" & getEnv"CC"
+
  echo "\n========================================================================================"
-  echo "Running ", path
+  echo "Running [flags: ", flags, "] ", path
  echo "========================================================================================"
-  exec "nim " & lang & " --verbosity:0 --outdir:build -r --hints:off --warnings:off " & path
+  exec "nim " & lang & cc & " " & flags & " --verbosity:0 --outdir:build -r --hints:off --warnings:off " & path

 ### tasks
 task test, "Run all tests":
  # -d:testingCurves is configured in a *.nim.cfg for convenience
-  test "tests/test_primitives.nim"
+  test "", "tests/test_primitives.nim"

-  test "tests/test_io_bigints.nim"
-  test "tests/test_bigints.nim"
-  test "tests/test_bigints_multimod.nim"
+  test "", "tests/test_io_bigints.nim"
+  test "", "tests/test_bigints.nim"
+  test "", "tests/test_bigints_multimod.nim"

-  test "tests/test_io_fields"
-  test "tests/test_finite_fields.nim"
-  test "tests/test_finite_fields_powinv.nim"
+  test "", "tests/test_io_fields"
+  test "", "tests/test_finite_fields.nim"
+  test "", "tests/test_finite_fields_powinv.nim"

-  test "tests/test_bigints_vs_gmp.nim"
-  test "tests/test_finite_fields_vs_gmp.nim"
+  test "", "tests/test_bigints_vs_gmp.nim"
+  test "", "tests/test_finite_fields_vs_gmp.nim"
+
+  if sizeof(int) == 8: # 32-bit tests
+    test "-d:Constantine32", "tests/test_primitives.nim"
+
+    test "-d:Constantine32", "tests/test_io_bigints.nim"
+    test "-d:Constantine32", "tests/test_bigints.nim"
+    test "-d:Constantine32", "tests/test_bigints_multimod.nim"
+
+    test "-d:Constantine32", "tests/test_io_fields"
+    test "-d:Constantine32", "tests/test_finite_fields.nim"
+    test "-d:Constantine32", "tests/test_finite_fields_powinv.nim"
+
+    test "-d:Constantine32", "tests/test_bigints_vs_gmp.nim"
+    test "-d:Constantine32", "tests/test_finite_fields_vs_gmp.nim"

 task test_no_gmp, "Run tests that don't require GMP":
  # -d:testingCurves is configured in a *.nim.cfg for convenience
-  test "tests/test_primitives.nim"
+  test "", "tests/test_primitives.nim"

-  test "tests/test_io_bigints.nim"
-  test "tests/test_bigints.nim"
-  test "tests/test_bigints_multimod.nim"
+  test "", "tests/test_io_bigints.nim"
+  test "", "tests/test_bigints.nim"
+  test "", "tests/test_bigints_multimod.nim"

-  test "tests/test_io_fields"
-  test "tests/test_finite_fields.nim"
-  test "tests/test_finite_fields_powinv.nim"
+  test "", "tests/test_io_fields"
+  test "", "tests/test_finite_fields.nim"
+  test "", "tests/test_finite_fields_powinv.nim"
+
+  if sizeof(int) == 8: # 32-bit tests
+    test "-d:Constantine32", "tests/test_primitives.nim"
+
+    test "-d:Constantine32", "tests/test_io_bigints.nim"
+    test "-d:Constantine32", "tests/test_bigints.nim"
+    test "-d:Constantine32", "tests/test_bigints_multimod.nim"
+
+    test "-d:Constantine32", "tests/test_io_fields"
+    test "-d:Constantine32", "tests/test_finite_fields.nim"
+    test "-d:Constantine32", "tests/test_finite_fields_powinv.nim"
--- a/constantine/arithmetic/README.md
+++ b/constantine/arithmetic/README.md
@ -4,6 +4,17 @@ This folder contains the implementation of
 - big integers
 - finite field arithmetic (i.e. modular arithmetic)

+As a tradeoff between speed, code size and compiler-enforced dependent type checking, the library is structured the following way:
+- Finite Field: statically parametrized by an elliptic curve
+- Big Integers: statically parametrized by the bit width of the field modulus
+- Limbs: statically parametrized by the number of words to handle the bitwidth
+
+This allows to reuse the same implementation at the limbs-level for
+curves that required the same number of words to save on code size,
+for example secp256k1 and BN254.
+It also enables compiler unrolling, inlining and register optimization,
+where code size is not an issue for example for multi-precision addition.
+
 ## References

 - Analyzing and Comparing Montgomery Multiplication Algorithms
@ -18,3 +29,7 @@ This folder contains the implementation of
  Chapter 5 of Guide to Pairing-Based Cryptography\
  Jean Luc Beuchat, Luis J. Dominguez Perez, Sylvain Duquesne, Nadia El Mrabet, Laura Fuentes-Castañeda, Francisco Rodríguez-Henríquez, 2017\
  https://www.researchgate.net/publication/319538235_Arithmetic_of_Finite_Fields
+
+- Faster big-integer modular multiplication for most moduli\
+  Gautam Botrel, Gus Gutoski, and Thomas Piellard, 2020\
+  https://hackmd.io/@zkteam/modular_multiplication
--- a/constantine/arithmetic/bigints_checked.nim
+++ b/constantine/arithmetic/bigints_checked.nim
@ -7,32 +7,59 @@
 # at your option. This file may not be copied, modified, or distributed except according to those terms.

 import
-  ./bigints_raw,
-  ../primitives/constant_time,
-  ../config/common
+  ../config/common,
+  ../primitives,
+  ./limbs,
+  ./montgomery

 # ############################################################
 #
-#                BigInts type-checked API
+#                        BigInts
 #
 # ############################################################

-# The "checked" API is exported as a building blocks
-# with enforced compile-time checking of BigInt bitsize
+# The API is exported as a building block
+# with enforced compile-time checking of BigInt bitwidth
 # and memory ownership.
+
+# ############################################################
+# Design
 #
-# The "raw" compute API uses views to avoid code duplication
-# due to generic/static monomorphization.
+# Control flow should only depends on the static maximum number of bits
+# This number is defined per Finite Field/Prime/Elliptic Curve
 #
-# The "checked" API is a thin wrapper above the "raw" API to get the best of both world:
-# - small code footprint
-# - compiler enforced checks: types, bitsizes (dependant types)
-# - compiler enforced memory: stack allocation and buffer ownership
+# Data Layout
+#
+# The previous implementation of Constantine used type-erased views
+# to optimized code-size (1)
+# Also instead of using the full 64-bit of an uint64 it used
+# 63-bit with the last bit to handle carries (2)
+#
+# (1) brought an advantage in terms of code-size if multiple curves
+# were supported.
+# However it prevented unrolling for some performance critical routines
+# like addition and Montgomery multiplication. Furthermore, addition
+# is only 1 or 2 instructions per limbs meaning unrolling+inlining
+# is probably smaller in code-size than a function call.
+#
+# (2) Not using the full 64-bit eased carry and borrow handling.
+# Also on older x86 Arch, the add-with-carry "ADC" instruction
+# may be up to 6x slower than plain "ADD" with memory operand in a carry-chain.
+#
+# However, recent CPUs (less than 5 years) have reasonable or lower ADC latencies
+# compared to the shifting and masking required when using 63 bits.
+# Also we save on words to iterate on (1 word for BN254, secp256k1, BLS12-381)
+#
+# Furthermore, pairing curves are not fast-reduction friendly
+# meaning that lazy reductions and lazy carries are impractical
+# and so it's simpler to always carry additions instead of
+# having redundant representations that forces costly reductions before multiplications.
+# https://github.com/mratsim/constantine/issues/15

 func wordsRequired(bits: int): int {.compileTime.} =
  ## Compute the number of limbs required
  # from the **announced** bit length
-  (bits + WordBitSize - 1) div WordBitSize
+  (bits + WordBitWidth - 1) div WordBitWidth

 type
  BigInt*[bits: static int] = object
@ -41,28 +68,18 @@ type
    ## - "bits" is the announced bit-length of the BigInt
    ##   This is public data, usually equal to the curve prime bitlength.
    ##
-    ## - "bitLength" is the internal bitlength of the integer
-    ##   This differs from the canonical bit-length as
-    ##   Constantine word-size is smaller than a machine word.
-    ##   This value should never be used as-is to prevent leaking secret data.
-    ##   Computing this value requires constant-time operations.
-    ##   Using this value requires converting it to the # of limbs in constant-time
-    ##
    ## - "limbs" is an internal field that holds the internal representation
    ##   of the big integer. Least-significant limb first. Within limbs words are native-endian.
    ##
    ## This internal representation can be changed
    ## without notice and should not be used by external applications or libraries.
-    bitLength*: uint32
    limbs*: array[bits.wordsRequired, Word]

-template view*(a: BigInt): BigIntViewConst =
-  ## Returns a borrowed type-erased immutable view to a bigint
-  BigIntViewConst(cast[BigIntView](a.unsafeAddr))
-
-template view*(a: var BigInt): BigIntViewMut =
-  ## Returns a borrowed type-erased mutable view to a mutable bigint
-  BigIntViewMut(cast[BigIntView](a.addr))
+# For unknown reason, `bits` doesn't semcheck if
+#   `limbs: Limbs[bits.wordsRequired]`
+# with
+#   `Limbs[N: static int] = distinct array[N, Word]`
+# so we don't set Limbs as a distinct type

 debug:
  import strutils
@ -70,9 +87,7 @@ debug:
  func `$`*(a: BigInt): string =
    result = "BigInt["
    result.add $BigInt.bits
-    result.add "](bitLength: "
-    result.add $a.bitLength
-    result.add ", limbs: ["
+    result.add "](limbs: ["
    result.add $BaseType(a.limbs[0]) & " (0x" & toHex(BaseType(a.limbs[0])) & ')'
    for i in 1 ..< a.limbs.len:
      result.add ", "
@ -83,54 +98,40 @@ debug:
 {.push raises: [].}
 {.push inline.}

-func setInternalBitLength*(a: var BigInt) =
-  ## Derive the actual bitsize used internally of a BigInt
-  ## from the announced BigInt bitsize
-  ## and set the bitLength field of that BigInt
-  ## to that computed value.
-  a.bitLength = uint32 static(a.bits + a.bits div WordBitSize)
-
 func `==`*(a, b: BigInt): CTBool[Word] =
  ## Returns true if 2 big ints are equal
  ## Comparison is constant-time
-  var accum: Word
-  for i in static(0 ..< a.limbs.len):
-    accum = accum or (a.limbs[i] xor b.limbs[i])
-  result = accum.isZero
+  a.limbs == b.limbs

 func isZero*(a: BigInt): CTBool[Word] =
  ## Returns true if a big int is equal to zero
-  a.view.isZero
+  a.limbs.isZero

 func setZero*(a: var BigInt) =
  ## Set a BigInt to 0
-  a.setInternalBitLength()
-  zeroMem(a.limbs[0].unsafeAddr, a.limbs.len * sizeof(Word))
+  a.limbs.setZero()

 func setOne*(a: var BigInt) =
  ## Set a BigInt to 1
-  a.setInternalBitLength()
-  a.limbs[0] = Word(1)
-  when a.limbs.len > 1:
-    zeroMem(a.limbs[1].unsafeAddr, (a.limbs.len-1) * sizeof(Word))
+  a.limbs.setOne()

 func cadd*(a: var BigInt, b: BigInt, ctl: CTBool[Word]): CTBool[Word] =
  ## Constant-time in-place conditional addition
  ## The addition is only performed if ctl is "true"
  ## The result carry is always computed.
-  cadd(a.view, b.view, ctl)
+  (CTBool[Word]) cadd(a.limbs, b.limbs, ctl)

 func csub*(a: var BigInt, b: BigInt, ctl: CTBool[Word]): CTBool[Word] =
  ## Constant-time in-place conditional addition
  ## The addition is only performed if ctl is "true"
  ## The result carry is always computed.
-  csub(a.view, b.view, ctl)
+  (CTBool[Word]) csub(a.limbs, b.limbs, ctl)

 func cdouble*(a: var BigInt, ctl: CTBool[Word]): CTBool[Word] =
  ## Constant-time in-place conditional doubling
  ## The doubling is only performed if ctl is "true"
  ## The result carry is always computed.
-  cadd(a.view, a.view, ctl)
+  (CTBool[Word]) cadd(a.limbs, a.limbs, ctl)

 # ############################################################
 #
@ -143,38 +144,38 @@ func cdouble*(a: var BigInt, ctl: CTBool[Word]): CTBool[Word] =
 func add*(a: var BigInt, b: BigInt): CTBool[Word] =
  ## Constant-time in-place addition
  ## Returns the carry
-  add(a.view, b.view)
+  (CTBool[Word]) add(a.limbs, b.limbs)

 func sub*(a: var BigInt, b: BigInt): CTBool[Word] =
  ## Constant-time in-place substraction
  ## Returns the borrow
-  sub(a.view, b.view)
+  (CTBool[Word]) sub(a.limbs, b.limbs)

 func double*(a: var BigInt): CTBool[Word] =
  ## Constant-time in-place doubling
  ## Returns the carry
-  add(a.view, a.view)
+  (CTBool[Word]) add(a.limbs, a.limbs)

 func sum*(r: var BigInt, a, b: BigInt): CTBool[Word] =
  ## Sum `a` and `b` into `r`.
  ## `r` is initialized/overwritten
  ##
  ## Returns the carry
-  sum(r.view, a.view, b.view)
+  (CTBool[Word]) sum(r.limbs, a.limbs, b.limbs)

 func diff*(r: var BigInt, a, b: BigInt): CTBool[Word] =
  ## Substract `b` from `a` and store the result into `r`.
  ## `r` is initialized/overwritten
  ##
  ## Returns the borrow
-  diff(r.view, a.view, b.view)
+  (CTBool[Word]) diff(r.limbs, a.limbs, b.limbs)

 func double*(r: var BigInt, a: BigInt): CTBool[Word] =
  ## Double `a` into `r`.
  ## `r` is initialized/overwritten
  ##
  ## Returns the carry
-  sum(r.view, a.view, a.view)
+  (CTBool[Word]) sum(r.limbs, a.limbs, a.limbs)

 # ############################################################
 #
@ -182,9 +183,13 @@ func double*(r: var BigInt, a: BigInt): CTBool[Word] =
 #
 # ############################################################

-# Use "csub", which unfortunately requires the first operand to be mutable.
-# for example for a <= b, we now that if a-b borrows then b > a and so a<=b is false
-# This can be tested with "not csub(a, b, CtFalse)"
+func `<`*(a, b: BigInt): CTBool[Word] =
+  ## Returns true if a < b
+  a.limbs < b.limbs
+
+func `<=`*(a, b: BigInt): CTBool[Word] =
+  ## Returns true if a <= b
+  a.limbs <= b.limbs

 # ############################################################
 #
@ -202,9 +207,15 @@ func reduce*[aBits, mBits](r: var BigInt[mBits], a: BigInt[aBits], M: BigInt[mBi
  # Note: for all cryptographic intents and purposes the modulus is known at compile-time
  # but we don't want to inline it as it would increase codesize, better have Nim
  # pass a pointer+length to a fixed session of the BSS.
-  reduce(r.view, a.view, M.view)
+  reduce(r.limbs, a.limbs, aBits, M.limbs, mBits)

-func montyResidue*(mres: var BigInt, a, N, r2modN: BigInt, negInvModWord: static BaseType) =
+# ############################################################
+#
+#                 Montgomery Arithmetic
+#
+# ############################################################
+
+func montyResidue*(mres: var BigInt, a, N, r2modM: BigInt, m0ninv: static BaseType, canUseNoCarryMontyMul: static bool) =
  ## Convert a BigInt from its natural representation
  ## to the Montgomery n-residue form
  ##
@ -213,9 +224,15 @@ func montyResidue*(mres: var BigInt, a, N, r2modN: BigInt, negInvModWord: static
  ## Caller must take care of properly switching between
  ## the natural and montgomery domain.
  ## Nesting Montgomery form is possible by applying this function twice.
-  montyResidue(mres.view, a.view, N.view, r2modN.view, Word(negInvModWord))
+  ##
+  ## The Montgomery Magic Constants:
+  ## - `m0ninv` is µ = -1/N (mod M)
+  ## - `r2modM` is R² (mod M)
+  ## with W = M.len
+  ## and R = (2^WordBitSize)^W
+  montyResidue(mres.limbs, a.limbs, N.limbs, r2modM.limbs, m0ninv, canUseNoCarryMontyMul)

-func redc*[mBits](r: var BigInt[mBits], a, N: BigInt[mBits], negInvModWord: static BaseType) =
+func redc*[mBits](r: var BigInt[mBits], a, M: BigInt[mBits], m0ninv: static BaseType, canUseNoCarryMontyMul: static bool) =
  ## Convert a BigInt from its Montgomery n-residue form
  ## to the natural representation
  ##
@ -227,31 +244,27 @@ func redc*[mBits](r: var BigInt[mBits], a, N: BigInt[mBits], negInvModWord: stat
    var one {.noInit.}: BigInt[mBits]
    one.setOne()
    one
-  redc(r.view, a.view, one.view, N.view, Word(negInvModWord))
+  redc(r.limbs, a.limbs, one.limbs, M.limbs, m0ninv, canUseNoCarryMontyMul)

-# ############################################################
-#
-#                 Montgomery Arithmetic
-#
-# ############################################################
-
-func montyMul*(r: var BigInt, a, b, M: BigInt, negInvModWord: static BaseType) =
+func montyMul*(r: var BigInt, a, b, M: BigInt, negInvModWord: static BaseType, canUseNoCarryMontyMul: static bool) =
  ## Compute r <- a*b (mod M) in the Montgomery domain
  ##
  ## This resets r to zero before processing. Use {.noInit.}
  ## to avoid duplicating with Nim zero-init policy
-  montyMul(r.view, a.view, b.view, M.view, Word(negInvModWord))
+  montyMul(r.limbs, a.limbs, b.limbs, M.limbs, negInvModWord, canUseNoCarryMontyMul)

-func montySquare*(r: var BigInt, a, M: BigInt, negInvModWord: static BaseType) =
+func montySquare*(r: var BigInt, a, M: BigInt, negInvModWord: static BaseType, canUseNoCarryMontyMul: static bool) =
  ## Compute r <- a^2 (mod M) in the Montgomery domain
  ##
  ## This resets r to zero before processing. Use {.noInit.}
  ## to avoid duplicating with Nim zero-init policy
-  montySquare(r.view, a.view, M.view, Word(negInvModWord))
+  montySquare(r.limbs, a.limbs, M.limbs, negInvModWord, canUseNoCarryMontyMul)

 func montyPow*[mBits, eBits: static int](
       a: var BigInt[mBits], exponent: BigInt[eBits],
-       M, one: BigInt[mBits], negInvModWord: static BaseType, windowSize: static int) =
+       M, one: BigInt[mBits], negInvModWord: static BaseType, windowSize: static int,
+       canUseNoCarryMontyMul: static bool
+      ) =
  ## Compute a <- a^exponent (mod M)
  ## ``a`` in the Montgomery domain
  ## ``exponent`` is any BigInt, in the canonical domain
@ -268,16 +281,14 @@ func montyPow*[mBits, eBits: static int](

  const scratchLen = if windowSize == 1: 2
                     else: (1 shl windowSize) + 1
-  var scratchSpace {.noInit.}: array[scratchLen, BigInt[mBits]]
-  var scratchPtrs {.noInit.}: array[scratchLen, BigIntViewMut]
-  for i in 0 ..< scratchLen:
-    scratchPtrs[i] = scratchSpace[i].view()
-
-  montyPow(a.view, expBE, M.view, one.view, Word(negInvModWord), scratchPtrs)
+  var scratchSpace {.noInit.}: array[scratchLen, Limbs[mBits.wordsRequired]]
+  montyPow(a.limbs, expBE, M.limbs, one.limbs, negInvModWord, scratchSpace, canUseNoCarryMontyMul)

 func montyPowUnsafeExponent*[mBits, eBits: static int](
       a: var BigInt[mBits], exponent: BigInt[eBits],
-       M, one: BigInt[mBits], negInvModWord: static BaseType, windowSize: static int) =
+       M, one: BigInt[mBits], negInvModWord: static BaseType, windowSize: static int,
+       canUseNoCarryMontyMul: static bool
+      ) =
  ## Compute a <- a^exponent (mod M)
  ## ``a`` in the Montgomery domain
  ## ``exponent`` is any BigInt, in the canonical domain
@ -298,16 +309,14 @@ func montyPowUnsafeExponent*[mBits, eBits: static int](

  const scratchLen = if windowSize == 1: 2
                     else: (1 shl windowSize) + 1
-  var scratchSpace {.noInit.}: array[scratchLen, BigInt[mBits]]
-  var scratchPtrs {.noInit.}: array[scratchLen, BigIntViewMut]
-  for i in 0 ..< scratchLen:
-    scratchPtrs[i] = scratchSpace[i].view()
-
-  montyPowUnsafeExponent(a.view, expBE, M.view, one.view, Word(negInvModWord), scratchPtrs)
+  var scratchSpace {.noInit.}: array[scratchLen, Limbs[mBits.wordsRequired]]
+  montyPowUnsafeExponent(a.limbs, expBE, M.limbs, one.limbs, negInvModWord, scratchSpace, canUseNoCarryMontyMul)

 func montyPowUnsafeExponent*[mBits: static int](
       a: var BigInt[mBits], exponent: openarray[byte],
-       M, one: BigInt[mBits], negInvModWord: static BaseType, windowSize: static int) =
+       M, one: BigInt[mBits], negInvModWord: static BaseType, windowSize: static int,
+       canUseNoCarryMontyMul: static bool
+      ) =
  ## Compute a <- a^exponent (mod M)
  ## ``a`` in the Montgomery domain
  ## ``exponent`` is a BigInt in canonical representation
@ -324,9 +333,5 @@ func montyPowUnsafeExponent*[mBits: static int](

  const scratchLen = if windowSize == 1: 2
                     else: (1 shl windowSize) + 1
-  var scratchSpace {.noInit.}: array[scratchLen, BigInt[mBits]]
-  var scratchPtrs {.noInit.}: array[scratchLen, BigIntViewMut]
-  for i in 0 ..< scratchLen:
-    scratchPtrs[i] = scratchSpace[i].view()
-
-  montyPowUnsafeExponent(a.view, exponent, M.view, one.view, Word(negInvModWord), scratchPtrs)
+  var scratchSpace {.noInit.}: array[scratchLen, Limbs[mBits.wordsRequired]]
+  montyPowUnsafeExponent(a.limbs, exponent, M.limbs, one.limbs, negInvModWord, scratchSpace, canUseNoCarryMontyMul)
--- a/constantine/arithmetic/bigints_raw.nim
+++ b/constantine/arithmetic/bigints_raw.nim
@ -1,830 +0,0 @@
-# Constantine
-# Copyright (c) 2018-2019    Status Research & Development GmbH
-# Copyright (c) 2020-Present Mamy André-Ratsimbazafy
-# Licensed and distributed under either of
-#   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
-#   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
-# at your option. This file may not be copied, modified, or distributed except according to those terms.
-
-# ############################################################
-#
-#         BigInt Raw representation and operations
-#
-# ############################################################
-#
-# This file holds the raw operations done on big ints
-# The representation is optimized for:
-# - constant-time (not leaking secret data via side-channel)
-# - generated code size, datatype size and stack usage
-# - performance
-# in this order
-
-# ############################################################
-# Design
-
-# To avoid carry issues we don't use the
-# most significant bit of each machine word.
-# i.e. for a uint64 base we only use 63-bit.
-# More info: https://github.com/status-im/nim-constantine/wiki/Constant-time-arithmetics#guidelines
-# Especially:
-#    - https://bearssl.org/bigint.html
-#    - https://cryptojedi.org/peter/data/pairing-20131122.pdf
-#    - http://docs.milagro.io/en/amcl/milagro-crypto-library-white-paper.html
-#
-# Note that this might also be beneficial in terms of performance.
-# Due to opcode latency, on Nehalem ADC is 6x times slower than ADD
-# if it has dependencies (i.e the ADC depends on a previous ADC result)
-#
-# Control flow should only depends on the static maximum number of bits
-# This number is defined per Finite Field/Prime/Elliptic Curve
-#
-# We internally order the limbs in little-endian
-# So the least significant limb is limb[0]
-# This is independent from the base type endianness.
-#
-# Constantine uses Nim generic integer to prevent mixing
-# BigInts of different bitlength at compile-time and
-# properly statically size the BigInt buffers.
-#
-# To avoid code-bloat due to monomorphization (i.e. duplicating code per announced bitlength)
-# actual computation is deferred to type-erased routines.
-
-import
-  ../primitives/constant_time,
-  ../primitives/extended_precision,
-  ../config/common
-from typetraits import distinctBase
-
-# ############################################################
-#
-#                BigInts type-erased API
-#
-# ############################################################
-
-# The "checked" API is exported as a building blocks
-# with enforced compile-time checking of BigInt bitsize
-# and memory ownership.
-#
-# The "raw" compute API uses views to avoid code duplication
-# due to generic/static monomorphization.
-#
-# The "checked" API is a thin wrapper above the "raw" API to get the best of both world:
-# - small code footprint
-# - compiler enforced checks: types, bitsizes
-# - compiler enforced memory: stack allocation and buffer ownership
-
-type
-  BigIntView* = ptr object
-    ## Type-erased fixed-precision big integer
-    ##
-    ## This type mirrors the BigInt type and is used
-    ## for the low-level computation API
-    ## This design
-    ## - avoids code bloat due to generic monomorphization
-    ##   otherwise each bigint routines would have an instantiation for
-    ##   each static `bits` parameter.
-    ## - while not forcing the caller to preallocate computation buffers
-    ##   for the high-level API and enforcing bitsizes
-    ## - avoids runtime bound-checks on the view
-    ##   for performance
-    ##   and to ensure exception-free code
-    ##   even when compiled in non "-d:danger" mode
-    ##
-    ## As with the BigInt type:
-    ## - "bitLength" is the internal bitlength of the integer
-    ##   This differs from the canonical bit-length as
-    ##   Constantine word-size is smaller than a machine word.
-    ##   This value should never be used as-is to prevent leaking secret data.
-    ##   Computing this value requires constant-time operations.
-    ##   Using this value requires converting it to the # of limbs in constant-time
-    ##
-    ## - "limbs" is an internal field that holds the internal representation
-    ##   of the big integer. Least-significant limb first. Within limbs words are native-endian.
-    ##
-    ## This internal representation can be changed
-    ## without notice and should not be used by external applications or libraries.
-    ##
-    ## Accesses should be done via BigIntViewConst / BigIntViewConst
-    ## to have the compiler check for mutability
-    bitLength: uint32
-    limbs: UncheckedArray[Word]
-
-  # "Indirection" to enforce pointer types deep immutability
-  BigIntViewConst* = distinct BigIntView
-    ## Immutable view into a BigInt
-  BigIntViewMut* = distinct BigIntView
-    ## Mutable view into a BigInt
-  BigIntViewAny* = BigIntViewConst or BigIntViewMut
-
-# No exceptions allowed
-{.push raises: [].}
-
-# ############################################################
-#
-#                 Deep Mutability safety
-#
-# ############################################################
-
-template `[]`*(v: BigIntViewConst, limbIdx: int): Word =
-  distinctBase(type v)(v).limbs[limbIdx]
-
-template `[]`*(v: BigIntViewMut, limbIdx: int): var Word =
-  distinctBase(type v)(v).limbs[limbIdx]
-
-template `[]=`*(v: BigIntViewMut, limbIdx: int, val: Word) =
-  distinctBase(type v)(v).limbs[limbIdx] = val
-
-template bitSizeof(v: BigIntViewAny): uint32 =
-  bind BigIntView
-  distinctBase(type v)(v).bitLength
-
-const divShiftor = log2(uint32(WordPhysBitSize))
-template numLimbs*(v: BigIntViewAny): int =
-  ## Compute the number of limbs from
-  ## the **internal** bitlength
-  (bitSizeof(v).int + WordPhysBitSize - 1) shr divShiftor
-
-template setBitLength(v: BigIntViewMut, internalBitLength: uint32) =
-  distinctBase(type v)(v).bitLength = internalBitLength
-
-# TODO: Check if repeated v.numLimbs calls are optimized away
-
-template `[]`*(v: BigIntViewConst, limbIdxFromEnd: BackwardsIndex): Word =
-  distinctBase(type v)(v).limbs[numLimbs(v).int - int limbIdxFromEnd]
-
-template `[]`*(v: BigIntViewMut, limbIdxFromEnd: BackwardsIndex): var Word =
-  distinctBase(type v)(v).limbs[numLimbs(v).int - int limbIdxFromEnd]
-
-template `[]=`*(v: BigIntViewMut, limbIdxFromEnd: BackwardsIndex, val: Word) =
-  distinctBase(type v)(v).limbs[numLimbs(v).int - int limbIdxFromEnd] = val
-
-# ############################################################
-#
-#           Checks and debug/test only primitives
-#
-# ############################################################
-
-template checkMatchingBitlengths(a, b: distinct BigIntViewAny) =
-  ## Check that bitlengths of bigints match
-  ## This is only checked
-  ## with "-d:debugConstantine" and when assertions are on.
-  debug:
-    assert distinctBase(type a)(a).bitLength ==
-      distinctBase(type b)(b).bitLength, "Internal Error: operands bitlength do not match"
-
-template checkValidModulus(m: BigIntViewConst) =
-  ## Check that the modulus is valid
-  ## The check is approximate, it only checks that
-  ## the most-significant words is non-zero instead of
-  ## checking that the last announced bit is 1.
-  ## This is only checked
-  ## with "-d:debugConstantine" and when assertions are on.
-  debug:
-    assert not isZero(m[^1]).bool, "Internal Error: the modulus must use all declared bits"
-
-template checkOddModulus(m: BigIntViewConst) =
-  ## CHeck that the modulus is odd
-  ## and valid for use in the Montgomery n-residue representation
-  debug:
-    assert bool(BaseType(m[0]) and 1), "Internal Error: the modulus must be odd to use the Montgomery representation."
-
-template checkWordShift(k: int) =
-  ## Checks that the shift is less than the word bit size
-  debug:
-    assert k <= WordBitSize, "Internal Error: the shift must be less than the word bit size"
-
-template checkPowScratchSpaceLen(len: int) =
-  ## Checks that there is a minimum of scratchspace to hold the temporaries
-  debug:
-    assert len >= 2, "Internal Error: the scratchspace for powmod should be equal or greater than 2"
-
-debug:
-  func `$`*(a: BigIntViewAny): string =
-    let len = a.numLimbs()
-    result = "["
-    for i in 0 ..< len - 1:
-      result.add $a[i]
-      result.add ", "
-    result.add $a[len-1]
-    result.add "] ("
-    result.add $a.bitSizeof
-    result.add " bits)"
-
-# ############################################################
-#
-#                    BigInt primitives
-#
-# ############################################################
-
-func `==`*(a, b: distinct BigIntViewAny): CTBool[Word] =
-  ## Returns true if 2 big ints are equal
-  ## Comparison is constant-time
-  checkMatchingBitlengths(a, b)
-  var accum: Word
-  for i in 0 ..< a.numLimbs():
-    accum = accum or (a[i] xor b[i])
-  result = accum.isZero
-
-func isZero*(a: BigIntViewAny): CTBool[Word] =
-  ## Returns true if a big int is equal to zero
-  var accum: Word
-  for i in 0 ..< a.numLimbs():
-    accum = accum or a[i]
-  result = accum.isZero()
-
-func setZero(a: BigIntViewMut) =
-  ## Set a BigInt to 0
-  ## It's bit size is unchanged
-  zeroMem(a[0].unsafeAddr, a.numLimbs() * sizeof(Word))
-
-func ccopy*(a: BigIntViewMut, b: BigIntViewAny, ctl: CTBool[Word]) =
-  ## Constant-time conditional copy
-  ## If ctl is true: b is copied into a
-  ## if ctl is false: b is not copied and a is untouched
-  ## Time and memory accesses are the same whether a copy occurs or not
-  checkMatchingBitlengths(a, b)
-  for i in 0 ..< a.numLimbs():
-    a[i] = ctl.mux(b[i], a[i])
-
-# The arithmetic primitives all accept a control input that indicates
-# if it is a placebo operation. It stills performs the
-# same memory accesses to be side-channel attack resistant.
-
-func cadd*(a: BigIntViewMut, b: BigIntViewAny, ctl: CTBool[Word]): CTBool[Word] =
-  ## Constant-time in-place conditional addition
-  ## The addition is only performed if ctl is "true"
-  ## The result carry is always computed.
-  ##
-  ## a and b MAY be the same buffer
-  ## a and b MUST have the same announced bitlength (i.e. `bits` static parameters)
-  checkMatchingBitlengths(a, b)
-
-  for i in 0 ..< a.numLimbs():
-    let new_a = a[i] + b[i] + Word(result)
-    result = new_a.isMsbSet()
-    a[i] = ctl.mux(new_a.mask(), a[i])
-
-func csub*(a: BigIntViewMut, b: BigIntViewAny, ctl: CTBool[Word]): CTBool[Word] =
-  ## Constant-time in-place conditional substraction
-  ## The substraction is only performed if ctl is "true"
-  ## The result carry is always computed.
-  ##
-  ## a and b MAY be the same buffer
-  ## a and b MUST have the same announced bitlength (i.e. `bits` static parameters)
-  checkMatchingBitlengths(a, b)
-
-  for i in 0 ..< a.numLimbs():
-    let new_a = a[i] - b[i] - Word(result)
-    result = new_a.isMsbSet()
-    a[i] = ctl.mux(new_a.mask(), a[i])
-
-func dec*(a: BigIntViewMut, w: Word): CTBool[Word] =
-  ## Decrement a big int by a small word
-  ## Returns the result carry
-
-  a[0] -= w
-  result = a[0].isMsbSet()
-  a[0] = a[0].mask()
-  for i in 1 ..< a.numLimbs():
-    a[i] -= Word(result)
-    result = a[i].isMsbSet()
-    a[i] = a[i].mask()
-
-func shiftRight*(a: BigIntViewMut, k: int) =
-  ## Shift right by k.
-  ##
-  ## k MUST be less than the base word size (2^31 or 2^63)
-  # We don't reuse shr for this in-place operation
-  # Do we need to return the shifted out part?
-  #
-  # Note: for speed, loading a[i] and a[i+1]
-  #       instead of a[i-1] and a[i]
-  #       is probably easier to parallelize for the compiler
-  #       (antidependence WAR vs loop-carried dependence RAW)
-  checkWordShift(k)
-
-  let len = a.numLimbs()
-  for i in 0 ..< len-1:
-    a[i] = (a[i] shr k) or mask(a[i+1] shl (WordBitSize - k))
-  a[len-1] = a[len-1] shr k
-
-# ############################################################
-#
-#          BigInt Primitives Optimized for speed
-#
-# ############################################################
-#
-# This section implements primitives that improve the speed
-# of common use-cases at the expense of a slight increase in code-size.
-# Where code size is a concern, the high-level API should use
-# copy and/or the conditional operations.
-
-func add*(a: BigIntViewMut, b: BigIntViewAny): CTBool[Word] =
-  ## Constant-time in-place addition
-  ## Returns the carry
-  ##
-  ## a and b MAY be the same buffer
-  ## a and b MUST have the same announced bitlength (i.e. `bits` static parameters)
-  checkMatchingBitlengths(a, b)
-
-  for i in 0 ..< a.numLimbs():
-    a[i] = a[i] + b[i] + Word(result)
-    result = a[i].isMsbSet()
-    a[i] = a[i].mask()
-
-func sub*(a: BigIntViewMut, b: BigIntViewAny): CTBool[Word] =
-  ## Constant-time in-place substraction
-  ## Returns the borrow
-  ##
-  ## a and b MAY be the same buffer
-  ## a and b MUST have the same announced bitlength (i.e. `bits` static parameters)
-  checkMatchingBitlengths(a, b)
-
-  for i in 0 ..< a.numLimbs():
-    a[i] = a[i] - b[i] - Word(result)
-    result = a[i].isMsbSet()
-    a[i] = a[i].mask()
-
-func sum*(r: BigIntViewMut, a, b: distinct BigIntViewAny): CTBool[Word] =
-  ## Sum `a` and `b` into `r`.
-  ## `r` is initialized/overwritten
-  ##
-  ## Returns the carry
-  checkMatchingBitlengths(a, b)
-
-  r.setBitLength(bitSizeof(a))
-
-  for i in 0 ..< a.numLimbs():
-    r[i] = a[i] + b[i] + Word(result)
-    result = r[i].isMsbSet()
-    r[i] = r[i].mask()
-
-func diff*(r: BigIntViewMut, a, b: distinct BigIntViewAny): CTBool[Word] =
-  ## Substract `b` from `a` and store the result into `r`.
-  ## `r` is initialized/overwritten
-  ##
-  ## Returns the borrow
-  checkMatchingBitlengths(a, b)
-
-  r.setBitLength(bitSizeof(a))
-
-  for i in 0 ..< a.numLimbs():
-    r[i] = a[i] - b[i] - Word(result)
-    result = r[i].isMsbSet()
-    r[i] = r[i].mask()
-
-# ############################################################
-#
-#                   Modular BigInt
-#
-# ############################################################
-
-func shlAddMod(a: BigIntViewMut, c: Word, M: BigIntViewConst) =
-  ## Fused modular left-shift + add
-  ## Shift input `a` by a word and add `c` modulo `M`
-  ##
-  ## With a word W = 2^WordBitSize and a modulus M
-  ## Does a <- a * W + c (mod M)
-  ##
-  ## The modulus `M` MUST announced most-significant bit must be set.
-  checkValidModulus(M)
-
-  let aLen = a.numLimbs()
-  let mBits = bitSizeof(M)
-
-  if mBits <= WordBitSize:
-    # If M fits in a single limb
-    var q: Word
-
-    # (hi, lo) = a * 2^63 + c
-    let hi = a[0] shr 1                   # 64 - 63 = 1
-    let lo = (a[0] shl WordBitSize) or c  # Assumes most-significant bit in c is not set
-    unsafeDiv2n1n(q, a[0], hi, lo, M[0])  # (hi, lo) mod M
-    return
-
-  else:
-    ## Multiple limbs
-    let hi = a[^1]                                          # Save the high word to detect carries
-    let R = mBits and WordBitSize                       # R = mBits mod 64
-
-    var a0, a1, m0: Word
-    if R == 0:                                              # If the number of mBits is a multiple of 64
-      a0 = a[^1]                                        #
-      moveMem(a[1].addr, a[0].addr, (aLen-1) * Word.sizeof) # we can just shift words
-      a[0] = c                                              # and replace the first one by c
-      a1 = a[^1]
-      m0 = M[^1]
-    else:                                                   # Else: need to deal with partial word shifts at the edge.
-      a0 = mask((a[^1] shl (WordBitSize-R)) or (a[^2] shr R))
-      moveMem(a[1].addr, a[0].addr, (aLen-1) * Word.sizeof)
-      a[0] = c
-      a1 = mask((a[^1] shl (WordBitSize-R)) or (a[^2] shr R))
-      m0 = mask((M[^1] shl (WordBitSize-R)) or (M[^2] shr R))
-
-    # m0 has its high bit set. (a0, a1)/p0 fits in a limb.
-    # Get a quotient q, at most we will be 2 iterations off
-    # from the true quotient
-
-    let
-      a_hi = a0 shr 1                              # 64 - 63 = 1
-      a_lo = (a0 shl WordBitSize) or a1
-    var q, r: Word
-    unsafeDiv2n1n(q, r, a_hi, a_lo, m0)            # Estimate quotient
-    q = mux(                                       # If n_hi == divisor
-          a0 == m0, MaxWord,                       # Quotient == MaxWord (0b0111...1111)
-          mux(
-            q.isZero, Zero,                        # elif q == 0, true quotient = 0
-            q - One                                # else instead of being of by 0, 1 or 2
-          )                                        # we returning q-1 to be off by -1, 0 or 1
-        )
-
-    # Now substract a*2^63 - q*p
-    var carry = Zero
-    var over_p = CtTrue                            # Track if quotient greater than the modulus
-
-    for i in 0 ..< M.numLimbs():
-      var qp_lo: Word
-
-      block: # q*p
-        # q * p + carry (doubleword) carry from previous limb
-        unsafeFMA(carry, qp_lo, q, M[i], carry)
-
-      block: # a*2^63 - q*p
-        a[i] -= qp_lo
-        carry += Word(a[i].isMsbSet)               # Adjust if borrow
-        a[i] = a[i].mask()                        # Normalize to u63
-
-      over_p = mux(
-                a[i] == M[i], over_p,
-                a[i] > M[i]
-              )
-
-    # Fix quotient, the true quotient is either q-1, q or q+1
-    #
-    # if carry < q or carry == q and over_p we must do "a -= p"
-    # if carry > hi (negative result) we must do "a += p"
-
-    let neg = carry > hi
-    let tooBig = not neg and (over_p or (carry < hi))
-
-    discard a.cadd(M, ctl = neg)
-    discard a.csub(M, ctl = tooBig)
-    return
-
-func reduce*(r: BigIntViewMut, a: BigIntViewAny, M: BigIntViewConst) =
-  ## Reduce `a` modulo `M` and store the result in `r`
-  ##
-  ## The modulus `M` MUST announced most-significant bit must be set.
-  ## The result `r` buffer size MUST be at least the size of `M` buffer
-  ##
-  ## CT: Depends only on the bitlength of `a` and the modulus `M`
-
-  # Note: for all cryptographic intents and purposes the modulus is known at compile-time
-  # but we don't want to inline it as it would increase codesize, better have Nim
-  # pass a pointer+length to a fixed session of the BSS.
-  checkValidModulus(M)
-
-  let aBits = bitSizeof(a)
-  let mBits = bitSizeof(M)
-  let aLen = a.numLimbs()
-
-  r.setBitLength(bitSizeof(M))
-
-  if aBits < mBits:
-    # if a uses less bits than the modulus,
-    # it is guaranteed < modulus.
-    # This relies on the precondition that the modulus uses all declared bits
-    copyMem(r[0].addr, a[0].unsafeAddr, aLen * sizeof(Word))
-    for i in aLen ..< r.numLimbs():
-      r[i] = Zero
-  else:
-    # a length i at least equal to the modulus.
-    # we can copy modulus.limbs-1 words
-    # and modular shift-left-add the rest
-    let mLen = M.numLimbs()
-    let aOffset = aLen - mLen
-    copyMem(r[0].addr, a[aOffset+1].unsafeAddr, (mLen-1) * sizeof(Word))
-    r[^1] = Zero
-    # Now shift-left the copied words while adding the new word modulo M
-    for i in countdown(aOffset, 0):
-      r.shlAddMod(a[i], M)
-
-# ############################################################
-#
-#                 Montgomery Arithmetic
-#
-# ############################################################
-
-template wordMul(a, b: Word): Word =
-  mask(a * b)
-
-func montyMul*(
-       r: BigIntViewMut, a, b: distinct BigIntViewAny,
-       M: BigIntViewConst, negInvModWord: Word) =
-  ## Compute r <- a*b (mod M) in the Montgomery domain
-  ## `negInvModWord` = -1/M (mod Word). Our words are 2^31 or 2^63
-  ##
-  ## This resets r to zero before processing. Use {.noInit.}
-  ## to avoid duplicating with Nim zero-init policy
-  ## The result `r` buffer size MUST be at least the size of `M` buffer
-  ##
-  ##
-  ## Assuming 63-bit wors, the magic constant should be:
-  ##
-  ## - µ ≡ -1/M[0] (mod 2^63) for a general multiplication
-  ##   This can be precomputed with `negInvModWord`
-  ## - 1 for conversion from Montgomery to canonical representation
-  ##   The library implements a faster `redc` primitive for that use-case
-  ## - R^2 (mod M) for conversion from canonical to Montgomery representation
-  ##
-  # i.e. c'R <- a'R b'R * R^-1 (mod M) in the natural domain
-  # as in the Montgomery domain all numbers are scaled by R
-
-  checkValidModulus(M)
-  checkOddModulus(M)
-  checkMatchingBitlengths(a, M)
-  checkMatchingBitlengths(b, M)
-
-  let nLen = M.numLimbs()
-  r.setBitLength(bitSizeof(M))
-  setZero(r)
-
-  var r_hi = Zero   # represents the high word that is used in intermediate computation before reduction mod M
-  for i in 0 ..< nLen:
-
-    let zi = (r[0] + wordMul(a[i], b[0])).wordMul(negInvModWord)
-    var carry: Word
-    # (carry, _) <- a[i] * b[0] + zi * M[0] + r[0]
-    unsafeFMA2_hi(carry, a[i], b[0], zi, M[0], r[0])
-
-    for j in 1 ..< nLen:
-      # (carry, r[j-1]) <- a[i] * b[j] + zi * M[j] + r[j] + carry
-      unsafeFMA2(carry, r[j-1], a[i], b[j], zi, M[j], r[j], carry)
-
-    r_hi += carry
-    r[^1] = r_hi.mask()
-    r_hi = r_hi shr WordBitSize
-
-  # If the extra word is not zero or if r-M does not borrow (i.e. r > M)
-  # Then substract M
-  discard r.csub(M, r_hi.isNonZero() or not r.csub(M, CtFalse))
-
-func redc*(r: BigIntViewMut, a: BigIntViewAny, one, N: BigIntViewConst, negInvModWord: Word) {.inline.} =
-  ## Transform a bigint ``a`` from it's Montgomery N-residue representation (mod N)
-  ## to the regular natural representation (mod N)
-  ##
-  ## with W = N.numLimbs()
-  ## and R = (2^WordBitSize)^W
-  ##
-  ## Does "a * R^-1 (mod N)"
-  ##
-  ## This is called a Montgomery Reduction
-  ## The Montgomery Magic Constant is µ = -1/N mod N
-  ## is used internally and can be precomputed with negInvModWord(Curve)
-  # References:
-  #   - https://eprint.iacr.org/2017/1057.pdf (Montgomery)
-  #     page: Radix-r interleaved multiplication algorithm
-  #   - https://en.wikipedia.org/wiki/Montgomery_modular_multiplication#Montgomery_arithmetic_on_multiprecision_(variable-radix)_integers
-  #   - http://langevin.univ-tln.fr/cours/MLC/extra/montgomery.pdf
-  #     Montgomery original paper
-  #
-  montyMul(r, a, one, N, negInvModWord)
-
-func montyResidue*(
-       r: BigIntViewMut, a: BigIntViewAny,
-       N, r2modN: BigIntViewConst, negInvModWord: Word) {.inline.} =
-  ## Transform a bigint ``a`` from it's natural representation (mod N)
-  ## to a the Montgomery n-residue representation
-  ##
-  ## Montgomery-Multiplication - based
-  ##
-  ## with W = N.numLimbs()
-  ## and R = (2^WordBitSize)^W
-  ##
-  ## Does "a * R (mod N)"
-  ##
-  ## `a`: The source BigInt in the natural representation. `a` in [0, N) range
-  ## `N`: The field modulus. N must be odd.
-  ## `r2modN`: 2^WordBitSize mod `N`. Can be precomputed with `r2mod` function
-  ##
-  ## Important: `r` is overwritten
-  ## The result `r` buffer size MUST be at least the size of `M` buffer
-  # Reference: https://eprint.iacr.org/2017/1057.pdf
-  montyMul(r, a, r2ModN, N, negInvModWord)
-
-func montySquare*(
-       r: BigIntViewMut, a: BigIntViewAny,
-       M: BigIntViewConst, negInvModWord: Word) {.inline.} =
-  ## Compute r <- a^2 (mod M) in the Montgomery domain
-  ## `negInvModWord` = -1/M (mod Word). Our words are 2^31 or 2^63
-  montyMul(r, a, a, M, negInvModWord)
-
-# Montgomery Modular Exponentiation
-# ------------------------------------------
-# We use fixed-window based exponentiation
-# that is constant-time: i.e. the number of multiplications
-# does not depend on the number of set bits in the exponents
-# those are always done and conditionally copied.
-#
-# The exponent MUST NOT be private data (until audited otherwise)
-# - Power attack on RSA, https://www.di.ens.fr/~fouque/pub/ches06.pdf
-# - Flush-and-reload on Sliding window exponentiation: https://tutcris.tut.fi/portal/files/8966761/p1639_pereida_garcia.pdf
-# - Sliding right into disaster, https://eprint.iacr.org/2017/627.pdf
-# - Fixed window leak: https://www.scirp.org/pdf/JCC_2019102810331929.pdf
-# - Constructing sliding-windows leak, https://easychair.org/publications/open/fBNC
-#
-# For pairing curves, this is the case since exponentiation is only
-# used for inversion via the Little Fermat theorem.
-# For RSA, some exponentiations uses private exponents.
-#
-# Note:
-# - Implementation closely follows Thomas Pornin's BearSSL
-# - Apache Milagro Crypto has an alternative implementation
-#   that is more straightforward however:
-#   - the exponent hamming weight is used as loop bounds
-#   - the base^k is stored at each index of a temp table of size k
-#   - the base^k to use is indexed by the hamming weight
-#     of the exponent, leaking this to cache attacks
-#   - in contrast BearSSL touches the whole table to
-#     hide the actual selection
-
-func getWindowLen(bufLen: int): uint =
-  ## Compute the maximum window size that fits in the scratchspace buffer
-  checkPowScratchSpaceLen(bufLen)
-  result = 5
-  while (1 shl result) + 1 > bufLen:
-    dec result
-
-func montyPowPrologue(
-       a: BigIntViewMut, M, one: BigIntViewConst,
-       negInvModWord: Word,
-       scratchspace: openarray[BigIntViewMut]
-     ): tuple[window: uint, bigIntSize: int] {.inline.}=
-  # Due to the high number of parameters,
-  # forcing this inline actually reduces the code size
-
-  result.window = scratchspace.len.getWindowLen()
-  result.bigIntSize = a.numLimbs() * sizeof(Word) +
-                      offsetof(BigIntView, limbs) +
-                      sizeof(BigIntView.bitLength)
-
-  # Precompute window content, special case for window = 1
-  # (i.e scratchspace has only space for 2 temporaries)
-  # The content scratchspace[2+k] is set at a^k
-  # with scratchspace[0] untouched
-  if result.window == 1:
-    copyMem(pointer scratchspace[1], pointer a, result.bigIntSize)
-  else:
-    scratchspace[1].setBitLength(bitSizeof(M))
-    copyMem(pointer scratchspace[2], pointer a, result.bigIntSize)
-    for k in 2 ..< 1 shl result.window:
-      scratchspace[k+1].montyMul(scratchspace[k], a, M, negInvModWord)
-
-  # Set a to one
-  copyMem(pointer a, pointer one, result.bigIntSize)
-
-func montyPowSquarings(
-        a: BigIntViewMut,
-        exponent: openarray[byte],
-        M: BigIntViewConst,
-        negInvModWord: Word,
-        tmp: BigIntViewMut,
-        window: uint,
-        bigIntSize: int,
-        acc, acc_len: var uint,
-        e: var int,
-      ): tuple[k, bits: uint] {.inline.}=
-  ## Squaring step of exponentiation by squaring
-  ## Get the next k bits in range [1, window)
-  ## Square k times
-  ## Returns the number of squarings done and the corresponding bits
-  ##
-  ## Updates iteration variables and accumulators
-  # Due to the high number of parameters,
-  # forcing this inline actually reduces the code size
-
-  # Get the next bits
-  var k = window
-  if acc_len < window:
-    if e < exponent.len:
-      acc = (acc shl 8) or exponent[e].uint
-      inc e
-      acc_len += 8
-    else: # Drained all exponent bits
-      k = acc_len
-
-  let bits = (acc shr (acc_len - k)) and ((1'u32 shl k) - 1)
-  acc_len -= k
-
-  # We have k bits and can do k squaring
-  for i in 0 ..< k:
-    tmp.montySquare(a, M, negInvModWord)
-    copyMem(pointer a, pointer tmp, bigIntSize)
-
-  return (k, bits)
-
-func montyPow*(
-       a: BigIntViewMut,
-       exponent: openarray[byte],
-       M, one: BigIntViewConst,
-       negInvModWord: Word,
-       scratchspace: openarray[BigIntViewMut]
-      ) =
-  ## Modular exponentiation r = a^exponent mod M
-  ## in the Montgomery domain
-  ##
-  ## This uses fixed-window optimization if possible
-  ##
-  ## - On input ``a`` is the base, on ``output`` a = a^exponent (mod M)
-  ##   ``a`` is in the Montgomery domain
-  ## - ``exponent`` is the exponent in big-endian canonical format (octet-string)
-  ##   Use ``exportRawUint`` for conversion
-  ## - ``M`` is the modulus
-  ## - ``one`` is 1 (mod M) in montgomery representation
-  ## - ``negInvModWord`` is the montgomery magic constant "-1/M[0] mod 2^WordBitSize"
-  ## - ``scratchspace`` with k the window bitsize of size up to 5
-  ##   This is a buffer that can hold between 2^k + 1 big-ints
-  ##   A window of of 1-bit (no window optimization) requires only 2 big-ints
-  ##
-  ## Note that the best window size require benchmarking and is a tradeoff between
-  ## - performance
-  ## - stack usage
-  ## - precomputation
-  ##
-  ## For example BLS12-381 window size of 5 is 30% faster than no window,
-  ## but windows of size 2, 3, 4 bring no performance benefit, only increased stack space.
-  ## A window of size 5 requires (2^5 + 1)*(381 + 7)/8 = 33 * 48 bytes = 1584 bytes
-  ## of scratchspace (on the stack).
-
-  let (window, bigIntSize) = montyPowPrologue(a, M, one, negInvModWord, scratchspace)
-
-  # We process bits with from most to least significant.
-  # At each loop iteration with have acc_len bits in acc.
-  # To maintain constant-time the number of iterations
-  # or the number of operations or memory accesses should be the same
-  # regardless of acc & acc_len
-  var
-    acc, acc_len: uint
-    e = 0
-  while acc_len > 0 or e < exponent.len:
-    let (k, bits) = montyPowSquarings(
-      a, exponent, M, negInvModWord,
-      scratchspace[0], window, bigIntSize,
-      acc, acc_len, e
-    )
-
-    # Window lookup: we set scratchspace[1] to the lookup value.
-    # If the window length is 1, then it's already set.
-    if window > 1:
-      # otherwise we need a constant-time lookup
-      # in particular we need the same memory accesses, we can't
-      # just index the openarray with the bits to avoid cache attacks.
-      for i in 1 ..< 1 shl k:
-        let ctl = Word(i) == Word(bits)
-        scratchspace[1].ccopy(scratchspace[1+i], ctl)
-
-    # Multiply with the looked-up value
-    # we keep the product only if the exponent bits are not all zero
-    scratchspace[0].montyMul(a, scratchspace[1], M, negInvModWord)
-    a.ccopy(scratchspace[0], Word(bits) != Zero)
-
-func montyPowUnsafeExponent*(
-       a: BigIntViewMut,
-       exponent: openarray[byte],
-       M, one: BigIntViewConst,
-       negInvModWord: Word,
-       scratchspace: openarray[BigIntViewMut]
-      ) =
-  ## Modular exponentiation r = a^exponent mod M
-  ## in the Montgomery domain
-  ##
-  ## Warning ⚠️ :
-  ## This is an optimization for public exponent
-  ## Otherwise bits of the exponent can be retrieved with:
-  ## - memory access analysis
-  ## - power analysis
-  ## - timing analysis
-
-  # TODO: scratchspace[1] is unused when window > 1
-
-  let (window, bigIntSize) = montyPowPrologue(
-         a, M, one, negInvModWord, scratchspace)
-
-  var
-    acc, acc_len: uint
-    e = 0
-  while acc_len > 0 or e < exponent.len:
-    let (k, bits) = montyPowSquarings(
-      a, exponent, M, negInvModWord,
-      scratchspace[0], window, bigIntSize,
-      acc, acc_len, e
-    )
-
-    ## Warning ⚠️: Exposes the exponent bits
-    if bits != 0:
-      if window > 1:
-        scratchspace[0].montyMul(a, scratchspace[1+bits], M, negInvModWord)
-      else:
-        # scratchspace[1] holds the original `a`
-        scratchspace[0].montyMul(a, scratchspace[1], M, negInvModWord)
-      copyMem(pointer a, pointer scratchspace[0], bigIntSize)
--- a/constantine/arithmetic/finite_fields.nim
+++ b/constantine/arithmetic/finite_fields.nim
@ -25,9 +25,9 @@
 #     which requires a prime

 import
-  ../primitives/constant_time,
+  ../primitives,
  ../config/[common, curves],
-  ./bigints_checked
+  ./bigints, ./montgomery

 # type
 #   `Fp`*[C: static Curve] = object
@ -57,16 +57,16 @@ debug:

 func fromBig*[C: static Curve](T: type Fp[C], src: BigInt): Fp[C] {.noInit.} =
  ## Convert a BigInt to its Montgomery form
-  result.mres.montyResidue(src, C.Mod.mres, C.getR2modP(), C.getNegInvModWord())
+  result.mres.montyResidue(src, C.Mod.mres, C.getR2modP(), C.getNegInvModWord(), C.canUseNoCarryMontyMul())

 func fromBig*[C: static Curve](dst: var Fp[C], src: BigInt) {.noInit.} =
  ## Convert a BigInt to its Montgomery form
-  dst.mres.montyResidue(src, C.Mod.mres, C.getR2modP(), C.getNegInvModWord())
+  dst.mres.montyResidue(src, C.Mod.mres, C.getR2modP(), C.getNegInvModWord(), C.canUseNoCarryMontyMul())

 func toBig*(src: Fp): auto {.noInit.} =
  ## Convert a finite-field element to a BigInt in natral representation
  var r {.noInit.}: typeof(src.mres)
-  r.redc(src.mres, Fp.C.Mod.mres, Fp.C.getNegInvModWord())
+  r.redc(src.mres, Fp.C.Mod.mres, Fp.C.getNegInvModWord(), Fp.C.canUseNoCarryMontyMul())
  return r

 # ############################################################
@ -83,6 +83,12 @@ func toBig*(src: Fp): auto {.noInit.} =
 #       - Golden Primes (φ^2 - φ - 1 with φ = 2^k for example Ed448-Goldilocks: 2^448 - 2^224 - 1)
 #       exist and can be implemented with compile-time specialization.

+# Note: for `+=`, double, sum
+#       not(a.mres < Fp.C.Mod.mres) is unnecessary if the prime has the form
+#       (2^64)^w - 1 (if using uint64 words).
+# In practice I'm not aware of such prime being using in elliptic curves.
+# 2^127 - 1 and 2^521 - 1 are used but 127 and 521 are not multiple of 32/64
+
 func `==`*(a, b: Fp): CTBool[Word] =
  ## Constant-time equality check
  a.mres == b.mres
@ -101,7 +107,7 @@ func setOne*(a: var Fp) =
 func `+=`*(a: var Fp, b: Fp) =
  ## In-place addition modulo p
  var overflowed = add(a.mres, b.mres)
-  overflowed = overflowed or not csub(a.mres, Fp.C.Mod.mres, CtFalse) # a >= P
+  overflowed = overflowed or not(a.mres < Fp.C.Mod.mres)
  discard csub(a.mres, Fp.C.Mod.mres, overflowed)

 func `-=`*(a: var Fp, b: Fp) =
@ -112,14 +118,14 @@ func `-=`*(a: var Fp, b: Fp) =
 func double*(a: var Fp) =
  ## Double ``a`` modulo p
  var overflowed = double(a.mres)
-  overflowed = overflowed or not csub(a.mres, Fp.C.Mod.mres, CtFalse) # a >= P
+  overflowed = overflowed or not(a.mres < Fp.C.Mod.mres)
  discard csub(a.mres, Fp.C.Mod.mres, overflowed)

 func sum*(r: var Fp, a, b: Fp) =
  ## Sum ``a`` and ``b`` into ``r`` module p
  ## r is initialized/overwritten
  var overflowed = r.mres.sum(a.mres, b.mres)
-  overflowed = overflowed or not csub(r.mres, Fp.C.Mod.mres, CtFalse) # r >= P
+  overflowed = overflowed or not(r.mres < Fp.C.Mod.mres)
  discard csub(r.mres, Fp.C.Mod.mres, overflowed)

 func diff*(r: var Fp, a, b: Fp) =
@ -132,17 +138,17 @@ func double*(r: var Fp, a: Fp) =
  ## Double ``a`` into ``r``
  ## `r` is initialized/overwritten
  var overflowed = r.mres.double(a.mres)
-  overflowed = overflowed or not csub(r.mres, Fp.C.Mod.mres, CtFalse) # r >= P
+  overflowed = overflowed or not(r.mres < Fp.C.Mod.mres)
  discard csub(r.mres, Fp.C.Mod.mres, overflowed)

 func prod*(r: var Fp, a, b: Fp) =
  ## Store the product of ``a`` by ``b`` modulo p into ``r``
  ## ``r`` is initialized / overwritten
-  r.mres.montyMul(a.mres, b.mres, Fp.C.Mod.mres, Fp.C.getNegInvModWord())
+  r.mres.montyMul(a.mres, b.mres, Fp.C.Mod.mres, Fp.C.getNegInvModWord(), Fp.C.canUseNoCarryMontyMul())

 func square*(r: var Fp, a: Fp) =
  ## Squaring modulo p
-  r.mres.montySquare(a.mres, Fp.C.Mod.mres, Fp.C.getNegInvModWord())
+  r.mres.montySquare(a.mres, Fp.C.Mod.mres, Fp.C.getNegInvModWord(), Fp.C.canUseNoCarryMontyMul())

 func neg*(r: var Fp, a: Fp) =
  ## Negate modulo p
@ -164,7 +170,8 @@ func pow*(a: var Fp, exponent: BigInt) =
  a.mres.montyPow(
    exponent,
    Fp.C.Mod.mres, Fp.C.getMontyOne(),
-    Fp.C.getNegInvModWord(), windowSize
+    Fp.C.getNegInvModWord(), windowSize,
+    Fp.C.canUseNoCarryMontyMul()
  )

 func powUnsafeExponent*(a: var Fp, exponent: BigInt) =
@ -182,7 +189,8 @@ func powUnsafeExponent*(a: var Fp, exponent: BigInt) =
  a.mres.montyPowUnsafeExponent(
    exponent,
    Fp.C.Mod.mres, Fp.C.getMontyOne(),
-    Fp.C.getNegInvModWord(), windowSize
+    Fp.C.getNegInvModWord(), windowSize,
+    Fp.C.canUseNoCarryMontyMul()
  )

 func inv*(a: var Fp) =
@ -193,7 +201,8 @@ func inv*(a: var Fp) =
  a.mres.montyPowUnsafeExponent(
    Fp.C.getInvModExponent(),
    Fp.C.Mod.mres, Fp.C.getMontyOne(),
-    Fp.C.getNegInvModWord(), windowSize
+    Fp.C.getNegInvModWord(), windowSize,
+    Fp.C.canUseNoCarryMontyMul()
  )

 # ############################################################
--- a/constantine/arithmetic/limbs.nim
+++ b/constantine/arithmetic/limbs.nim
@ -0,0 +1,445 @@
+# Constantine
+# Copyright (c) 2018-2019    Status Research & Development GmbH
+# Copyright (c) 2020-Present Mamy André-Ratsimbazafy
+# Licensed and distributed under either of
+#   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
+#   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
+# at your option. This file may not be copied, modified, or distributed except according to those terms.
+
+import
+  ../config/common,
+  ../primitives
+
+# ############################################################
+#
+#         Limbs raw representation and operations
+#
+# ############################################################
+#
+# This file holds the raw operations done on big ints
+# The representation is optimized for:
+# - constant-time (not leaking secret data via side-channel)
+# - performance
+# - generated code size, datatype size and stack usage
+# in this order
+#
+# The "limbs" API limits code duplication
+# due to generic/static monomorphization for bit-width
+# that are represented with the same number of words.
+#
+# It also exposes at the number of words to the compiler
+# to allow aggressive unrolling and inlining for example
+# of multi-precision addition which is so small (2 instructions per word)
+# that inlining it improves both performance and code-size
+# even for 2 curves (secp256k1 and BN254) that could share the code.
+#
+# The limb-endianess is little-endian, less significant limb is at index 0.
+# The word-endianness is native-endian.
+
+type Limbs*[N: static int] = array[N, Word]
+  ## Limbs-type
+  ## Should be distinct type to avoid builtins to use non-constant time
+  ## implementation, for example for comparison.
+  ##
+  ## but for unknown reason, it prevents semchecking `bits`
+
+# No exceptions allowed
+{.push raises: [].}
+
+# ############################################################
+#
+#                      Accessors
+#
+# ############################################################
+#
+# Commented out since we don't use a distinct type
+
+# template `[]`[N](v: Limbs[N], idx: int): Word =
+#   (array[N, Word])(v)[idx]
+#
+# template `[]`[N](v: var Limbs[N], idx: int): var Word =
+#   (array[N, Word])(v)[idx]
+#
+# template `[]=`[N](v: Limbs[N], idx: int, val: Word) =
+#   (array[N, Word])(v)[idx] = val
+
+# ############################################################
+#
+#           Checks and debug/test only primitives
+#
+# ############################################################
+
+# ############################################################
+#
+#                   Limbs Primitives
+#
+# ############################################################
+
+{.push inline.}
+# The following primitives are small enough on regular limb sizes
+# (BN254 and secp256k1 -> 4 limbs, BLS12-381 -> 6 limbs)
+# that inline both decreases the code size and increases speed
+# as we avoid the parmeter packing/unpacking ceremony at function entry/exit
+# and unrolling overhead is minimal.
+
+func `==`*(a, b: Limbs): CTBool[Word] =
+  ## Returns true if 2 limbs are equal
+  ## Comparison is constant-time
+  var accum = Zero
+  for i in 0 ..< a.len:
+    accum = accum or (a[i] xor b[i])
+  result = accum.isZero()
+
+func isZero*(a: Limbs): CTBool[Word] =
+  ## Returns true if ``a`` is equal to zero
+  var accum = Zero
+  for i in 0 ..< a.len:
+    accum = accum or a[i]
+  result = accum.isZero()
+
+func setZero*(a: var Limbs) =
+  ## Set ``a`` to 0
+  zeroMem(a[0].addr, sizeof(a))
+
+func setOne*(a: var Limbs) =
+  ## Set ``a`` to 1
+  a[0] = Word(1)
+  when a.len > 1:
+    zeroMem(a[1].addr, (a.len - 1) * sizeof(Word))
+
+func ccopy*(a: var Limbs, b: Limbs, ctl: CTBool[Word]) =
+  ## Constant-time conditional copy
+  ## If ctl is true: b is copied into a
+  ## if ctl is false: b is not copied and a is untouched
+  ## Time and memory accesses are the same whether a copy occurs or not
+  # TODO: on x86, we use inline assembly for CMOV
+  #       the codegen is a bit inefficient as the condition `ctl`
+  #       is tested for each limb.
+  for i in 0 ..< a.len:
+    ctl.ccopy(a[i], b[i])
+
+func add*(a: var Limbs, b: Limbs): Carry =
+  ## Limbs addition
+  ## Returns the carry
+  result = Carry(0)
+  for i in 0 ..< a.len:
+    addC(result, a[i], a[i], b[i], result)
+
+func cadd*(a: var Limbs, b: Limbs, ctl: CTBool[Word]): Carry =
+  ## Limbs conditional addition
+  ## Returns the carry
+  ##
+  ## if ctl is true: a <- a + b
+  ## if ctl is false: a <- a
+  ## The carry is always computed whether ctl is true or false
+  ##
+  ## Time and memory accesses are the same whether a copy occurs or not
+  result = Carry(0)
+  var sum: Word
+  for i in 0 ..< a.len:
+    addC(result, sum, a[i], b[i], result)
+    ctl.ccopy(a[i], sum)
+
+func sum*(r: var Limbs, a, b: Limbs): Carry =
+  ## Sum `a` and `b` into `r`
+  ## `r` is initialized/overwritten
+  ##
+  ## Returns the carry
+  result = Carry(0)
+  for i in 0 ..< a.len:
+    addC(result, r[i], a[i], b[i], result)
+
+func sub*(a: var Limbs, b: Limbs): Borrow =
+  ## Limbs substraction
+  ## Returns the borrow
+  result = Borrow(0)
+  for i in 0 ..< a.len:
+    subB(result, a[i], a[i], b[i], result)
+
+func csub*(a: var Limbs, b: Limbs, ctl: CTBool[Word]): Borrow =
+  ## Limbs conditional substraction
+  ## Returns the borrow
+  ##
+  ## if ctl is true: a <- a - b
+  ## if ctl is false: a <- a
+  ## The borrow is always computed whether ctl is true or false
+  ##
+  ## Time and memory accesses are the same whether a copy occurs or not
+  result = Borrow(0)
+  var diff: Word
+  for i in 0 ..< a.len:
+    subB(result, diff, a[i], b[i], result)
+    ctl.ccopy(a[i], diff)
+
+func diff*(r: var Limbs, a, b: Limbs): Borrow =
+  ## Diff `a` and `b` into `r`
+  ## `r` is initialized/overwritten
+  ##
+  ## Returns the borrow
+  result = Borrow(0)
+  for i in 0 ..< a.len:
+    subB(result, r[i], a[i], b[i], result)
+
+func `<`*(a, b: Limbs): CTBool[Word] =
+  ## Returns true if a < b
+  ## Comparison is constant-time
+  var diff: Word
+  var borrow: Borrow
+  for i in 0 ..< a.len:
+    subB(borrow, diff, a[i], b[i], borrow)
+
+  result = (CTBool[Word])(borrow)
+
+func `<=`*(a, b: Limbs): CTBool[Word] =
+  ## Returns true if a <= b
+  ## Comparison is constant-time
+  not(b < a)
+
+{.pop.} # inline
+
+# ############################################################
+#
+#                   Modular BigInt
+#
+# ############################################################
+#
+# To avoid code-size explosion due to monomorphization
+# and given that reductions are not in hot path in Constantine
+# we use type-erased procedures, instead of instantiating
+# one per number of limbs combination
+
+# Type-erasure
+# ------------------------------------------------------------
+
+type
+  LimbsView = ptr UncheckedArray[Word]
+    ## Type-erased fixed-precision limbs
+    ##
+    ## This type mirrors the Limb type and is used
+    ## for some low-level computation API
+    ## This design
+    ## - avoids code bloat due to generic monomorphization
+    ##   otherwise limbs routines would have an instantiation for
+    ##   each number of words.
+    ##
+    ## Accesses should be done via BigIntViewConst / BigIntViewConst
+    ## to have the compiler check for mutability
+
+  # "Indirection" to enforce pointer types deep immutability
+  LimbsViewConst = distinct LimbsView
+    ## Immutable view into the limbs of a BigInt
+  LimbsViewMut = distinct LimbsView
+    ## Mutable view into a BigInt
+  LimbsViewAny = LimbsViewConst or LimbsViewMut
+
+# Deep Mutability safety
+# ------------------------------------------------------------
+
+template view(a: Limbs): LimbsViewConst =
+  ## Returns a borrowed type-erased immutable view to a bigint
+  LimbsViewConst(cast[LimbsView](a.unsafeAddr))
+
+template view(a: var Limbs): LimbsViewMut =
+  ## Returns a borrowed type-erased mutable view to a mutable bigint
+  LimbsViewMut(cast[LimbsView](a.addr))
+
+template `[]`*(v: LimbsViewConst, limbIdx: int): Word =
+  LimbsView(v)[limbIdx]
+
+template `[]`*(v: LimbsViewMut, limbIdx: int): var Word =
+  LimbsView(v)[limbIdx]
+
+template `[]=`*(v: LimbsViewMut, limbIdx: int, val: Word) =
+  LimbsView(v)[limbIdx] = val
+
+# Type-erased add-sub
+# ------------------------------------------------------------
+
+func cadd(a: LimbsViewMut, b: LimbsViewAny, ctl: CTBool[Word], len: int): Carry =
+  ## Type-erased conditional addition
+  ## Returns the carry
+  ##
+  ## if ctl is true: a <- a + b
+  ## if ctl is false: a <- a
+  ## The carry is always computed whether ctl is true or false
+  ##
+  ## Time and memory accesses are the same whether a copy occurs or not
+  result = Carry(0)
+  var sum: Word
+  for i in 0 ..< len:
+    addC(result, sum, a[i], b[i], result)
+    ctl.ccopy(a[i], sum)
+
+func csub(a: LimbsViewMut, b: LimbsViewAny, ctl: CTBool[Word], len: int): Borrow =
+  ## Type-erased conditional addition
+  ## Returns the borrow
+  ##
+  ## if ctl is true: a <- a - b
+  ## if ctl is false: a <- a
+  ## The borrow is always computed whether ctl is true or false
+  ##
+  ## Time and memory accesses are the same whether a copy occurs or not
+  result = Borrow(0)
+  var diff: Word
+  for i in 0 ..< len:
+    subB(result, diff, a[i], b[i], result)
+    ctl.ccopy(a[i], diff)
+
+# Modular reduction
+# ------------------------------------------------------------
+
+func numWordsFromBits(bits: int): int {.inline.} =
+  const divShiftor = log2(uint32(WordBitWidth))
+  result = (bits + WordBitWidth - 1) shr divShiftor
+
+func shlAddMod_estimate(a: LimbsViewMut, aLen: int,
+                        c: Word, M: LimbsViewConst, mBits: int
+                      ): tuple[neg, tooBig: CTBool[Word]] =
+  ## Estimate a <- a shl 2^w + c (mod M)
+  ##
+  ## with w the base word width, usually 32 on 32-bit platforms and 64 on 64-bit platforms
+  ##
+  ## Updates ``a`` and returns ``neg`` and ``tooBig``
+  ## If ``neg``, the estimate in ``a`` is negative and ``M`` must be added to it.
+  ## If ``tooBig``, the estimate in ``a`` overflowed and ``M`` must be substracted from it.
+
+  # Aliases
+  # ----------------------------------------------------------------------
+  let MLen = numWordsFromBits(mBits)
+
+  # Captures aLen and MLen
+  template `[]`(v: untyped, limbIdxFromEnd: BackwardsIndex): Word {.dirty.}=
+    v[`v Len` - limbIdxFromEnd.int]
+
+  # ----------------------------------------------------------------------
+                                                          # Assuming 64-bit words
+  let hi = a[^1]                                          # Save the high word to detect carries
+  let R = mBits and (WordBitWidth - 1)                    # R = mBits mod 64
+
+  var a0, a1, m0: Word
+  if R == 0:                                              # If the number of mBits is a multiple of 64
+    a0 = a[^1]                                            #
+    moveMem(a[1].addr, a[0].addr, (aLen-1) * Word.sizeof) # we can just shift words
+    a[0] = c                                              # and replace the first one by c
+    a1 = a[^1]
+    m0 = M[^1]
+  else:                                                   # Else: need to deal with partial word shifts at the edge.
+    a0 = (a[^1] shl (WordBitWidth-R)) or (a[^2] shr R)
+    moveMem(a[1].addr, a[0].addr, (aLen-1) * Word.sizeof)
+    a[0] = c
+    a1 = (a[^1] shl (WordBitWidth-R)) or (a[^2] shr R)
+    m0 = (M[^1] shl (WordBitWidth-R)) or (M[^2] shr R)
+
+  # m0 has its high bit set. (a0, a1)/p0 fits in a limb.
+  # Get a quotient q, at most we will be 2 iterations off
+  # from the true quotient
+  var q, r: Word
+  unsafeDiv2n1n(q, r, a0, a1, m0)                # Estimate quotient
+  q = mux(                                       # If n_hi == divisor
+        a0 == m0, MaxWord,                       # Quotient == MaxWord (0b1111...1111)
+        mux(
+          q.isZero, Zero,                        # elif q == 0, true quotient = 0
+          q - One                                # else instead of being of by 0, 1 or 2
+        )                                        # we returning q-1 to be off by -1, 0 or 1
+      )
+
+  # Now substract a*2^64 - q*p
+  var carry = Zero
+  var over_p = CtTrue                            # Track if quotient greater than the modulus
+
+  for i in 0 ..< MLen:
+    var qp_lo: Word
+
+    block: # q*p
+      # q * p + carry (doubleword) carry from previous limb
+      muladd1(carry, qp_lo, q, M[i], Word carry)
+
+    block: # a*2^64 - q*p
+      var borrow: Borrow
+      subB(borrow, a[i], a[i], qp_lo, Borrow(0))
+      carry += Word(borrow) # Adjust if borrow
+
+    over_p = mux(
+              a[i] == M[i], over_p,
+              a[i] > M[i]
+            )
+
+  # Fix quotient, the true quotient is either q-1, q or q+1
+  #
+  # if carry < q or carry == q and over_p we must do "a -= p"
+  # if carry > hi (negative result) we must do "a += p"
+
+  result.neg = Word(carry) > hi
+  result.tooBig = not(result.neg) and (over_p or (Word(carry) < hi))
+
+func shlAddMod(a: LimbsViewMut, aLen: int,
+               c: Word, M: LimbsViewConst, mBits: int) =
+  ## Fused modular left-shift + add
+  ## Shift input `a` by a word and add `c` modulo `M`
+  ##
+  ## With a word W = 2^WordBitSize and a modulus M
+  ## Does a <- a * W + c (mod M)
+  ##
+  ## The modulus `M` most-significant bit at `mBits` MUST be set.
+  if mBits <= WordBitWidth:
+    # If M fits in a single limb
+
+    # We normalize M with R so that the MSB is set
+    # And normalize (a * 2^64 + c) by R as well to maintain the result
+    # This ensures that (a0, a1)/p0 fits in a limb.
+    let R = mBits and (WordBitWidth - 1)
+
+    # (hi, lo) = a * 2^64 + c
+    let hi = (a[0] shl (WordBitWidth-R)) or (c shr R)
+    let lo = c shl (WordBitWidth-R)
+    let m0 = M[0] shl (WordBitWidth-R)
+
+    var q, r: Word
+    unsafeDiv2n1n(q, r, hi, lo, m0)  # (hi, lo) mod M
+
+    a[0] = r shr (WordBitWidth-R)
+
+  else:
+    ## Multiple limbs
+    let (neg, tooBig) = shlAddMod_estimate(a, aLen, c, M, mBits)
+    discard a.cadd(M, ctl = neg, aLen)
+    discard a.csub(M, ctl = tooBig, aLen)
+
+func reduce(r: LimbsViewMut,
+            a: LimbsViewAny, aBits: int,
+            M: LimbsViewConst, mBits: int) =
+  ## Reduce `a` modulo `M` and store the result in `r`
+  let aLen = numWordsFromBits(aBits)
+  let mLen = numWordsFromBits(mBits)
+  let rLen = mLen
+
+  if aBits < mBits:
+    # if a uses less bits than the modulus,
+    # it is guaranteed < modulus.
+    # This relies on the precondition that the modulus uses all declared bits
+    copyMem(r[0].addr, a[0].unsafeAddr, aLen * sizeof(Word))
+    for i in aLen ..< mLen:
+      r[i] = Zero
+  else:
+    # a length i at least equal to the modulus.
+    # we can copy modulus.limbs-1 words
+    # and modular shift-left-add the rest
+    let aOffset = aLen - mLen
+    copyMem(r[0].addr, a[aOffset+1].unsafeAddr, (mLen-1) * sizeof(Word))
+    r[rLen - 1] = Zero
+    # Now shift-left the copied words while adding the new word modulo M
+    for i in countdown(aOffset, 0):
+      shlAddMod(r, rLen, a[i], M, mBits)
+
+func reduce*[aLen, mLen](r: var Limbs[mLen],
+                         a: Limbs[aLen], aBits: static int,
+                         M: Limbs[mLen], mBits: static int
+                        ) {.inline.} =
+  ## Reduce `a` modulo `M` and store the result in `r`
+  ##
+  ## Warning ⚠: At the moment this is NOT constant-time
+  ##            as it relies on hardware division.
+  # This is implemented via type-erased indirection to avoid
+  # a significant amount of code duplication if instantiated for
+  # varying bitwidth.
+  reduce(r.view(), a.view(), aBits, M.view(), mBits)
--- a/constantine/arithmetic/montgomery.nim
+++ b/constantine/arithmetic/montgomery.nim
@ -0,0 +1,473 @@
+# Constantine
+# Copyright (c) 2018-2019    Status Research & Development GmbH
+# Copyright (c) 2020-Present Mamy André-Ratsimbazafy
+# Licensed and distributed under either of
+#   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
+#   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
+# at your option. This file may not be copied, modified, or distributed except according to those terms.
+
+import
+  ../config/common,
+  ../primitives,
+  ./limbs,
+  macros
+
+# ############################################################
+#
+#         Multiprecision Montgomery Arithmetic
+#
+# ############################################################
+#
+# Note: Montgomery multiplications and squarings are the biggest bottlenecks
+#       of an elliptic curve library, asymptotically 100% of the costly algorithms:
+#       - field exponentiation
+#       - field inversion
+#       - extension towers multiplication, squarings, inversion
+#       - elliptic curve point addition
+#       - elliptic curve point doubling
+#       - elliptic curve point multiplication
+#       - pairing Miller Loop
+#       - pairing final exponentiation
+#       are bottlenecked by Montgomery multiplications or squarings
+#
+# Unfortunately, the fastest implementation of Montgomery Multiplication
+# on x86 is impossible without resorting to assembly (probably 15~30% faster)
+#
+# It requires implementing 2 parallel pipelines of carry-chains (via instruction-level parallelism)
+# of MULX, ADCX and ADOX instructions, according to Intel paper:
+# https://www.intel.cn/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
+# and the code generation of MCL
+# https://github.com/herumi/mcl
+#
+# A generic implementation would require implementing a mini-compiler as macro
+# significantly sacrificing code readability, portability, auditability and maintainability.
+#
+# This would however save significant hardware or cloud resources.
+# An example inline assembly compiler for add-with-carry is available in
+# primitives/research/addcarry_subborrow_compiler.nim
+#
+# Instead we follow the optimized high-level implementation of Goff
+# which skips a significant amount of additions for moduli
+# that have their the most significant bit unset.
+
+# Loop unroller
+# ------------------------------------------------------------
+
+proc replaceNodes(ast: NimNode, what: NimNode, by: NimNode): NimNode =
+  # Replace "what" ident node by "by"
+  proc inspect(node: NimNode): NimNode =
+    case node.kind:
+    of {nnkIdent, nnkSym}:
+      if node.eqIdent(what):
+        return by
+      return node
+    of nnkEmpty:
+      return node
+    of nnkLiterals:
+      return node
+    else:
+      var rTree = node.kind.newTree()
+      for child in node:
+        rTree.add inspect(child)
+      return rTree
+  result = inspect(ast)
+
+macro staticFor(idx: untyped{nkIdent}, start, stopEx: static int, body: untyped): untyped =
+  result = newStmtList()
+  for i in start ..< stopEx:
+    result.add nnkBlockStmt.newTree(
+      ident("unrolledIter_" & $idx & $i),
+      body.replaceNodes(idx, newLit i)
+    )
+
+# Implementation
+# ------------------------------------------------------------
+# Note: the low-level implementations should not use static parameter
+#       the code generated is already big enough for curve with different
+#       limb sizes, we want to use the same codepath when limbs lenght are compatible.
+
+func montyMul_CIOS_nocarry_unrolled(r: var Limbs, a, b, M: Limbs, m0ninv: BaseType) =
+  ## Montgomery Multiplication using Coarse Grained Operand Scanning (CIOS)
+  ## and no-carry optimization.
+  ## This requires the most significant word of the Modulus
+  ##   M[^1] < high(Word) shr 1 (i.e. less than 0b01111...1111)
+  ## https://hackmd.io/@zkteam/modular_multiplication
+
+  # We want all the computation to be kept in registers
+  # hence we use a temporary `t`, hoping that the compiler does it.
+  var t: typeof(M) # zero-init
+  const N = t.len
+  staticFor i, 0, N:
+    # (A, t[0]) <- a[0] * b[i] + t[0]
+    #  m        <- (t[0] * m0ninv) mod 2^w
+    # (C, _)    <- m * M[0] + t[0]
+    var A: Word
+    muladd1(A, t[0], a[0], b[i], t[0])
+    let m = t[0] * Word(m0ninv)
+    var C, lo: Word
+    muladd1(C, lo, m, M[0], t[0])
+
+    staticFor j, 1, N:
+      # (A, t[j])   <- a[j] * b[i] + A + t[j]
+      # (C, t[j-1]) <- m * M[j] + C + t[j]
+      muladd2(A, t[j], a[j], b[i], A, t[j])
+      muladd2(C, t[j-1], m, M[j], C, t[j])
+
+    t[N-1] = C + A
+
+  discard t.csub(M, not(t < M))
+  r = t
+
+func montyMul_CIOS(r: var Limbs, a, b, M: Limbs, m0ninv: BaseType) =
+  ## Montgomery Multiplication using Coarse Grained Operand Scanning (CIOS)
+  # - Analyzing and Comparing Montgomery Multiplication Algorithms
+  #   Cetin Kaya Koc and Tolga Acar and Burton S. Kaliski Jr.
+  #   http://pdfs.semanticscholar.org/5e39/41ff482ec3ee41dc53c3298f0be085c69483.pdf
+  #
+  # - Montgomery Arithmetic from a Software Perspective\
+  #   Joppe W. Bos and Peter L. Montgomery, 2017\
+  #   https://eprint.iacr.org/2017/1057
+
+  # We want all the computation to be kept in registers
+  # hence we use a temporary `t`, hoping that the compiler does it.
+  var t: typeof(M)   # zero-init
+  const N = t.len
+  # Extra words to handle up to 2 carries t[N] and t[N+1]
+  var tN: Word
+  var tNp1: Carry
+
+  staticFor i, 0, N:
+    var C = Zero
+
+    # Multiplication
+    staticFor j, 0, N:
+      # (C, t[j]) <- a[j] * b[i] + t[j] + C
+      muladd2(C, t[j], a[j], b[i], t[j], C)
+    addC(tNp1, tN, tN, C, Carry(0))
+
+    # Reduction
+    #  m        <- (t[0] * m0ninv) mod 2^w
+    # (C, _)    <- m * M[0] + t[0]
+    var lo: Word
+    C = Zero
+    let m = t[0] * Word(m0ninv)
+    muladd1(C, lo, m, M[0], t[0])
+    staticFor j, 1, N:
+      # (C, t[j]) <- a[j] * b[i] + t[j] + C
+      muladd2(C, t[j-1], m, M[j], t[j], C)
+
+    #  (C,t[N-1]) <- t[N] + C
+    #  (_, t[N])  <- t[N+1] + C
+    var carry: Carry
+    addC(carry, t[N-1], tN, C, Carry(0))
+    addC(carry, tN, Word(tNp1), Zero, carry)
+
+  # t[N+1] can only be non-zero in the intermediate computation
+  # since it is immediately reduce to t[N] at the end of each "i" iteration
+  # However if t[N] is non-zero we have t > M
+  discard t.csub(M, tN.isNonZero() or not(t < M)) # TODO: (t >= M) is unnecessary for prime in the form (2^64)^w
+  r = t
+
+# Exported API
+# ------------------------------------------------------------
+
+func montyMul*(
+        r: var Limbs, a, b, M: Limbs,
+        m0ninv: static BaseType, canUseNoCarryMontyMul: static bool) {.inline.} =
+  ## Compute r <- a*b (mod M) in the Montgomery domain
+  ## `m0ninv` = -1/M (mod Word). Our words are 2^32 or 2^64
+  ##
+  ## This resets r to zero before processing. Use {.noInit.}
+  ## to avoid duplicating with Nim zero-init policy
+  ## The result `r` buffer size MUST be at least the size of `M` buffer
+  ##
+  ##
+  ## Assuming 64-bit words, the magic constant should be:
+  ##
+  ## - µ ≡ -1/M[0] (mod 2^64) for a general multiplication
+  ##   This can be precomputed with `negInvModWord`
+  ## - 1 for conversion from Montgomery to canonical representation
+  ##   The library implements a faster `redc` primitive for that use-case
+  ## - R^2 (mod M) for conversion from canonical to Montgomery representation
+  ##
+  # i.e. c'R <- a'R b'R * R^-1 (mod M) in the natural domain
+  # as in the Montgomery domain all numbers are scaled by R
+
+  # Nim doesn't like static Word, so we pass static BaseType up to here
+  # Then we passe them as Word again for the final processing.
+
+  # Many curve moduli are "Montgomery-friendly" which means that m0inv is 1
+  # This saves N basic type multiplication and potentially many register mov
+  # as well as unless using "mulx" instruction, x86 "mul" requires very specific registers.
+  # Compilers should be able to constant-propagate, but this prevents reusing code
+  # for example between secp256k1 (friendly) and BN254 (non-friendly).
+  # Here, as "montyMul" is inlined at the call site, the compiler shouldn't constant fold, saving size.
+  # Inlining the implementation instead (and no-inline this "montyMul" proc) would allow constant propagation
+  # of Montgomery-friendly m0ninv if the compiler deems it interesting,
+  # or we use `when m0ninv == 1` and enforce the inlining.
+  when canUseNoCarryMontyMul:
+    montyMul_CIOS_nocarry_unrolled(r, a, b, M, m0ninv)
+  else:
+    montyMul_CIOS(r, a, b, M, m0ninv)
+
+func montySquare*(r: var Limbs, a, M: Limbs,
+                  m0ninv: static BaseType, canUseNoCarryMontyMul: static bool) {.inline.} =
+  ## Compute r <- a^2 (mod M) in the Montgomery domain
+  ## `negInvModWord` = -1/M (mod Word). Our words are 2^31 or 2^63
+  montyMul(r, a, a, M, m0ninv, canUseNoCarryMontyMul)
+
+func redc*(r: var Limbs, a, one, M: Limbs,
+           m0ninv: static BaseType, canUseNoCarryMontyMul: static bool) {.inline.} =
+  ## Transform a bigint ``a`` from it's Montgomery N-residue representation (mod N)
+  ## to the regular natural representation (mod N)
+  ##
+  ## with W = M.len
+  ## and R = (2^WordBitSize)^W
+  ##
+  ## Does "a * R^-1 (mod M)"
+  ##
+  ## This is called a Montgomery Reduction
+  ## The Montgomery Magic Constant is µ = -1/N mod M
+  ## is used internally and can be precomputed with negInvModWord(Curve)
+  # References:
+  #   - https://eprint.iacr.org/2017/1057.pdf (Montgomery)
+  #     page: Radix-r interleaved multiplication algorithm
+  #   - https://en.wikipedia.org/wiki/Montgomery_modular_multiplication#Montgomery_arithmetic_on_multiprecision_(variable-radix)_integers
+  #   - http://langevin.univ-tln.fr/cours/MLC/extra/montgomery.pdf
+  #     Montgomery original paper
+  #
+  montyMul(r, a, one, M, m0ninv, canUseNoCarryMontyMul)
+
+func montyResidue*(r: var Limbs, a, M, r2modM: Limbs,
+                   m0ninv: static BaseType, canUseNoCarryMontyMul: static bool) {.inline.} =
+  ## Transform a bigint ``a`` from it's natural representation (mod N)
+  ## to a the Montgomery n-residue representation
+  ##
+  ## Montgomery-Multiplication - based
+  ##
+  ## with W = M.len
+  ## and R = (2^WordBitSize)^W
+  ##
+  ## Does "a * R (mod M)"
+  ##
+  ## `a`: The source BigInt in the natural representation. `a` in [0, N) range
+  ## `M`: The field modulus. M must be odd.
+  ## `r2modM`: 2^WordBitSize mod `M`. Can be precomputed with `r2mod` function
+  ##
+  ## Important: `r` is overwritten
+  ## The result `r` buffer size MUST be at least the size of `M` buffer
+  # Reference: https://eprint.iacr.org/2017/1057.pdf
+  montyMul(r, a, r2ModM, M, m0ninv, canUseNoCarryMontyMul)
+
+# Montgomery Modular Exponentiation
+# ------------------------------------------
+# We use fixed-window based exponentiation
+# that is constant-time: i.e. the number of multiplications
+# does not depend on the number of set bits in the exponents
+# those are always done and conditionally copied.
+#
+# The exponent MUST NOT be private data (until audited otherwise)
+# - Power attack on RSA, https://www.di.ens.fr/~fouque/pub/ches06.pdf
+# - Flush-and-reload on Sliding window exponentiation: https://tutcris.tut.fi/portal/files/8966761/p1639_pereida_garcia.pdf
+# - Sliding right into disaster, https://eprint.iacr.org/2017/627.pdf
+# - Fixed window leak: https://www.scirp.org/pdf/JCC_2019102810331929.pdf
+# - Constructing sliding-windows leak, https://easychair.org/publications/open/fBNC
+#
+# For pairing curves, this is the case since exponentiation is only
+# used for inversion via the Little Fermat theorem.
+# For RSA, some exponentiations uses private exponents.
+#
+# Note:
+# - Implementation closely follows Thomas Pornin's BearSSL
+# - Apache Milagro Crypto has an alternative implementation
+#   that is more straightforward however:
+#   - the exponent hamming weight is used as loop bounds
+#   - the base^k is stored at each index of a temp table of size k
+#   - the base^k to use is indexed by the hamming weight
+#     of the exponent, leaking this to cache attacks
+#   - in contrast BearSSL touches the whole table to
+#     hide the actual selection
+
+template checkPowScratchSpaceLen(len: int) =
+  ## Checks that there is a minimum of scratchspace to hold the temporaries
+  debug:
+    assert len >= 2, "Internal Error: the scratchspace for powmod should be equal or greater than 2"
+
+func getWindowLen(bufLen: int): uint =
+  ## Compute the maximum window size that fits in the scratchspace buffer
+  checkPowScratchSpaceLen(bufLen)
+  result = 5
+  while (1 shl result) + 1 > bufLen:
+    dec result
+
+func montyPowPrologue(
+       a: var Limbs, M, one: Limbs,
+       m0ninv: static BaseType,
+       scratchspace: var openarray[Limbs],
+       canUseNoCarryMontyMul: static bool
+     ): uint =
+  ## Setup the scratchspace
+  ## Returns the fixed-window size for exponentiation with window optimization.
+  result = scratchspace.len.getWindowLen()
+  # Precompute window content, special case for window = 1
+  # (i.e scratchspace has only space for 2 temporaries)
+  # The content scratchspace[2+k] is set at a^k
+  # with scratchspace[0] untouched
+  if result == 1:
+    scratchspace[1] = a
+  else:
+    scratchspace[2] = a
+    for k in 2 ..< 1 shl result:
+      scratchspace[k+1].montyMul(scratchspace[k], a, M, m0ninv, canUseNoCarryMontyMul)
+
+  # Set a to one
+  a = one
+
+func montyPowSquarings(
+        a: var Limbs,
+        exponent: openarray[byte],
+        M: Limbs,
+        negInvModWord: static BaseType,
+        tmp: var Limbs,
+        window: uint,
+        acc, acc_len: var uint,
+        e: var int,
+        canUseNoCarryMontyMul: static bool
+      ): tuple[k, bits: uint] {.inline.}=
+  ## Squaring step of exponentiation by squaring
+  ## Get the next k bits in range [1, window)
+  ## Square k times
+  ## Returns the number of squarings done and the corresponding bits
+  ##
+  ## Updates iteration variables and accumulators
+  # Due to the high number of parameters,
+  # forcing this inline actually reduces the code size
+
+  # Get the next bits
+  var k = window
+  if acc_len < window:
+    if e < exponent.len:
+      acc = (acc shl 8) or exponent[e].uint
+      inc e
+      acc_len += 8
+    else: # Drained all exponent bits
+      k = acc_len
+
+  let bits = (acc shr (acc_len - k)) and ((1'u32 shl k) - 1)
+  acc_len -= k
+
+  # We have k bits and can do k squaring
+  for i in 0 ..< k:
+    tmp.montySquare(a, M, negInvModWord, canUseNoCarryMontyMul)
+    a = tmp
+
+  return (k, bits)
+
+func montyPow*(
+       a: var Limbs,
+       exponent: openarray[byte],
+       M, one: Limbs,
+       negInvModWord: static BaseType,
+       scratchspace: var openarray[Limbs],
+       canUseNoCarryMontyMul: static bool
+      ) =
+  ## Modular exponentiation r = a^exponent mod M
+  ## in the Montgomery domain
+  ##
+  ## This uses fixed-window optimization if possible
+  ##
+  ## - On input ``a`` is the base, on ``output`` a = a^exponent (mod M)
+  ##   ``a`` is in the Montgomery domain
+  ## - ``exponent`` is the exponent in big-endian canonical format (octet-string)
+  ##   Use ``exportRawUint`` for conversion
+  ## - ``M`` is the modulus
+  ## - ``one`` is 1 (mod M) in montgomery representation
+  ## - ``negInvModWord`` is the montgomery magic constant "-1/M[0] mod 2^WordBitSize"
+  ## - ``scratchspace`` with k the window bitsize of size up to 5
+  ##   This is a buffer that can hold between 2^k + 1 big-ints
+  ##   A window of of 1-bit (no window optimization) requires only 2 big-ints
+  ##
+  ## Note that the best window size require benchmarking and is a tradeoff between
+  ## - performance
+  ## - stack usage
+  ## - precomputation
+  ##
+  ## For example BLS12-381 window size of 5 is 30% faster than no window,
+  ## but windows of size 2, 3, 4 bring no performance benefit, only increased stack space.
+  ## A window of size 5 requires (2^5 + 1)*(381 + 7)/8 = 33 * 48 bytes = 1584 bytes
+  ## of scratchspace (on the stack).
+
+  let window = montyPowPrologue(a, M, one, negInvModWord, scratchspace, canUseNoCarryMontyMul)
+
+  # We process bits with from most to least significant.
+  # At each loop iteration with have acc_len bits in acc.
+  # To maintain constant-time the number of iterations
+  # or the number of operations or memory accesses should be the same
+  # regardless of acc & acc_len
+  var
+    acc, acc_len: uint
+    e = 0
+  while acc_len > 0 or e < exponent.len:
+    let (k, bits) = montyPowSquarings(
+      a, exponent, M, negInvModWord,
+      scratchspace[0], window,
+      acc, acc_len, e,
+      canUseNoCarryMontyMul
+    )
+
+    # Window lookup: we set scratchspace[1] to the lookup value.
+    # If the window length is 1, then it's already set.
+    if window > 1:
+      # otherwise we need a constant-time lookup
+      # in particular we need the same memory accesses, we can't
+      # just index the openarray with the bits to avoid cache attacks.
+      for i in 1 ..< 1 shl k:
+        let ctl = Word(i) == Word(bits)
+        scratchspace[1].ccopy(scratchspace[1+i], ctl)
+
+    # Multiply with the looked-up value
+    # we keep the product only if the exponent bits are not all zero
+    scratchspace[0].montyMul(a, scratchspace[1], M, negInvModWord, canUseNoCarryMontyMul)
+    a.ccopy(scratchspace[0], Word(bits).isNonZero())
+
+func montyPowUnsafeExponent*(
+       a: var Limbs,
+       exponent: openarray[byte],
+       M, one: Limbs,
+       negInvModWord: static BaseType,
+       scratchspace: var openarray[Limbs],
+       canUseNoCarryMontyMul: static bool
+      ) =
+  ## Modular exponentiation r = a^exponent mod M
+  ## in the Montgomery domain
+  ##
+  ## Warning ⚠️ :
+  ## This is an optimization for public exponent
+  ## Otherwise bits of the exponent can be retrieved with:
+  ## - memory access analysis
+  ## - power analysis
+  ## - timing analysis
+
+  # TODO: scratchspace[1] is unused when window > 1
+
+  let window = montyPowPrologue(a, M, one, negInvModWord, scratchspace, canUseNoCarryMontyMul)
+
+  var
+    acc, acc_len: uint
+    e = 0
+  while acc_len > 0 or e < exponent.len:
+    let (k, bits) = montyPowSquarings(
+      a, exponent, M, negInvModWord,
+      scratchspace[0], window,
+      acc, acc_len, e,
+      canUseNoCarryMontyMul
+    )
+
+    ## Warning ⚠️: Exposes the exponent bits
+    if bits != 0:
+      if window > 1:
+        scratchspace[0].montyMul(a, scratchspace[1+bits], M, negInvModWord, canUseNoCarryMontyMul)
+      else:
+        # scratchspace[1] holds the original `a`
+        scratchspace[0].montyMul(a, scratchspace[1], M, negInvModWord, canUseNoCarryMontyMul)
+      a = scratchspace[0]
--- a/constantine/arithmetic/precomputed.nim
+++ b/constantine/arithmetic/precomputed.nim
@ -7,7 +7,7 @@
 # at your option. This file may not be copied, modified, or distributed except according to those terms.

 import
-  ./bigints_checked,
+  ./bigints,
  ../primitives/constant_time,
  ../config/common,
  ../io/io_bigints
@ -22,37 +22,77 @@ import
 # ############################################################
 #
 # Those primitives are intended to be compile-time only
-# They are generic over the bitsize: enabling them at runtime
-# would create a copy for each bitsize used (monomorphization)
-# leading to code-bloat.
-# Thos are NOT compile-time, using CTBool seems to confuse the VM
+# Those are NOT tagged compile-time, using CTBool seems to confuse the VM

 # We don't use distinct types here, they confuse the VM
-# Similarly, isMsbSet causes trouble with distinct type in the VM
+# Similarly, using addC / subB confuses the VM

-func isMsbSet(x: BaseType): bool =
-  const msb_pos = BaseType.sizeof * 8 - 1
-  bool(x shr msb_pos)
+# As we choose to use the full 32/64 bits of the integers and there is no carry flag
+# in the compile-time VM we need a portable (and slow) "adc" and "sbb".
+# Hopefully compilation time stays decent.
+
+const
+  HalfWidth = WordBitWidth shr 1
+  HalfBase = (BaseType(1) shl HalfWidth)
+  HalfMask = HalfBase - 1
+
+func split(n: BaseType): tuple[hi, lo: BaseType] =
+  result.hi = n shr HalfWidth
+  result.lo = n and HalfMask
+
+func merge(hi, lo: BaseType): BaseType =
+  (hi shl HalfWidth) or lo
+
+func addC(cOut, sum: var BaseType, a, b, cIn: BaseType) =
+  # Add with carry, fallback for the Compile-Time VM
+  # (CarryOut, Sum) <- a + b + CarryIn
+  let (aHi, aLo) = split(a)
+  let (bHi, bLo) = split(b)
+  let tLo = aLo + bLo + cIn
+  let (cLo, rLo) = split(tLo)
+  let tHi = aHi + bHi + cLo
+  let (cHi, rHi) = split(tHi)
+  cOut = cHi
+  sum = merge(rHi, rLo)
+
+func subB(bOut, diff: var BaseType, a, b, bIn: BaseType) =
+  # Substract with borrow, fallback for the Compile-Time VM
+  # (BorrowOut, Sum) <- a - b - BorrowIn
+  let (aHi, aLo) = split(a)
+  let (bHi, bLo) = split(b)
+  let tLo = HalfBase + aLo - bLo - bIn
+  let (noBorrowLo, rLo) = split(tLo)
+  let tHi = HalfBase + aHi - bHi - BaseType(noBorrowLo == 0)
+  let (noBorrowHi, rHi) = split(tHi)
+  bOut = BaseType(noBorrowHi == 0)
+  diff = merge(rHi, rLo)

 func dbl(a: var BigInt): bool =
  ## In-place multiprecision double
  ##   a -> 2a
+  var carry, sum: BaseType
  for i in 0 ..< a.limbs.len:
-    var z = BaseType(a.limbs[i]) * 2 + BaseType(result)
-    result = z.isMsbSet()
-    a.limbs[i] = mask(Word(z))
+    let ai = BaseType(a.limbs[i])
+    addC(carry, sum, ai, ai, carry)
+    a.limbs[i] = Word(sum)

-func sub(a: var BigInt, b: BigInt, ctl: bool): bool =
+  result = bool(carry)
+
+func csub(a: var BigInt, b: BigInt, ctl: bool): bool =
  ## In-place optional substraction
  ##
  ## It is NOT constant-time and is intended
  ## only for compile-time precomputation
  ## of non-secret data.
+  var borrow, diff: BaseType
  for i in 0 ..< a.limbs.len:
-    let new_a = BaseType(a.limbs[i]) - BaseType(b.limbs[i]) - BaseType(result)
-    result = new_a.isMsbSet()
-    a.limbs[i] = if ctl: new_a.Word.mask()
-                 else: a.limbs[i]
+    let ai = BaseType(a.limbs[i])
+    let bi = BaseType(b.limbs[i])
+    subB(borrow, diff, ai, bi, borrow)
+    if ctl:
+      a.limbs[i] = Word(diff)
+
+  result = bool(borrow)

 func doubleMod(a: var BigInt, M: BigInt) =
  ## In-place modular double
@ -62,8 +102,8 @@ func doubleMod(a: var BigInt, M: BigInt) =
  ## only for compile-time precomputation
  ## of non-secret data.
  var ctl = dbl(a)
-  ctl = ctl or not sub(a, M, false)
-  discard sub(a, M, ctl)
+  ctl = ctl or not a.csub(M, false)
+  discard csub(a, M, ctl)

 # ############################################################
 #
@ -75,11 +115,19 @@ func checkOddModulus(M: BigInt) =
  doAssert bool(BaseType(M.limbs[0]) and 1), "Internal Error: the modulus must be odd to use the Montgomery representation."

 func checkValidModulus(M: BigInt) =
-  const expectedMsb = M.bits-1 - WordBitSize * (M.limbs.len - 1)
+  const expectedMsb = M.bits-1 - WordBitWidth * (M.limbs.len - 1)
  let msb = log2(BaseType(M.limbs[^1]))

  doAssert msb == expectedMsb, "Internal Error: the modulus must use all declared bits and only those"

+func useNoCarryMontyMul*(M: BigInt): bool =
+  ## Returns if the modulus is compatible
+  ## with the no-carry Montgomery Multiplication
+  ## from https://hackmd.io/@zkteam/modular_multiplication
+  # Indirection needed because static object are buggy
+  # https://github.com/nim-lang/Nim/issues/9679
+  BaseType(M.limbs[^1]) < high(BaseType) shr 1
+
 func negInvModWord*(M: BigInt): BaseType =
  ## Returns the Montgomery domain magic constant for the input modulus:
  ##
@ -88,9 +136,9 @@ func negInvModWord*(M: BigInt): BaseType =
  ## M[0] is the least significant limb of M
  ## M must be odd and greater than 2.
  ##
-  ## Assuming 63-bit words:
+  ## Assuming 64-bit words:
  ##
-  ## µ ≡ -1/M[0] (mod 2^63)
+  ## µ ≡ -1/M[0] (mod 2^64)

  # We use BaseType for return value because static distinct type
  # confuses Nim semchecks [UPSTREAM BUG]
@ -108,17 +156,17 @@ func negInvModWord*(M: BigInt): BaseType =
  # - http://marc-b-reynolds.github.io/math/2017/09/18/ModInverse.html

  # For Montgomery magic number, we are in a special case
-  # where a = M and m = 2^WordBitsize.
+  # where a = M and m = 2^WordBitWidth.
  # For a and m to be coprimes, a must be odd.

  # We have the following relation
  # ax ≡ 1 (mod 2^k) <=> ax(2 - ax) ≡ 1 (mod 2^(2k))
  #
  # To get  -1/M0 mod LimbSize
-  # we can either negate the resulting x of `ax(2 - ax) ≡ 1 (mod 2^(2k))`
-  # or do ax(2 + ax) ≡ 1 (mod 2^(2k))
+  # we can negate the result x of `ax(2 - ax) ≡ 1 (mod 2^(2k))`
+  # or if k is odd: do ax(2 + ax) ≡ 1 (mod 2^(2k))
  #
-  # To get the the modular inverse of 2^k' with arbitrary k' (like k=63 in our case)
+  # To get the the modular inverse of 2^k' with arbitrary k'
  # we can do modInv(a, 2^64) mod 2^63 as mentionned in Koc paper.

  checkOddModulus(M)
@ -126,21 +174,21 @@ func negInvModWord*(M: BigInt): BaseType =

  let
    M0 = BaseType(M.limbs[0])
-    k = log2(uint32(WordPhysBitSize))
+    k = log2(WordBitWidth.uint32)

  result = M0                 # Start from an inverse of M0 modulo 2, M0 is odd and it's own inverse
  for _ in 0 ..< k:           # at each iteration we get the inverse mod(2^2k)
-    result *= 2 + M0 * result # x' = x(2 + ax) (`+` to avoid negating at the end)
+    result *= 2 - M0 * result # x' = x(2 - ax)

-  # Our actual word size is 2^63 not 2^64
-  result = result and BaseType(MaxWord)
+  # negate to obtain the negative inverse
+  result = not(result) + 1

 func r_powmod(n: static int, M: BigInt): BigInt =
  ## Returns the Montgomery domain magic constant for the input modulus:
  ##
-  ##   R ≡ R (mod M) with R = (2^WordBitSize)^numWords
+  ##   R ≡ R (mod M) with R = (2^WordBitWidth)^numWords
  ##   or
-  ##   R² ≡ R² (mod M) with R = (2^WordBitSize)^numWords
+  ##   R² ≡ R² (mod M) with R = (2^WordBitWidth)^numWords
  ##
  ## Assuming a field modulus of size 256-bit with 63-bit words, we require 5 words
  ##   R² ≡ ((2^63)^5)^2 (mod M) = 2^630 (mod M)
@ -161,22 +209,20 @@ func r_powmod(n: static int, M: BigInt): BigInt =
  checkOddModulus(M)
  checkValidModulus(M)

-  result.setInternalBitLength()
-
  const
    w = M.limbs.len
-    msb = M.bits-1 - WordBitSize * (w - 1)
-    start = (w-1)*WordBitSize + msb
-    stop = n*WordBitSize*w
+    msb = M.bits-1 - WordBitWidth * (w - 1)
+    start = (w-1)*WordBitWidth + msb
+    stop = n*WordBitWidth*w

-  result.limbs[^1] = Word(1 shl msb) # C0 = 2^(wn-1), the power of 2 immediatly less than the modulus
+  result.limbs[^1] = Word(BaseType(1) shl msb) # C0 = 2^(wn-1), the power of 2 immediatly less than the modulus
  for _ in start ..< stop:
    result.doubleMod(M)

 func r2mod*(M: BigInt): BigInt =
  ## Returns the Montgomery domain magic constant for the input modulus:
  ##
-  ##   R² ≡ R² (mod M) with R = (2^WordBitSize)^numWords
+  ##   R² ≡ R² (mod M) with R = (2^WordBitWidth)^numWords
  ##
  ## Assuming a field modulus of size 256-bit with 63-bit words, we require 5 words
  ##   R² ≡ ((2^63)^5)^2 (mod M) = 2^630 (mod M)
@ -195,6 +241,6 @@ func primeMinus2_BE*[bits: static int](
  ## For use to precompute modular inverse exponent.

  var tmp = P
-  discard tmp.sub(BigInt[bits].fromRawUint([byte 2], bigEndian), true)
+  discard tmp.csub(BigInt[bits].fromRawUint([byte 2], bigEndian), true)

  result.exportRawUint(tmp, bigEndian)
--- a/constantine/config/common.nim
+++ b/constantine/config/common.nim
@ -12,9 +12,9 @@
 #
 # ############################################################

-import ../primitives/constant_time
+import ../primitives

-when sizeof(int) == 8:
+when sizeof(int) == 8 and not defined(Constantine32):
  type
    BaseType* = uint64
      ## Physical BigInt for conversion in "normal integers"
@ -29,23 +29,15 @@ type
    ## A logical BigInt word is of size physical MachineWord-1

 const
-  ExcessBits = 1
-  WordPhysBitSize* = sizeof(Word) * 8
-  WordBitSize* = WordPhysBitSize - ExcessBits
+  WordBitWidth* = sizeof(Word) * 8
+    ## Logical word size

  CtTrue* = ctrue(Word)
  CtFalse* = cfalse(Word)

  Zero* = Word(0)
  One* = Word(1)
-  MaxWord* = (not Zero) shr (WordPhysBitSize - WordBitSize)
-    ## This represents 0x7F_FF_FF_FF__FF_FF_FF_FF
-    ## also 0b0111...1111
-    ## This biggest representable number in our limbs.
-    ## i.e. The most significant bit is never set at the end of each function
-
-template mask*(w: Word): Word =
-  w and MaxWord
+  MaxWord* = Word(high(BaseType))

 # ############################################################
 #
--- a/constantine/config/curves.nim
+++ b/constantine/config/curves.nim
@ -11,7 +11,7 @@ import
  macros,
  # Internal
  ./curves_parser, ./common,
-  ../arithmetic/[precomputed, bigints_checked]
+  ../arithmetic/[precomputed, bigints]


 # ############################################################
@ -109,6 +109,17 @@ macro genMontyMagics(T: typed): untyped =
  let E = T.getImpl[2]
  for i in 1 ..< E.len:
    let curve = E[i]
+    # const MyCurve_CanUseNoCarryMontyMul = useNoCarryMontyMul(MyCurve_Modulus)
+    result.add newConstStmt(
+      ident($curve & "_CanUseNoCarryMontyMul"), newCall(
+        bindSym"useNoCarryMontyMul",
+        nnkDotExpr.newTree(
+          bindSym($curve & "_Modulus"),
+          ident"mres"
+        )
+      )
+    )
+
    # const MyCurve_R2modP = r2mod(MyCurve_Modulus)
    result.add newConstStmt(
      ident($curve & "_R2modP"), newCall(
@ -154,6 +165,11 @@ macro genMontyMagics(T: typed): untyped =

 genMontyMagics(Curve)

+macro canUseNoCarryMontyMul*(C: static Curve): untyped =
+  ## Returns true if the Modulus is compatible with a fast
+  ## Montgomery multiplication that avoids many carries
+  result = bindSym($C & "_CanUseNoCarryMontyMul")
+
 macro getR2modP*(C: static Curve): untyped =
  ## Get the Montgomery "R^2 mod P" constant associated to a curve field modulus
  result = bindSym($C & "_R2modP")
@ -192,7 +208,7 @@ macro debugConsts(): untyped =
      echo "Curve ", `curveName`,':'
      echo "  Field Modulus:                 ", `modulus`
      echo "  Montgomery R² (mod P):         ", `r2modp`
-      echo "  Montgomery -1/P[0] (mod 2^", WordBitSize, "): ", `negInvModWord`
+      echo "  Montgomery -1/P[0] (mod 2^", WordBitWidth, "): ", `negInvModWord`
  result.add quote do:
    echo "----------------------------------------------------------------------------"

--- a/constantine/config/curves_parser.nim
+++ b/constantine/config/curves_parser.nim
@ -10,7 +10,7 @@ import
  # Standard library
  macros,
  # Internal
-  ../io/io_bigints, ../arithmetic/bigints_checked
+  ../io/io_bigints, ../arithmetic/bigints

 # Macro to parse declarative curves configuration.

--- a/constantine/io/io_bigints.nim
+++ b/constantine/io/io_bigints.nim
@ -12,7 +12,7 @@

 import
  ../primitives/constant_time,
-  ../arithmetic/bigints_checked,
+  ../arithmetic/bigints,
  ../config/common

 # ############################################################
@ -21,6 +21,10 @@ import
 #
 # ############################################################

+# Note: the parsing/serialization routines were initially developed
+#       with an internal representation that used 31 bits out of a uint32
+#       or 63-bits out of an uint64
+
 # TODO: tag/remove exceptions raised.

 func fromRawUintLE(
@ -49,10 +53,10 @@ func fromRawUintLE(
    acc_len += 8 # We count bit by bit

    # if full, dump
-    if acc_len >= WordBitSize:
-      dst.limbs[dst_idx] = mask(acc)
+    if acc_len >= WordBitWidth:
+      dst.limbs[dst_idx] = acc
      inc dst_idx
-      acc_len -= WordBitSize
+      acc_len -= WordBitWidth
      acc = src_byte shr (8 - acc_len)

  if dst_idx < dst.limbs.len:
@ -86,10 +90,10 @@ func fromRawUintBE(
    acc_len += 8 # We count bit by bit

    # if full, dump
-    if acc_len >= WordBitSize:
-      dst.limbs[dst_idx] = mask(acc)
+    if acc_len >= WordBitWidth:
+      dst.limbs[dst_idx] = acc
      inc dst_idx
-      acc_len -= WordBitSize
+      acc_len -= WordBitWidth
      acc = src_byte shr (8 - acc_len)

  if dst_idx < dst.limbs.len:
@ -113,7 +117,6 @@ func fromRawUint*(
    dst.fromRawUintLE(src)
  else:
    dst.fromRawUintBE(src)
-  dst.setInternalBitLength()

 func fromRawUint*(
        T: type BigInt,
@ -187,14 +190,17 @@ func exportRawUintLE(
    inc src_idx

    if acc_len == 0:
-      # Edge case, we need to refill the buffer to output 64-bit
-      # as we can only read 63-bit per word
+      # We need to refill the buffer to output 64-bit
      acc = w
-      acc_len = WordBitSize
+      acc_len = WordBitWidth
    else:
-      let lo = (w shl acc_len) or acc
-      dec acc_len
-      acc = w shr (WordBitSize - acc_len)
+      when WordBitWidth == sizeof(Word) * 8:
+        let lo = acc
+        acc = w
+      else: # If using 63-bit (or less) out of uint64
+        let lo = (w shl acc_len) or acc
+        dec acc_len
+        acc = w shr (WordBitWidth - acc_len)

      if tail >= sizeof(Word):
        # Unrolled copy
@ -237,14 +243,17 @@ func exportRawUintBE(
    inc src_idx

    if acc_len == 0:
-      # Edge case, we need to refill the buffer to output 64-bit
-      # as we can only read 63-bit per word
+      # We need to refill the buffer to output 64-bit
      acc = w
-      acc_len = WordBitSize
+      acc_len = WordBitWidth
    else:
-      let lo = (w shl acc_len) or acc
-      dec acc_len
-      acc = w shr (WordBitSize - acc_len)
+      when WordBitWidth == sizeof(Word) * 8:
+        let lo = acc
+        acc = w
+      else: # If using 63-bit (or less) out of uint64
+        let lo = (w shl acc_len) or acc
+        dec acc_len
+        acc = w shr (WordBitWidth - acc_len)

      if tail >= sizeof(Word):
        # Unrolled copy
--- a/constantine/io/io_fields.nim
+++ b/constantine/io/io_fields.nim
@ -9,7 +9,7 @@
 import
  ./io_bigints,
  ../config/curves,
-  ../arithmetic/[bigints_checked, finite_fields]
+  ../arithmetic/[bigints, finite_fields]

 # No exceptions allowed
 {.push raises: [].}
--- a/constantine/primitives.nim
+++ b/constantine/primitives.nim
@ -0,0 +1,21 @@
+# Constantine
+# Copyright (c) 2018-2019    Status Research & Development GmbH
+# Copyright (c) 2020-Present Mamy André-Ratsimbazafy
+# Licensed and distributed under either of
+#   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
+#   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
+# at your option. This file may not be copied, modified, or distributed except according to those terms.
+
+import
+  primitives/constant_time_types,
+  primitives/constant_time,
+  primitives/multiplexers,
+  primitives/addcarry_subborrow,
+  primitives/extended_precision
+
+export
+  constant_time_types,
+  constant_time,
+  multiplexers,
+  addcarry_subborrow,
+  extended_precision
--- a/constantine/primitives/README.md
+++ b/constantine/primitives/README.md
@ -6,3 +6,86 @@ This folder holds:
  to have the compiler enforce proper usage
 - extended precision multiplication and division primitives
 - assembly primitives
+- intrinsics
+
+## Security
+
+⚠: **Hardware assumptions**
+
+  Constantine assumes that multiplication is implemented
+  constant-time in hardware.
+
+  If this is not the case,
+  you SHOULD **strongly reconsider** your hardware choice or
+  reimplement multiplication with constant-time guarantees
+  (at the cost of speed and code-size)
+
+⚠: Currently division and modulo operations are `unsafe`
+  and uses hardware division.
+  No known CPU implements division in constant-time.
+  A constant-time alternative will be provided.
+
+While extremely slow, division and modulo are only used
+on random or user inputs to constrain them to the prime field
+of the elliptic curves.
+Constantine internals are built to avoid costly constant-time divisions.
+
+## Performance and code size
+
+It is recommended to prefer Clang, MSVC or ICC over GCC if possible.
+GCC code is significantly slower and bigger for multiprecision arithmetic
+even when using dedicated intrinsics.
+
+See https://gcc.godbolt.org/z/2h768y
+```C
+#include <stdint.h>
+#include <x86intrin.h>
+
+void add256(uint64_t a[4], uint64_t b[4]){
+  uint8_t carry = 0;
+  for (int i = 0; i < 4; ++i)
+    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
+}
+```
+
+GCC
+```asm
+add256:
+        movq    (%rsi), %rax
+        addq    (%rdi), %rax
+        setc    %dl
+        movq    %rax, (%rdi)
+        movq    8(%rdi), %rax
+        addb    $-1, %dl
+        adcq    8(%rsi), %rax
+        setc    %dl
+        movq    %rax, 8(%rdi)
+        movq    16(%rdi), %rax
+        addb    $-1, %dl
+        adcq    16(%rsi), %rax
+        setc    %dl
+        movq    %rax, 16(%rdi)
+        movq    24(%rsi), %rax
+        addb    $-1, %dl
+        adcq    %rax, 24(%rdi)
+        ret
+```
+
+Clang
+```asm
+add256:
+        movq    (%rsi), %rax
+        addq    %rax, (%rdi)
+        movq    8(%rsi), %rax
+        adcq    %rax, 8(%rdi)
+        movq    16(%rsi), %rax
+        adcq    %rax, 16(%rdi)
+        movq    24(%rsi), %rax
+        adcq    %rax, 24(%rdi)
+        retq
+```
+
+### Inline assembly
+
+Using inline assembly will sacrifice code readability, portability, auditability and maintainability.
+That said the performance might be worth it.
--- a/constantine/primitives/addcarry_subborrow.nim
+++ b/constantine/primitives/addcarry_subborrow.nim
@ -0,0 +1,168 @@
+# Constantine
+# Copyright (c) 2018-2019    Status Research & Development GmbH
+# Copyright (c) 2020-Present Mamy André-Ratsimbazafy
+# Licensed and distributed under either of
+#   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
+#   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
+# at your option. This file may not be copied, modified, or distributed except according to those terms.
+
+import ./constant_time_types
+
+# ############################################################
+#
+#            Add-with-carry and Sub-with-borrow
+#
+# ############################################################
+#
+# This file implements add-with-carry and sub-with-borrow
+#
+# It is currently (Mar 2020) impossible to have the compiler
+# generate optimal code in a generic way.
+#
+# On x86, addcarry_u64 intrinsic will generate optimal code
+# except for GCC.
+#
+# On other CPU architectures inline assembly might be desirable.
+# A compiler proof-of-concept is available in the "research" folder.
+#
+# See https://gcc.godbolt.org/z/2h768y
+# ```C
+# #include <stdint.h>
+# #include <x86intrin.h>
+#
+# void add256(uint64_t a[4], uint64_t b[4]){
+#   uint8_t carry = 0;
+#   for (int i = 0; i < 4; ++i)
+#     carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
+# }
+# ```
+#
+# GCC
+# ```asm
+# add256:
+#         movq    (%rsi), %rax
+#         addq    (%rdi), %rax
+#         setc    %dl
+#         movq    %rax, (%rdi)
+#         movq    8(%rdi), %rax
+#         addb    $-1, %dl
+#         adcq    8(%rsi), %rax
+#         setc    %dl
+#         movq    %rax, 8(%rdi)
+#         movq    16(%rdi), %rax
+#         addb    $-1, %dl
+#         adcq    16(%rsi), %rax
+#         setc    %dl
+#         movq    %rax, 16(%rdi)
+#         movq    24(%rsi), %rax
+#         addb    $-1, %dl
+#         adcq    %rax, 24(%rdi)
+#         ret
+# ```
+#
+# Clang
+# ```asm
+# add256:
+#         movq    (%rsi), %rax
+#         addq    %rax, (%rdi)
+#         movq    8(%rsi), %rax
+#         adcq    %rax, 8(%rdi)
+#         movq    16(%rsi), %rax
+#         adcq    %rax, 16(%rdi)
+#         movq    24(%rsi), %rax
+#         adcq    %rax, 24(%rdi)
+#         retq
+# ```
+
+# ############################################################
+#
+#                     Intrinsics
+#
+# ############################################################
+
+# Note: GCC before 2017 had incorrect codegen in some cases:
+# - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81300
+
+when X86:
+  when defined(windows):
+    {.pragma: intrinsics, header:"<intrin.h>", nodecl.}
+  else:
+    {.pragma: intrinsics, header:"<x86intrin.h>", nodecl.}
+
+  func addcarry_u32(carryIn: Carry, a, b: Ct[uint32], sum: var Ct[uint32]): Carry {.importc: "_addcarry_u32", intrinsics.}
+  func subborrow_u32(borrowIn: Borrow, a, b: Ct[uint32], diff: var Ct[uint32]): Borrow {.importc: "_subborrow_u32", intrinsics.}
+
+  func addcarry_u64(carryIn: Carry, a, b: Ct[uint64], sum: var Ct[uint64]): Carry {.importc: "_addcarry_u64", intrinsics.}
+  func subborrow_u64(borrowIn: Borrow, a, b: Ct[uint64], diff: var Ct[uint64]): Borrow {.importc: "_subborrow_u64", intrinsics.}
+
+# ############################################################
+#
+#                     Public
+#
+# ############################################################
+
+func addC*(cOut: var Carry, sum: var Ct[uint32], a, b: Ct[uint32], cIn: Carry) {.inline.} =
+  ## Addition with carry
+  ## (CarryOut, Sum) <- a + b + CarryIn
+  when X86:
+    cOut = addcarry_u32(cIn, a, b, sum)
+  else:
+    let dblPrec = uint64(cIn) + uint64(a) + uint64(b)
+    sum = (Ct[uint32])(dblPrec)
+    cOut = Carry(dblPrec shr 32)
+
+func subB*(bOut: var Borrow, diff: var Ct[uint32], a, b: Ct[uint32], bIn: Borrow) {.inline.} =
+  ## Substraction with borrow
+  ## (BorrowOut, Diff) <- a - b - borrowIn
+  when X86:
+    bOut = subborrow_u32(bIn, a, b, diff)
+  else:
+    let dblPrec = uint64(a) - uint64(b) - uint64(bIn)
+    diff = (Ct[uint32])(dblPrec)
+    # On borrow the high word will be 0b1111...1111 and needs to be masked
+    bOut = Borrow((dblPrec shr 32) and 1)
+
+func addC*(cOut: var Carry, sum: var Ct[uint64], a, b: Ct[uint64], cIn: Carry) {.inline.} =
+  ## Addition with carry
+  ## (CarryOut, Sum) <- a + b + CarryIn
+  when X86:
+    cOut = addcarry_u64(cIn, a, b, sum)
+  else:
+    block:
+      static:
+        doAssert GCC_Compatible
+        doAssert sizeof(int) == 8
+
+      var dblPrec {.noInit.}: uint128
+      {.emit:[dblPrec, " = (unsigned __int128)", a," + (unsigned __int128)", b, " + (unsigned __int128)",cIn,";"].}
+
+      # Don't forget to dereference the var param in C mode
+      when defined(cpp):
+        {.emit:[cOut, " = (NU64)(", dblPrec," >> ", 64'u64, ");"].}
+        {.emit:[sum, " = (NU64)", dblPrec,";"].}
+      else:
+        {.emit:["*",cOut, " = (NU64)(", dblPrec," >> ", 64'u64, ");"].}
+        {.emit:["*",sum, " = (NU64)", dblPrec,";"].}
+
+func subB*(bOut: var Borrow, diff: var Ct[uint64], a, b: Ct[uint64], bIn: Borrow) {.inline.} =
+  ## Substraction with borrow
+  ## (BorrowOut, Diff) <- a - b - borrowIn
+  when X86:
+    bOut = subborrow_u64(bIn, a, b, diff)
+  else:
+    block:
+      static:
+        doAssert GCC_Compatible
+        doAssert sizeof(int) == 8
+
+      var dblPrec {.noInit.}: uint128
+      {.emit:[dblPrec, " = (unsigned __int128)", a," - (unsigned __int128)", b, " - (unsigned __int128)",bIn,";"].}
+
+      # Don't forget to dereference the var param in C mode
+      # On borrow the high word will be 0b1111...1111 and needs to be masked
+      when defined(cpp):
+        {.emit:[bOut, " = (NU64)(", dblPrec," >> ", 64'u64, ") & 1;"].}
+        {.emit:[diff, " = (NU64)", dblPrec,";"].}
+      else:
+        {.emit:["*",bOut, " = (NU64)(", dblPrec," >> ", 64'u64, ") & 1;"].}
+        {.emit:["*",diff, " = (NU64)", dblPrec,";"].}
--- a/constantine/primitives/constant_time.nim
+++ b/constantine/primitives/constant_time.nim
@ -6,27 +6,7 @@
 #   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
 # at your option. This file may not be copied, modified, or distributed except according to those terms.

-# ############################################################
-#
-#                Constant-time primitives
-#
-# ############################################################
-type
-  BaseUint* = SomeUnsignedInt or byte
-
-  Ct*[T: BaseUint] = distinct T
-
-  CTBool*[T: Ct] = distinct T # range[T(0)..T(1)]
-    ## To avoid the compiler replacing bitwise boolean operations
-    ## by conditional branches, we don't use booleans.
-    ## We use an int to prevent compiler "optimization" and introduction of branches
-    # Note, we could use "range" but then the codegen
-    # uses machine-sized signed integer types.
-    # signed types and machine-dependent words are undesired
-    # - we don't want compiler optimizing signed "undefined behavior"
-    # - Basic functions like BIgInt add/sub
-    #   return and/or accept CTBool, we don't want them
-    #   to require unnecessarily 8 bytes instead of 4 bytes
+import ./constant_time_types

 # ############################################################
 #
@ -227,79 +207,6 @@ template `<=`*[T: Ct](x, y: T): CTBool[T] =
 template `xor`*[T: Ct](x, y: CTBool[T]): CTBool[T] =
  CTBool[T](noteq(T(x), T(y)))

-func mux*[T](ctl: CTBool[T], x, y: T): T {.inline.}=
-  ## Multiplexer / selector
-  ## Returns x if ctl is true
-  ## else returns y
-  ## So equivalent to ctl? x: y
-  #
-  # TODO verify assembly generated
-  # Alternatives:
-  # - https://cryptocoding.net/index.php/Coding_rules
-  # - https://www.cl.cam.ac.uk/~rja14/Papers/whatyouc.pdf
-  when defined(amd64) or defined(i386):
-    when sizeof(T) == 8:
-      var muxed = x
-      asm """
-        testq %[ctl], %[ctl]
-        cmovzq %[y], %[muxed]
-        : [muxed] "+r" (`muxed`)
-        : [ctl] "r" (`ctl`), [y] "r" (`y`)
-        : "cc"
-      """
-      muxed
-    elif sizeof(T) == 4:
-      var muxed = x
-      asm """
-        testl %[ctl], %[ctl]
-        cmovzl %[y], %[muxed]
-        : [muxed] "+r" (`muxed`)
-        : [ctl] "r" (`ctl`), [y] "r" (`y`)
-        : "cc"
-      """
-      muxed
-    else:
-      {.error: "Unsupported word size".}
-  else:
-    let # Templates duplicate input params code
-      x_Mux = x
-      y_Mux = y
-    y_Mux xor (-T(ctl) and (x_Mux xor y_Mux))
-
-func mux*[T: CTBool](ctl: CTBool, x, y: T): T {.inline.}=
-  ## Multiplexer / selector
-  ## Returns x if ctl is true
-  ## else returns y
-  ## So equivalent to ctl? x: y
-  when defined(amd64) or defined(i386):
-    when sizeof(T) == 8:
-      var muxed = x
-      asm """
-        testq %[ctl], %[ctl]
-        cmovzq %[y], %[muxed]
-        : [muxed] "+r" (`muxed`)
-        : [ctl] "r" (`ctl`), [y] "r" (`y`)
-        : "cc"
-      """
-      muxed
-    elif sizeof(T) == 4:
-      var muxed = x
-      asm """
-        testl %[ctl], %[ctl]
-        cmovzl %[y], %[muxed]
-        : [muxed] "+r" (`muxed`)
-        : [ctl] "r" (`ctl`), [y] "r" (`y`)
-        : "cc"
-      """
-      muxed
-    else:
-      {.error: "Unsupported word size".}
-  else:
-    let # Templates duplicate input params code
-      x_Mux = x
-      y_Mux = y
-    T(T.T(y_Mux) xor (-T.T(ctl) and T.T(x_Mux xor y_Mux)))
-
 # ############################################################
 #
 #         Workaround system.nim `!=` template
--- a/constantine/primitives/constant_time_types.nim
+++ b/constantine/primitives/constant_time_types.nim
@ -0,0 +1,41 @@
+# Constantine
+# Copyright (c) 2018-2019    Status Research & Development GmbH
+# Copyright (c) 2020-Present Mamy André-Ratsimbazafy
+# Licensed and distributed under either of
+#   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
+#   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
+# at your option. This file may not be copied, modified, or distributed except according to those terms.
+
+type
+  BaseUint* = SomeUnsignedInt or byte
+
+  Ct*[T: BaseUint] = distinct T
+
+  CTBool*[T: Ct] = distinct T # range[T(0)..T(1)]
+    ## To avoid the compiler replacing bitwise boolean operations
+    ## by conditional branches, we don't use booleans.
+    ## We use an int to prevent compiler "optimization" and introduction of branches
+    # Note, we could use "range" but then the codegen
+    # uses machine-sized signed integer types.
+    # signed types and machine-dependent words are undesired
+    # - we don't want compiler optimizing signed "undefined behavior"
+    # - Basic functions like BigInt add/sub
+    #   return and/or accept CTBool, we don't want them
+    #   to require unnecessarily 8 bytes instead of 4 bytes
+    #
+    # Also Nim adds tests everywhere a range type is used which is great
+    # except in a crypto library:
+    # - We don't want exceptions
+    # - Nim will be helpful and return the offending value, which might be secret data
+    # - This will hint the underlying C compiler about the value range
+    #   and seeing 0/1 it might want to use branches again.
+
+  Carry* = Ct[uint8]  # distinct range[0'u8 .. 1]
+  Borrow* = Ct[uint8] # distinct range[0'u8 .. 1]
+
+const GCC_Compatible* = defined(gcc) or defined(clang) or defined(llvm_gcc)
+const X86* = defined(amd64) or defined(i386)
+
+when sizeof(int) == 8 and GCC_Compatible:
+  type
+    uint128*{.importc: "unsigned __int128".} = object
--- a/constantine/primitives/extended_precision.nim
+++ b/constantine/primitives/extended_precision.nim
@ -8,11 +8,11 @@

 # ############################################################
 #
-#  Unsafe constant-time primitives with specific restrictions
+#               Extended precision primitives
 #
 # ############################################################

-import ./constant_time
+import ./constant_time_types

 # ############################################################
 #
@ -37,37 +37,30 @@ func unsafeDiv2n1n*(q, r: var Ct[uint32], n_hi, n_lo, d: Ct[uint32]) {.inline.}=
  q = (Ct[uint32])(dividend div divisor)
  r = (Ct[uint32])(dividend mod divisor)

-func unsafeFMA*(hi, lo: var Ct[uint32], a, b, c: Ct[uint32]) {.inline.} =
+func muladd1*(hi, lo: var Ct[uint32], a, b, c: Ct[uint32]) {.inline.} =
  ## Extended precision multiplication + addition
-  ## This is constant-time on most hardware except some specific one like Cortex M0
  ## (hi, lo) <- a*b + c
-  block:
-    # Note: since a and b use 31-bit,
-    # the result is 62-bit and carrying cannot overflow
-    let dblPrec = uint64(a) * uint64(b) + uint64(c)
-    hi = Ct[uint32](dblPrec shr 31)
-    lo = Ct[uint32](dblPrec) and Ct[uint32](1'u32 shl 31 - 1)
+  ##
+  ## Note: 0xFFFFFFFF² -> (hi: 0xFFFFFFFE, lo: 0x00000001)
+  ##       so adding any c cannot overflow
+  ##
+  ## This is constant-time on most hardware
+  ## See: https://www.bearssl.org/ctmul.html
+  let dblPrec = uint64(a) * uint64(b) + uint64(c)
+  lo = (Ct[uint32])(dblPrec)
+  hi = (Ct[uint32])(dblPrec shr 32)

-func unsafeFMA2*(hi, lo: var Ct[uint32], a1, b1, a2, b2, c1, c2: Ct[uint32]) {.inline.}=
-  ## (hi, lo) <- a1 * b1 + a2 * b2 + c1 + c2
-  block:
-    # TODO: Can this overflow?
-    let dblPrec = uint64(a1) * uint64(b1) +
-                  uint64(a2) * uint64(b2) +
-                  uint64(c1) +
-                  uint64(c2)
-    hi = Ct[uint32](dblPrec shr 31)
-    lo = Ct[uint32](dblPrec) and Ct[uint32](1'u32 shl 31 - 1)
-
-func unsafeFMA2_hi*(hi: var Ct[uint32], a1, b1, a2, b2, c1: Ct[uint32]) {.inline.}=
-  ## Returns the high word of the sum of extended precision multiply-adds
-  ## (hi, _) <- a1 * b1 + a2 * b2 + c
-  block:
-    # TODO: Can this overflow?
-    let dblPrec = uint64(a1) * uint64(b1) +
-                  uint64(a2) * uint64(b2) +
-                  uint64(c1)
-    hi = Ct[uint32](dblPrec shr 31)
+func muladd2*(hi, lo: var Ct[uint32], a, b, c1, c2: Ct[uint32]) {.inline.}=
+  ## Extended precision multiplication + addition + addition
+  ## This is constant-time on most hardware except some specific one like Cortex M0
+  ## (hi, lo) <- a*b + c1 + c2
+  ##
+  ## Note: 0xFFFFFFFF² -> (hi: 0xFFFFFFFE, lo: 0x00000001)
+  ##       so adding 0xFFFFFFFF leads to (hi: 0xFFFFFFFF, lo: 0x00000000)
+  ##       and we have enough space to add again 0xFFFFFFFF without overflowing
+  let dblPrec = uint64(a) * uint64(b) + uint64(c1) + uint64(c2)
+  lo = (Ct[uint32])(dblPrec)
+  hi = (Ct[uint32])(dblPrec shr 32)

 # ############################################################
 #
@ -76,191 +69,14 @@ func unsafeFMA2_hi*(hi: var Ct[uint32], a1, b1, a2, b2, c1: Ct[uint32]) {.inline
 # ############################################################

 when sizeof(int) == 8:
-  const GccCompatible = defined(gcc) or defined(clang) or defined(llvm_gcc)
+  when defined(vcc):
+    from ./extended_precision_x86_64_msvc import unsafeDiv2n1n, muladd1, muladd2
+  elif GCCCompatible:
+    # TODO: constant-time div2n1n
+    when X86:
+      from ./extended_precision_x86_64_gcc import unsafeDiv2n1n
+      from ./extended_precision_64bit_uint128 import muladd1, muladd2
+    else:
+      from ./extended_precision_64bit_uint128 import unsafeDiv2n1n, muladd1, muladd2

-  when GccCompatible:
-    type
-      uint128*{.importc: "unsigned __int128".} = object
-
-    func unsafeDiv2n1n*(q, r: var Ct[uint64], n_hi, n_lo, d: Ct[uint64]) {.inline.}=
-      ## Division uint128 by uint64
-      ## Warning ⚠️ :
-      ##   - if n_hi == d, quotient does not fit in an uint64 and will throw SIGFPE
-      ##   - if n_hi > d result is undefined
-      {.warning: "unsafeDiv2n1n is not constant-time at the moment on most hardware".}
-
-      # TODO !!! - Replace by constant-time, portable, non-assembly version
-      #          -> use uint128? Compiler might add unwanted branches
-
-      # DIV r/m64
-      # Divide RDX:RAX (n_hi:n_lo) by r/m64
-      #
-      # Inputs
-      #   - numerator high word in RDX,
-      #   - numerator low word in RAX,
-      #   - divisor as r/m parameter (register or memory at the compiler discretion)
-      # Result
-      #   - Quotient in RAX
-      #   - Remainder in RDX
-
-      # 1. name the register/memory "divisor"
-      # 2. don't forget to dereference the var hidden pointer
-      # 3. -
-      # 4. no clobbered registers beside explectly used RAX and RDX
-      when defined(amd64):
-        when defined(cpp):
-          asm """
-            divq %[divisor]
-            : "=a" (`q`), "=d" (`r`)
-            : "d" (`n_hi`), "a" (`n_lo`), [divisor] "rm" (`d`)
-            :
-          """
-        else:
-          asm """
-            divq %[divisor]
-            : "=a" (`*q`), "=d" (`*r`)
-            : "d" (`n_hi`), "a" (`n_lo`), [divisor] "rm" (`d`)
-            :
-          """
-      else:
-        var dblPrec {.noInit.}: uint128
-        {.emit:[dblPrec, " = (unsigned __int128)", n_hi," << 64 | (unsigned __int128)",n_lo,";"].}
-
-        # Don't forget to dereference the var param in C mode
-        when defined(cpp):
-          {.emit:[q, " = (NU64)(", dblPrec," / ", d, ");"].}
-          {.emit:[r, " = (NU64)(", dblPrec," % ", d, ");"].}
-        else:
-          {.emit:["*",q, " = (NU64)(", dblPrec," / ", d, ");"].}
-          {.emit:["*",r, " = (NU64)(", dblPrec," % ", d, ");"].}
-
-    func unsafeFMA*(hi, lo: var Ct[uint64], a, b, c: Ct[uint64]) {.inline.}=
-      ## Extended precision multiplication + addition
-      ## This is constant-time on most hardware except some specific one like Cortex M0
-      ## (hi, lo) <- a*b + c
-      block:
-        # Note: since a and b use 63-bit,
-        # the result is 126-bit and carrying cannot overflow
-        var dblPrec {.noInit.}: uint128
-        {.emit:[dblPrec, " = (unsigned __int128)", a," * (unsigned __int128)", b, " + (unsigned __int128)",c,";"].}
-
-        # Don't forget to dereference the var param in C mode
-        when defined(cpp):
-          {.emit:[hi, " = (NU64)(", dblPrec," >> ", 63'u64, ");"].}
-          {.emit:[lo, " = (NU64)", dblPrec," & ", 1'u64 shl 63 - 1, ";"].}
-        else:
-          {.emit:["*",hi, " = (NU64)(", dblPrec," >> ", 63'u64, ");"].}
-          {.emit:["*",lo, " = (NU64)", dblPrec," & ", 1'u64 shl 63 - 1, ";"].}
-
-    func unsafeFMA2*(hi, lo: var Ct[uint64], a1, b1, a2, b2, c1, c2: Ct[uint64]) {.inline.}=
-      ## (hi, lo) <- a1 * b1 + a2 * b2 + c1 + c2
-      block:
-        # TODO: Can this overflow?
-        var dblPrec: uint128
-        {.emit:[dblPrec, " = (unsigned __int128)", a1," * (unsigned __int128)", b1,
-                        " + (unsigned __int128)", a2," * (unsigned __int128)", b2,
-                        " + (unsigned __int128)", c1,
-                        " + (unsigned __int128)", c2, ";"].}
-         # Don't forget to dereference the var param in C mode
-        when defined(cpp):
-          {.emit:[hi, " = (NU64)(", dblPrec," >> ", 63'u64, ");"].}
-          {.emit:[lo, " = (NU64)", dblPrec," & ", (1'u64 shl 63 - 1), ";"].}
-        else:
-          {.emit:["*",hi, " = (NU64)(", dblPrec," >> ", 63'u64, ");"].}
-          {.emit:["*",lo, " = (NU64)", dblPrec," & ", (1'u64 shl 63 - 1), ";"].}
-
-    func unsafeFMA2_hi*(hi: var Ct[uint64], a1, b1, a2, b2, c: Ct[uint64]) {.inline.}=
-      ## Returns the high word of the sum of extended precision multiply-adds
-      ## (hi, _) <- a1 * b1 + a2 * b2 + c
-      block:
-        var dblPrec: uint128
-        {.emit:[dblPrec, " = (unsigned __int128)", a1," * (unsigned __int128)", b1,
-                        " + (unsigned __int128)", a2," * (unsigned __int128)", b2,
-                        " + (unsigned __int128)", c, ";"].}
-         # Don't forget to dereference the var param in C mode
-        when defined(cpp):
-          {.emit:[hi, " = (NU64)(", dblPrec," >> ", 63'u64, ");"].}
-        else:
-          {.emit:["*",hi, " = (NU64)(", dblPrec," >> ", 63'u64, ");"].}
-
-  elif defined(vcc):
-    func udiv128(highDividend, lowDividend, divisor: uint64, remainder: var uint64): uint64 {.importc:"_udiv128", header: "<immintrin.h>", nodecl.}
-      ## Division 128 by 64, Microsoft only, 64-bit only,
-      ## returns quotient as return value remainder as var parameter
-      ## Warning ⚠️ :
-      ##   - if n_hi == d, quotient does not fit in an uint64 and will throw SIGFPE
-      ##   - if n_hi > d result is undefined
-
-    func unsafeDiv2n1n*(q, r: var Ct[uint64], n_hi, n_lo, d: Ct[uint64]) {.inline.}=
-        ## Division uint128 by uint64
-        ## Warning ⚠️ :
-        ##   - if n_hi == d, quotient does not fit in an uint64 and will throw SIGFPE
-        ##   - if n_hi > d result is undefined
-        {.warning: "unsafeDiv2n1n is not constant-time at the moment on most hardware".}
-
-        # TODO !!! - Replace by constant-time, portable, non-assembly version
-        #          -> use uint128? Compiler might add unwanted branches
-        q = udiv128(n_hi, n_lo, d, r)
-
-    func addcarry_u64(carryIn: cuchar, a, b: uint64, sum: var uint64): cuchar {.importc:"_addcarry_u64", header:"<intrin.h>", nodecl.}
-      ## (CarryOut, Sum) <-- a + b
-      ## Available on MSVC and ICC (Clang and GCC have very bad codegen, use uint128 instead)
-      ## Return value is the carry-out
-
-    func umul128(a, b: uint64, hi: var uint64): uint64 {.importc:"_umul128", header:"<intrin.h>", nodecl.}
-      ## (hi, lo) <-- a * b
-      ## Return value is the low word
-
-    func unsafeFMA*(hi, lo: var Ct[uint64], a, b, c: Ct[uint64]) {.inline.}=
-      ## Extended precision multiplication + addition
-      ## This is constant-time on most hardware except some specific one like Cortex M0
-      ## (hi, lo) <- a*b + c
-      var carry: cuchar
-      var hi, lo: uint64
-      lo = umul128(uint64(a), uint64(b), hi)
-      carry = addcarry_u64(cuchar(0), lo, uint64(c), lo)
-      discard addcarry_u64(carry, hi, 0, hi)
-
-    func unsafeFMA2*(hi, lo: var Ct[uint64], a1, b1, a2, b2, c1, c2: Ct[uint64]) {.inline.}=
-      ## (hi, lo) <- a1 * b1 + a2 * b2 + c1 + c2
-      var f1_lo, f1_hi, f2_lo, f2_hi: uint64
-      var carry: cuchar
-
-      f1_lo = umul128(uint64(a1), uint64(b1), f1_hi)
-      f2_lo = umul128(uint64(a2), uint64(b2), f2_hi)
-
-      # On CPU with ADX: we can use addcarryx_u64 (adcx/adox) to have
-      # separate carry chains that can be processed in parallel by CPU
-
-      # Carry chain 1
-      carry = addcarry_u64(cuchar(0), f1_lo, uint64(c1), f1_lo)
-      discard addcarry_u64(carry, f1_hi, 0, f1_hi)
-
-      # Carry chain 2
-      carry = addcarry_u64(cuchar(0), f2_lo, uint64(c2), f2_lo)
-      discard addcarry_u64(carry, f2_hi, 0, f2_hi)
-
-      # Merge
-      carry = addcarry_u64(cuchar(0), f1_lo, f2_lo, lo)
-      discard addcarry_u64(carry, f1_hi, f2_hi, hi)
-
-    func unsafeFMA2_hi*(hi: var Ct[uint64], a1, b1, a2, b2, c: Ct[uint64]) {.inline.}=
-      ## Returns the high word of the sum of extended precision multiply-adds
-      ## (hi, _) <- a1 * b1 + a2 * b2 + c
-
-      var f1_lo, f1_hi, f2_lo, f2_hi: uint64
-      var carry: cuchar
-
-      f1_lo = umul128(uint64(a1), uint64(b1), f1_hi)
-      f2_lo = umul128(uint64(a2), uint64(b2), f2_hi)
-
-      carry = addcarry_u64(cuchar(0), f1_lo, uint64(c), f1_lo)
-      discard addcarry_u64(carry, f1_hi, 0, f1_hi)
-
-      # Merge
-      var lo: uint64
-      carry = addcarry_u64(cuchar(0), f1_lo, f2_lo, lo)
-      discard addcarry_u64(carry, f1_hi, f2_hi, hi)
-
-  else:
-    {.error: "Compiler not implemented".}
+  export unsafeDiv2n1n, muladd1, muladd2
--- a/constantine/primitives/extended_precision_64bit_uint128.nim
+++ b/constantine/primitives/extended_precision_64bit_uint128.nim
@ -0,0 +1,81 @@
+# Constantine
+# Copyright (c) 2018-2019    Status Research & Development GmbH
+# Copyright (c) 2020-Present Mamy André-Ratsimbazafy
+# Licensed and distributed under either of
+#   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
+#   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
+# at your option. This file may not be copied, modified, or distributed except according to those terms.
+
+import ./constant_time_types
+
+# ############################################################
+#
+# Extended precision primitives on GCC & Clang (all CPU archs)
+#
+# ############################################################
+
+static:
+  doAssert GCC_Compatible
+  doAssert sizeof(int) == 8
+
+func unsafeDiv2n1n*(q, r: var Ct[uint64], n_hi, n_lo, d: Ct[uint64]) {.inline.}=
+  ## Division uint128 by uint64
+  ## Warning ⚠️ :
+  ##   - if n_hi == d, quotient does not fit in an uint64 and will throw SIGFPE on some platforms
+  ##   - if n_hi > d result is undefined
+  {.warning: "unsafeDiv2n1n is not constant-time at the moment on most hardware".}
+
+  var dblPrec {.noInit.}: uint128
+  {.emit:[dblPrec, " = (unsigned __int128)", n_hi," << 64 | (unsigned __int128)",n_lo,";"].}
+
+  # Don't forget to dereference the var param in C mode
+  when defined(cpp):
+    {.emit:[q, " = (NU64)(", dblPrec," / ", d, ");"].}
+    {.emit:[r, " = (NU64)(", dblPrec," % ", d, ");"].}
+  else:
+    {.emit:["*",q, " = (NU64)(", dblPrec," / ", d, ");"].}
+    {.emit:["*",r, " = (NU64)(", dblPrec," % ", d, ");"].}
+
+func muladd1*(hi, lo: var Ct[uint64], a, b, c: Ct[uint64]) {.inline.} =
+  ## Extended precision multiplication + addition
+  ## (hi, lo) <- a*b + c
+  ##
+  ## Note: 0xFFFFFFFF_FFFFFFFF² -> (hi: 0xFFFFFFFFFFFFFFFE, lo: 0x0000000000000001)
+  ##       so adding any c cannot overflow
+  ##
+  ## This is constant-time on most hardware
+  ## See: https://www.bearssl.org/ctmul.html
+  block:
+    var dblPrec {.noInit.}: uint128
+    {.emit:[dblPrec, " = (unsigned __int128)", a," * (unsigned __int128)", b, " + (unsigned __int128)",c,";"].}
+
+    # Don't forget to dereference the var param in C mode
+    when defined(cpp):
+      {.emit:[hi, " = (NU64)(", dblPrec," >> ", 64'u64, ");"].}
+      {.emit:[lo, " = (NU64)", dblPrec,";"].}
+    else:
+      {.emit:["*",hi, " = (NU64)(", dblPrec," >> ", 64'u64, ");"].}
+      {.emit:["*",lo, " = (NU64)", dblPrec,";"].}
+
+func muladd2*(hi, lo: var Ct[uint64], a, b, c1, c2: Ct[uint64]) {.inline.}=
+  ## Extended precision multiplication + addition + addition
+  ## This is constant-time on most hardware except some specific one like Cortex M0
+  ## (hi, lo) <- a*b + c1 + c2
+  ##
+  ## Note: 0xFFFFFFFF_FFFFFFFF² -> (hi: 0xFFFFFFFFFFFFFFFE, lo: 0x0000000000000001)
+  ##       so adding 0xFFFFFFFFFFFFFFFF leads to (hi: 0xFFFFFFFFFFFFFFFF, lo: 0x0000000000000000)
+  ##       and we have enough space to add again 0xFFFFFFFFFFFFFFFF without overflowing
+  block:
+    var dblPrec {.noInit.}: uint128
+    {.emit:[
+      dblPrec, " = (unsigned __int128)", a," * (unsigned __int128)", b,
+               " + (unsigned __int128)",c1," + (unsigned __int128)",c2,";"
+    ].}
+
+    # Don't forget to dereference the var param in C mode
+    when defined(cpp):
+      {.emit:[hi, " = (NU64)(", dblPrec," >> ", 64'u64, ");"].}
+      {.emit:[lo, " = (NU64)", dblPrec,";"].}
+    else:
+      {.emit:["*",hi, " = (NU64)(", dblPrec," >> ", 64'u64, ");"].}
+      {.emit:["*",lo, " = (NU64)", dblPrec,";"].}
--- a/constantine/primitives/extended_precision_x86_64_gcc.nim
+++ b/constantine/primitives/extended_precision_x86_64_gcc.nim
@ -0,0 +1,60 @@
+# Constantine
+# Copyright (c) 2018-2019    Status Research & Development GmbH
+# Copyright (c) 2020-Present Mamy André-Ratsimbazafy
+# Licensed and distributed under either of
+#   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
+#   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
+# at your option. This file may not be copied, modified, or distributed except according to those terms.
+
+import ./constant_time_types
+
+# ############################################################
+#
+#   Extended precision primitives for X86-64 on GCC & Clang
+#
+# ############################################################
+
+static:
+  doAssert(defined(gcc) or defined(clang) or defined(llvm_gcc))
+  doAssert sizeof(int) == 8
+  doAssert X86
+
+func unsafeDiv2n1n*(q, r: var Ct[uint64], n_hi, n_lo, d: Ct[uint64]) {.inline.}=
+  ## Division uint128 by uint64
+  ## Warning ⚠️ :
+  ##   - if n_hi == d, quotient does not fit in an uint64 and will throw SIGFPE
+  ##   - if n_hi > d result is undefined
+  {.warning: "unsafeDiv2n1n is not constant-time at the moment on most hardware".}
+
+  # TODO !!! - Replace by constant-time, portable, non-assembly version
+  #          -> use uint128? Compiler might add unwanted branches
+
+  # DIV r/m64
+  # Divide RDX:RAX (n_hi:n_lo) by r/m64
+  #
+  # Inputs
+  #   - numerator high word in RDX,
+  #   - numerator low word in RAX,
+  #   - divisor as r/m parameter (register or memory at the compiler discretion)
+  # Result
+  #   - Quotient in RAX
+  #   - Remainder in RDX
+
+  # 1. name the register/memory "divisor"
+  # 2. don't forget to dereference the var hidden pointer
+  # 3. -
+  # 4. no clobbered registers beside explicitly used RAX and RDX
+  when defined(cpp):
+    asm """
+      divq %[divisor]
+      : "=a" (`q`), "=d" (`r`)
+      : "d" (`n_hi`), "a" (`n_lo`), [divisor] "rm" (`d`)
+      :
+    """
+  else:
+    asm """
+      divq %[divisor]
+      : "=a" (`*q`), "=d" (`*r`)
+      : "d" (`n_hi`), "a" (`n_lo`), [divisor] "rm" (`d`)
+      :
+    """
--- a/constantine/primitives/extended_precision_x86_64_msvc.nim
+++ b/constantine/primitives/extended_precision_x86_64_msvc.nim
@ -0,0 +1,78 @@
+# Constantine
+# Copyright (c) 2018-2019    Status Research & Development GmbH
+# Copyright (c) 2020-Present Mamy André-Ratsimbazafy
+# Licensed and distributed under either of
+#   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
+#   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
+# at your option. This file may not be copied, modified, or distributed except according to those terms.
+
+import
+  ./constant_time_types,
+  ./addcarry_subborrow
+
+# ############################################################
+#
+#      Extended precision primitives for X86-64 on MSVC
+#
+# ############################################################
+
+static:
+  doAssert defined(vcc)
+  doAssert sizeof(int) == 8
+  doAssert X86
+
+func udiv128(highDividend, lowDividend, divisor: Ct[uint64], remainder: var Ct[uint64]): Ct[uint64] {.importc:"_udiv128", header: "<intrin.h>", nodecl.}
+  ## Division 128 by 64, Microsoft only, 64-bit only,
+  ## returns quotient as return value remainder as var parameter
+  ## Warning ⚠️ :
+  ##   - if n_hi == d, quotient does not fit in an uint64 and will throw SIGFPE
+  ##   - if n_hi > d result is undefined
+
+func umul128(a, b: Ct[uint64], hi: var Ct[uint64]): Ct[uint64] {.importc:"_umul128", header:"<intrin.h>", nodecl.}
+  ## (hi, lo) <-- a * b
+  ## Return value is the low word
+
+func unsafeDiv2n1n*(q, r: var Ct[uint64], n_hi, n_lo, d: Ct[uint64]) {.inline.}=
+    ## Division uint128 by uint64
+    ## Warning ⚠️ :
+    ##   - if n_hi == d, quotient does not fit in an uint64 and will throw SIGFPE
+    ##   - if n_hi > d result is undefined
+    {.warning: "unsafeDiv2n1n is not constant-time at the moment on most hardware".}
+
+    # TODO !!! - Replace by constant-time, portable, non-assembly version
+    #          -> use uint128? Compiler might add unwanted branches
+    q = udiv128(n_hi, n_lo, d, r)
+
+func muladd1*(hi, lo: var Ct[uint64], a, b, c: Ct[uint64]) {.inline.} =
+  ## Extended precision multiplication + addition
+  ## (hi, lo) <- a*b + c
+  ##
+  ## Note: 0xFFFFFFFF_FFFFFFFF² -> (hi: 0xFFFFFFFFFFFFFFFE, lo: 0x0000000000000001)
+  ##       so adding any c cannot overflow
+  ##
+  ## This is constant-time on most hardware
+  ## See: https://www.bearssl.org/ctmul.html
+  var carry: Carry
+  lo = umul128(a, b, hi)
+  addC(carry, lo, lo, c, Carry(0))
+  addC(carry, hi, hi, 0, carry)
+
+func muladd2*(hi, lo: var Ct[uint64], a, b, c1, c2: Ct[uint64]) {.inline.}=
+  ## Extended precision multiplication + addition + addition
+  ## This is constant-time on most hardware except some specific one like Cortex M0
+  ## (hi, lo) <- a*b + c1 + c2
+  ##
+  ## Note: 0xFFFFFFFF_FFFFFFFF² -> (hi: 0xFFFFFFFFFFFFFFFE, lo: 0x0000000000000001)
+  ##       so adding 0xFFFFFFFFFFFFFFFF leads to (hi: 0xFFFFFFFFFFFFFFFF, lo: 0x0000000000000000)
+  ##       and we have enough space to add again 0xFFFFFFFFFFFFFFFF without overflowing
+  # For speed this could be implemented with parallel pipelined carry chains
+  # via MULX + ADCX + ADOX
+  var carry1, carry2: Carry
+
+  lo = umul128(a, b, hi)
+  # Carry chain 1
+  addC(carry1, lo, lo, c1, Carry(0))
+  addC(carry1, hi, hi, 0, carry1)
+  # Carry chain 2
+  addC(carry2, lo, lo, c2, Carry(0))
+  addC(carry2, hi, hi, 0, carry2)
--- a/constantine/primitives/multiplexers.nim
+++ b/constantine/primitives/multiplexers.nim
@ -0,0 +1,161 @@
+# Constantine
+# Copyright (c) 2018-2019    Status Research & Development GmbH
+# Copyright (c) 2020-Present Mamy André-Ratsimbazafy
+# Licensed and distributed under either of
+#   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
+#   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
+# at your option. This file may not be copied, modified, or distributed except according to those terms.
+
+import ./constant_time_types
+
+# ############################################################
+#
+#   Constant-time multiplexers/selectors/conditional moves
+#
+# ############################################################
+
+# For efficiency, those are implemented in inline assembly if possible
+# API:
+# - mux(CTBool, Word, Word)
+# - mux(CTBool, CTBool, CTBool)
+# - ccopy(CTBool, var Word, Word)
+#
+# Those prevents the compiler from introducing branches and leaking secret data:
+# - https://www.cl.cam.ac.uk/~rja14/Papers/whatyouc.pdf
+# - https://github.com/veorq/cryptocoding
+
+# Generic implementation
+# ------------------------------------------------------------
+
+func mux_fallback[T](ctl: CTBool[T], x, y: T): T {.inline.}=
+  ## result = if ctl: x else: y
+  ## This is a constant-time operation
+  y xor (-T(ctl) and (x xor y))
+
+func mux_fallback[T: CTBool](ctl: CTBool, x, y: T): T {.inline.}=
+  ## result = if ctl: x else: y
+  ## This is a constant-time operation
+  T(T.T(y) xor (-T.T(ctl) and T.T(x xor y)))
+
+func ccopy_fallback[T](ctl: CTBool[T], x: var T, y: T) {.inline.}=
+  ## Conditional copy
+  ## Copy ``y`` into ``x`` if ``ctl`` is true
+  x = ctl.mux_fallback(y, x)
+
+# x86 and x86-64
+# ------------------------------------------------------------
+
+template mux_x86_impl() {.dirty.} =
+  static: doAssert(X86)
+  static: doAssert(GCC_Compatible)
+
+  when sizeof(T) == 8:
+    var muxed = x
+    asm """
+      testq %[ctl], %[ctl]
+      cmovzq %[y], %[muxed]
+      : [muxed] "+r" (`muxed`)
+      : [ctl] "r" (`ctl`), [y] "r" (`y`)
+      : "cc"
+    """
+    muxed
+  elif sizeof(T) == 4:
+    var muxed = x
+    asm """
+      testl %[ctl], %[ctl]
+      cmovzl %[y], %[muxed]
+      : [muxed] "+r" (`muxed`)
+      : [ctl] "r" (`ctl`), [y] "r" (`y`)
+      : "cc"
+    """
+    muxed
+  else:
+    {.error: "Unsupported word size".}
+
+func mux_x86[T](ctl: CTBool[T], x, y: T): T {.inline.}=
+  ## Multiplexer / selector
+  ## Returns x if ctl is true
+  ## else returns y
+  ## So equivalent to ctl? x: y
+  mux_x86_impl()
+
+func mux_x86[T: CTBool](ctl: CTBool, x, y: T): T {.inline.}=
+  ## Multiplexer / selector
+  ## Returns x if ctl is true
+  ## else returns y
+  ## So equivalent to ctl? x: y
+  mux_x86_impl()
+
+func ccopy_x86[T](ctl: CTBool[T], x: var T, y: T) {.inline.}=
+  ## Conditional copy
+  ## Copy ``y`` into ``x`` if ``ctl`` is true
+  static: doAssert(X86)
+  static: doAssert(GCC_Compatible)
+
+  when sizeof(T) == 8:
+    when defined(cpp):
+      asm """
+        testq %[ctl], %[ctl]
+        cmovnzq %[y], %[x]
+        : [x] "+r" (`x`)
+        : [ctl] "r" (`ctl`), [y] "r" (`y`)
+        : "cc"
+      """
+    else:
+      asm """
+        testq %[ctl], %[ctl]
+        cmovnzq %[y], %[x]
+        : [x] "+r" (`*x`)
+        : [ctl] "r" (`ctl`), [y] "r" (`y`)
+        : "cc"
+      """
+  elif sizeof(T) == 4:
+    when defined(cpp):
+      asm """
+        testl %[ctl], %[ctl]
+        cmovnzl %[y], %[x]
+        : [x] "+r" (`x`)
+        : [ctl] "r" (`ctl`), [y] "r" (`y`)
+        : "cc"
+      """
+    else:
+      asm """
+        testl %[ctl], %[ctl]
+        cmovnzl %[y], %[x]
+        : [x] "+r" (`*x`)
+        : [ctl] "r" (`ctl`), [y] "r" (`y`)
+        : "cc"
+      """
+  else:
+    {.error: "Unsupported word size".}
+
+# Public functions
+# ------------------------------------------------------------
+
+func mux*[T](ctl: CTBool[T], x, y: T): T {.inline.}=
+  ## Multiplexer / selector
+  ## Returns x if ctl is true
+  ## else returns y
+  ## So equivalent to ctl? x: y
+  when X86 and GCC_Compatible:
+    mux_x86(ctl, x, y)
+  else:
+    mux_fallback(ctl, x, y)
+
+func mux*[T: CTBool](ctl: CTBool, x, y: T): T {.inline.}=
+  ## Multiplexer / selector
+  ## Returns x if ctl is true
+  ## else returns y
+  ## So equivalent to ctl? x: y
+  when X86 and GCC_Compatible:
+    mux_x86(ctl, x, y)
+  else:
+    mux_fallback(ctl, x, y)
+
+func ccopy*[T](ctl: CTBool[T], x: var T, y: T) {.inline.}=
+  ## Conditional copy
+  ## Copy ``y`` into ``x`` if ``ctl`` is true
+  when X86 and GCC_Compatible:
+    ccopy_x86(ctl, x, y)
+  else:
+    ccopy_fallback(ctl, x, y)
--- a/constantine/primitives/research/README.md
+++ b/constantine/primitives/research/README.md
@ -0,0 +1,45 @@
+# Compiler for generic inline assembly code-generation
+
+This folder holds alternative implementations of primitives
+that uses inline assembly.
+
+This avoids the pitfalls of traditional compiler bad code generation
+for multiprecision arithmetic (see GCC https://gcc.godbolt.org/z/2h768y)
+or unsupported features like handling 2 carry chains for
+multiplication using MULX/ADOX/ADCX.
+
+To be generic over multiple curves,
+for example BN254 requires 4 words and BLS12-381 requires 6 words of size 64 bits,
+the compilers is implemented as a set of macros that generate inline assembly.
+
+⚠⚠⚠ Warning! Warning! Warning!
+
+This is a significant sacrifice of code readability, portability, auditability and maintainability in favor of performance.
+
+This combines 2 of the most notorious ways to obfuscate your code:
+* metaprogramming and macros
+* inline assembly
+
+Adventurers beware: not for the faint of heart.
+
+This is unfinished, untested, unused, unfuzzed and just a proof-of-concept at the moment.*
+
+_* I take no responsibility if this smashes your stack, eats your cat, hides a skeleton in your closet, warps a pink elephant in the room, summons untold eldritch horrors or causes the heat death of the universe. You have been warned._
+
+_The road to debugging hell is paved with metaprogrammed assembly optimizations._
+
+_For my defence, OpenSSL assembly is generated by a Perl script and neither Perl nor the generated Assembly are type-checked by a dependently-typed compiler._
+
+## References
+
+Multiprecision (Montgomery) Multiplication & Squaring in Assembly
+
+- Intel MULX/ADCX/ADOX Table 2 p13: https://www.intel.cn/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
+- Squaring: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/large-integer-squaring-ia-paper.pdf
+- https://eprint.iacr.org/eprint-bin/getfile.pl?entry=2017/558&version=20170608:200345&file=558.pdf
+- https://github.com/intel/ipp-crypto
+- https://github.com/herumi/mcl
+
+Experimentations in Nim
+
+- https://github.com/mratsim/finite-fields
--- a/constantine/primitives/research/addcarry_subborrow_compiler.nim
+++ b/constantine/primitives/research/addcarry_subborrow_compiler.nim
@ -0,0 +1,133 @@
+# Constantine
+# Copyright (c) 2018-2019    Status Research & Development GmbH
+# Copyright (c) 2020-Present Mamy André-Ratsimbazafy
+# Licensed and distributed under either of
+#   * MIT license (license terms in the root directory or at http://opensource.org/licenses/MIT).
+#   * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
+# at your option. This file may not be copied, modified, or distributed except according to those terms.
+
+# ############################################################
+#
+#            Add-with-carry and Sub-with-borrow
+#
+# ############################################################
+#
+# This is a proof-of-concept optimal add-with-carry
+# compiler implemented as Nim macros.
+#
+# This overcome the bad GCC codegen aven with addcary_u64 intrinsic.
+
+import macros
+
+func wordsRequired(bits: int): int {.compileTime.} =
+  ## Compute the number of limbs required
+  ## from the announced bit length
+  (bits + 64 - 1) div 64
+
+type
+  BigInt[bits: static int] {.byref.} = object
+    ## BigInt
+    ## Enforce-passing by reference otherwise uint128 are passed by stack
+    ## which causes issue with the inline assembly
+    limbs: array[bits.wordsRequired, uint64]
+
+macro addCarryGen_u64(a, b: untyped, bits: static int): untyped =
+  var asmStmt = (block:
+    "      movq %[b], %[tmp]\n" &
+    "      addq %[tmp], %[a]\n"
+  )
+
+  let maxByteOffset = bits div 8
+  const wsize = sizeof(uint64)
+
+  when defined(gcc):
+    for byteOffset in countup(wsize, maxByteOffset-1, wsize):
+      asmStmt.add (block:
+        "\n" &
+        # movq 8+%[b], %[tmp]
+        "      movq " & $byteOffset & "+%[b], %[tmp]\n" &
+        # adcq %[tmp], 8+%[a]
+        "      adcq %[tmp], " & $byteOffset & "+%[a]\n"
+      )
+  elif defined(clang):
+    # https://lists.llvm.org/pipermail/llvm-dev/2017-August/116202.html
+    for byteOffset in countup(wsize, maxByteOffset-1, wsize):
+      asmStmt.add (block:
+        "\n" &
+        # movq 8+%[b], %[tmp]
+        "      movq " & $byteOffset & "%[b], %[tmp]\n" &
+        # adcq %[tmp], 8+%[a]
+        "      adcq %[tmp], " & $byteOffset & "%[a]\n"
+      )
+
+  let tmp = ident("tmp")
+  asmStmt.add (block:
+    ": [tmp] \"+r\" (`" & $tmp & "`), [a] \"+m\" (`" & $a & "->limbs[0]`)\n" &
+    ": [b] \"m\"(`" & $b & "->limbs[0]`)\n" &
+    ": \"cc\""
+  )
+
+  result = newStmtList()
+  result.add quote do:
+    var `tmp`{.noinit.}: uint64
+
+  result.add nnkAsmStmt.newTree(
+    newEmptyNode(),
+    newLit asmStmt
+  )
+
+  echo result.toStrLit
+
+func `+=`(a: var BigInt, b: BigInt) {.noinline.}=
+  # Depending on inline or noinline
+  # the generated ASM addressing must be tweaked for Clang
+  # https://lists.llvm.org/pipermail/llvm-dev/2017-August/116202.html
+  addCarryGen_u64(a, b, BigInt.bits)
+
+# #############################################
+when isMainModule:
+  import random
+  proc rand(T: typedesc[BigInt]): T =
+    for i in 0 ..< result.limbs.len:
+      result.limbs[i] = uint64(rand(high(int)))
+
+  proc main() =
+    block:
+      let a = BigInt[128](limbs: [high(uint64), 0])
+      let b = BigInt[128](limbs: [1'u64, 0])
+
+      echo "a:        ", a
+      echo "b:        ", b
+      echo "------------------------------------------------------"
+
+      var a1 = a
+      a1 += b
+      echo a1
+      echo "======================================================"
+
+    block:
+      let a = rand(BigInt[256])
+      let b = rand(BigInt[256])
+
+      echo "a:        ", a
+      echo "b:        ", b
+      echo "------------------------------------------------------"
+
+      var a1 = a
+      a1 += b
+      echo a1
+      echo "======================================================"
+
+    block:
+      let a = rand(BigInt[384])
+      let b = rand(BigInt[384])
+
+      echo "a:        ", a
+      echo "b:        ", b
+      echo "------------------------------------------------------"
+
+      var a1 = a
+      a1 += b
+      echo a1
+
+  main()
--- a/constantine/tower_field_extensions/abelian_groups.nim
+++ b/constantine/tower_field_extensions/abelian_groups.nim
@ -9,7 +9,7 @@
 import
  ../arithmetic/finite_fields,
  ../config/common,
-  ../primitives/constant_time
+  ../primitives

 # ############################################################
 #
--- a/tests/prng.nim
+++ b/tests/prng.nim
@ -7,7 +7,7 @@
 # at your option. This file may not be copied, modified, or distributed except according to those terms.

 import
-  ../constantine/arithmetic/bigints_checked,
+  ../constantine/arithmetic/bigints,
  ../constantine/config/[common, curves]

 # ############################################################
@ -86,13 +86,12 @@ func random[T](rng: var RngState, a: var T, C: static Curve) {.noInit.}=
  when T is BigInt:
    var reduced, unreduced{.noInit.}: T

-    unreduced.setInternalBitLength()
    for i in 0 ..< unreduced.limbs.len:
      unreduced.limbs[i] = Word(rng.next())

    # Note: a simple modulo will be biaised but it's simple and "fast"
    reduced.reduce(unreduced, C.Mod.mres)
-    a.montyResidue(reduced, C.Mod.mres, C.getR2modP(), C.getNegInvModWord())
+    a.montyResidue(reduced, C.Mod.mres, C.getR2modP(), C.getNegInvModWord(), C.canUseNoCarryMontyMul())

  else:
    for field in fields(a):
--- a/tests/test_bigints.nim
+++ b/tests/test_bigints.nim
@ -8,9 +8,9 @@

 import  unittest,
        ../constantine/io/io_bigints,
-        ../constantine/arithmetic/bigints_checked,
+        ../constantine/arithmetic/bigints,
        ../constantine/config/common,
-        ../constantine/primitives/constant_time
+        ../constantine/primitives

 proc main() =
  suite "isZero":
--- a/tests/test_bigints_multimod.nim
+++ b/tests/test_bigints_multimod.nim
@ -11,8 +11,8 @@ import
  unittest,
  # Third-party
  ../constantine/io/io_bigints,
-  ../constantine/arithmetic/[bigints_raw, bigints_checked],
-  ../constantine/primitives/constant_time
+  ../constantine/arithmetic/bigints,
+  ../constantine/primitives

 proc main() =
  suite "Bigints - Multiprecision modulo":
@ -89,4 +89,52 @@ proc main() =
        check:
          bool(r == expected)

+    test "bitsize 1882 mod bitsize 312":
+        let a = BigInt[1882].fromHex("0x3feb1d432a950d856fc121c5057671cf81bf9d283a30b69128e84d57900aba486136b9e93f96293dbf7e280b8a641d970748b27ba0986411c7359f32f37447e34ae9e9189336269326fb62fd4d0891bf2383548e8ada92517cf5001e449dd5b4c6501b361636c13f3d5db5ed40f7048f8b1b8db65e9a34a08992e19527ded175fd6b4c4559c25c384691f0567ad27cf5df2b4192d94dc3bf596216067fd02a3790c048bc4bff16e70f84c395ff1243d4b92b514d0c22fc35a82611b77137f09ec8bc31df58fbea2b532ef38ed9078bd2982893326833a20daf2792bdf1ac75ca80e2ffd063f49bb173e7b100")
+        let m = BigInt[312].fromHex("0x8bad37615c65cb40b592525aeb19de0b8a3f9db87f3c77050a77050ebe81712d78253cdc0eafec")
+
+        let expected = BigInt[312].fromHex("0x1d79fa2f576827a70b38b303036884b346fc52941b2df0863e8f635c467ea1aec04520e6feb614")
+
+        var r: BigInt[312]
+        r.reduce(a, m)
+
+        check:
+          bool(r == expected)
+
+    test "bitsize 5276 mod bitsize 337":
+        let a = BigInt[5276].fromHex("0xdcc610304e437c91df568effa736e9ec472d921d2e32f0123f59f8a0e7a639a84db3d6e91c4ce9164e2183aeb9efdfdf5b179b1e5b8074602193b9ba0f5cf547ce31c6c6d33317c40fdcc66090d13034a8ed82b1244cd9e82ec43b08a4a8cd7aaa4937b72b19b01c942427db3e630e70f6823f36a4d0db17b0515ab1582672f613c22f43b2743929d92a924b2d7529a08fa2950ac90fd529207d3dd55a65f80f77715b340755f545424375a1f6dfe3eea1309365036d924226297ecd1296c5938a7b18fe36c3126f54161818ff8e29d69c25b7a47a47061f6e76b6ffbe0c2dbcbf83f49b0bd24cb6f2de460e6c6540e15e23e23573a04dff7d18f88e266a1e36627181dd18a9a182182b1c4e1ec8123b916d18a82139c6b2f5cc7206681b21ec3b14f4da44337892a90db21e070c8799a5cd7e81c03b901ade08021401d6a4cd27bef1e1215c65c2e8abadf44cc455383b37c12fe1f25774bbb0552ca54699c8d38cd88b56ff80c130734dbd231f8f2d15e62effe7bfedde43c4d06f06115befbafabcb1128b3c80f8c6395696f28b6d32c12cc74ba7fcef95e97bd854c98716b6c079d971199a4d3fa4f6d7f901f5370b3f0a4fa6dddff81820ca012bb821560b86701d25a3c99f0daae5824bc5d4731c1e5e879b94bb0a5a862ac79d22fc42d20d3d8963a49997627d4d246088a21531e58174e55eed8007c7e05bece76c64a368c42a7e178b0ba0ce3b54f1d9a568755c71f3518e5d10caa2eda8edd74f13c41b70c6ff0a75f6b821b38cb6148acf6890fc79d508cfa741c8514498b81aaf1698420bf844742d325afe8fce3e85c1d2aefc6bc254e3628f19116643a538c6657a937d62069dfe7217a9e9138e8a12f9857c9eb671c2adb3b3129d0653eb62296bdcfe51335b966e39838a4b18fce380af1f00")
+        let m = BigInt[337].fromHex("0x016255c2e37f1b1405f9f195040e80778b896b23a1487a40ece792894025590800bcadc343fcef4e2d01b8")
+
+        let expected = BigInt[337].fromHex("0x66bb36adf84b9024f97100688cc66be2f412fd91e9bae3623e810dcae86166a52bdb4c889fa0e5d128d8")
+
+        var r: BigInt[337]
+        r.reduce(a, m)
+
+        check:
+          bool(r == expected)
+
+    test "bitsize 4793 mod bitsize 190":
+        let a = BigInt[4793].fromHex("0x20924645cabc04f0213ba42961e10dedd2c6bd0c9625d04949037b15f9546001551651049038285b441824ef5540a174da0d5ab5f6f07750b9d6ea21a8dd127b467cd1ff0d547d7c86705402bfb8efee231c8385d14666d4e5fdd4e4e6c230ed61b631a6387c57823578139db306c1687bb950985e608c2792694e895e97039c0c155c79b3d595b391f5f8217feeebc20b093658d3e7612449ef575da3d0cde0d3726c58ca9302952deaee8b44a31029086db65838767c60b63f68f9c207ca128574ff9023fc29de264c8e4df20b7764064f9228a2481d5936cc840e107f73b04fcf31f8060c38ea5fb9c8f165e4bbdd1c7b8f0cfb950be57d87678a0a3d45eb1ccbef1a977e881de4f4f95ef0e144a0486ca47084a565242a2baab7a5383e85d51c466d7b03e1f06285bfe04cbb4b90e829a50af103ab8a812cfdad100344b3ae0ab3b96e26a0d97cf16d1910212471f9b3f5e3d0133360387ca3a52682d68447e7ac454e321bc5381a24ff5348baad68d3609a7dfa2118275f2620cf30b1ebb21d98b1d783b45c2acf4a9a9b1cfeba21b2fe1d93fda3234ee90bdba1b23e3a514c7e2189f7bf07236397e1efc5cb5b3a3e748ba130272d880b9d74fc6c2386f19c9e51093ce885ad60493a3d4d0c84154e6fb6d4bb222207eb9f3a2136cebe883a5a89b95eba5363c113f330636d00dda40f3445afb651a56a1d00e5d3815b3c06f123e5eb6b8ce5621ab8f05765fe803e94a12998c249cb1e84c9c4785c8631454283e0471149bc541eebc691b3231e4969b433b9c8195db915cb3baef8db7b3ab0dff2aa7f284e5b86e8055ab95bf45086a216138000")
+        let m = BigInt[190].fromHex("0x3cf10d948e00a135ab10a6d073b8289e8465d5798d06891c")
+
+        let expected = BigInt[190].fromHex("0x1154587f8cfac96bc146790bc49262ad32e1560a0bf734a4")
+
+        var r: BigInt[190]
+        r.reduce(a, m)
+
+        check:
+          bool(r == expected)
+
+    test "bitsize 5240 mod bitsize 3072":
+        let a = BigInt[5240].fromHex("0xfd76093ad413df93dddded94a16c17ffbdd1d8ce9377f5940af54293410603fd381822bb9cec0074005f68dd7bc33f879fb0613b9cd0bac50c27e5e40a0c3948e5b4a07e6c9b1795f2f60647d67b9fd5025d82deffcbb62209a921eb766f7a2335d1b6ca0a4877c948d5cf68aaa2d5bdbea3f991eb027b91d03c91712d739cf6522279007add5febb85fe8f5ef661641fc2dc36b37e709d77f3a0f016f1b421527d2a28c9e8734f243a81e985a26e6ec5650ae81f10381869a9a78e5dac5e6d65783c40f8c232398425e96da2c6d94290ee6463580c826c609691a860f8ceb233a072fca384a9a74d15beaad7df8f1dcde437a02db6218b0c0bca43f9f936bdede271b730b098cf4dd97a84a10feb7eb04841ac01e4728a12a1f96b88ac91bef33818095777893813635cc918480f255a45bf9ebc740e3992877879d4ee64b1aad22439dc9872e5ff13a25dcb32669dab77e05982adbfb06073d5b9bd2b0dcdc7a515296b13e7251d3fb6ca132492f66d312e2610011284b6a2f2a1c26a873959ff5935ddd1e229d4ec7cdf3fa1ce1f55bc549481adfee5ec8aa8b4eddad88c74298d50c2d310be6c21e92067bf8b5ae2330f750f60122251d1fcf84c58a3abee3bed8715dda1c016eb58672faed0a5c678806028195586a349702eeff0738d14ac9a2ced66a50e894cf0a3d546608e6666443d5ea4bc6e7078ae356257ce12fc3d6cf84fe52f13fe27ad89038d041698ed615c856d326d2f1fc5cc916a5176d44a965cf5247b81b901212eb3f35912e82d68b28fe438b3f9cb43362794a53976dcc6ec21aa097813e47f7cb02af3e2bdd9f4323e3ffb577a800cdb5fe925b832263db0332a235c6d8de3df97d8a963c8062f1d39aa302f8db2e2ac292fb6f66d4e2e074e8ce4c77ca0a311bcd3455")
+        let m = BigInt[3072].fromHex("0x82a3b927471854fe51e4246c36e83a110ab5fec30a5f26fca0316b8ab42784bc17015ef9d0217769e695b92bfe4f30ffcfef881179edc6623dcaf305ee4424da00d8c49a873535e095ac64a8cc66767c26ffa7f2f1acdaedd82b09b62d297f951cd3af83e7023b4eafea8056fc9e4f53f03eb9ee93613d58219214f8d884f51d4e09b336a4bf53fb29a4394fc9b8d4004f4ab04cdfda43441e63846e3dbd02c46ab521b85a16d4a063c33be63e88c9b3fec486f9eda4958a167cb4dd64dd44c7047e4f1372e6ce6f29bbc4a6cc0f498c0428dbc35daaa81abedd937e602ce3eb38666f0ccd603955949e068dd005e2e2bf6d423fd183fcbf61c504eeffa589c3482251b1191e7d71b8e31fc05979b4ebb6ab57ce810d6e34144a8417ab2ca45709b3841bb08cbf38658d2f4129adee121933369deb238db2f74df4490ea5486685554cc4dac015f4d09ded70a4fc808b080142eb7c865fe8e89046f3c0de448f1442258d2cd565dfd457cfb49ab0c0a735196e6cb06a962f29e53060576327b8")
+
+        let expected = BigInt[3072].fromHex("0x75be9187192dccf08bedcb06c7fba60830840cb8a5c3a5895e63ffd78073f2f7e0ccc72ae2f91c2be9fe51e48373bf4426e6e1babb9bc5374747a0e24b982a27359cf403a6bb900800b6dd52b309788df7f599f3db6f5b5ba5fbe88b8d03ab32fbe8d75dbbad0178f70dc4dfbc39008e5c8a4975f08060f4af1718e1a8811b0b73daabf67bf971c1fa79d678e3e2bf878a844004d1ab5b11a2c3e4fa8abbbe15b75a4a15a4c0eecd128ad7b13571545a967cac88d1b1e88c3b09723849c54adede6b36dd21000f41bc404083bf01902d2d3591c2e51fe0cc26d691cbc9ba6ea3137bd977745cc8761c828f7d54899841701faeca7ff5fc975968693284c2dcaf68e9852a67b5782810834f2eed0ba8e69d18c2a9d8aa1d81528110f0156610febe5ee2db65add65006a9f91370828e356c7751fa50bb49f43b408cd2f4767a43bc57888afe01d2a85d457c68a3eb60de713b79c318b92cb1b2837cf78f9e6e5ec0091d2810a34a1c75400190f8582a8b42f436b799db088689f8187b6db8530d")
+
+        var r: BigInt[3072]
+        r.reduce(a, m)
+
+        check:
+          bool(r == expected)
+
 main()
--- a/tests/test_bigints_vs_gmp.nim
+++ b/tests/test_bigints_vs_gmp.nim
@ -13,7 +13,7 @@ import
  gmp, stew/byteutils,
  # Internal
  ../constantine/io/io_bigints,
-  ../constantine/arithmetic/[bigints_raw, bigints_checked],
+  ../constantine/arithmetic/bigints,
  ../constantine/primitives/constant_time

 # We test up to 1024-bit, more is really slow
@ -113,16 +113,16 @@ proc main() =

    var aW, mW: csize # Word written by GMP

-    discard mpz_export(aBuf[0].addr, aW.addr, GMP_LeastSignificantWordFirst, 1, GMP_WordNativeEndian, 0, a)
-    discard mpz_export(mBuf[0].addr, mW.addr, GMP_LeastSignificantWordFirst, 1, GMP_WordNativeEndian, 0, m)
+    discard mpz_export(aBuf[0].addr, aW.addr, GMP_MostSignificantWordFirst, 1, GMP_WordNativeEndian, 0, a)
+    discard mpz_export(mBuf[0].addr, mW.addr, GMP_MostSignificantWordFirst, 1, GMP_WordNativeEndian, 0, m)

    # Since the modulus is using all bits, it's we can test for exact amount copy
-    doAssert aLen >= aW, "Expected at most " & $aLen & " bytes but wrote " & $aW & " for " & toHex(aBuf) & " (little-endian)"
-    doAssert mLen == mW, "Expected " & $mLen & " bytes but wrote " & $mW & " for " & toHex(mBuf) & " (little-endian)"
+    doAssert aLen >= aW, "Expected at most " & $aLen & " bytes but wrote " & $aW & " for " & toHex(aBuf) & " (big-endian)"
+    doAssert mLen == mW, "Expected " & $mLen & " bytes but wrote " & $mW & " for " & toHex(mBuf) & " (big-endian)"

    # Build the bigint
-    let aTest = BigInt[aBits].fromRawUint(aBuf, littleEndian)
-    let mTest = BigInt[mBits].fromRawUint(mBuf, littleEndian)
+    let aTest = BigInt[aBits].fromRawUint(aBuf.toOpenArray(0, aW-1), bigEndian)
+    let mTest = BigInt[mBits].fromRawUint(mBuf.toOpenArray(0, mW-1), bigEndian)

    #########################################################
    # Modulus
@ -135,15 +135,16 @@ proc main() =
    # Check
    var rGMP: array[mLen, byte]
    var rW: csize # Word written by GMP
-    discard mpz_export(rGMP[0].addr, rW.addr, GMP_LeastSignificantWordFirst, 1, GMP_WordNativeEndian, 0, r)
+    discard mpz_export(rGMP[0].addr, rW.addr, GMP_MostSignificantWordFirst, 1, GMP_WordNativeEndian, 0, r)

    var rConstantine: array[mLen, byte]
-    exportRawUint(rConstantine, rTest, littleEndian)
+    exportRawUint(rConstantine, rTest, bigEndian)

    # echo "rGMP: ", rGMP.toHex()
    # echo "rConstantine: ", rConstantine.toHex()

-    doAssert rGMP == rConstantine, block:
+    # Note: in bigEndian, GMP aligns left while constantine aligns right
+    doAssert rGMP.toOpenArray(0, rW-1) == rConstantine.toOpenArray(mLen-rW, mLen-1), block:
      # Reexport as bigEndian for debugging
      discard mpz_export(aBuf[0].addr, aW.addr, GMP_MostSignificantWordFirst, 1, GMP_WordNativeEndian, 0, a)
      discard mpz_export(mBuf[0].addr, mW.addr, GMP_MostSignificantWordFirst, 1, GMP_WordNativeEndian, 0, m)
@ -152,6 +153,7 @@ proc main() =
      "  m (" & align($mBits, 4) & "-bit):   " & mBuf.toHex & "\n" &
      "failed:" & "\n" &
      "  GMP:            " & rGMP.toHex() & "\n" &
-      "  Constantine:    " & rConstantine.toHex()
+      "  Constantine:    " & rConstantine.toHex() & "\n" &
+      "(Note that GMP aligns bytes left while constantine aligns bytes right)"

 main()
--- a/tests/test_finite_fields_powinv.nim
+++ b/tests/test_finite_fields_powinv.nim
@ -7,7 +7,7 @@
 # at your option. This file may not be copied, modified, or distributed except according to those terms.

 import  unittest,
-        ../constantine/arithmetic/[bigints_checked, finite_fields],
+        ../constantine/arithmetic/[bigints, finite_fields],
        ../constantine/io/io_fields,
        ../constantine/config/curves

--- a/tests/test_finite_fields_vs_gmp.nim
+++ b/tests/test_finite_fields_vs_gmp.nim
@ -13,7 +13,7 @@ import
  gmp, stew/byteutils,
  # Internal
  ../constantine/io/[io_bigints, io_fields],
-  ../constantine/arithmetic/[finite_fields, bigints_checked],
+  ../constantine/arithmetic/[finite_fields, bigints],
  ../constantine/primitives/constant_time,
  ../constantine/config/curves

@ -100,7 +100,7 @@ proc binary_epilogue[C: static Curve, N: static int](
    "  b:   " & bBuf.toHex & "\n" &
    "failed:" & "\n" &
    "  GMP:            " & rGMP.toHex() & "\n" &
-    "  Constantine:    " & rConstantine.toHex() &
+    "  Constantine:    " & rConstantine.toHex() & "\n" &
    "(Note that GMP aligns bytes left while constantine aligns bytes right)"

 # ############################################################
--- a/tests/test_fp2.nim
+++ b/tests/test_fp2.nim
@ -12,7 +12,7 @@ import
  # Internals
  ../constantine/tower_field_extensions/[abelian_groups, fp2_complex],
  ../constantine/config/[common, curves],
-  ../constantine/arithmetic/bigints_checked,
+  ../constantine/arithmetic/bigints,
  # Test utilities
  ./prng

@ -45,7 +45,7 @@ suite "𝔽p2 = 𝔽p[𝑖] (irreducible polynomial x²+1)":
            O

          var r: typeof(C.Mod.mres)
-          r.redc(oneFp2.c0.mres, C.Mod.mres, C.getNegInvModWord())
+          r.redc(oneFp2.c0.mres, C.Mod.mres, C.getNegInvModWord(), canUseNoCarryMontyMul = false)

          check:
            bool(r == oneBig)
--- a/tests/test_io_bigints.nim
+++ b/tests/test_io_bigints.nim
@ -9,7 +9,7 @@
 import  unittest, random,
        ../constantine/io/io_bigints,
        ../constantine/config/common,
-        ../constantine/arithmetic/bigints_checked
+        ../constantine/arithmetic/bigints

 randomize(0xDEADBEEF) # Random seed for reproducibility
 type T = BaseType
@ -24,7 +24,6 @@ proc main() =

        check:
          T(big.limbs[0]) == 0
-          T(big.limbs[1]) == 0

    test "Parsing and dumping round-trip on uint64":
      block:
@ -85,4 +84,11 @@ proc main() =

        check: p == hex

+    test "Round trip on 3072-bit integer":
+      const n = "0x75be9187192dccf08bedcb06c7fba60830840cb8a5c3a5895e63ffd78073f2f7e0ccc72ae2f91c2be9fe51e48373bf4426e6e1babb9bc5374747a0e24b982a27359cf403a6bb900800b6dd52b309788df7f599f3db6f5b5ba5fbe88b8d03ab32fbe8d75dbbad0178f70dc4dfbc39008e5c8a4975f08060f4af1718e1a8811b0b73daabf67bf971c1fa79d678e3e2bf878a844004d1ab5b11a2c3e4fa8abbbe15b75a4a15a4c0eecd128ad7b13571545a967cac88d1b1e88c3b09723849c54adede6b36dd21000f41bc404083bf01902d2d3591c2e51fe0cc26d691cbc9ba6ea3137bd977745cc8761c828f7d54899841701faeca7ff5fc975968693284c2dcaf68e9852a67b5782810834f2eed0ba8e69d18c2a9d8aa1d81528110f0156610febe5ee2db65add65006a9f91370828e356c7751fa50bb49f43b408cd2f4767a43bc57888afe01d2a85d457c68a3eb60de713b79c318b92cb1b2837cf78f9e6e5ec0091d2810a34a1c75400190f8582a8b42f436b799db088689f8187b6db8530d"
+      let x = BigInt[3072].fromHex(n)
+      let h = x.toHex(bigEndian)
+
+      check: n == h
+
 main()
--- a/tests/test_io_fields.nim
+++ b/tests/test_io_fields.nim
@ -10,7 +10,7 @@ import  unittest, random,
        ../constantine/io/[io_bigints, io_fields],
        ../constantine/config/curves,
        ../constantine/config/common,
-        ../constantine/arithmetic/[bigints_checked, finite_fields]
+        ../constantine/arithmetic/[bigints, finite_fields]

 randomize(0xDEADBEEF) # Random seed for reproducibility
 type T = BaseType
@ -18,6 +18,30 @@ type T = BaseType
 proc main() =
  suite "IO - Finite fields":
    test "Parsing and serializing round-trip on uint64":
+      # 101 ---------------------------------
+      block:
+        # "Little-endian" - 0
+        let x = BaseType(0)
+        let x_bytes = cast[array[sizeof(BaseType), byte]](x)
+        var f: Fp[Fake101]
+        f.fromUint(x)
+
+        var r_bytes: array[sizeof(BaseType), byte]
+        exportRawUint(r_bytes, f, littleEndian)
+        check: x_bytes == r_bytes
+
+      block:
+        # "Little-endian" - 1
+        let x = BaseType(1)
+        let x_bytes = cast[array[sizeof(BaseType), byte]](x)
+        var f: Fp[Fake101]
+        f.fromUint(x)
+
+        var r_bytes: array[sizeof(BaseType), byte]
+        exportRawUint(r_bytes, f, littleEndian)
+        check: x_bytes == r_bytes
+
+      # Mersenne 61 ---------------------------------
      block:
        # "Little-endian" - 0
        let x = 0'u64
@ -103,4 +127,20 @@ proc main() =

        check: p == hex

+    test "Round trip on prime field of NIST P256 (secp256r1) curve":
+      block: # 2^126
+        const p = "0x0000000000000000000000000000000040000000000000000000000000000000"
+        let x = Fp[P256].fromBig BigInt[256].fromHex(p)
+        let hex = x.toHex(bigEndian)
+
+        check: p == hex
+
+    test "Round trip on prime field of BLS12_381 curve":
+      block: # 2^126
+        const p = "0x000000000000000000000000000000000000000000000000000000000000000040000000000000000000000000000000"
+        let x = Fp[BLS12_381].fromBig BigInt[381].fromHex(p)
+        let hex = x.toHex(bigEndian)
+
+        check: p == hex
+
 main()
--- a/tests/test_primitives.nim
+++ b/tests/test_primitives.nim
@ -7,7 +7,7 @@
 # at your option. This file may not be copied, modified, or distributed except according to those terms.

 import  unittest, random, math,
-        ../constantine/primitives/constant_time
+        ../constantine/primitives

 # Random seed for reproducibility
 randomize(0xDEADBEEF)