This document describes the CPU-specific features and compilation flags that significantly improves Nimbus performance.
We focus on x86-64 and ARMv8 (64 bits).
Given that the major bottleneck of Nimbus is big integer for cryptography, 64-bit architecture improves elliptic curve cryptography processing by ~2x over 32 bits since we can divide the number of low-level assembly operations by half.
_Note: SHA256 isn't improved by 64-bit since it uses 32-bit operations by design_
The major bottlenecks that can be improved by CPU specific instructions are:
Thanks to caching, SHA256 computation speed is mostly relevant only when receiving new blocks and attestations from the network, but state transitions do not depend on it (unlike a naive spec implementation).
**SSSE3 must not be confused with SSE3 from Pentium 3 (2004) and Athlon 64 (2005)**
and verifying a block requires verifying up to 6 signatures (block proposer, RANDAO, aggregate verification of attestations, proposer slashings, attester slashings, voluntary exits).
32-bit ARM (ARMv6) has a multiplication instruction 32x32 -> 64 called UMULL.
Unfortunately, 64-bit ARM (ARMv8) unlike x86-64 doesn't have a single 64x64 -> 128 multiplication instruction. MUL and UMULH instruction needs to be used for extended precision multiplication.
- Multiprecision Multiplication on ARMv8\
Zhe Liu, Kimmo Jarvinenadl, Weiqiang Liu, Hwajeong Seo\
Concretely, this means that ARMv8 CPUs are impaired compared to x86-64 at equivalent frequency for big integers and cryptography (for example Apple M1).
### Cryptographic extensions
Except for Raspberry Pi, ARMv8 processors support the crypto extensions which include hardware implementation of SHA256.