From b6a3c892d7714dd7c0755d6fd3dd2ff3e76a57b3 Mon Sep 17 00:00:00 2001
From: Mamy Ratsimbazafy <mamy_github@numforge.co>
Date: Thu, 8 Apr 2021 15:48:43 +0200
Subject: [PATCH] Cpu architecture optimization documentation (#2483)

* x86 features

* Update docs/cpu_features.md [skip CI]

Co-authored-by: tersec <tersec@users.noreply.github.com>

* Update docs/cpu_features.md [skip CI]

Co-authored-by: tersec <tersec@users.noreply.github.com>

* less space [skip CI]

* Add ARMv8 [skip ci]

Co-authored-by: tersec <tersec@users.noreply.github.com>
---
 docs/cpu_features.md | 149 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 149 insertions(+)
 create mode 100644 docs/cpu_features.md

diff --git a/docs/cpu_features.md b/docs/cpu_features.md
new file mode 100644
index 000000000..6061e878d
--- /dev/null
+++ b/docs/cpu_features.md
@@ -0,0 +1,149 @@
+# CPU Features for Nimbus
+
+This document describes the CPU-specific features and compilation flags that significantly improves Nimbus performance.
+
+We focus on x86-64 and ARMv8 (64 bits).
+Given that the major bottleneck of Nimbus is big integer for cryptography, 64-bit architecture improves elliptic curve cryptography processing by ~2x over 32 bits since we can divide the number of low-level assembly operations by half.
+
+_Note: SHA256 isn't improved by 64-bit since it uses 32-bit operations by design_
+
+The major bottlenecks that can be improved by CPU specific instructions are:
+- Elliptic curve cryptography for BLS12-381
+- SHA256 hashing
+
+## x86
+
+### SSSE3 (Supplemental SSE3)
+
+Intel: Core 2, 2006\
+AMD: Bulldozer, 2011\
+Flag: `-mssse3`
+Configuration: https://github.com/supranational/blst/blob/v0.3.4/build/assembly.S#L3-L6
+
+SSSE3 improves SHA256 computations. SHA256 is used **recursively** to hash all consensus objects and to build a merkle tree.
+Thanks to caching, SHA256 computation speed is mostly relevant only when receiving new blocks and attestations from the network, but state transitions do not depend on it (unlike a naive spec implementation).
+
+**SSSE3 must not be confused with SSE3 from Pentium 3 (2004) and Athlon 64 (2005)**
+
+```
+git clone https://github.com/status-im/nim-blscurve
+cd nim-blscurve
+git submodule update --init
+nim c -r -d:danger --passC:"-D__BLST_PORTABLE__" --outdir:build benchmarks/bench_sha256.nim
+nim c -r -d:danger --outdir:build benchmarks/bench_sha256.nim
+```
+
+Due to tree hashing, hashing 32 bytes is the most important benchmark.
+
+**Without SSSE3**
+```
+Backend: BLST, mode: 64-bit
+==================================================================================
+
+SHA256 - 32B - BLST       4524886.878 ops/s          221 ns/op          660 cycles
+SHA256 - 128B - BLST      1776198.934 ops/s          563 ns/op         1689 cycles
+SHA256 - 5MB - BLST            70.723 ops/s     14139678 ns/op     42419720 cycles
+```
+**With SSSE3**
+
+```
+Backend: BLST, mode: 64-bit
+==================================================================================
+
+SHA256 - 32B - BLST       5376344.086 ops/s          186 ns/op          555 cycles
+SHA256 - 128B - BLST      2183406.114 ops/s          458 ns/op         1376 cycles
+SHA256 - 5MB - BLST            87.142 ops/s     11475557 ns/op     34427254 cycles
+```
+
+### BMI2 & ADX
+
+Intel: Broadwell, 2015\
+AMD: Ryzen, 2017\
+Configuration: https://github.com/supranational/blst/blob/v0.3.4/build/assembly.S#L18
+
+The MULX instruction (BMI2), ADCX and ADOX (ADX) significantly improves big integer multiplication and squaring.
+The speedup is about 20~25% depending on the custom assembly implementation.
+
+All CPUs that support ADX support BMI2.
+
+```
+git clone https://github.com/status-im/nim-blscurve
+cd nim-blscurve
+git submodule update --init
+nim c -r -d:danger --hints:off --warnings:off --verbosity:0 --outdir:build benchmarks/bls_signature.nim
+nim c -r -d:danger --passC:"-mbmi2 -madx" --hints:off --warnings:off --verbosity:0 --outdir:build benchmarks/bls_signature.nim
+```
+
+**Verification** is the bottleneck as it must be done for each block and attestation or aggregate received
+and verifying a block requires verifying up to 6 signatures (block proposer, RANDAO, aggregate verifification of attestations, proposer slashings, attester slashings, voluntary exits).
+**Signing** can become a bottleneck when a node has many validators.
+
+**Without BMI2 & ADX**
+```
+Backend: BLST, mode: 64-bit
+=============================================================================================================
+
+BLS signature                                           1960.023 ops/s       510198 ns/op      1530624 cycles
+BLS verification                                         743.122 ops/s      1345674 ns/op      4037105 cycles
+BLS agg verif of 1 msg by 128 pubkeys                    704.634 ops/s      1419176 ns/op      4257591 cycles
+BLS verif of 6 msgs by 6 pubkeys                         120.588 ops/s      8292683 ns/op     24878257 cycles
+Serial batch verify 6 msgs by 6 pubkeys (with blinding)  218.027 ops/s      4586595 ns/op     13759932 cycles
+```
+
+**With BMI2 & ADX**
+```
+Backend: BLST, mode: 64-bit
+=============================================================================================================
+
+BLS signature                                           2550.084 ops/s       392144 ns/op      1176454 cycles
+BLS verification                                         930.081 ops/s      1075175 ns/op      3225589 cycles
+BLS agg verif of 1 msg by 128 pubkeys                    878.672 ops/s      1138081 ns/op      3414286 cycles
+BLS verif of 6 msgs by 6 pubkeys                         154.833 ops/s      6458588 ns/op     19376076 cycles
+Serial batch verify 6 msgs by 6 pubkeys (with blinding)  282.562 ops/s      3539046 ns/op     10617328 cycles
+```
+
+### SHA-NI
+
+The hardware SHA instructions has NOT been available in Intel consumer hardware until 2021.
+AMD has made it available in Zen architecture since 2017.
+
+Intel:
+- Rocket Lake (2021)
+- Ice Lake (low-power laptops 2018)
+- Goldmont (Apollo Lake Pentiums & Celerons 2016, Denverton Atoms 2017)
+
+AMD: Ryzen, 2017\
+Flag: `-msha`
+Configuration: https://github.com/supranational/blst/blob/v0.3.4/src/sha256.h#L11-L12
+
+On Ryzen, **hardware SHA is 4X faster** than when using SIMD instructions (Table 1, p14).
+
+- SoK: A Performance Evaluation of Cryptographic InstructionSets on Modern Architectures\
+  Armando Faz-Hernández, Julio López, Ana Karina D. S. de Oliveira, 2018\
+  https://www.lasca.ic.unicamp.br/media/publications/p9-faz-hernandez.pdf
+
+## ARM
+
+32-bit ARM (ARMv6) has a multiplication instruction 32x32 -> 64 called UMULL.
+
+Unfortunately, 64-bit ARM (ARMv8) unlike x86-64 doesn't have a single 64x64 -> 128 multiplication instruction. MUL and UMULH instruction needs to be used for extended precision multiplication.
+
+- Multiprecision Multiplication on ARMv8\
+  Zhe Liu, Kimmo Jarvinenadl, Weiqiang Liu, Hwajeong Seo\
+  http://arith24.arithsymposium.org/slides/s2-liu.pdf
+
+Concretely, this means that ARMv8 CPUs are impaired compared to x86-64 at equivalent frequency for big integers and cryptography (for example Apple M1).
+
+### Cryptographic extensions
+
+Except for Raspberry Pi, ARMv8 processors support the crypto extensions which include hardware implementation of SHA256.
+
+This is detected via
+- `__ARM_FEATURE_CRYPTO` https://github.com/supranational/blst/blob/v0.3.4/src/sha256.h#L14-L15
+
+The compilation flag should be either
+- `-mfpu=crypto-neon-fp-armv8`
+- or `-march=armv8-a+crypto`
+
+The speedup is expected to be 2x faster than without.\
+https://patchwork.kernel.org/project/linux-arm-kernel/patch/20150316154835.GA31336@google.com/