nim-codex

Table of Contents

ECC libraries
Computations in GF
CPU arithmetics
GF(2^64-2^32+1)

ECC libraries

FastECC can use GF(p) and GF(p^2), p may be 64-bit
Leopard uses GF(2^n), supports N<=65536 and M>=K
AFAIR, FastECC is 2-3 times faster than Leopard for same M,K setting. But it may be correct only for large N. Also, FastECC starts to beat fastest N^2 RS codecs around N=100.

FastECC uses real FFT, so it can be adapted to any field having roots of unity. But since fields are non-binary, it requires recoding either input or output data and storing those extra bits somewhere. F.e. recommended field is GF(0xFFF00001) and input 4KB data are recoded into 1024 values of this field plus one bit.

Leopard uses so-called "additive FFT", which isn't FFT, but another, pretty similar transformation. The reason is no unity roots in GF(2^n), making FFT impossible.

Computations in GF

In GF(p^2), ADD/SUB/MUL are implemented via operations on two GF(p) values by rules similar to Complex field.

In GF(p), computations done modulo p:

ADD/SUB are implemented by a single '?:' operation
MUL requires full-scale mod p computations, and for a constant p it requires 4 multiplications and a few add/sub operations

In GF(2^n):

ADD/SUB is just XOR
MUL is usually fetched from multiplication tables. When tables become too large, numbers are split into multiple parts and partial result combined using property (a+b)*c = ab+ac

DIV is the most complex operation for any GF. Fortunately, iFFT requires only one DIV operation per input value (i.e. O(n) divisions total compared to O(n*log(n)) multiplications), and FFT doesn't need them at all.

DIV for any GF can be implemented via:

DivTable[a,b]
ReciprocalTable[a] storing 1/a values, with following multiplication by b
By computing 1/a: For any GF(p^n) and any x, x^(p^n-1) == 1, so 1/x = x^(p^n-2). This requires O(log(p^n)) MUL operations over field

CPU arithmetics

Modern CPUs are mostly 64-bit (although some are 32-bit). This means that they perform all computations on 64-bit values, thus computing on smaller values just wastes time, while MUL/DIV on larger values is slower (in terms of bits processed per second).

With SIMD, they essentially become 8/16/32/64-bit processors for ADD/SUB/XOR. On x86, MUL is best implemented for 32-bit values, and DIV isn't implemented at all.

For GF(2^n), MUL is implemented via tables. For x86 SIMD, it uses PSHUFB which is essentially lookup with 4-bit index. Number of 4-bit multiplications required to implement GF(2^n) MUL, is an order of (n/4)^2 (both input values are split into 4-bit segments and each input-1 segment is multiplied to each input-2 segment).

This means that we need 4x more time for GF(2^32) MUL than for GF(2^16) MUL, making computations in GF(2^32) about 2x slower (and much slower for division since we can't keep table of reciprocals).

GPUs are better to consider as 32-bit processors.

GF(2^64-2^32+1)

Code: https://github.com/pornin/ecgfp5

It's much faster than fields implemented in FastECC - but only on CPUs supporting 64*64=128 multiplication.

For x86 SIMD, GPUs and 32-bit CPUs, this multiplication will be implemented via four 32*32=64 multiplications - exactly like a*b mod p for 32-bit values. But since it processes 64 bits of data, it still be 2x faster in terms of GB/s processed.

Probably, on x64 scalar implementation would be faster than anything less than AVX-512, allowing much simpler code.

Moreover, GF(2^64-2^32+1) field is more dense than GF(0xFFF0001) - i.e. we can recode about gigabyte of data into values of this field plus a single bit (compared to only 4 KB for my field)