leopard/README.md

# Leopard-RS
## MDS Reed-Solomon Erasure Correction Codes for Large Data in C

Leopard-RS is a fast library for Forward Error Correction.
From a block of equally sized original data pieces, it generates recovery
symbols that can be used to recover lost original data.


#### Motivation:

It gets slower as O(N Log N) in the input data size, and its inner loops are
vectorized using the best approaches available on modern processors, using the
fastest finite fields (8-bit or 16-bit Galois fields with Cantor basis {2}).

It sets new speed records for MDS encoding and decoding of large data,
achieving over 1.2 GB/s to encode with the AVX2 instruction set on a single core.

There is another library `FastECC` by Bulat-Ziganshin that should have similar performance:
[https://github.com/Bulat-Ziganshin/FastECC](https://github.com/Bulat-Ziganshin/FastECC).
Both libraries implement the same high-level algorithm in {3}, while Leopard implements the
newer polynomial basis GF(2^r) approach outlined in {1}, and FastECC uses complex finite fields
modulo special primes.  There are trade-offs that may make either approach preferable based
on the application:
+ Older processors do not support SSSE3 and FastECC supports these processors better.
+ FastECC supports data sets above 65,536 pieces as it uses 32-bit finite field math.  
+ Leopard does not require expanding the input or output data to make it fit in the field, so it can be more space efficient.

Example applications are data recovery software and data center replication.


#### Encoder API:

Preconditions:

* The original and recovery data must not exceed 65536 pieces.
* The recovery_count <= original_count.
* The buffer_bytes must be a multiple of 64.
* Each buffer should have the same number of bytes.
* Even the last piece must be rounded up to the block size.

```
#include "leopard.h"
```

For full documentation please read `leopard.h`.

+ `leo_init()` : Initialize library.
+ `leo_encode_work_count()` : Calculate the number of work_data buffers to provide to leo_encode().
+ `leo_encode()`: Generate recovery data.


#### Decoder API:

For full documentation please read `leopard.h`.

+ `leo_init()` : Initialize library.
+ `leo_decode_work_count()` : Calculate the number of work_data buffers to provide to leo_decode().
+ `leo_decode()` : Recover original data.


#### Benchmarks:

Benchmark results are available here:
[https://raw.githubusercontent.com/catid/leopard/master/Benchmarks.md](https://raw.githubusercontent.com/catid/leopard/master/Benchmarks.md)


#### FFT Data Layout:

We pack the data into memory in this order:

~~~
[Recovery Data (Power of Two = M)] [Original Data] [Zero Padding out to 65536]
~~~

For encoding, the placement is implied instead of actual memory layout.
For decoding, the layout is explicitly used.


#### Encoder algorithm:

The encoder is described in {3}.  Operations are done O(K Log M),
where K is the original data size, and M is up to twice the
size of the recovery set.

Roughly in brief:

~~~
Recovery = FFT( IFFT(Data_0) xor IFFT(Data_1) xor ... )
~~~

It walks the original data M chunks at a time performing the IFFT.
Each IFFT intermediate result is XORed together into the first M chunks of
the data layout.  Finally the FFT is performed.


Encoder optimizations:
* The first IFFT can be performed directly in the first M chunks.
* The zero padding can be skipped while performing the final IFFT.
Unrolling is used in the code to accomplish both these optimizations.
* The final FFT can be truncated also if recovery set is not a power of 2.
It is easy to truncate the FFT by ending the inner loop early.


#### Decoder algorithm:

The decoder is described in {1}.  Operations are done O(N Log N), where N is up
to twice the size of the original data as described below.

Roughly in brief:

~~~
Original = -ErrLocator * FFT( Derivative( IFFT( ErrLocator * ReceivedData ) ) )
~~~


#### Precalculations:

At startup initialization, FFTInitialize() precalculates FWT(L) as
described by equation (92) in {1}, where L = Log[i] for i = 0..Order,
Order = 256 or 65536 for FF8/16.  This is stored in the LogWalsh vector.

It also precalculates the FFT skew factors (s_i) as described by
equation (28).  This is stored in the FFTSkew vector.

For memory workspace N data chunks are needed, where N is a power of two
at or above M + K.  K is the original data size and M is the next power
of two above the recovery data size.  For example for K = 200 pieces of
data and 10% redundancy, there are 20 redundant pieces, which rounds up
to 32 = M.  M + K = 232 pieces, so N rounds up to 256.


#### Online calculations:

At runtime, the error locator polynomial is evaluated using the
Fast Walsh-Hadamard transform as described in {1} equation (92).

At runtime the data is explicit laid out in workspace memory like this:
~~~
[Recovery Data (Power of Two = M)] [Original Data (K)] [Zero Padding out to N]
~~~

Data that was lost is replaced with zeroes.
Data that was received, including recovery data, is multiplied by the error
locator polynomial as it is copied into the workspace.

The IFFT is applied to the entire workspace of N chunks.
Since the IFFT starts with pairs of inputs and doubles in width at each
iteration, the IFFT is optimized by skipping zero padding at the end until
it starts mixing with non-zero data.

The formal derivative is applied to the entire workspace of N chunks.

The FFT is applied to the entire workspace of N chunks.
The FFT is optimized by only performing intermediate calculations required
to recover lost data.  Since it starts wide and ends up working on adjacent
pairs, at some point the intermediate results are not needed for data that
will not be read by the application.  This optimization is implemented by
the ErrorBitfield class.

Finally, only recovered data is multiplied by the negative of the
error locator polynomial as it is copied into the front of the
workspace for the application to retrieve.


#### Finite field arithmetic optimizations:

For faster finite field multiplication, large tables are precomputed and
applied during encoding/decoding on 64 bytes of data at a time using
SSSE3 or AVX2 vector instructions and the ALTMAP approach from Jerasure.

Addition in this finite field is XOR, and a vectorized memory XOR routine
is also used.


#### References:

This library implements an MDS erasure code introduced in this paper:

~~~
    {1} S.-J. Lin, T. Y. Al-Naffouri, Y. S. Han, and W.-H. Chung,
    "Novel Polynomial Basis with Fast Fourier Transform
	and Its Application to Reed-Solomon Erasure Codes"
    IEEE Trans. on Information Theory, pp. 6284-6299, November, 2016.
~~~

~~~
    {2} D. G. Cantor, "On arithmetical algorithms over finite fields",
    Journal of Combinatorial Theory, Series A, vol. 50, no. 2, pp. 285-300, 1989.
~~~

~~~
    {3} Sian-Jheng Lin, Wei-Ho Chung, "An Efficient (n, k) Information
    Dispersal Algorithm for High Code Rate System over Fermat Fields,"
    IEEE Commun. Lett., vol.16, no.12, pp. 2036-2039, Dec. 2012.
~~~

~~~
    {4} Plank, J. S., Greenan, K. M., Miller, E. L., "Screaming fast Galois Field
    arithmetic using Intel SIMD instructions."  In: FAST-2013: 11th Usenix
    Conference on File and Storage Technologies, San Jose, 2013
~~~
	
Some papers are mirrored in the /docs/ folder.


#### Credits

Inspired by discussion with:

+ Sian-Jhen Lin <sjhenglin@gmail.com> : Author of {1} {3}, basis for Leopard
+ Bulat Ziganshin <bulat.ziganshin@gmail.com> : Author of FastECC
+ Yutaka Sawada <tenfon@outlook.jp> : Author of MultiPar

Software by Christopher A. Taylor <mrcatid@gmail.com>
Project structure 2017-05-25 09:24:15 +00:00			`# Leopard-RS`
Update README.md 2017-05-30 18:45:00 +00:00			`## MDS Reed-Solomon Erasure Correction Codes for Large Data in C`
Project structure 2017-05-25 09:24:15 +00:00
Improve docs 2017-05-28 22:15:39 +00:00			`Leopard-RS is a fast library for Forward Error Correction.`
Project structure 2017-05-25 09:24:15 +00:00			`From a block of equally sized original data pieces, it generates recovery`
			`symbols that can be used to recover lost original data.`


			`#### Motivation:`

			`It gets slower as O(N Log N) in the input data size, and its inner loops are`
			`vectorized using the best approaches available on modern processors, using the`
Improve docs 2017-05-28 22:15:39 +00:00			`fastest finite fields (8-bit or 16-bit Galois fields with Cantor basis {2}).`

			`It sets new speed records for MDS encoding and decoding of large data,`
			`achieving over 1.2 GB/s to encode with the AVX2 instruction set on a single core.`
Project structure 2017-05-25 09:24:15 +00:00
Improve docs 2017-05-28 22:15:39 +00:00			There is another library `FastECC` by Bulat-Ziganshin that should have similar performance:
Update README.md 2017-05-29 10:18:29 +00:00			`[https://github.com/Bulat-Ziganshin/FastECC](https://github.com/Bulat-Ziganshin/FastECC).`
			`Both libraries implement the same high-level algorithm in {3}, while Leopard implements the`
			`newer polynomial basis GF(2^r) approach outlined in {1}, and FastECC uses complex finite fields`
			`modulo special primes. There are trade-offs that may make either approach preferable based`
			`on the application:`
			`+ Older processors do not support SSSE3 and FastECC supports these processors better.`
Update README.md 2017-05-29 11:38:09 +00:00			`+ FastECC supports data sets above 65,536 pieces as it uses 32-bit finite field math.`
Update README.md 2017-05-29 10:18:29 +00:00			`+ Leopard does not require expanding the input or output data to make it fit in the field, so it can be more space efficient.`
Project structure 2017-05-25 09:24:15 +00:00
			`Example applications are data recovery software and data center replication.`


			`#### Encoder API:`

Improve docs 2017-05-28 22:15:39 +00:00			`Preconditions:`

			`* The original and recovery data must not exceed 65536 pieces.`
			`* The recovery_count <= original_count.`
			`* The buffer_bytes must be a multiple of 64.`
			`* Each buffer should have the same number of bytes.`
			`* Even the last piece must be rounded up to the block size.`

Project structure 2017-05-25 09:24:15 +00:00			```
			`#include "leopard.h"`
			```

			For full documentation please read `leopard.h`.

			+ `leo_init()` : Initialize library.
			+ `leo_encode_work_count()` : Calculate the number of work_data buffers to provide to leo_encode().
			+ `leo_encode()`: Generate recovery data.


			`#### Decoder API:`

			For full documentation please read `leopard.h`.

			+ `leo_init()` : Initialize library.
			+ `leo_decode_work_count()` : Calculate the number of work_data buffers to provide to leo_decode().
Improve docs 2017-05-28 22:15:39 +00:00			+ `leo_decode()` : Recover original data.
Project structure 2017-05-25 09:24:15 +00:00

			`#### Benchmarks:`

Split benchmarks 2017-05-30 18:58:01 +00:00			`Benchmark results are available here:`
			`[https://raw.githubusercontent.com/catid/leopard/master/Benchmarks.md](https://raw.githubusercontent.com/catid/leopard/master/Benchmarks.md)`
Improve docs 2017-05-28 22:15:39 +00:00

			`#### FFT Data Layout:`

			`We pack the data into memory in this order:`

			`~~~`
			`[Recovery Data (Power of Two = M)] [Original Data] [Zero Padding out to 65536]`
			`~~~`

			`For encoding, the placement is implied instead of actual memory layout.`
			`For decoding, the layout is explicitly used.`


			`#### Encoder algorithm:`

			`The encoder is described in {3}. Operations are done O(K Log M),`
			`where K is the original data size, and M is up to twice the`
			`size of the recovery set.`

			`Roughly in brief:`

			`~~~`
			`Recovery = FFT( IFFT(Data_0) xor IFFT(Data_1) xor ... )`
			`~~~`

			`It walks the original data M chunks at a time performing the IFFT.`
			`Each IFFT intermediate result is XORed together into the first M chunks of`
			`the data layout. Finally the FFT is performed.`


			`Encoder optimizations:`
			`* The first IFFT can be performed directly in the first M chunks.`
			`* The zero padding can be skipped while performing the final IFFT.`
			`Unrolling is used in the code to accomplish both these optimizations.`
			`* The final FFT can be truncated also if recovery set is not a power of 2.`
			`It is easy to truncate the FFT by ending the inner loop early.`


			`#### Decoder algorithm:`

			`The decoder is described in {1}. Operations are done O(N Log N), where N is up`
			`to twice the size of the original data as described below.`

			`Roughly in brief:`

			`~~~`
			`Original = -ErrLocator * FFT( Derivative( IFFT( ErrLocator * ReceivedData ) ) )`
			`~~~`


			`#### Precalculations:`

			`At startup initialization, FFTInitialize() precalculates FWT(L) as`
			`described by equation (92) in {1}, where L = Log[i] for i = 0..Order,`
			`Order = 256 or 65536 for FF8/16. This is stored in the LogWalsh vector.`

			`It also precalculates the FFT skew factors (s_i) as described by`
			`equation (28). This is stored in the FFTSkew vector.`

			`For memory workspace N data chunks are needed, where N is a power of two`
			`at or above M + K. K is the original data size and M is the next power`
			`of two above the recovery data size. For example for K = 200 pieces of`
			`data and 10% redundancy, there are 20 redundant pieces, which rounds up`
			`to 32 = M. M + K = 232 pieces, so N rounds up to 256.`


			`#### Online calculations:`

			`At runtime, the error locator polynomial is evaluated using the`
			`Fast Walsh-Hadamard transform as described in {1} equation (92).`

			`At runtime the data is explicit laid out in workspace memory like this:`
Update README.md 2017-05-29 11:08:37 +00:00			`~~~`
Improve docs 2017-05-28 22:15:39 +00:00			`[Recovery Data (Power of Two = M)] [Original Data (K)] [Zero Padding out to N]`
Update README.md 2017-05-29 11:08:37 +00:00			`~~~`
Improve docs 2017-05-28 22:15:39 +00:00
			`Data that was lost is replaced with zeroes.`
			`Data that was received, including recovery data, is multiplied by the error`
			`locator polynomial as it is copied into the workspace.`

			`The IFFT is applied to the entire workspace of N chunks.`
			`Since the IFFT starts with pairs of inputs and doubles in width at each`
			`iteration, the IFFT is optimized by skipping zero padding at the end until`
			`it starts mixing with non-zero data.`

			`The formal derivative is applied to the entire workspace of N chunks.`

			`The FFT is applied to the entire workspace of N chunks.`
			`The FFT is optimized by only performing intermediate calculations required`
			`to recover lost data. Since it starts wide and ends up working on adjacent`
			`pairs, at some point the intermediate results are not needed for data that`
			`will not be read by the application. This optimization is implemented by`
			`the ErrorBitfield class.`

			`Finally, only recovered data is multiplied by the negative of the`
			`error locator polynomial as it is copied into the front of the`
			`workspace for the application to retrieve.`


			`#### Finite field arithmetic optimizations:`

			`For faster finite field multiplication, large tables are precomputed and`
			`applied during encoding/decoding on 64 bytes of data at a time using`
			`SSSE3 or AVX2 vector instructions and the ALTMAP approach from Jerasure.`

			`Addition in this finite field is XOR, and a vectorized memory XOR routine`
			`is also used.`


			`#### References:`
Project structure 2017-05-25 09:24:15 +00:00
			`This library implements an MDS erasure code introduced in this paper:`
First commit 2017-05-18 03:06:13 +00:00
			`~~~`
Improve docs 2017-05-28 22:15:39 +00:00			`{1} S.-J. Lin, T. Y. Al-Naffouri, Y. S. Han, and W.-H. Chung,`
FF16 is mostly working still some bugs 2017-05-30 08:37:27 +00:00			`"Novel Polynomial Basis with Fast Fourier Transform`
			`and Its Application to Reed-Solomon Erasure Codes"`
First commit 2017-05-18 03:06:13 +00:00			`IEEE Trans. on Information Theory, pp. 6284-6299, November, 2016.`
			`~~~`
Project structure 2017-05-25 09:24:15 +00:00
Improve docs 2017-05-28 22:15:39 +00:00			`~~~`
			`{2} D. G. Cantor, "On arithmetical algorithms over finite fields",`
			`Journal of Combinatorial Theory, Series A, vol. 50, no. 2, pp. 285-300, 1989.`
			`~~~`
Project structure 2017-05-25 09:24:15 +00:00
Improve docs 2017-05-28 22:15:39 +00:00			`~~~`
FF16 is mostly working still some bugs 2017-05-30 08:37:27 +00:00			`{3} Sian-Jheng Lin, Wei-Ho Chung, "An Efficient (n, k) Information`
			`Dispersal Algorithm for High Code Rate System over Fermat Fields,"`
Improve docs 2017-05-28 22:15:39 +00:00			`IEEE Commun. Lett., vol.16, no.12, pp. 2036-2039, Dec. 2012.`
			`~~~`
FF16 is mostly working still some bugs 2017-05-30 08:37:27 +00:00
			`~~~`
Update README.md 2017-05-30 09:21:15 +00:00			`{4} Plank, J. S., Greenan, K. M., Miller, E. L., "Screaming fast Galois Field`
			`arithmetic using Intel SIMD instructions." In: FAST-2013: 11th Usenix`
			`Conference on File and Storage Technologies, San Jose, 2013`
FF16 is mostly working still some bugs 2017-05-30 08:37:27 +00:00			`~~~`
Improve docs 2017-05-28 22:15:39 +00:00
			`Some papers are mirrored in the /docs/ folder.`
Project structure 2017-05-25 09:24:15 +00:00

			`#### Credits`

Improve docs 2017-05-28 22:15:39 +00:00			`Inspired by discussion with:`

Update README.md 2017-05-28 22:17:38 +00:00			`+ Sian-Jhen Lin <sjhenglin@gmail.com> : Author of {1} {3}, basis for Leopard`
			`+ Bulat Ziganshin <bulat.ziganshin@gmail.com> : Author of FastECC`
			`+ Yutaka Sawada <tenfon@outlook.jp> : Author of MultiPar`
Project structure 2017-05-25 09:24:15 +00:00
Update README.md 2017-05-30 09:21:15 +00:00			`Software by Christopher A. Taylor <mrcatid@gmail.com>`