Improve docs

This commit is contained in:
Christopher Taylor 2017-05-28 15:15:39 -07:00
parent 3f9b00e8ed
commit 5e61c917f7
7 changed files with 427 additions and 57 deletions

View File

@ -29,19 +29,116 @@
#pragma once
/*
TODO:
+ Fixes for all different input sizes
+ New 16-bit Muladd inner loops
+ Benchmarks for large data!
+ Add multi-threading to split up long parallelizable calculations
+ Write detailed comments for all the routines
+ Final benchmarks!
+ Release version 1
+ Finish up documentation
FFT Data Layout:
TBD:
+ Look into getting EncodeL working so we can support smaller data (Ask Lin)
+ Look into using FFT_m instead of FFT_n for decoder
We pack the data into memory in this order:
[Recovery Data (Power of Two = M)] [Original Data] [Zero Padding out to 65536]
For encoding, the placement is implied instead of actual memory layout.
For decoding, the layout is explicitly used.
*/
/*
Encoder algorithm:
The encoder is described in {3}. Operations are done O(K Log M),
where K is the original data size, and M is up to twice the
size of the recovery set.
Roughly in brief:
Recovery = FFT( IFFT(Data_0) xor IFFT(Data_1) xor ... )
It walks the original data M chunks at a time performing the IFFT.
Each IFFT intermediate result is XORed together into the first M chunks of
the data layout. Finally the FFT is performed.
Encoder optimizations:
* The first IFFT can be performed directly in the first M chunks.
* The zero padding can be skipped while performing the final IFFT.
Unrolling is used in the code to accomplish both these optimizations.
* The final FFT can be truncated also if recovery set is not a power of 2.
It is easy to truncate the FFT by ending the inner loop early.
*/
/*
Decoder algorithm:
The decoder is described in {1}. Operations are done O(N Log N), where N is up
to twice the size of the original data as described below.
Roughly in brief:
Original = -ErrLocator * FFT( Derivative( IFFT( ErrLocator * ReceivedData ) ) )
Precalculations:
---------------
At startup initialization, FFTInitialize() precalculates FWT(L) as
described by equation (92) in {1}, where L = Log[i] for i = 0..Order,
Order = 256 or 65536 for FF8/16. This is stored in the LogWalsh vector.
It also precalculates the FFT skew factors (s_i) as described by
equation (28). This is stored in the FFTSkew vector.
For memory workspace N data chunks are needed, where N is a power of two
at or above M + K. K is the original data size and M is the next power
of two above the recovery data size. For example for K = 200 pieces of
data and 10% redundancy, there are 20 redundant pieces, which rounds up
to 32 = M. M + K = 232 pieces, so N rounds up to 256.
Online calculations:
-------------------
At runtime, the error locator polynomial is evaluated using the
Fast Walsh-Hadamard transform as described in {1} equation (92).
At runtime the data is explicit laid out in workspace memory like this:
[Recovery Data (Power of Two = M)] [Original Data (K)] [Zero Padding out to N]
Data that was lost is replaced with zeroes.
Data that was received, including recovery data, is multiplied by the error
locator polynomial as it is copied into the workspace.
The IFFT is applied to the entire workspace of N chunks.
Since the IFFT starts with pairs of inputs and doubles in width at each
iteration, the IFFT is optimized by skipping zero padding at the end until
it starts mixing with non-zero data.
The formal derivative is applied to the entire workspace of N chunks.
The FFT is applied to the entire workspace of N chunks.
The FFT is optimized by only performing intermediate calculations required
to recover lost data. Since it starts wide and ends up working on adjacent
pairs, at some point the intermediate results are not needed for data that
will not be read by the application. This optimization is implemented by
the ErrorBitfield class.
Finally, only recovered data is multiplied by the negative of the
error locator polynomial as it is copied into the front of the
workspace for the application to retrieve.
Future directions:
-----------------
Note that a faster decoder is described in {3} that is O(K Log M) instead,
which should be 2x faster than the current one. However I do not fully
understand how to implement it for this field and could use some help.
*/
/*
Finite field arithmetic optimizations:
For faster finite field multiplication, large tables are precomputed and
applied during encoding/decoding on 64 bytes of data at a time using
SSSE3 or AVX2 vector instructions and the ALTMAP approach from Jerasure.
Addition in this finite field is XOR, and a vectorized memory XOR routine
is also used.
*/
#include "leopard.h"

View File

@ -37,6 +37,8 @@
This finite field contains 65536 elements and so each element is one byte.
This library is designed for data that is a multiple of 64 bytes in size.
Algorithms are described in LeopardCommon.h
*/
namespace leopard { namespace ff16 {
@ -115,9 +117,9 @@ void ifft_butterfly4(
//------------------------------------------------------------------------------
// Encode
// Reed-Solomon Encode
void Encode(
void ReedSolomonEncode(
uint64_t buffer_bytes,
unsigned original_count,
unsigned recovery_count,
@ -127,9 +129,9 @@ void Encode(
//------------------------------------------------------------------------------
// Decode
// Reed-Solomon Decode
void Decode(
void ReedSolomonDecode(
uint64_t buffer_bytes,
unsigned original_count,
unsigned recovery_count,

View File

@ -778,6 +778,7 @@ static void FFTInitialize()
for (unsigned i = 0; i < kOrder; ++i)
LogWalsh[i] = LogLUT[i];
LogWalsh[0] = 0;
FWHT(LogWalsh, kBits);
}
@ -845,9 +846,9 @@ void VectorIFFTButterfly(
//------------------------------------------------------------------------------
// Encode
// Reed-Solomon Encode
void Encode(
void ReedSolomonEncode(
uint64_t buffer_bytes,
unsigned original_count,
unsigned recovery_count,
@ -1076,9 +1077,9 @@ void ErrorBitfield::Prepare()
//------------------------------------------------------------------------------
// Decode
// Reed-Solomon Decode
void Decode(
void ReedSolomonDecode(
uint64_t buffer_bytes,
unsigned original_count,
unsigned recovery_count,

View File

@ -37,6 +37,8 @@
This finite field contains 256 elements and so each element is one byte.
This library is designed for data that is a multiple of 64 bytes in size.
Algorithms are described in LeopardCommon.h
*/
namespace leopard { namespace ff8 {
@ -161,9 +163,9 @@ void VectorIFFTButterfly(
//------------------------------------------------------------------------------
// Encode
// Reed-Solomon Encode
void Encode(
void ReedSolomonEncode(
uint64_t buffer_bytes,
unsigned original_count,
unsigned recovery_count,
@ -173,9 +175,9 @@ void Encode(
//------------------------------------------------------------------------------
// Decode
// Reed-Solomon Decode
void Decode(
void ReedSolomonDecode(
uint64_t buffer_bytes,
unsigned original_count,
unsigned recovery_count,

294
README.md
View File

@ -1,31 +1,38 @@
# Leopard-RS
## Leopard Reed-Solomon Error Correction Codes in C
## Reed-Solomon Error Correction Codes for Large Data in C
#### This software is still under active development. It may or may not work right now. I'm trying to get it done ASAP. Current latest result is that K=128 code rate 1/2 is working and benchmarks are posted here: [http://catid.mechafetus.com/news/news.php?view=399](http://catid.mechafetus.com/news/news.php?view=399)
Leopard-RS is a portable, fast library for Forward Error Correction.
Leopard-RS is a fast library for Forward Error Correction.
From a block of equally sized original data pieces, it generates recovery
symbols that can be used to recover lost original data.
* It requires that data pieces are all a fixed size, a multiple of 64 bytes.
* The original and recovery data must not exceed 65536 pieces.
#### Motivation:
It gets slower as O(N Log N) in the input data size, and its inner loops are
vectorized using the best approaches available on modern processors, using the
fastest finite fields (8-bit or 16-bit Galois fields) for bulk data.
fastest finite fields (8-bit or 16-bit Galois fields with Cantor basis {2}).
It sets new speed records for MDS encoding and decoding of large data.
It is also the only open-source production ready software for this purpose
available today.
It sets new speed records for MDS encoding and decoding of large data,
achieving over 1.2 GB/s to encode with the AVX2 instruction set on a single core.
There is another library `FastECC` by Bulat-Ziganshin that should have similar performance:
[https://github.com/Bulat-Ziganshin/FastECC](https://github.com/Bulat-Ziganshin/FastECC)
Example applications are data recovery software and data center replication.
#### Encoder API:
Preconditions:
* The original and recovery data must not exceed 65536 pieces.
* The recovery_count <= original_count.
* The buffer_bytes must be a multiple of 64.
* Each buffer should have the same number of bytes.
* Even the last piece must be rounded up to the block size.
```
#include "leopard.h"
```
@ -39,55 +46,286 @@ For full documentation please read `leopard.h`.
#### Decoder API:
```
#include "leopard.h"
```
For full documentation please read `leopard.h`.
+ `leo_init()` : Initialize library.
+ `leo_decode_work_count()` : Calculate the number of work_data buffers to provide to leo_decode().
+ `leo_decode()` : Generate recovery data.
+ `leo_decode()` : Recover original data.
#### Benchmarks:
On my laptop:
```
TODO
Leopard Encoder(8.192 MB in 128 pieces, 128 losses): Input=1242.62 MB/s, Output=1242.62 MB/s
Leopard Decoder(8.192 MB in 128 pieces, 128 losses): Input=482.53 MB/s, Output=482.53 MB/s
```
#### Comparisons:
Comparing performance from all my error correction code libraries, on my laptop:
To summarize, a set of 128 of 64 KB data files are supplemented by about 128 redundant code pieces (encoded) meaning a code rate of 1/2. From those redundant code pieces the original set is recovered (decoded).
The results are all from libraries I've written over the past few years. They all have the same vector-optimized inner loops, but the types of error correction codes are different.
```
TODO
For 64KB data chunks:
CM256 Encoder: 64000 bytes k = 128 m = 128 : 82194.7 usec, 99.6658 MBps
CM256 Decoder: 64000 bytes k = 128 m = 128 : 78279.5 usec, 104.651 MBps
Longhair Encoded k=128 data blocks with m=128 recovery blocks in 81641.2 usec : 100.342 MB/s
Longhair Decoded 128 erasures in 85000.7 usec : 96.3757 MB/s
WH256 wirehair_encode(N = 128) in 12381.3 usec, 661.644 MB/s after 127.385 avg losses
WH256 wirehair_decode(N = 128) average overhead = 0.025 blocks, average reconstruct time = 9868.65 usec, 830.103 MB/s
FEC-AL Encoder(8.192 MB in 128 pieces, 128 losses): Input=518.545 MB/s, Output=518.545 MB/s, (Encode create: 3762.73 MB/s)
FEC-AL Decoder(8.192 MB in 128 pieces, 128 losses): Input=121.093 MB/s, Output=121.093 MB/s, (Overhead = 0 pieces)
Leopard Encoder(8.192 MB in 128 pieces, 128 losses): Input=1242.62 MB/s, Output=1242.62 MB/s
Leopard Decoder(8.192 MB in 128 pieces, 128 losses): Input=482.53 MB/s, Output=482.53 MB/s
```
For 128 data pieces of input and 128 data pieces of redundancy:
#### Background
Fastest to encode: Leopard (1.2 GB/s)
Distant second-place: WH256 (660 MB/s), FEC-AL (515 MB/s)
Slowest encoders: Longhair, CM256
Fastest to decode: WH256 (830 MB/s)
Distant second-place: Leopard (482 MB/s)
Slowest decoders: FEC-AL, CM256, Longhair
There are a lot of variables that affect when each of these libraries should be used.
Each one is ideal in a different situation, and no one library can be called the best overall.
The situation tested mainly helps explore the trade-offs of WH256, FEC-AL and Leopard for code rate 1/2.
##### CM256: Traditional O(N^2) Cauchy matrix MDS Reed-Solomon codec
Runs at about 100 MB/s encode and decode for this case.
This is an MDS code that uses a Cauchy matrix for structure.
Other examples of this type would be most MDS Reed-Solomon codecs online: Jerasure, Zfec, ISA-L, etc.
It requires SSSE3 or newer Intel instruction sets for this speed. Otherwise it runs much slower.
This type of software gets slower as O(K*M) where K = input count and M = recovery count.
It is practical for either small data or small recovery set up to 255 pieces.
It is available for production use under BSD license here:
http://github.com/catid/cm256
(Note that the inner loops can be optimized more by applying the GF256 library.)
##### Longhair: Binary O(N^2) Cauchy matrix MDS Reed-Solomon codec
Runs at about 100 MB/s encode and decode for this case.
This is an MDS code that uses a Cauchy matrix for structure.
This one only requires XOR operations so it can run fast on low-end processors.
Requires data is a multiple of 8 bytes.
This type of software gets slower as O(K*M) where K = input count and M = recovery count.
It is practical for either small data or small recovery set up to 255 pieces.
There is no other optimized software available online for this type of error correction code. There is a slow version available in the Jerasure software library.
It is available for production use under BSD license here:
http://github.com/catid/longhair
(Note that the inner loops can be optimized more by applying the GF256 library.)
##### Wirehair: O(N) Hybrid LDPC Erasure Code
Encodes at 660 MB/s, and decodes at 830 MB/s for ALL cases.
This is not an MDS code. It has about a 3% chance of failing to recover and requiring one extra block of data.
It uses mostly XOR so it only gets a little slower on lower-end processors.
This type of software gets slower as O(K) where K = input count.
This library incorporates some novel ideas that are unpublished. The new ideas are described in the source code.
It is practical for data up to 64,000 pieces and can be used as a "fountain" code.
There is no other optimized software available online for this type of error correction code. I believe there are some public (slow) implementations of Raptor codes available online for study.
It is available for production use under BSD license here:
http://github.com/catid/wirehair
There's a pre-production version that needs more work here using GF256 for more speed,
which is what I used for the benchmark:
http://github.com/catid/wh256
##### FEC-AL *new*: O(N^2/8) XOR Structured Convolutional Matrix Code
Encodes at 510 MB/s. Decodes at 121 MB/s.
This is not an MDS code. It has about a 1% chance of failing to recover and requiring one extra block of data.
This library incorporates some novel ideas that are unpublished. The new ideas are described in the README.
It uses mostly XOR operations so only gets about 2-4x slower on lower-end processors.
It gets slower as O(K*M/8) for larger data, bounded by the speed of XOR.
This new approach is ideal for streaming erasure codes; two implementations are offered one for files and another for real-time streaming reliable data.
It is practical for data up to about 4,000 pieces and can be used as a "fountain" code.
There is no other software available online for this type of error correction code.
It is available for production use under BSD license here:
http://github.com/catid/fecal
It can also be used as a convolutional streaming code here for e.g. rUDP:
http://github.com/catid/siamese
##### Leopard-RS *new*: O(K Log M) FFT MDS Reed-Solomon codec
Encodes at 1.2 GB/s, and decodes at 480 MB/s for this case.
12x faster than existing MDS approaches to encode, and almost 5x faster to decode.
This uses a recent result from 2014 introducing a novel polynomial basis permitting FFT over fast Galois fields.
This is an MDS Reed-Solomon similar to Jerasure, Zfec, ISA-L, etc, but much faster.
It requires SSSE3 or newer Intel instruction sets for this speed. Otherwise it runs much slower.
Requires data is a multiple of 64 bytes.
This type of software gets slower as O(K Log M) where K = input count, M = recovery count.
It is practical for extremely large data.
There is no other software available online for this type of error correction code.
#### FFT Data Layout:
We pack the data into memory in this order:
~~~
[Recovery Data (Power of Two = M)] [Original Data] [Zero Padding out to 65536]
~~~
For encoding, the placement is implied instead of actual memory layout.
For decoding, the layout is explicitly used.
#### Encoder algorithm:
The encoder is described in {3}. Operations are done O(K Log M),
where K is the original data size, and M is up to twice the
size of the recovery set.
Roughly in brief:
~~~
Recovery = FFT( IFFT(Data_0) xor IFFT(Data_1) xor ... )
~~~
It walks the original data M chunks at a time performing the IFFT.
Each IFFT intermediate result is XORed together into the first M chunks of
the data layout. Finally the FFT is performed.
Encoder optimizations:
* The first IFFT can be performed directly in the first M chunks.
* The zero padding can be skipped while performing the final IFFT.
Unrolling is used in the code to accomplish both these optimizations.
* The final FFT can be truncated also if recovery set is not a power of 2.
It is easy to truncate the FFT by ending the inner loop early.
#### Decoder algorithm:
The decoder is described in {1}. Operations are done O(N Log N), where N is up
to twice the size of the original data as described below.
Roughly in brief:
~~~
Original = -ErrLocator * FFT( Derivative( IFFT( ErrLocator * ReceivedData ) ) )
~~~
#### Precalculations:
At startup initialization, FFTInitialize() precalculates FWT(L) as
described by equation (92) in {1}, where L = Log[i] for i = 0..Order,
Order = 256 or 65536 for FF8/16. This is stored in the LogWalsh vector.
It also precalculates the FFT skew factors (s_i) as described by
equation (28). This is stored in the FFTSkew vector.
For memory workspace N data chunks are needed, where N is a power of two
at or above M + K. K is the original data size and M is the next power
of two above the recovery data size. For example for K = 200 pieces of
data and 10% redundancy, there are 20 redundant pieces, which rounds up
to 32 = M. M + K = 232 pieces, so N rounds up to 256.
#### Online calculations:
At runtime, the error locator polynomial is evaluated using the
Fast Walsh-Hadamard transform as described in {1} equation (92).
At runtime the data is explicit laid out in workspace memory like this:
[Recovery Data (Power of Two = M)] [Original Data (K)] [Zero Padding out to N]
Data that was lost is replaced with zeroes.
Data that was received, including recovery data, is multiplied by the error
locator polynomial as it is copied into the workspace.
The IFFT is applied to the entire workspace of N chunks.
Since the IFFT starts with pairs of inputs and doubles in width at each
iteration, the IFFT is optimized by skipping zero padding at the end until
it starts mixing with non-zero data.
The formal derivative is applied to the entire workspace of N chunks.
The FFT is applied to the entire workspace of N chunks.
The FFT is optimized by only performing intermediate calculations required
to recover lost data. Since it starts wide and ends up working on adjacent
pairs, at some point the intermediate results are not needed for data that
will not be read by the application. This optimization is implemented by
the ErrorBitfield class.
Finally, only recovered data is multiplied by the negative of the
error locator polynomial as it is copied into the front of the
workspace for the application to retrieve.
#### Future directions:
Note that a faster decoder is described in {3} that is O(K Log M) instead,
which should be 2x faster than the current one. However I do not fully
understand how to implement it for this field and could use some help.
#### Finite field arithmetic optimizations:
For faster finite field multiplication, large tables are precomputed and
applied during encoding/decoding on 64 bytes of data at a time using
SSSE3 or AVX2 vector instructions and the ALTMAP approach from Jerasure.
Addition in this finite field is XOR, and a vectorized memory XOR routine
is also used.
#### References:
This library implements an MDS erasure code introduced in this paper:
~~~
S.-J. Lin, T. Y. Al-Naffouri, Y. S. Han, and W.-H. Chung,
{1} S.-J. Lin, T. Y. Al-Naffouri, Y. S. Han, and W.-H. Chung,
"Novel Polynomial Basis with Fast Fourier Transform and Its Application to Reed-Solomon Erasure Codes"
IEEE Trans. on Information Theory, pp. 6284-6299, November, 2016.
~~~
The paper is available here: [http://ct.ee.ntust.edu.tw/it2016-2.pdf](http://ct.ee.ntust.edu.tw/it2016-2.pdf)
And also mirrored in the /docs/ folder.
~~~
{2} D. G. Cantor, "On arithmetical algorithms over finite fields",
Journal of Combinatorial Theory, Series A, vol. 50, no. 2, pp. 285-300, 1989.
~~~
The high-level summary is that instead of using complicated fields,
an additive FFT was introduced that works with familiar Galois fields for the first time.
This is actually a huge new result that will change how Reed-Solomon codecs will be written.
My contribution is extending the ALTMAP approach from Jerasure
for 16-bit Galois fields out to 64 bytes to enable AVX2 speedups,
and marry it with the row parallelism introduced by ISA-L.
~~~
{3} Sian-Jheng Lin, Wei-Ho Chung, “An Efficient (n, k) Information
Dispersal Algorithm for High Code Rate System over Fermat Fields,”
IEEE Commun. Lett., vol.16, no.12, pp. 2036-2039, Dec. 2012.
~~~
Some papers are mirrored in the /docs/ folder.
#### Credits
The idea is the brain-child of S.-J. Lin. He is a super bright guy who should be recognized more widely!
Inspired by discussion with:
Sian-Jhen Lin <sjhenglin@gmail.com> : Author of {1} {3}, basis for Leopard
Bulat Ziganshin <bulat.ziganshin@gmail.com> : Author of FastECC
Yutaka Sawada <tenfon@outlook.jp> : Author of MultiPar
This software was written entirely by myself ( Christopher A. Taylor mrcatid@gmail.com ). If you find it useful and would like to buy me a coffee, consider tipping.

View File

@ -107,7 +107,7 @@ LEO_EXPORT LeopardResult leo_encode(
#ifdef LEO_HAS_FF8
if (n <= leopard::ff8::kOrder)
{
leopard::ff8::Encode(
leopard::ff8::ReedSolomonEncode(
buffer_bytes,
original_count,
recovery_count,
@ -120,7 +120,7 @@ LEO_EXPORT LeopardResult leo_encode(
#ifdef LEO_HAS_FF16
if (n <= leopard::ff16::kOrder)
{
leopard::ff16::Encode(
leopard::ff16::ReedSolomonEncode(
buffer_bytes,
original_count,
recovery_count,
@ -181,7 +181,7 @@ LEO_EXPORT LeopardResult leo_decode(
#ifdef LEO_HAS_FF8
if (n <= leopard::ff8::kOrder)
{
leopard::ff8::Decode(
leopard::ff8::ReedSolomonDecode(
buffer_bytes,
original_count,
recovery_count,
@ -196,7 +196,7 @@ LEO_EXPORT LeopardResult leo_decode(
#ifdef LEO_HAS_FF16
if (n <= leopard::ff16::kOrder)
{
leopard::ff16::Decode(
leopard::ff16::ReedSolomonDecode(
buffer_bytes,
original_count,
recovery_count,

View File

@ -30,8 +30,18 @@
#define CAT_LEOPARD_RS_H
/*
Leopard-RS: Reed-Solomon Error Correction Coding for Extremely Large Data
Leopard-RS: Reed-Solomon Error Correction Codes for Large Data in C
WORK IN PROGRESS - NON FUNCTIONAL
Algorithms are described in LeopardCommon.h
Inspired by discussion with:
Sian-Jhen Lin <sjhenglin@gmail.com> : Author of {1} {3}, basis for Leopard
Bulat Ziganshin <bulat.ziganshin@gmail.com> : Author of FastECC
Yutaka Sawada <tenfon@outlook.jp> : Author of MultiPar
References:
@ -42,6 +52,26 @@
{2} D. G. Cantor, "On arithmetical algorithms over finite fields",
Journal of Combinatorial Theory, Series A, vol. 50, no. 2, pp. 285-300, 1989.
{3} Sian-Jheng Lin, Wei-Ho Chung, An Efficient (n, k) Information
Dispersal Algorithm for High Code Rate System over Fermat Fields,
IEEE Commun. Lett., vol.16, no.12, pp. 2036-2039, Dec. 2012.
*/
/*
TODO:
+ Fixes for all different input sizes
+ New 16-bit Muladd inner loops
+ Benchmarks for large data!
+ Add multi-threading to split up long parallelizable calculations
+ Write detailed comments for all the routines
+ Final benchmarks!
+ Release version 1
+ Finish up documentation
TBD:
+ Look into getting EncodeL working so we can support smaller data (Ask Lin)
+ Look into using FFT_m instead of FFT_n for decoder
*/
// Library version