mirror of
https://github.com/status-im/leopard.git
synced 2025-02-16 16:07:36 +00:00
Improve docs
This commit is contained in:
parent
3f9b00e8ed
commit
5e61c917f7
121
LeopardCommon.h
121
LeopardCommon.h
@ -29,19 +29,116 @@
|
||||
#pragma once
|
||||
|
||||
/*
|
||||
TODO:
|
||||
+ Fixes for all different input sizes
|
||||
+ New 16-bit Muladd inner loops
|
||||
+ Benchmarks for large data!
|
||||
+ Add multi-threading to split up long parallelizable calculations
|
||||
+ Write detailed comments for all the routines
|
||||
+ Final benchmarks!
|
||||
+ Release version 1
|
||||
+ Finish up documentation
|
||||
FFT Data Layout:
|
||||
|
||||
TBD:
|
||||
+ Look into getting EncodeL working so we can support smaller data (Ask Lin)
|
||||
+ Look into using FFT_m instead of FFT_n for decoder
|
||||
We pack the data into memory in this order:
|
||||
|
||||
[Recovery Data (Power of Two = M)] [Original Data] [Zero Padding out to 65536]
|
||||
|
||||
For encoding, the placement is implied instead of actual memory layout.
|
||||
For decoding, the layout is explicitly used.
|
||||
*/
|
||||
|
||||
/*
|
||||
Encoder algorithm:
|
||||
|
||||
The encoder is described in {3}. Operations are done O(K Log M),
|
||||
where K is the original data size, and M is up to twice the
|
||||
size of the recovery set.
|
||||
|
||||
Roughly in brief:
|
||||
|
||||
Recovery = FFT( IFFT(Data_0) xor IFFT(Data_1) xor ... )
|
||||
|
||||
It walks the original data M chunks at a time performing the IFFT.
|
||||
Each IFFT intermediate result is XORed together into the first M chunks of
|
||||
the data layout. Finally the FFT is performed.
|
||||
|
||||
Encoder optimizations:
|
||||
* The first IFFT can be performed directly in the first M chunks.
|
||||
* The zero padding can be skipped while performing the final IFFT.
|
||||
Unrolling is used in the code to accomplish both these optimizations.
|
||||
* The final FFT can be truncated also if recovery set is not a power of 2.
|
||||
It is easy to truncate the FFT by ending the inner loop early.
|
||||
*/
|
||||
|
||||
/*
|
||||
Decoder algorithm:
|
||||
|
||||
The decoder is described in {1}. Operations are done O(N Log N), where N is up
|
||||
to twice the size of the original data as described below.
|
||||
|
||||
Roughly in brief:
|
||||
|
||||
Original = -ErrLocator * FFT( Derivative( IFFT( ErrLocator * ReceivedData ) ) )
|
||||
|
||||
|
||||
Precalculations:
|
||||
---------------
|
||||
|
||||
At startup initialization, FFTInitialize() precalculates FWT(L) as
|
||||
described by equation (92) in {1}, where L = Log[i] for i = 0..Order,
|
||||
Order = 256 or 65536 for FF8/16. This is stored in the LogWalsh vector.
|
||||
|
||||
It also precalculates the FFT skew factors (s_i) as described by
|
||||
equation (28). This is stored in the FFTSkew vector.
|
||||
|
||||
For memory workspace N data chunks are needed, where N is a power of two
|
||||
at or above M + K. K is the original data size and M is the next power
|
||||
of two above the recovery data size. For example for K = 200 pieces of
|
||||
data and 10% redundancy, there are 20 redundant pieces, which rounds up
|
||||
to 32 = M. M + K = 232 pieces, so N rounds up to 256.
|
||||
|
||||
|
||||
Online calculations:
|
||||
-------------------
|
||||
|
||||
At runtime, the error locator polynomial is evaluated using the
|
||||
Fast Walsh-Hadamard transform as described in {1} equation (92).
|
||||
|
||||
At runtime the data is explicit laid out in workspace memory like this:
|
||||
[Recovery Data (Power of Two = M)] [Original Data (K)] [Zero Padding out to N]
|
||||
|
||||
Data that was lost is replaced with zeroes.
|
||||
Data that was received, including recovery data, is multiplied by the error
|
||||
locator polynomial as it is copied into the workspace.
|
||||
|
||||
The IFFT is applied to the entire workspace of N chunks.
|
||||
Since the IFFT starts with pairs of inputs and doubles in width at each
|
||||
iteration, the IFFT is optimized by skipping zero padding at the end until
|
||||
it starts mixing with non-zero data.
|
||||
|
||||
The formal derivative is applied to the entire workspace of N chunks.
|
||||
|
||||
The FFT is applied to the entire workspace of N chunks.
|
||||
The FFT is optimized by only performing intermediate calculations required
|
||||
to recover lost data. Since it starts wide and ends up working on adjacent
|
||||
pairs, at some point the intermediate results are not needed for data that
|
||||
will not be read by the application. This optimization is implemented by
|
||||
the ErrorBitfield class.
|
||||
|
||||
Finally, only recovered data is multiplied by the negative of the
|
||||
error locator polynomial as it is copied into the front of the
|
||||
workspace for the application to retrieve.
|
||||
|
||||
|
||||
Future directions:
|
||||
-----------------
|
||||
|
||||
Note that a faster decoder is described in {3} that is O(K Log M) instead,
|
||||
which should be 2x faster than the current one. However I do not fully
|
||||
understand how to implement it for this field and could use some help.
|
||||
*/
|
||||
|
||||
/*
|
||||
Finite field arithmetic optimizations:
|
||||
|
||||
For faster finite field multiplication, large tables are precomputed and
|
||||
applied during encoding/decoding on 64 bytes of data at a time using
|
||||
SSSE3 or AVX2 vector instructions and the ALTMAP approach from Jerasure.
|
||||
|
||||
Addition in this finite field is XOR, and a vectorized memory XOR routine
|
||||
is also used.
|
||||
*/
|
||||
|
||||
#include "leopard.h"
|
||||
|
@ -37,6 +37,8 @@
|
||||
|
||||
This finite field contains 65536 elements and so each element is one byte.
|
||||
This library is designed for data that is a multiple of 64 bytes in size.
|
||||
|
||||
Algorithms are described in LeopardCommon.h
|
||||
*/
|
||||
|
||||
namespace leopard { namespace ff16 {
|
||||
@ -115,9 +117,9 @@ void ifft_butterfly4(
|
||||
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// Encode
|
||||
// Reed-Solomon Encode
|
||||
|
||||
void Encode(
|
||||
void ReedSolomonEncode(
|
||||
uint64_t buffer_bytes,
|
||||
unsigned original_count,
|
||||
unsigned recovery_count,
|
||||
@ -127,9 +129,9 @@ void Encode(
|
||||
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// Decode
|
||||
// Reed-Solomon Decode
|
||||
|
||||
void Decode(
|
||||
void ReedSolomonDecode(
|
||||
uint64_t buffer_bytes,
|
||||
unsigned original_count,
|
||||
unsigned recovery_count,
|
||||
|
@ -778,6 +778,7 @@ static void FFTInitialize()
|
||||
for (unsigned i = 0; i < kOrder; ++i)
|
||||
LogWalsh[i] = LogLUT[i];
|
||||
LogWalsh[0] = 0;
|
||||
|
||||
FWHT(LogWalsh, kBits);
|
||||
}
|
||||
|
||||
@ -845,9 +846,9 @@ void VectorIFFTButterfly(
|
||||
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// Encode
|
||||
// Reed-Solomon Encode
|
||||
|
||||
void Encode(
|
||||
void ReedSolomonEncode(
|
||||
uint64_t buffer_bytes,
|
||||
unsigned original_count,
|
||||
unsigned recovery_count,
|
||||
@ -1076,9 +1077,9 @@ void ErrorBitfield::Prepare()
|
||||
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// Decode
|
||||
// Reed-Solomon Decode
|
||||
|
||||
void Decode(
|
||||
void ReedSolomonDecode(
|
||||
uint64_t buffer_bytes,
|
||||
unsigned original_count,
|
||||
unsigned recovery_count,
|
||||
|
10
LeopardFF8.h
10
LeopardFF8.h
@ -37,6 +37,8 @@
|
||||
|
||||
This finite field contains 256 elements and so each element is one byte.
|
||||
This library is designed for data that is a multiple of 64 bytes in size.
|
||||
|
||||
Algorithms are described in LeopardCommon.h
|
||||
*/
|
||||
|
||||
namespace leopard { namespace ff8 {
|
||||
@ -161,9 +163,9 @@ void VectorIFFTButterfly(
|
||||
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// Encode
|
||||
// Reed-Solomon Encode
|
||||
|
||||
void Encode(
|
||||
void ReedSolomonEncode(
|
||||
uint64_t buffer_bytes,
|
||||
unsigned original_count,
|
||||
unsigned recovery_count,
|
||||
@ -173,9 +175,9 @@ void Encode(
|
||||
|
||||
|
||||
//------------------------------------------------------------------------------
|
||||
// Decode
|
||||
// Reed-Solomon Decode
|
||||
|
||||
void Decode(
|
||||
void ReedSolomonDecode(
|
||||
uint64_t buffer_bytes,
|
||||
unsigned original_count,
|
||||
unsigned recovery_count,
|
||||
|
294
README.md
294
README.md
@ -1,31 +1,38 @@
|
||||
# Leopard-RS
|
||||
## Leopard Reed-Solomon Error Correction Codes in C
|
||||
## Reed-Solomon Error Correction Codes for Large Data in C
|
||||
|
||||
#### This software is still under active development. It may or may not work right now. I'm trying to get it done ASAP. Current latest result is that K=128 code rate 1/2 is working and benchmarks are posted here: [http://catid.mechafetus.com/news/news.php?view=399](http://catid.mechafetus.com/news/news.php?view=399)
|
||||
|
||||
Leopard-RS is a portable, fast library for Forward Error Correction.
|
||||
Leopard-RS is a fast library for Forward Error Correction.
|
||||
From a block of equally sized original data pieces, it generates recovery
|
||||
symbols that can be used to recover lost original data.
|
||||
|
||||
* It requires that data pieces are all a fixed size, a multiple of 64 bytes.
|
||||
* The original and recovery data must not exceed 65536 pieces.
|
||||
|
||||
|
||||
#### Motivation:
|
||||
|
||||
It gets slower as O(N Log N) in the input data size, and its inner loops are
|
||||
vectorized using the best approaches available on modern processors, using the
|
||||
fastest finite fields (8-bit or 16-bit Galois fields) for bulk data.
|
||||
fastest finite fields (8-bit or 16-bit Galois fields with Cantor basis {2}).
|
||||
|
||||
It sets new speed records for MDS encoding and decoding of large data.
|
||||
It is also the only open-source production ready software for this purpose
|
||||
available today.
|
||||
It sets new speed records for MDS encoding and decoding of large data,
|
||||
achieving over 1.2 GB/s to encode with the AVX2 instruction set on a single core.
|
||||
|
||||
There is another library `FastECC` by Bulat-Ziganshin that should have similar performance:
|
||||
[https://github.com/Bulat-Ziganshin/FastECC](https://github.com/Bulat-Ziganshin/FastECC)
|
||||
|
||||
Example applications are data recovery software and data center replication.
|
||||
|
||||
|
||||
#### Encoder API:
|
||||
|
||||
Preconditions:
|
||||
|
||||
* The original and recovery data must not exceed 65536 pieces.
|
||||
* The recovery_count <= original_count.
|
||||
* The buffer_bytes must be a multiple of 64.
|
||||
* Each buffer should have the same number of bytes.
|
||||
* Even the last piece must be rounded up to the block size.
|
||||
|
||||
```
|
||||
#include "leopard.h"
|
||||
```
|
||||
@ -39,55 +46,286 @@ For full documentation please read `leopard.h`.
|
||||
|
||||
#### Decoder API:
|
||||
|
||||
```
|
||||
#include "leopard.h"
|
||||
```
|
||||
|
||||
For full documentation please read `leopard.h`.
|
||||
|
||||
+ `leo_init()` : Initialize library.
|
||||
+ `leo_decode_work_count()` : Calculate the number of work_data buffers to provide to leo_decode().
|
||||
+ `leo_decode()` : Generate recovery data.
|
||||
+ `leo_decode()` : Recover original data.
|
||||
|
||||
|
||||
#### Benchmarks:
|
||||
|
||||
On my laptop:
|
||||
|
||||
```
|
||||
TODO
|
||||
Leopard Encoder(8.192 MB in 128 pieces, 128 losses): Input=1242.62 MB/s, Output=1242.62 MB/s
|
||||
Leopard Decoder(8.192 MB in 128 pieces, 128 losses): Input=482.53 MB/s, Output=482.53 MB/s
|
||||
```
|
||||
|
||||
|
||||
#### Comparisons:
|
||||
|
||||
Comparing performance from all my error correction code libraries, on my laptop:
|
||||
|
||||
To summarize, a set of 128 of 64 KB data files are supplemented by about 128 redundant code pieces (encoded) meaning a code rate of 1/2. From those redundant code pieces the original set is recovered (decoded).
|
||||
|
||||
The results are all from libraries I've written over the past few years. They all have the same vector-optimized inner loops, but the types of error correction codes are different.
|
||||
|
||||
```
|
||||
TODO
|
||||
For 64KB data chunks:
|
||||
|
||||
CM256 Encoder: 64000 bytes k = 128 m = 128 : 82194.7 usec, 99.6658 MBps
|
||||
CM256 Decoder: 64000 bytes k = 128 m = 128 : 78279.5 usec, 104.651 MBps
|
||||
|
||||
Longhair Encoded k=128 data blocks with m=128 recovery blocks in 81641.2 usec : 100.342 MB/s
|
||||
Longhair Decoded 128 erasures in 85000.7 usec : 96.3757 MB/s
|
||||
|
||||
WH256 wirehair_encode(N = 128) in 12381.3 usec, 661.644 MB/s after 127.385 avg losses
|
||||
WH256 wirehair_decode(N = 128) average overhead = 0.025 blocks, average reconstruct time = 9868.65 usec, 830.103 MB/s
|
||||
|
||||
FEC-AL Encoder(8.192 MB in 128 pieces, 128 losses): Input=518.545 MB/s, Output=518.545 MB/s, (Encode create: 3762.73 MB/s)
|
||||
FEC-AL Decoder(8.192 MB in 128 pieces, 128 losses): Input=121.093 MB/s, Output=121.093 MB/s, (Overhead = 0 pieces)
|
||||
|
||||
Leopard Encoder(8.192 MB in 128 pieces, 128 losses): Input=1242.62 MB/s, Output=1242.62 MB/s
|
||||
Leopard Decoder(8.192 MB in 128 pieces, 128 losses): Input=482.53 MB/s, Output=482.53 MB/s
|
||||
```
|
||||
|
||||
For 128 data pieces of input and 128 data pieces of redundancy:
|
||||
|
||||
#### Background
|
||||
Fastest to encode: Leopard (1.2 GB/s)
|
||||
Distant second-place: WH256 (660 MB/s), FEC-AL (515 MB/s)
|
||||
Slowest encoders: Longhair, CM256
|
||||
|
||||
Fastest to decode: WH256 (830 MB/s)
|
||||
Distant second-place: Leopard (482 MB/s)
|
||||
Slowest decoders: FEC-AL, CM256, Longhair
|
||||
|
||||
There are a lot of variables that affect when each of these libraries should be used.
|
||||
Each one is ideal in a different situation, and no one library can be called the best overall.
|
||||
The situation tested mainly helps explore the trade-offs of WH256, FEC-AL and Leopard for code rate 1/2.
|
||||
|
||||
|
||||
##### CM256: Traditional O(N^2) Cauchy matrix MDS Reed-Solomon codec
|
||||
|
||||
Runs at about 100 MB/s encode and decode for this case.
|
||||
This is an MDS code that uses a Cauchy matrix for structure.
|
||||
Other examples of this type would be most MDS Reed-Solomon codecs online: Jerasure, Zfec, ISA-L, etc.
|
||||
It requires SSSE3 or newer Intel instruction sets for this speed. Otherwise it runs much slower.
|
||||
This type of software gets slower as O(K*M) where K = input count and M = recovery count.
|
||||
It is practical for either small data or small recovery set up to 255 pieces.
|
||||
|
||||
It is available for production use under BSD license here:
|
||||
http://github.com/catid/cm256
|
||||
(Note that the inner loops can be optimized more by applying the GF256 library.)
|
||||
|
||||
|
||||
##### Longhair: Binary O(N^2) Cauchy matrix MDS Reed-Solomon codec
|
||||
|
||||
Runs at about 100 MB/s encode and decode for this case.
|
||||
This is an MDS code that uses a Cauchy matrix for structure.
|
||||
This one only requires XOR operations so it can run fast on low-end processors.
|
||||
Requires data is a multiple of 8 bytes.
|
||||
This type of software gets slower as O(K*M) where K = input count and M = recovery count.
|
||||
It is practical for either small data or small recovery set up to 255 pieces.
|
||||
There is no other optimized software available online for this type of error correction code. There is a slow version available in the Jerasure software library.
|
||||
|
||||
It is available for production use under BSD license here:
|
||||
http://github.com/catid/longhair
|
||||
(Note that the inner loops can be optimized more by applying the GF256 library.)
|
||||
|
||||
|
||||
##### Wirehair: O(N) Hybrid LDPC Erasure Code
|
||||
|
||||
Encodes at 660 MB/s, and decodes at 830 MB/s for ALL cases.
|
||||
This is not an MDS code. It has about a 3% chance of failing to recover and requiring one extra block of data.
|
||||
It uses mostly XOR so it only gets a little slower on lower-end processors.
|
||||
This type of software gets slower as O(K) where K = input count.
|
||||
This library incorporates some novel ideas that are unpublished. The new ideas are described in the source code.
|
||||
It is practical for data up to 64,000 pieces and can be used as a "fountain" code.
|
||||
There is no other optimized software available online for this type of error correction code. I believe there are some public (slow) implementations of Raptor codes available online for study.
|
||||
|
||||
It is available for production use under BSD license here:
|
||||
http://github.com/catid/wirehair
|
||||
|
||||
There's a pre-production version that needs more work here using GF256 for more speed,
|
||||
which is what I used for the benchmark:
|
||||
http://github.com/catid/wh256
|
||||
|
||||
|
||||
##### FEC-AL *new*: O(N^2/8) XOR Structured Convolutional Matrix Code
|
||||
|
||||
Encodes at 510 MB/s. Decodes at 121 MB/s.
|
||||
This is not an MDS code. It has about a 1% chance of failing to recover and requiring one extra block of data.
|
||||
This library incorporates some novel ideas that are unpublished. The new ideas are described in the README.
|
||||
It uses mostly XOR operations so only gets about 2-4x slower on lower-end processors.
|
||||
It gets slower as O(K*M/8) for larger data, bounded by the speed of XOR.
|
||||
This new approach is ideal for streaming erasure codes; two implementations are offered one for files and another for real-time streaming reliable data.
|
||||
It is practical for data up to about 4,000 pieces and can be used as a "fountain" code.
|
||||
There is no other software available online for this type of error correction code.
|
||||
|
||||
It is available for production use under BSD license here:
|
||||
http://github.com/catid/fecal
|
||||
|
||||
It can also be used as a convolutional streaming code here for e.g. rUDP:
|
||||
http://github.com/catid/siamese
|
||||
|
||||
|
||||
##### Leopard-RS *new*: O(K Log M) FFT MDS Reed-Solomon codec
|
||||
|
||||
Encodes at 1.2 GB/s, and decodes at 480 MB/s for this case.
|
||||
12x faster than existing MDS approaches to encode, and almost 5x faster to decode.
|
||||
This uses a recent result from 2014 introducing a novel polynomial basis permitting FFT over fast Galois fields.
|
||||
This is an MDS Reed-Solomon similar to Jerasure, Zfec, ISA-L, etc, but much faster.
|
||||
It requires SSSE3 or newer Intel instruction sets for this speed. Otherwise it runs much slower.
|
||||
Requires data is a multiple of 64 bytes.
|
||||
This type of software gets slower as O(K Log M) where K = input count, M = recovery count.
|
||||
It is practical for extremely large data.
|
||||
There is no other software available online for this type of error correction code.
|
||||
|
||||
|
||||
#### FFT Data Layout:
|
||||
|
||||
We pack the data into memory in this order:
|
||||
|
||||
~~~
|
||||
[Recovery Data (Power of Two = M)] [Original Data] [Zero Padding out to 65536]
|
||||
~~~
|
||||
|
||||
For encoding, the placement is implied instead of actual memory layout.
|
||||
For decoding, the layout is explicitly used.
|
||||
|
||||
|
||||
#### Encoder algorithm:
|
||||
|
||||
The encoder is described in {3}. Operations are done O(K Log M),
|
||||
where K is the original data size, and M is up to twice the
|
||||
size of the recovery set.
|
||||
|
||||
Roughly in brief:
|
||||
|
||||
~~~
|
||||
Recovery = FFT( IFFT(Data_0) xor IFFT(Data_1) xor ... )
|
||||
~~~
|
||||
|
||||
It walks the original data M chunks at a time performing the IFFT.
|
||||
Each IFFT intermediate result is XORed together into the first M chunks of
|
||||
the data layout. Finally the FFT is performed.
|
||||
|
||||
|
||||
Encoder optimizations:
|
||||
* The first IFFT can be performed directly in the first M chunks.
|
||||
* The zero padding can be skipped while performing the final IFFT.
|
||||
Unrolling is used in the code to accomplish both these optimizations.
|
||||
* The final FFT can be truncated also if recovery set is not a power of 2.
|
||||
It is easy to truncate the FFT by ending the inner loop early.
|
||||
|
||||
|
||||
#### Decoder algorithm:
|
||||
|
||||
The decoder is described in {1}. Operations are done O(N Log N), where N is up
|
||||
to twice the size of the original data as described below.
|
||||
|
||||
Roughly in brief:
|
||||
|
||||
~~~
|
||||
Original = -ErrLocator * FFT( Derivative( IFFT( ErrLocator * ReceivedData ) ) )
|
||||
~~~
|
||||
|
||||
|
||||
#### Precalculations:
|
||||
|
||||
At startup initialization, FFTInitialize() precalculates FWT(L) as
|
||||
described by equation (92) in {1}, where L = Log[i] for i = 0..Order,
|
||||
Order = 256 or 65536 for FF8/16. This is stored in the LogWalsh vector.
|
||||
|
||||
It also precalculates the FFT skew factors (s_i) as described by
|
||||
equation (28). This is stored in the FFTSkew vector.
|
||||
|
||||
For memory workspace N data chunks are needed, where N is a power of two
|
||||
at or above M + K. K is the original data size and M is the next power
|
||||
of two above the recovery data size. For example for K = 200 pieces of
|
||||
data and 10% redundancy, there are 20 redundant pieces, which rounds up
|
||||
to 32 = M. M + K = 232 pieces, so N rounds up to 256.
|
||||
|
||||
|
||||
#### Online calculations:
|
||||
|
||||
At runtime, the error locator polynomial is evaluated using the
|
||||
Fast Walsh-Hadamard transform as described in {1} equation (92).
|
||||
|
||||
At runtime the data is explicit laid out in workspace memory like this:
|
||||
[Recovery Data (Power of Two = M)] [Original Data (K)] [Zero Padding out to N]
|
||||
|
||||
Data that was lost is replaced with zeroes.
|
||||
Data that was received, including recovery data, is multiplied by the error
|
||||
locator polynomial as it is copied into the workspace.
|
||||
|
||||
The IFFT is applied to the entire workspace of N chunks.
|
||||
Since the IFFT starts with pairs of inputs and doubles in width at each
|
||||
iteration, the IFFT is optimized by skipping zero padding at the end until
|
||||
it starts mixing with non-zero data.
|
||||
|
||||
The formal derivative is applied to the entire workspace of N chunks.
|
||||
|
||||
The FFT is applied to the entire workspace of N chunks.
|
||||
The FFT is optimized by only performing intermediate calculations required
|
||||
to recover lost data. Since it starts wide and ends up working on adjacent
|
||||
pairs, at some point the intermediate results are not needed for data that
|
||||
will not be read by the application. This optimization is implemented by
|
||||
the ErrorBitfield class.
|
||||
|
||||
Finally, only recovered data is multiplied by the negative of the
|
||||
error locator polynomial as it is copied into the front of the
|
||||
workspace for the application to retrieve.
|
||||
|
||||
|
||||
#### Future directions:
|
||||
|
||||
Note that a faster decoder is described in {3} that is O(K Log M) instead,
|
||||
which should be 2x faster than the current one. However I do not fully
|
||||
understand how to implement it for this field and could use some help.
|
||||
|
||||
|
||||
#### Finite field arithmetic optimizations:
|
||||
|
||||
For faster finite field multiplication, large tables are precomputed and
|
||||
applied during encoding/decoding on 64 bytes of data at a time using
|
||||
SSSE3 or AVX2 vector instructions and the ALTMAP approach from Jerasure.
|
||||
|
||||
Addition in this finite field is XOR, and a vectorized memory XOR routine
|
||||
is also used.
|
||||
|
||||
|
||||
#### References:
|
||||
|
||||
This library implements an MDS erasure code introduced in this paper:
|
||||
|
||||
~~~
|
||||
S.-J. Lin, T. Y. Al-Naffouri, Y. S. Han, and W.-H. Chung,
|
||||
{1} S.-J. Lin, T. Y. Al-Naffouri, Y. S. Han, and W.-H. Chung,
|
||||
"Novel Polynomial Basis with Fast Fourier Transform and Its Application to Reed-Solomon Erasure Codes"
|
||||
IEEE Trans. on Information Theory, pp. 6284-6299, November, 2016.
|
||||
~~~
|
||||
|
||||
The paper is available here: [http://ct.ee.ntust.edu.tw/it2016-2.pdf](http://ct.ee.ntust.edu.tw/it2016-2.pdf)
|
||||
And also mirrored in the /docs/ folder.
|
||||
~~~
|
||||
{2} D. G. Cantor, "On arithmetical algorithms over finite fields",
|
||||
Journal of Combinatorial Theory, Series A, vol. 50, no. 2, pp. 285-300, 1989.
|
||||
~~~
|
||||
|
||||
The high-level summary is that instead of using complicated fields,
|
||||
an additive FFT was introduced that works with familiar Galois fields for the first time.
|
||||
This is actually a huge new result that will change how Reed-Solomon codecs will be written.
|
||||
|
||||
My contribution is extending the ALTMAP approach from Jerasure
|
||||
for 16-bit Galois fields out to 64 bytes to enable AVX2 speedups,
|
||||
and marry it with the row parallelism introduced by ISA-L.
|
||||
~~~
|
||||
{3} Sian-Jheng Lin, Wei-Ho Chung, “An Efficient (n, k) Information
|
||||
Dispersal Algorithm for High Code Rate System over Fermat Fields,”
|
||||
IEEE Commun. Lett., vol.16, no.12, pp. 2036-2039, Dec. 2012.
|
||||
~~~
|
||||
|
||||
Some papers are mirrored in the /docs/ folder.
|
||||
|
||||
|
||||
#### Credits
|
||||
|
||||
The idea is the brain-child of S.-J. Lin. He is a super bright guy who should be recognized more widely!
|
||||
Inspired by discussion with:
|
||||
|
||||
Sian-Jhen Lin <sjhenglin@gmail.com> : Author of {1} {3}, basis for Leopard
|
||||
Bulat Ziganshin <bulat.ziganshin@gmail.com> : Author of FastECC
|
||||
Yutaka Sawada <tenfon@outlook.jp> : Author of MultiPar
|
||||
|
||||
This software was written entirely by myself ( Christopher A. Taylor mrcatid@gmail.com ). If you find it useful and would like to buy me a coffee, consider tipping.
|
||||
|
@ -107,7 +107,7 @@ LEO_EXPORT LeopardResult leo_encode(
|
||||
#ifdef LEO_HAS_FF8
|
||||
if (n <= leopard::ff8::kOrder)
|
||||
{
|
||||
leopard::ff8::Encode(
|
||||
leopard::ff8::ReedSolomonEncode(
|
||||
buffer_bytes,
|
||||
original_count,
|
||||
recovery_count,
|
||||
@ -120,7 +120,7 @@ LEO_EXPORT LeopardResult leo_encode(
|
||||
#ifdef LEO_HAS_FF16
|
||||
if (n <= leopard::ff16::kOrder)
|
||||
{
|
||||
leopard::ff16::Encode(
|
||||
leopard::ff16::ReedSolomonEncode(
|
||||
buffer_bytes,
|
||||
original_count,
|
||||
recovery_count,
|
||||
@ -181,7 +181,7 @@ LEO_EXPORT LeopardResult leo_decode(
|
||||
#ifdef LEO_HAS_FF8
|
||||
if (n <= leopard::ff8::kOrder)
|
||||
{
|
||||
leopard::ff8::Decode(
|
||||
leopard::ff8::ReedSolomonDecode(
|
||||
buffer_bytes,
|
||||
original_count,
|
||||
recovery_count,
|
||||
@ -196,7 +196,7 @@ LEO_EXPORT LeopardResult leo_decode(
|
||||
#ifdef LEO_HAS_FF16
|
||||
if (n <= leopard::ff16::kOrder)
|
||||
{
|
||||
leopard::ff16::Decode(
|
||||
leopard::ff16::ReedSolomonDecode(
|
||||
buffer_bytes,
|
||||
original_count,
|
||||
recovery_count,
|
||||
|
32
leopard.h
32
leopard.h
@ -30,8 +30,18 @@
|
||||
#define CAT_LEOPARD_RS_H
|
||||
|
||||
/*
|
||||
Leopard-RS: Reed-Solomon Error Correction Coding for Extremely Large Data
|
||||
Leopard-RS: Reed-Solomon Error Correction Codes for Large Data in C
|
||||
|
||||
WORK IN PROGRESS - NON FUNCTIONAL
|
||||
|
||||
Algorithms are described in LeopardCommon.h
|
||||
|
||||
|
||||
Inspired by discussion with:
|
||||
|
||||
Sian-Jhen Lin <sjhenglin@gmail.com> : Author of {1} {3}, basis for Leopard
|
||||
Bulat Ziganshin <bulat.ziganshin@gmail.com> : Author of FastECC
|
||||
Yutaka Sawada <tenfon@outlook.jp> : Author of MultiPar
|
||||
|
||||
References:
|
||||
|
||||
@ -42,6 +52,26 @@
|
||||
|
||||
{2} D. G. Cantor, "On arithmetical algorithms over finite fields",
|
||||
Journal of Combinatorial Theory, Series A, vol. 50, no. 2, pp. 285-300, 1989.
|
||||
|
||||
{3} Sian-Jheng Lin, Wei-Ho Chung, “An Efficient (n, k) Information
|
||||
Dispersal Algorithm for High Code Rate System over Fermat Fields,”
|
||||
IEEE Commun. Lett., vol.16, no.12, pp. 2036-2039, Dec. 2012.
|
||||
*/
|
||||
|
||||
/*
|
||||
TODO:
|
||||
+ Fixes for all different input sizes
|
||||
+ New 16-bit Muladd inner loops
|
||||
+ Benchmarks for large data!
|
||||
+ Add multi-threading to split up long parallelizable calculations
|
||||
+ Write detailed comments for all the routines
|
||||
+ Final benchmarks!
|
||||
+ Release version 1
|
||||
+ Finish up documentation
|
||||
|
||||
TBD:
|
||||
+ Look into getting EncodeL working so we can support smaller data (Ask Lin)
|
||||
+ Look into using FFT_m instead of FFT_n for decoder
|
||||
*/
|
||||
|
||||
// Library version
|
||||
|
Loading…
x
Reference in New Issue
Block a user