basic functionality seems to work

2026-07-29 20:33:18 +00:00 · 2026-05-03 20:29:09 +02:00 · 2026-05-03 20:29:09 +02:00 · cd6b61fab8
commit cd6b61fab8
parent a72ce0e474
25 changed files with 15428 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,7 @@
+.DS_Store
+dist
+dist-newstyle
+tmp
+*.a
+*.o
+*.hi
--- a/29
+++ b/29
@ -0,0 +1,29 @@
+BSD 3-Clause License
+
+Copyright (c) 2017 Christopher A. Taylor, and 2026 Logos
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+* Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+* Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/README.md
+++ b/README.md
@ -0,0 +1,54 @@
+Leopard fast erasure coding library
+-----------------------------------
+
+This is a Haskell binding to the ["Leopard" erasure coding library](https://github.com/catid/leopard)
+by Christopher A. Taylor.
+
+### What's this about?
+
+Erasure coding allows you to reconstruct a redundantly encoded data even if some
+pieces are missing. For example if you encode a piece of data with 10-out-of-15 
+encoding (usually denoted by `K=10` and `N=15`), then the data is chunked into 15
+pieces, and any 10 pieces (together with their index in 1..15) can reconstruct
+the original data. 
+
+This is very useful for example when dealing with unreliable networks.
+
+Leopard uses [Reed-Solomon code](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction) 
+over binary fields `GF(2^8)` or `GF(2^16)` and low-level optimizations to achieve 
+high performance.
+
+Reed-Solomon codes also guarantee that any `K` out of `N` pieces can recover the
+data, where `K` pieces have exactly the size of the original data (however you also need
+the additional information of which available piece is which one out of the `N`).
+
+### Standard notations
+
+The encoding algorithm is called the "code". The original data is chunked into 
+`K >= 1` pieces. This is then encoded into `N > K` redundant pieces. The ratio 
+`rho = K / N < 1` is called the "rate" of the code. The expansion factor `1 / rho = N / K`
+is the redundancy overhead. Leopard only supports `1/2 <= rho < 1`, that is,
+the encoded data is at most twice the size of the original data.
+
+Leopard uses a so-called "systematic code", which means that the first `K` pieces
+is simply the original data. The notation `M = N - K` for the number of the remaining,
+"parity" pieces is also standard.
+
+Internally, Leopard encodes `K` 8 or 16 bit words ("symbols") into `N` words. By
+partitioning the original dataset into sets of `K` bytes (or 16 bit words), we can 
+trivially recover the above semantics.
+
+### Limitations
+
+Leopard itself has some limitations on the parameters:
+
+- `K >= 2`
+- `M <= K`
+- `N = K + M <= 65536`
+- the chunk size must by divisible by 64 bytes.
+
+### Compatibility
+
+I have not much experience about linking C++ with Haskell. This was tested only
+on a single ARM-based computer running macOS.
+
--- a/cpp/LICENSE
+++ b/cpp/LICENSE
@ -0,0 +1,29 @@
+BSD 3-Clause License
+
+Copyright (c) 2017, Christopher A. Taylor
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+* Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+* Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/cpp/LeopardCommon.cpp
+++ b/cpp/LeopardCommon.cpp
@ -0,0 +1,472 @@
+/*
+    Copyright (c) 2017 Christopher A. Taylor.  All rights reserved.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions are met:
+
+    * Redistributions of source code must retain the above copyright notice,
+      this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright notice,
+      this list of conditions and the following disclaimer in the documentation
+      and/or other materials provided with the distribution.
+    * Neither the name of Leopard-RS nor the names of its contributors may be
+      used to endorse or promote products derived from this software without
+      specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+    AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+    IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+    ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+    LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+    CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+    SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+    INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+    CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+    ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+    POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#include "LeopardCommon.h"
+
+#include <thread>
+
+namespace leopard {
+
+
+//------------------------------------------------------------------------------
+// Runtime CPU Architecture Check
+//
+// Feature checks stolen shamelessly from
+// https://github.com/jedisct1/libsodium/blob/master/src/libsodium/sodium/runtime.c
+
+#if defined(HAVE_ANDROID_GETCPUFEATURES)
+    #include <cpu-features.h>
+#endif
+
+#if defined(LEO_TRY_NEON)
+# if defined(IOS) && defined(__ARM_NEON__)
+    // Requires iPhone 5S or newer
+# else
+    // Remember to add LOCAL_STATIC_LIBRARIES := cpufeatures
+    bool CpuHasNeon = false; // V6 / V7
+    bool CpuHasNeon64 = false; // 64-bit
+# endif
+#endif
+
+
+#if !defined(LEO_TARGET_MOBILE)
+
+#ifdef _MSC_VER
+    #include <intrin.h> // __cpuid
+    #pragma warning(disable: 4752) // found Intel(R) Advanced Vector Extensions; consider using /arch:AVX
+#endif
+
+#ifdef LEO_TRY_AVX2
+    bool CpuHasAVX2 = false;
+#endif
+
+bool CpuHasSSSE3 = false;
+
+#define CPUID_EBX_AVX2    0x00000020
+#define CPUID_ECX_SSSE3   0x00000200
+
+static void _cpuid(unsigned int cpu_info[4U], const unsigned int cpu_info_type)
+{
+#if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_AMD64) || defined(_M_IX86))
+    __cpuid((int *) cpu_info, cpu_info_type);
+#else //if defined(HAVE_CPUID)
+    cpu_info[0] = cpu_info[1] = cpu_info[2] = cpu_info[3] = 0;
+# ifdef __i386__
+    __asm__ __volatile__ ("pushfl; pushfl; "
+                          "popl %0; "
+                          "movl %0, %1; xorl %2, %0; "
+                          "pushl %0; "
+                          "popfl; pushfl; popl %0; popfl" :
+                          "=&r" (cpu_info[0]), "=&r" (cpu_info[1]) :
+                          "i" (0x200000));
+    if (((cpu_info[0] ^ cpu_info[1]) & 0x200000) == 0) {
+        return; /* LCOV_EXCL_LINE */
+    }
+# endif
+# ifdef __i386__
+    __asm__ __volatile__ ("xchgl %%ebx, %k1; cpuid; xchgl %%ebx, %k1" :
+                          "=a" (cpu_info[0]), "=&r" (cpu_info[1]),
+                          "=c" (cpu_info[2]), "=d" (cpu_info[3]) :
+                          "0" (cpu_info_type), "2" (0U));
+# elif defined(__x86_64__)
+    __asm__ __volatile__ ("xchgq %%rbx, %q1; cpuid; xchgq %%rbx, %q1" :
+                          "=a" (cpu_info[0]), "=&r" (cpu_info[1]),
+                          "=c" (cpu_info[2]), "=d" (cpu_info[3]) :
+                          "0" (cpu_info_type), "2" (0U));
+# else
+    __asm__ __volatile__ ("cpuid" :
+                          "=a" (cpu_info[0]), "=b" (cpu_info[1]),
+                          "=c" (cpu_info[2]), "=d" (cpu_info[3]) :
+                          "0" (cpu_info_type), "2" (0U));
+# endif
+#endif
+}
+
+#elif defined(LEO_USE_SSE2NEON)
+bool CpuHasSSSE3 = true;
+#endif // defined(LEO_TARGET_MOBILE)
+
+
+void InitializeCPUArch()
+{
+#if defined(LEO_TRY_NEON) && defined(HAVE_ANDROID_GETCPUFEATURES)
+    AndroidCpuFamily family = android_getCpuFamily();
+    if (family == ANDROID_CPU_FAMILY_ARM)
+    {
+        if (android_getCpuFeatures() & ANDROID_CPU_ARM_FEATURE_NEON)
+            CpuHasNeon = true;
+    }
+    else if (family == ANDROID_CPU_FAMILY_ARM64)
+    {
+        CpuHasNeon = true;
+        if (android_getCpuFeatures() & ANDROID_CPU_ARM64_FEATURE_ASIMD)
+            CpuHasNeon64 = true;
+    }
+#endif
+
+#if !defined(LEO_TARGET_MOBILE)
+    unsigned int cpu_info[4];
+
+    _cpuid(cpu_info, 1);
+    CpuHasSSSE3 = ((cpu_info[2] & CPUID_ECX_SSSE3) != 0);
+
+#if defined(LEO_TRY_AVX2)
+    _cpuid(cpu_info, 7);
+    CpuHasAVX2 = ((cpu_info[1] & CPUID_EBX_AVX2) != 0);
+#endif // LEO_TRY_AVX2
+
+#ifndef LEO_USE_SSSE3_OPT
+    CpuHasSSSE3 = false;
+#endif // LEO_USE_SSSE3_OPT
+#ifndef LEO_USE_AVX2_OPT
+    CpuHasAVX2 = false;
+#endif // LEO_USE_AVX2_OPT
+
+#endif // LEO_TARGET_MOBILE
+}
+
+
+//------------------------------------------------------------------------------
+// XOR Memory
+
+void xor_mem(
+    void * LEO_RESTRICT vx, const void * LEO_RESTRICT vy,
+    uint64_t bytes)
+{
+#if defined(LEO_TRY_AVX2)
+    if (CpuHasAVX2)
+    {
+        LEO_M256 * LEO_RESTRICT x32 = reinterpret_cast<LEO_M256 *>(vx);
+        const LEO_M256 * LEO_RESTRICT y32 = reinterpret_cast<const LEO_M256 *>(vy);
+        while (bytes >= 128)
+        {
+            const LEO_M256 x0 = _mm256_xor_si256(_mm256_loadu_si256(x32),     _mm256_loadu_si256(y32));
+            const LEO_M256 x1 = _mm256_xor_si256(_mm256_loadu_si256(x32 + 1), _mm256_loadu_si256(y32 + 1));
+            const LEO_M256 x2 = _mm256_xor_si256(_mm256_loadu_si256(x32 + 2), _mm256_loadu_si256(y32 + 2));
+            const LEO_M256 x3 = _mm256_xor_si256(_mm256_loadu_si256(x32 + 3), _mm256_loadu_si256(y32 + 3));
+            _mm256_storeu_si256(x32, x0);
+            _mm256_storeu_si256(x32 + 1, x1);
+            _mm256_storeu_si256(x32 + 2, x2);
+            _mm256_storeu_si256(x32 + 3, x3);
+            x32 += 4, y32 += 4;
+            bytes -= 128;
+        };
+        if (bytes > 0)
+        {
+            const LEO_M256 x0 = _mm256_xor_si256(_mm256_loadu_si256(x32),     _mm256_loadu_si256(y32));
+            const LEO_M256 x1 = _mm256_xor_si256(_mm256_loadu_si256(x32 + 1), _mm256_loadu_si256(y32 + 1));
+            _mm256_storeu_si256(x32, x0);
+            _mm256_storeu_si256(x32 + 1, x1);
+        }
+        return;
+    }
+#endif // LEO_TRY_AVX2
+
+    LEO_M128 * LEO_RESTRICT x16 = reinterpret_cast<LEO_M128 *>(vx);
+    const LEO_M128 * LEO_RESTRICT y16 = reinterpret_cast<const LEO_M128 *>(vy);
+    do
+    {
+        const LEO_M128 x0 = _mm_xor_si128(_mm_loadu_si128(x16),     _mm_loadu_si128(y16));
+        const LEO_M128 x1 = _mm_xor_si128(_mm_loadu_si128(x16 + 1), _mm_loadu_si128(y16 + 1));
+        const LEO_M128 x2 = _mm_xor_si128(_mm_loadu_si128(x16 + 2), _mm_loadu_si128(y16 + 2));
+        const LEO_M128 x3 = _mm_xor_si128(_mm_loadu_si128(x16 + 3), _mm_loadu_si128(y16 + 3));
+        _mm_storeu_si128(x16, x0);
+        _mm_storeu_si128(x16 + 1, x1);
+        _mm_storeu_si128(x16 + 2, x2);
+        _mm_storeu_si128(x16 + 3, x3);
+        x16 += 4, y16 += 4;
+        bytes -= 64;
+    } while (bytes > 0);
+}
+
+#ifdef LEO_M1_OPT
+
+void xor_mem_2to1(
+    void * LEO_RESTRICT x,
+    const void * LEO_RESTRICT y,
+    const void * LEO_RESTRICT z,
+    uint64_t bytes)
+{
+#if defined(LEO_TRY_AVX2)
+    if (CpuHasAVX2)
+    {
+        LEO_M256 * LEO_RESTRICT x32 = reinterpret_cast<LEO_M256 *>(x);
+        const LEO_M256 * LEO_RESTRICT y32 = reinterpret_cast<const LEO_M256 *>(y);
+        const LEO_M256 * LEO_RESTRICT z32 = reinterpret_cast<const LEO_M256 *>(z);
+        while (bytes >= 128)
+        {
+            LEO_M256 x0 = _mm256_xor_si256(_mm256_loadu_si256(x32), _mm256_loadu_si256(y32));
+            x0 = _mm256_xor_si256(x0, _mm256_loadu_si256(z32));
+            LEO_M256 x1 = _mm256_xor_si256(_mm256_loadu_si256(x32 + 1), _mm256_loadu_si256(y32 + 1));
+            x1 = _mm256_xor_si256(x1, _mm256_loadu_si256(z32 + 1));
+            LEO_M256 x2 = _mm256_xor_si256(_mm256_loadu_si256(x32 + 2), _mm256_loadu_si256(y32 + 2));
+            x2 = _mm256_xor_si256(x2, _mm256_loadu_si256(z32 + 2));
+            LEO_M256 x3 = _mm256_xor_si256(_mm256_loadu_si256(x32 + 3), _mm256_loadu_si256(y32 + 3));
+            x3 = _mm256_xor_si256(x3, _mm256_loadu_si256(z32 + 3));
+            _mm256_storeu_si256(x32, x0);
+            _mm256_storeu_si256(x32 + 1, x1);
+            _mm256_storeu_si256(x32 + 2, x2);
+            _mm256_storeu_si256(x32 + 3, x3);
+            x32 += 4, y32 += 4, z32 += 4;
+            bytes -= 128;
+        };
+
+        if (bytes > 0)
+        {
+            LEO_M256 x0 = _mm256_xor_si256(_mm256_loadu_si256(x32),     _mm256_loadu_si256(y32));
+            x0 = _mm256_xor_si256(x0, _mm256_loadu_si256(z32));
+            LEO_M256 x1 = _mm256_xor_si256(_mm256_loadu_si256(x32 + 1), _mm256_loadu_si256(y32 + 1));
+            x1 = _mm256_xor_si256(x1, _mm256_loadu_si256(z32 + 1));
+            _mm256_storeu_si256(x32, x0);
+            _mm256_storeu_si256(x32 + 1, x1);
+        }
+
+        return;
+    }
+#endif // LEO_TRY_AVX2
+
+    LEO_M128 * LEO_RESTRICT x16 = reinterpret_cast<LEO_M128 *>(x);
+    const LEO_M128 * LEO_RESTRICT y16 = reinterpret_cast<const LEO_M128 *>(y);
+    const LEO_M128 * LEO_RESTRICT z16 = reinterpret_cast<const LEO_M128 *>(z);
+    do
+    {
+        LEO_M128 x0 = _mm_xor_si128(_mm_loadu_si128(x16), _mm_loadu_si128(y16));
+        x0 = _mm_xor_si128(x0, _mm_loadu_si128(z16));
+        LEO_M128 x1 = _mm_xor_si128(_mm_loadu_si128(x16 + 1), _mm_loadu_si128(y16 + 1));
+        x1 = _mm_xor_si128(x1, _mm_loadu_si128(z16 + 1));
+        LEO_M128 x2 = _mm_xor_si128(_mm_loadu_si128(x16 + 2), _mm_loadu_si128(y16 + 2));
+        x2 = _mm_xor_si128(x2, _mm_loadu_si128(z16 + 2));
+        LEO_M128 x3 = _mm_xor_si128(_mm_loadu_si128(x16 + 3), _mm_loadu_si128(y16 + 3));
+        x3 = _mm_xor_si128(x3, _mm_loadu_si128(z16 + 3));
+        _mm_storeu_si128(x16, x0);
+        _mm_storeu_si128(x16 + 1, x1);
+        _mm_storeu_si128(x16 + 2, x2);
+        _mm_storeu_si128(x16 + 3, x3);
+        x16 += 4, y16 += 4, z16 += 4;
+        bytes -= 64;
+    } while (bytes > 0);
+}
+
+#endif // LEO_M1_OPT
+
+#ifdef LEO_USE_VECTOR4_OPT
+
+void xor_mem4(
+    void * LEO_RESTRICT vx_0, const void * LEO_RESTRICT vy_0,
+    void * LEO_RESTRICT vx_1, const void * LEO_RESTRICT vy_1,
+    void * LEO_RESTRICT vx_2, const void * LEO_RESTRICT vy_2,
+    void * LEO_RESTRICT vx_3, const void * LEO_RESTRICT vy_3,
+    uint64_t bytes)
+{
+#if defined(LEO_TRY_AVX2)
+    if (CpuHasAVX2)
+    {
+        LEO_M256 * LEO_RESTRICT       x32_0 = reinterpret_cast<LEO_M256 *>      (vx_0);
+        const LEO_M256 * LEO_RESTRICT y32_0 = reinterpret_cast<const LEO_M256 *>(vy_0);
+        LEO_M256 * LEO_RESTRICT       x32_1 = reinterpret_cast<LEO_M256 *>      (vx_1);
+        const LEO_M256 * LEO_RESTRICT y32_1 = reinterpret_cast<const LEO_M256 *>(vy_1);
+        LEO_M256 * LEO_RESTRICT       x32_2 = reinterpret_cast<LEO_M256 *>      (vx_2);
+        const LEO_M256 * LEO_RESTRICT y32_2 = reinterpret_cast<const LEO_M256 *>(vy_2);
+        LEO_M256 * LEO_RESTRICT       x32_3 = reinterpret_cast<LEO_M256 *>      (vx_3);
+        const LEO_M256 * LEO_RESTRICT y32_3 = reinterpret_cast<const LEO_M256 *>(vy_3);
+        while (bytes >= 128)
+        {
+            const LEO_M256 x0_0 = _mm256_xor_si256(_mm256_loadu_si256(x32_0),     _mm256_loadu_si256(y32_0));
+            const LEO_M256 x1_0 = _mm256_xor_si256(_mm256_loadu_si256(x32_0 + 1), _mm256_loadu_si256(y32_0 + 1));
+            const LEO_M256 x2_0 = _mm256_xor_si256(_mm256_loadu_si256(x32_0 + 2), _mm256_loadu_si256(y32_0 + 2));
+            const LEO_M256 x3_0 = _mm256_xor_si256(_mm256_loadu_si256(x32_0 + 3), _mm256_loadu_si256(y32_0 + 3));
+            _mm256_storeu_si256(x32_0, x0_0);
+            _mm256_storeu_si256(x32_0 + 1, x1_0);
+            _mm256_storeu_si256(x32_0 + 2, x2_0);
+            _mm256_storeu_si256(x32_0 + 3, x3_0);
+            x32_0 += 4, y32_0 += 4;
+            const LEO_M256 x0_1 = _mm256_xor_si256(_mm256_loadu_si256(x32_1),     _mm256_loadu_si256(y32_1));
+            const LEO_M256 x1_1 = _mm256_xor_si256(_mm256_loadu_si256(x32_1 + 1), _mm256_loadu_si256(y32_1 + 1));
+            const LEO_M256 x2_1 = _mm256_xor_si256(_mm256_loadu_si256(x32_1 + 2), _mm256_loadu_si256(y32_1 + 2));
+            const LEO_M256 x3_1 = _mm256_xor_si256(_mm256_loadu_si256(x32_1 + 3), _mm256_loadu_si256(y32_1 + 3));
+            _mm256_storeu_si256(x32_1, x0_1);
+            _mm256_storeu_si256(x32_1 + 1, x1_1);
+            _mm256_storeu_si256(x32_1 + 2, x2_1);
+            _mm256_storeu_si256(x32_1 + 3, x3_1);
+            x32_1 += 4, y32_1 += 4;
+            const LEO_M256 x0_2 = _mm256_xor_si256(_mm256_loadu_si256(x32_2),     _mm256_loadu_si256(y32_2));
+            const LEO_M256 x1_2 = _mm256_xor_si256(_mm256_loadu_si256(x32_2 + 1), _mm256_loadu_si256(y32_2 + 1));
+            const LEO_M256 x2_2 = _mm256_xor_si256(_mm256_loadu_si256(x32_2 + 2), _mm256_loadu_si256(y32_2 + 2));
+            const LEO_M256 x3_2 = _mm256_xor_si256(_mm256_loadu_si256(x32_2 + 3), _mm256_loadu_si256(y32_2 + 3));
+            _mm256_storeu_si256(x32_2, x0_2);
+            _mm256_storeu_si256(x32_2 + 1, x1_2);
+            _mm256_storeu_si256(x32_2 + 2, x2_2);
+            _mm256_storeu_si256(x32_2 + 3, x3_2);
+            x32_2 += 4, y32_2 += 4;
+            const LEO_M256 x0_3 = _mm256_xor_si256(_mm256_loadu_si256(x32_3),     _mm256_loadu_si256(y32_3));
+            const LEO_M256 x1_3 = _mm256_xor_si256(_mm256_loadu_si256(x32_3 + 1), _mm256_loadu_si256(y32_3 + 1));
+            const LEO_M256 x2_3 = _mm256_xor_si256(_mm256_loadu_si256(x32_3 + 2), _mm256_loadu_si256(y32_3 + 2));
+            const LEO_M256 x3_3 = _mm256_xor_si256(_mm256_loadu_si256(x32_3 + 3), _mm256_loadu_si256(y32_3 + 3));
+            _mm256_storeu_si256(x32_3,     x0_3);
+            _mm256_storeu_si256(x32_3 + 1, x1_3);
+            _mm256_storeu_si256(x32_3 + 2, x2_3);
+            _mm256_storeu_si256(x32_3 + 3, x3_3);
+            x32_3 += 4, y32_3 += 4;
+            bytes -= 128;
+        }
+        if (bytes > 0)
+        {
+            const LEO_M256 x0_0 = _mm256_xor_si256(_mm256_loadu_si256(x32_0),     _mm256_loadu_si256(y32_0));
+            const LEO_M256 x1_0 = _mm256_xor_si256(_mm256_loadu_si256(x32_0 + 1), _mm256_loadu_si256(y32_0 + 1));
+            const LEO_M256 x0_1 = _mm256_xor_si256(_mm256_loadu_si256(x32_1),     _mm256_loadu_si256(y32_1));
+            const LEO_M256 x1_1 = _mm256_xor_si256(_mm256_loadu_si256(x32_1 + 1), _mm256_loadu_si256(y32_1 + 1));
+            _mm256_storeu_si256(x32_0, x0_0);
+            _mm256_storeu_si256(x32_0 + 1, x1_0);
+            _mm256_storeu_si256(x32_1, x0_1);
+            _mm256_storeu_si256(x32_1 + 1, x1_1);
+            const LEO_M256 x0_2 = _mm256_xor_si256(_mm256_loadu_si256(x32_2),     _mm256_loadu_si256(y32_2));
+            const LEO_M256 x1_2 = _mm256_xor_si256(_mm256_loadu_si256(x32_2 + 1), _mm256_loadu_si256(y32_2 + 1));
+            const LEO_M256 x0_3 = _mm256_xor_si256(_mm256_loadu_si256(x32_3),     _mm256_loadu_si256(y32_3));
+            const LEO_M256 x1_3 = _mm256_xor_si256(_mm256_loadu_si256(x32_3 + 1), _mm256_loadu_si256(y32_3 + 1));
+            _mm256_storeu_si256(x32_2,     x0_2);
+            _mm256_storeu_si256(x32_2 + 1, x1_2);
+            _mm256_storeu_si256(x32_3,     x0_3);
+            _mm256_storeu_si256(x32_3 + 1, x1_3);
+        }
+        return;
+    }
+#endif // LEO_TRY_AVX2
+    LEO_M128 * LEO_RESTRICT       x16_0 = reinterpret_cast<LEO_M128 *>      (vx_0);
+    const LEO_M128 * LEO_RESTRICT y16_0 = reinterpret_cast<const LEO_M128 *>(vy_0);
+    LEO_M128 * LEO_RESTRICT       x16_1 = reinterpret_cast<LEO_M128 *>      (vx_1);
+    const LEO_M128 * LEO_RESTRICT y16_1 = reinterpret_cast<const LEO_M128 *>(vy_1);
+    LEO_M128 * LEO_RESTRICT       x16_2 = reinterpret_cast<LEO_M128 *>      (vx_2);
+    const LEO_M128 * LEO_RESTRICT y16_2 = reinterpret_cast<const LEO_M128 *>(vy_2);
+    LEO_M128 * LEO_RESTRICT       x16_3 = reinterpret_cast<LEO_M128 *>      (vx_3);
+    const LEO_M128 * LEO_RESTRICT y16_3 = reinterpret_cast<const LEO_M128 *>(vy_3);
+    do
+    {
+        const LEO_M128 x0_0 = _mm_xor_si128(_mm_loadu_si128(x16_0),     _mm_loadu_si128(y16_0));
+        const LEO_M128 x1_0 = _mm_xor_si128(_mm_loadu_si128(x16_0 + 1), _mm_loadu_si128(y16_0 + 1));
+        const LEO_M128 x2_0 = _mm_xor_si128(_mm_loadu_si128(x16_0 + 2), _mm_loadu_si128(y16_0 + 2));
+        const LEO_M128 x3_0 = _mm_xor_si128(_mm_loadu_si128(x16_0 + 3), _mm_loadu_si128(y16_0 + 3));
+        _mm_storeu_si128(x16_0, x0_0);
+        _mm_storeu_si128(x16_0 + 1, x1_0);
+        _mm_storeu_si128(x16_0 + 2, x2_0);
+        _mm_storeu_si128(x16_0 + 3, x3_0);
+        x16_0 += 4, y16_0 += 4;
+        const LEO_M128 x0_1 = _mm_xor_si128(_mm_loadu_si128(x16_1),     _mm_loadu_si128(y16_1));
+        const LEO_M128 x1_1 = _mm_xor_si128(_mm_loadu_si128(x16_1 + 1), _mm_loadu_si128(y16_1 + 1));
+        const LEO_M128 x2_1 = _mm_xor_si128(_mm_loadu_si128(x16_1 + 2), _mm_loadu_si128(y16_1 + 2));
+        const LEO_M128 x3_1 = _mm_xor_si128(_mm_loadu_si128(x16_1 + 3), _mm_loadu_si128(y16_1 + 3));
+        _mm_storeu_si128(x16_1, x0_1);
+        _mm_storeu_si128(x16_1 + 1, x1_1);
+        _mm_storeu_si128(x16_1 + 2, x2_1);
+        _mm_storeu_si128(x16_1 + 3, x3_1);
+        x16_1 += 4, y16_1 += 4;
+        const LEO_M128 x0_2 = _mm_xor_si128(_mm_loadu_si128(x16_2),     _mm_loadu_si128(y16_2));
+        const LEO_M128 x1_2 = _mm_xor_si128(_mm_loadu_si128(x16_2 + 1), _mm_loadu_si128(y16_2 + 1));
+        const LEO_M128 x2_2 = _mm_xor_si128(_mm_loadu_si128(x16_2 + 2), _mm_loadu_si128(y16_2 + 2));
+        const LEO_M128 x3_2 = _mm_xor_si128(_mm_loadu_si128(x16_2 + 3), _mm_loadu_si128(y16_2 + 3));
+        _mm_storeu_si128(x16_2, x0_2);
+        _mm_storeu_si128(x16_2 + 1, x1_2);
+        _mm_storeu_si128(x16_2 + 2, x2_2);
+        _mm_storeu_si128(x16_2 + 3, x3_2);
+        x16_2 += 4, y16_2 += 4;
+        const LEO_M128 x0_3 = _mm_xor_si128(_mm_loadu_si128(x16_3),     _mm_loadu_si128(y16_3));
+        const LEO_M128 x1_3 = _mm_xor_si128(_mm_loadu_si128(x16_3 + 1), _mm_loadu_si128(y16_3 + 1));
+        const LEO_M128 x2_3 = _mm_xor_si128(_mm_loadu_si128(x16_3 + 2), _mm_loadu_si128(y16_3 + 2));
+        const LEO_M128 x3_3 = _mm_xor_si128(_mm_loadu_si128(x16_3 + 3), _mm_loadu_si128(y16_3 + 3));
+        _mm_storeu_si128(x16_3,     x0_3);
+        _mm_storeu_si128(x16_3 + 1, x1_3);
+        _mm_storeu_si128(x16_3 + 2, x2_3);
+        _mm_storeu_si128(x16_3 + 3, x3_3);
+        x16_3 += 4, y16_3 += 4;
+        bytes -= 64;
+    } while (bytes > 0);
+}
+
+#endif // LEO_USE_VECTOR4_OPT
+
+void VectorXOR_Threads(
+    const uint64_t bytes,
+    unsigned count,
+    void** x,
+    void** y)
+{
+#ifdef LEO_USE_VECTOR4_OPT
+    if (count >= 4)
+    {
+        int i_end = count - 4;
+#pragma omp parallel for
+        for (int i = 0; i <= i_end; i += 4)
+        {
+            xor_mem4(
+                x[i + 0], y[i + 0],
+                x[i + 1], y[i + 1],
+                x[i + 2], y[i + 2],
+                x[i + 3], y[i + 3],
+                bytes);
+        }
+        count %= 4;
+        i_end -= count;
+        x += i_end;
+        y += i_end;
+    }
+#endif // LEO_USE_VECTOR4_OPT
+
+    for (unsigned i = 0; i < count; ++i)
+        xor_mem(x[i], y[i], bytes);
+}
+void VectorXOR(
+    const uint64_t bytes,
+    unsigned count,
+    void** x,
+    void** y)
+{
+#ifdef LEO_USE_VECTOR4_OPT
+    if (count >= 4)
+    {
+        int i_end = count - 4;
+        for (int i = 0; i <= i_end; i += 4)
+        {
+            xor_mem4(
+                x[i + 0], y[i + 0],
+                x[i + 1], y[i + 1],
+                x[i + 2], y[i + 2],
+                x[i + 3], y[i + 3],
+                bytes);
+        }
+        count %= 4;
+        i_end -= count;
+        x += i_end;
+        y += i_end;
+    }
+#endif // LEO_USE_VECTOR4_OPT
+
+    for (unsigned i = 0; i < count; ++i)
+        xor_mem(x[i], y[i], bytes);
+}
+
+
+} // namespace leopard
--- a/cpp/LeopardCommon.h
+++ b/cpp/LeopardCommon.h
@ -0,0 +1,502 @@
+/*
+    Copyright (c) 2017 Christopher A. Taylor.  All rights reserved.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions are met:
+
+    * Redistributions of source code must retain the above copyright notice,
+      this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright notice,
+      this list of conditions and the following disclaimer in the documentation
+      and/or other materials provided with the distribution.
+    * Neither the name of Leopard-RS nor the names of its contributors may be
+      used to endorse or promote products derived from this software without
+      specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+    AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+    IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+    ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+    LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+    CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+    SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+    INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+    CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+    ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+    POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#pragma once
+
+/*
+    TODO:
+
+    Mid-term:
+    + Add compile-time selectable XOR-only rowops instead of MULADD
+    + Look into 12-bit fields as a performance optimization
+
+    Long-term:
+    + Evaluate the error locator polynomial based on fast polynomial interpolations in O(k log^2 k)
+    + Look into getting EncodeL working so we can support larger recovery sets
+    + Implement the decoder algorithm from {3} based on the Forney algorithm
+*/
+
+/*
+    FFT Data Layout:
+
+    We pack the data into memory in this order:
+
+    [Recovery Data (Power of Two = M)] [Original Data] [Zero Padding out to 65536]
+
+    For encoding, the placement is implied instead of actual memory layout.
+    For decoding, the layout is explicitly used.
+*/
+
+/*
+    Encoder algorithm:
+
+    The encoder is described in {3}.  Operations are done O(K Log M),
+    where K is the original data size, and M is up to twice the
+    size of the recovery set.
+
+    Roughly in brief:
+
+        Recovery = FFT( IFFT(Data_0) xor IFFT(Data_1) xor ... )
+
+    It walks the original data M chunks at a time performing the IFFT.
+    Each IFFT intermediate result is XORed together into the first M chunks of
+    the data layout.  Finally the FFT is performed.
+
+    Encoder optimizations:
+    * The first IFFT can be performed directly in the first M chunks.
+    * The zero padding can be skipped while performing the final IFFT.
+    Unrolling is used in the code to accomplish both these optimizations.
+    * The final FFT can be truncated also if recovery set is not a power of 2.
+    It is easy to truncate the FFT by ending the inner loop early.
+    * The FFT operations can be unrolled two layers at a time so that instead
+    of writing the result of the first layer out and reading it back in for
+    the second layer, those interactions can happen in registers immediately.
+*/
+
+/*
+    Decoder algorithm:
+
+    The decoder is described in {1}.  Operations are done O(N Log N), where N is up
+    to twice the size of the original data as described below.
+
+    Roughly in brief:
+
+        Original = -ErrLocator * FFT( Derivative( IFFT( ErrLocator * ReceivedData ) ) )
+
+
+    Precalculations:
+    ---------------
+
+    At startup initialization, FFTInitialize() precalculates FWT(L) as
+    described by equation (92) in {1}, where L = Log[i] for i = 0..Order,
+    Order = 256 or 65536 for FF8/16.  This is stored in the LogWalsh vector.
+
+    It also precalculates the FFT skew factors (s_i) as described by
+    equation (28).  This is stored in the FFTSkew vector.
+
+    For memory workspace N data chunks are needed, where N is a power of two
+    at or above M + K.  K is the original data size and M is the next power
+    of two above the recovery data size.  For example for K = 200 pieces of
+    data and 10% redundancy, there are 20 redundant pieces, which rounds up
+    to 32 = M.  M + K = 232 pieces, so N rounds up to 256.
+
+
+    Online calculations:
+    -------------------
+
+    At runtime, the error locator polynomial is evaluated using the
+    Fast Walsh-Hadamard transform as described in {1} equation (92).
+
+    At runtime the data is explicit laid out in workspace memory like this:
+    [Recovery Data (Power of Two = M)] [Original Data (K)] [Zero Padding out to N]
+
+    Data that was lost is replaced with zeroes.
+    Data that was received, including recovery data, is multiplied by the error
+    locator polynomial as it is copied into the workspace.
+
+    The IFFT is applied to the entire workspace of N chunks.
+    Since the IFFT starts with pairs of inputs and doubles in width at each
+    iteration, the IFFT is optimized by skipping zero padding at the end until
+    it starts mixing with non-zero data.
+
+    The formal derivative is applied to the entire workspace of N chunks.
+    This is a massive XOR loop that runs 4 columns in parallel for speed.
+
+    The FFT is applied to the entire workspace of N chunks.
+    The FFT is optimized by only performing intermediate calculations required
+    to recover lost data.  Since it starts wide and ends up working on adjacent
+    pairs, at some point the intermediate results are not needed for data that
+    will not be read by the application.  This optimization is implemented by
+    the ErrorBitfield class.
+
+    Finally, only recovered data is multiplied by the negative of the
+    error locator polynomial as it is copied into the front of the
+    workspace for the application to retrieve.
+*/
+
+/*
+    Finite field arithmetic optimizations:
+
+    For faster finite field multiplication, large tables are precomputed and
+    applied during encoding/decoding on 64 bytes of data at a time using
+    SSSE3 or AVX2 vector instructions and the ALTMAP approach from Jerasure.
+
+    Addition in this finite field is XOR, and a vectorized memory XOR routine
+    is also used.
+*/
+
+#include "leopard.h"
+
+#include <stdint.h>
+#ifdef _WIN32
+#include <malloc.h>
+#endif //_WIN32
+#include <vector>
+#include <atomic>
+#include <memory>
+#include <mutex>
+#include <condition_variable>
+
+
+//------------------------------------------------------------------------------
+// Constants
+
+// Enable 8-bit or 16-bit fields
+#define LEO_HAS_FF8
+#define LEO_HAS_FF16
+
+// Enable using SIMD instructions
+#define LEO_USE_SSSE3_OPT
+#define LEO_USE_AVX2_OPT
+
+// Avoid calculating final FFT values in decoder using bitfield
+#define LEO_ERROR_BITFIELD_OPT
+
+// Interleave butterfly operations between layer pairs in FFT
+#define LEO_INTERLEAVE_BUTTERFLY4_OPT
+
+// Optimize M=1 case
+#define LEO_M1_OPT
+
+// Unroll inner loops 4 times
+#define LEO_USE_VECTOR4_OPT
+
+// MacOS M1
+#if defined(__aarch64__)
+  #define LEO_USE_SSE2NEON
+  #define LEO_TARGET_MOBILE
+#endif
+
+//------------------------------------------------------------------------------
+// Debug
+
+// Some bugs only repro in release mode, so this can be helpful
+//#define LEO_DEBUG_IN_RELEASE
+
+#if defined(_DEBUG) || defined(DEBUG) || defined(LEO_DEBUG_IN_RELEASE)
+    #define LEO_DEBUG
+    #ifdef _WIN32
+        #define LEO_DEBUG_BREAK __debugbreak()
+    #else
+        #define LEO_DEBUG_BREAK __builtin_trap()
+    #endif
+    #define LEO_DEBUG_ASSERT(cond) { if (!(cond)) { LEO_DEBUG_BREAK; } }
+#else
+    #define LEO_DEBUG_BREAK ;
+    #define LEO_DEBUG_ASSERT(cond) ;
+#endif
+
+
+//------------------------------------------------------------------------------
+// Windows Header
+
+#ifdef _WIN32
+    #define WIN32_LEAN_AND_MEAN
+
+    #ifndef _WINSOCKAPI_
+        #define DID_DEFINE_WINSOCKAPI
+        #define _WINSOCKAPI_
+    #endif
+    #ifndef NOMINMAX
+        #define NOMINMAX
+    #endif
+    #ifndef _WIN32_WINNT
+        #define _WIN32_WINNT 0x0601 /* Windows 7+ */
+    #endif
+
+    #include <windows.h>
+#endif
+
+#ifdef DID_DEFINE_WINSOCKAPI
+    #undef _WINSOCKAPI_
+    #undef DID_DEFINE_WINSOCKAPI
+#endif
+
+
+//------------------------------------------------------------------------------
+// Platform/Architecture
+
+#ifdef _MSC_VER
+    #include <intrin.h>
+#endif
+
+#if defined(ANDROID) || defined(IOS)
+    #define LEO_TARGET_MOBILE
+#endif // ANDROID
+
+#if defined(__AVX2__) || (defined (_MSC_VER) && _MSC_VER >= 1900)
+    #define LEO_TRY_AVX2 /* 256-bit */
+    #include <immintrin.h>
+    #define LEO_ALIGN_BYTES 32
+#else // __AVX2__
+    #define LEO_ALIGN_BYTES 16
+#endif // __AVX2__
+
+#if !defined(LEO_TARGET_MOBILE)
+    // Note: MSVC currently only supports SSSE3 but not AVX2
+    #include <tmmintrin.h> // SSSE3: _mm_shuffle_epi8
+    #include <emmintrin.h> // SSE2
+#elif defined(LEO_USE_SSE2NEON)
+    #include "sse2neon/sse2neon.h"
+#endif // LEO_TARGET_MOBILE
+
+#if defined(HAVE_ARM_NEON_H)
+    #include <arm_neon.h>
+#endif // HAVE_ARM_NEON_H
+
+#if defined(LEO_TARGET_MOBILE)
+
+    #define LEO_ALIGNED_ACCESSES /* Inputs must be aligned to LEO_ALIGN_BYTES */
+
+# if defined(HAVE_ARM_NEON_H)
+    // Compiler-specific 128-bit SIMD register keyword
+    #define LEO_M128 uint8x16_t
+    #define LEO_TRY_NEON
+#elif defined(LEO_USE_SSE2NEON)
+    #define LEO_M128 __m128i
+#else
+    #define LEO_M128 uint64_t
+# endif
+
+#else // LEO_TARGET_MOBILE
+
+    // Compiler-specific 128-bit SIMD register keyword
+    #define LEO_M128 __m128i
+
+#endif // LEO_TARGET_MOBILE
+
+#ifdef LEO_TRY_AVX2
+    // Compiler-specific 256-bit SIMD register keyword
+    #define LEO_M256 __m256i
+#endif
+
+// Compiler-specific C++11 restrict keyword
+#define LEO_RESTRICT __restrict
+
+// Compiler-specific force inline keyword
+#ifdef _MSC_VER
+    #define LEO_FORCE_INLINE inline __forceinline
+#else
+    #define LEO_FORCE_INLINE inline __attribute__((always_inline))
+#endif
+
+// Compiler-specific alignment keyword
+// Note: Alignment only matters for ARM NEON where it should be 16
+#ifdef _MSC_VER
+    #define LEO_ALIGNED __declspec(align(LEO_ALIGN_BYTES))
+#else // _MSC_VER
+    #define LEO_ALIGNED __attribute__((aligned(LEO_ALIGN_BYTES)))
+#endif // _MSC_VER
+
+
+namespace leopard {
+
+
+//------------------------------------------------------------------------------
+// Runtime CPU Architecture Check
+
+// Initialize CPU architecture flags
+void InitializeCPUArch();
+
+
+#if defined(LEO_TRY_NEON)
+# if defined(IOS) && defined(__ARM_NEON__)
+    // Does device support NEON?
+    static const bool CpuHasNeon = true;
+    static const bool CpuHasNeon64 = true;
+# else
+    // Does device support NEON?
+    // Remember to add LOCAL_STATIC_LIBRARIES := cpufeatures
+    extern bool CpuHasNeon; // V6 / V7
+    extern bool CpuHasNeon64; // 64-bit
+# endif
+#endif
+
+#if !defined(LEO_TARGET_MOBILE)
+# if defined(LEO_TRY_AVX2)
+    // Does CPU support AVX2?
+    extern bool CpuHasAVX2;
+# endif
+    // Does CPU support SSSE3?
+    extern bool CpuHasSSSE3;
+#elif defined(LEO_USE_SSE2NEON)
+    extern bool CpuHasSSSE3;
+#endif // LEO_TARGET_MOBILE
+
+
+//------------------------------------------------------------------------------
+// Portable Intrinsics
+
+// Returns highest bit index 0..31 where the first non-zero bit is found
+// Precondition: x != 0
+LEO_FORCE_INLINE unsigned LastNonzeroBit32(unsigned x)
+{
+#ifdef _MSC_VER
+    unsigned long index;
+    // Note: Ignoring result because x != 0
+    _BitScanReverse(&index, (uint32_t)x);
+    return (unsigned)index;
+#else
+    // Note: Ignoring return value of 0 because x != 0
+    static_assert(sizeof(unsigned) == 4, "Assuming 32 bit unsigneds in LastNonzeroBit32");
+    return 31 - (unsigned)__builtin_clz(x);
+#endif
+}
+
+// Returns next power of two at or above given value
+LEO_FORCE_INLINE unsigned NextPow2(unsigned n)
+{
+    return 2UL << LastNonzeroBit32(n - 1);
+}
+
+
+//------------------------------------------------------------------------------
+// XOR Memory
+//
+// This works for both 8-bit and 16-bit finite fields
+
+// x[] ^= y[]
+void xor_mem(
+    void * LEO_RESTRICT x, const void * LEO_RESTRICT y,
+    uint64_t bytes);
+
+#ifdef LEO_M1_OPT
+
+// x[] ^= y[] ^ z[]
+void xor_mem_2to1(
+    void * LEO_RESTRICT x,
+    const void * LEO_RESTRICT y,
+    const void * LEO_RESTRICT z,
+    uint64_t bytes);
+
+#endif // LEO_M1_OPT
+
+#ifdef LEO_USE_VECTOR4_OPT
+
+// For i = {0, 1, 2, 3}: x_i[] ^= x_i[]
+void xor_mem4(
+    void * LEO_RESTRICT x_0, const void * LEO_RESTRICT y_0,
+    void * LEO_RESTRICT x_1, const void * LEO_RESTRICT y_1,
+    void * LEO_RESTRICT x_2, const void * LEO_RESTRICT y_2,
+    void * LEO_RESTRICT x_3, const void * LEO_RESTRICT y_3,
+    uint64_t bytes);
+
+#endif // LEO_USE_VECTOR4_OPT
+
+// x[] ^= y[]
+void VectorXOR(
+    const uint64_t bytes,
+    unsigned count,
+    void** x,
+    void** y);
+
+// x[] ^= y[] (Multithreaded)
+void VectorXOR_Threads(
+    const uint64_t bytes,
+    unsigned count,
+    void** x,
+    void** y);
+
+
+//------------------------------------------------------------------------------
+// XORSummer
+
+class XORSummer
+{
+public:
+    // Set the addition destination and byte count
+    LEO_FORCE_INLINE void Initialize(void* dest)
+    {
+        DestBuffer = dest;
+        Waiting = nullptr;
+    }
+
+    // Accumulate some source data
+    LEO_FORCE_INLINE void Add(const void* src, const uint64_t bytes)
+    {
+#ifdef LEO_M1_OPT
+        if (Waiting)
+        {
+            xor_mem_2to1(DestBuffer, src, Waiting, bytes);
+            Waiting = nullptr;
+        }
+        else
+            Waiting = src;
+#else // LEO_M1_OPT
+        xor_mem(DestBuffer, src, bytes);
+#endif // LEO_M1_OPT
+    }
+
+    // Finalize in the destination buffer
+    LEO_FORCE_INLINE void Finalize(const uint64_t bytes)
+    {
+#ifdef LEO_M1_OPT
+        if (Waiting)
+            xor_mem(DestBuffer, Waiting, bytes);
+#endif // LEO_M1_OPT
+    }
+
+protected:
+    void* DestBuffer;
+    const void* Waiting;
+};
+
+
+//------------------------------------------------------------------------------
+// SIMD-Safe Aligned Memory Allocations
+
+static const unsigned kAlignmentBytes = LEO_ALIGN_BYTES;
+
+static LEO_FORCE_INLINE uint8_t* SIMDSafeAllocate(size_t size)
+{
+    uint8_t* data = (uint8_t*)calloc(1, kAlignmentBytes + size);
+    if (!data)
+        return nullptr;
+    unsigned offset = (unsigned)((uintptr_t)data % kAlignmentBytes);
+    data += kAlignmentBytes - offset;
+    data[-1] = (uint8_t)offset;
+    return data;
+}
+
+static LEO_FORCE_INLINE void SIMDSafeFree(void* ptr)
+{
+    if (!ptr)
+        return;
+    uint8_t* data = (uint8_t*)ptr;
+    unsigned offset = data[-1];
+    if (offset >= kAlignmentBytes)
+    {
+        LEO_DEBUG_BREAK; // Should never happen
+        return;
+    }
+    data -= kAlignmentBytes - offset;
+    free(data);
+}
+
+
+} // namespace leopard
--- a/cpp/LeopardFF16.cpp
+++ b/cpp/LeopardFF16.cpp
--- a/cpp/LeopardFF16.h
+++ b/cpp/LeopardFF16.h
@ -0,0 +1,93 @@
+/*
+    Copyright (c) 2017 Christopher A. Taylor.  All rights reserved.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions are met:
+
+    * Redistributions of source code must retain the above copyright notice,
+      this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright notice,
+      this list of conditions and the following disclaimer in the documentation
+      and/or other materials provided with the distribution.
+    * Neither the name of Leopard-RS nor the names of its contributors may be
+      used to endorse or promote products derived from this software without
+      specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+    AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+    IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+    ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+    LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+    CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+    SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+    INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+    CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+    ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+    POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#pragma once
+
+#include "LeopardCommon.h"
+
+#ifdef LEO_HAS_FF16
+
+/*
+    16-bit Finite Field Math
+
+    This finite field contains 65536 elements and so each element is one byte.
+    This library is designed for data that is a multiple of 64 bytes in size.
+
+    Algorithms are described in LeopardCommon.h
+*/
+
+namespace leopard { namespace ff16 {
+
+
+//------------------------------------------------------------------------------
+// Datatypes and Constants
+
+// Finite field element type
+typedef uint16_t ffe_t;
+
+// Number of bits per element
+static const unsigned kBits = 16;
+
+// Finite field order: Number of elements in the field
+static const unsigned kOrder = 65536;
+
+// Modulus for field operations
+static const ffe_t kModulus = 65535;
+
+// LFSR Polynomial that generates the field elements
+static const unsigned kPolynomial = 0x1002D;
+
+
+//------------------------------------------------------------------------------
+// API
+
+// Returns false if the self-test fails
+bool Initialize();
+
+void ReedSolomonEncode(
+    uint64_t buffer_bytes,
+    unsigned original_count,
+    unsigned recovery_count,
+    unsigned m, // = NextPow2(recovery_count)
+    const void* const * const data,
+    void** work); // m * 2 elements
+
+void ReedSolomonDecode(
+    uint64_t buffer_bytes,
+    unsigned original_count,
+    unsigned recovery_count,
+    unsigned m, // = NextPow2(recovery_count)
+    unsigned n, // = NextPow2(m + original_count)
+    const void* const * const original, // original_count elements
+    const void* const * const recovery, // recovery_count elements
+    void** work); // n elements
+
+
+}} // namespace leopard::ff16
+
+#endif // LEO_HAS_FF16
--- a/cpp/LeopardFF8.cpp
+++ b/cpp/LeopardFF8.cpp
--- a/cpp/LeopardFF8.h
+++ b/cpp/LeopardFF8.h
@ -0,0 +1,93 @@
+/*
+    Copyright (c) 2017 Christopher A. Taylor.  All rights reserved.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions are met:
+
+    * Redistributions of source code must retain the above copyright notice,
+      this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright notice,
+      this list of conditions and the following disclaimer in the documentation
+      and/or other materials provided with the distribution.
+    * Neither the name of Leopard-RS nor the names of its contributors may be
+      used to endorse or promote products derived from this software without
+      specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+    AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+    IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+    ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+    LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+    CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+    SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+    INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+    CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+    ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+    POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#pragma once
+
+#include "LeopardCommon.h"
+
+#ifdef LEO_HAS_FF8
+
+/*
+    8-bit Finite Field Math
+
+    This finite field contains 256 elements and so each element is one byte.
+    This library is designed for data that is a multiple of 64 bytes in size.
+
+    Algorithms are described in LeopardCommon.h
+*/
+
+namespace leopard { namespace ff8 {
+
+
+//------------------------------------------------------------------------------
+// Datatypes and Constants
+
+// Finite field element type
+typedef uint8_t ffe_t;
+
+// Number of bits per element
+static const unsigned kBits = 8;
+
+// Finite field order: Number of elements in the field
+static const unsigned kOrder = 256;
+
+// Modulus for field operations
+static const ffe_t kModulus = 255;
+
+// LFSR Polynomial that generates the field elements
+static const unsigned kPolynomial = 0x11D;
+
+
+//------------------------------------------------------------------------------
+// API
+
+// Returns false if the self-test fails
+bool Initialize();
+
+void ReedSolomonEncode(
+    uint64_t buffer_bytes,
+    unsigned original_count,
+    unsigned recovery_count,
+    unsigned m, // = NextPow2(recovery_count)
+    const void* const * const data,
+    void** work); // m * 2 elements
+
+void ReedSolomonDecode(
+    uint64_t buffer_bytes,
+    unsigned original_count,
+    unsigned recovery_count,
+    unsigned m, // = NextPow2(recovery_count)
+    unsigned n, // = NextPow2(m + original_count)
+    const void* const * const original, // original_count elements
+    const void* const * const recovery, // recovery_count elements
+    void** work); // n elements
+
+
+}} // namespace leopard::ff8
+
+#endif // LEO_HAS_FF8
--- a/cpp/build.sh
+++ b/cpp/build.sh
@ -0,0 +1,4 @@
+#!/bin/bash
+
+g++ -O3 -std=c++11 -c *.cpp
+ar rcs libleopard.a *.o
--- a/cpp/leopard.cpp
+++ b/cpp/leopard.cpp
@ -0,0 +1,347 @@
+/*
+    Copyright (c) 2017 Christopher A. Taylor.  All rights reserved.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions are met:
+
+    * Redistributions of source code must retain the above copyright notice,
+      this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright notice,
+      this list of conditions and the following disclaimer in the documentation
+      and/or other materials provided with the distribution.
+    * Neither the name of Leopard-RS nor the names of its contributors may be
+      used to endorse or promote products derived from this software without
+      specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+    AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+    IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+    ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+    LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+    CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+    SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+    INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+    CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+    ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+    POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#include "leopard.h"
+#include "LeopardCommon.h"
+
+#ifdef LEO_HAS_FF8
+    #include "LeopardFF8.h"
+#endif // LEO_HAS_FF8
+#ifdef LEO_HAS_FF16
+    #include "LeopardFF16.h"
+#endif // LEO_HAS_FF16
+
+#include <string.h>
+
+extern "C" {
+
+
+//------------------------------------------------------------------------------
+// Initialization API
+
+static bool m_Initialized = false;
+
+LEO_EXPORT int leo_init_(int version)
+{
+    if (version != LEO_VERSION)
+        return Leopard_InvalidInput;
+
+    leopard::InitializeCPUArch();
+
+#ifdef LEO_HAS_FF8
+    if (!leopard::ff8::Initialize())
+        return Leopard_Platform;
+#endif // LEO_HAS_FF8
+
+#ifdef LEO_HAS_FF16
+    if (!leopard::ff16::Initialize())
+        return Leopard_Platform;
+#endif // LEO_HAS_FF16
+
+
+    m_Initialized = true;
+    return Leopard_Success;
+}
+
+//------------------------------------------------------------------------------
+// Result
+
+LEO_EXPORT const char* leo_result_string(LeopardResult result)
+{
+    switch (result)
+    {
+    case Leopard_Success: return "Operation succeeded";
+    case Leopard_NeedMoreData: return "Not enough recovery data received";
+    case Leopard_TooMuchData: return "Buffer counts are too high";
+    case Leopard_InvalidSize: return "Buffer size must be a multiple of 64 bytes";
+    case Leopard_InvalidCounts: return "Invalid counts provided";
+    case Leopard_InvalidInput: return "A function parameter was invalid";
+    case Leopard_Platform: return "Platform is unsupported";
+    case Leopard_CallInitialize: return "Call leo_init() first";
+    }
+    return "Unknown";
+}
+
+
+//------------------------------------------------------------------------------
+// Encoder API
+
+LEO_EXPORT unsigned leo_encode_work_count(
+    unsigned original_count,
+    unsigned recovery_count)
+{
+    if (original_count == 1)
+        return recovery_count;
+    if (recovery_count == 1)
+        return 1;
+    return leopard::NextPow2(recovery_count) * 2;
+}
+
+// recovery_data = parity of original_data (xor sum)
+static void EncodeM1(
+    uint64_t buffer_bytes,
+    unsigned original_count,
+    const void* const * const original_data,
+    void* recovery_data)
+{
+    memcpy(recovery_data, original_data[0], buffer_bytes);
+
+    leopard::XORSummer summer;
+    summer.Initialize(recovery_data);
+
+    for (unsigned i = 1; i < original_count; ++i)
+        summer.Add(original_data[i], buffer_bytes);
+
+    summer.Finalize(buffer_bytes);
+}
+
+LEO_EXPORT LeopardResult leo_encode(
+    uint64_t buffer_bytes,                    // Number of bytes in each data buffer
+    unsigned original_count,                  // Number of original_data[] buffer pointers
+    unsigned recovery_count,                  // Number of recovery_data[] buffer pointers
+    unsigned work_count,                      // Number of work_data[] buffer pointers, from leo_encode_work_count()
+    const void* const * const original_data,  // Array of pointers to original data buffers
+    void** work_data)                         // Array of work buffers
+{
+    if (buffer_bytes <= 0 || buffer_bytes % 64 != 0)
+        return Leopard_InvalidSize;
+
+    if (recovery_count <= 0 || recovery_count > original_count)
+        return Leopard_InvalidCounts;
+
+    if (!original_data || !work_data)
+        return Leopard_InvalidInput;
+
+    if (!m_Initialized)
+        return Leopard_CallInitialize;
+
+    // Handle k = 1 case
+    if (original_count == 1)
+    {
+        for (unsigned i = 0; i < recovery_count; ++i)
+            memcpy(work_data[i], original_data[i], buffer_bytes);
+        return Leopard_Success;
+    }
+
+    // Handle m = 1 case
+    if (recovery_count == 1)
+    {
+        EncodeM1(
+            buffer_bytes,
+            original_count,
+            original_data,
+            work_data[0]);
+        return Leopard_Success;
+    }
+
+    const unsigned m = leopard::NextPow2(recovery_count);
+    const unsigned n = leopard::NextPow2(m + original_count);
+
+    if (work_count != m * 2)
+        return Leopard_InvalidCounts;
+
+#ifdef LEO_HAS_FF8
+    if (n <= leopard::ff8::kOrder)
+    {
+        leopard::ff8::ReedSolomonEncode(
+            buffer_bytes,
+            original_count,
+            recovery_count,
+            m,
+            original_data,
+            work_data);
+    }
+    else
+#endif // LEO_HAS_FF8
+#ifdef LEO_HAS_FF16
+    if (n <= leopard::ff16::kOrder)
+    {
+        leopard::ff16::ReedSolomonEncode(
+            buffer_bytes,
+            original_count,
+            recovery_count,
+            m,
+            original_data,
+            work_data);
+    }
+    else
+#endif // LEO_HAS_FF16
+        return Leopard_TooMuchData;
+
+    return Leopard_Success;
+}
+
+
+//------------------------------------------------------------------------------
+// Decoder API
+
+LEO_EXPORT unsigned leo_decode_work_count(
+    unsigned original_count,
+    unsigned recovery_count)
+{
+    if (original_count == 1 || recovery_count == 1)
+        return original_count;
+    const unsigned m = leopard::NextPow2(recovery_count);
+    const unsigned n = leopard::NextPow2(m + original_count);
+    return n;
+}
+
+static void DecodeM1(
+    uint64_t buffer_bytes,
+    unsigned original_count,
+    const void* const * original_data,
+    const void* recovery_data,
+    void* work_data)
+{
+    memcpy(work_data, recovery_data, buffer_bytes);
+
+    leopard::XORSummer summer;
+    summer.Initialize(work_data);
+
+    for (unsigned i = 0; i < original_count; ++i)
+        if (original_data[i])
+            summer.Add(original_data[i], buffer_bytes);
+
+    summer.Finalize(buffer_bytes);
+}
+
+LEO_EXPORT LeopardResult leo_decode(
+    uint64_t buffer_bytes,                    // Number of bytes in each data buffer
+    unsigned original_count,                  // Number of original_data[] buffer pointers
+    unsigned recovery_count,                  // Number of recovery_data[] buffer pointers
+    unsigned work_count,                      // Number of buffer pointers in work_data[]
+    const void* const * const original_data,  // Array of original data buffers
+    const void* const * const recovery_data,  // Array of recovery data buffers
+    void** work_data)                         // Array of work data buffers
+{
+    if (buffer_bytes <= 0 || buffer_bytes % 64 != 0)
+        return Leopard_InvalidSize;
+
+    if (recovery_count <= 0 || recovery_count > original_count)
+        return Leopard_InvalidCounts;
+
+    if (!original_data || !recovery_data || !work_data)
+        return Leopard_InvalidInput;
+
+    if (!m_Initialized)
+        return Leopard_CallInitialize;
+
+    // Check if not enough recovery data arrived
+    unsigned original_loss_count = 0;
+    unsigned original_loss_i = 0;
+    for (unsigned i = 0; i < original_count; ++i)
+    {
+        if (!original_data[i])
+        {
+            ++original_loss_count;
+            original_loss_i = i;
+        }
+    }
+    unsigned recovery_got_count = 0;
+    unsigned recovery_got_i = 0;
+    for (unsigned i = 0; i < recovery_count; ++i)
+    {
+        if (recovery_data[i])
+        {
+            ++recovery_got_count;
+            recovery_got_i = i;
+        }
+    }
+    if (recovery_got_count < original_loss_count)
+        return Leopard_NeedMoreData;
+
+    // Handle k = 1 case
+    if (original_count == 1)
+    {
+        memcpy(work_data[0], recovery_data[recovery_got_i], buffer_bytes);
+        return Leopard_Success;
+    }
+    
+    // Handle case original_loss_count = 0
+    if (original_loss_count == 0)
+    {
+        for(unsigned i = 0; i < original_count; i++)
+            memcpy(work_data[i], original_data[i], buffer_bytes);
+        return Leopard_Success;
+    }
+
+    // Handle m = 1 case
+    if (recovery_count == 1)
+    {
+        DecodeM1(
+            buffer_bytes,
+            original_count,
+            original_data,
+            recovery_data[0],
+            work_data[original_loss_i]);
+        return Leopard_Success;
+    }
+
+    const unsigned m = leopard::NextPow2(recovery_count);
+    const unsigned n = leopard::NextPow2(m + original_count);
+
+    if (work_count != n)
+        return Leopard_InvalidCounts;
+
+#ifdef LEO_HAS_FF8
+    if (n <= leopard::ff8::kOrder)
+    {
+        leopard::ff8::ReedSolomonDecode(
+            buffer_bytes,
+            original_count,
+            recovery_count,
+            m,
+            n,
+            original_data,
+            recovery_data,
+            work_data);
+    }
+    else
+#endif // LEO_HAS_FF8
+#ifdef LEO_HAS_FF16
+    if (n <= leopard::ff16::kOrder)
+    {
+        leopard::ff16::ReedSolomonDecode(
+            buffer_bytes,
+            original_count,
+            recovery_count,
+            m,
+            n,
+            original_data,
+            recovery_data,
+            work_data);
+    }
+    else
+#endif // LEO_HAS_FF16
+        return Leopard_TooMuchData;
+
+    return Leopard_Success;
+}
+
+
+} // extern "C"
--- a/cpp/leopard.h
+++ b/cpp/leopard.h
@ -0,0 +1,242 @@
+/*
+    Copyright (c) 2017 Christopher A. Taylor.  All rights reserved.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions are met:
+
+    * Redistributions of source code must retain the above copyright notice,
+      this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright notice,
+      this list of conditions and the following disclaimer in the documentation
+      and/or other materials provided with the distribution.
+    * Neither the name of Leopard-RS nor the names of its contributors may be
+      used to endorse or promote products derived from this software without
+      specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+    AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+    IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+    ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+    LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+    CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+    SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+    INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+    CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+    ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+    POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifndef CAT_LEOPARD_RS_H
+#define CAT_LEOPARD_RS_H
+
+/*
+    Leopard-RS
+    MDS Reed-Solomon Erasure Correction Codes for Large Data in C
+
+    Algorithms are described in LeopardCommon.h
+
+
+    Inspired by discussion with:
+
+    Sian-Jhen Lin <sjhenglin@gmail.com> : Author of {1} {3}, basis for Leopard
+    Bulat Ziganshin <bulat.ziganshin@gmail.com> : Author of FastECC
+    Yutaka Sawada <tenfon@outlook.jp> : Author of MultiPar
+
+
+    References:
+
+    {1} S.-J. Lin, T. Y. Al-Naffouri, Y. S. Han, and W.-H. Chung,
+    "Novel Polynomial Basis with Fast Fourier Transform
+    and Its Application to Reed-Solomon Erasure Codes"
+    IEEE Trans. on Information Theory, pp. 6284-6299, November, 2016.
+
+    {2} D. G. Cantor, "On arithmetical algorithms over finite fields",
+    Journal of Combinatorial Theory, Series A, vol. 50, no. 2, pp. 285-300, 1989.
+
+    {3} Sian-Jheng Lin, Wei-Ho Chung, "An Efficient (n, k) Information
+    Dispersal Algorithm for High Code Rate System over Fermat Fields,"
+    IEEE Commun. Lett., vol.16, no.12, pp. 2036-2039, Dec. 2012.
+
+    {4} Plank, J. S., Greenan, K. M., Miller, E. L., "Screaming fast Galois Field
+    arithmetic using Intel SIMD instructions."  In: FAST-2013: 11th Usenix
+    Conference on File and Storage Technologies, San Jose, 2013
+*/
+
+// Library version
+#define LEO_VERSION 2
+
+// Tweak if the functions are exported or statically linked
+//#define LEO_DLL /* Defined when building/linking as DLL */
+//#define LEO_BUILDING /* Defined by the library makefile */
+
+#if defined(LEO_BUILDING)
+# if defined(LEO_DLL)
+    #define LEO_EXPORT __declspec(dllexport)
+# else
+    #define LEO_EXPORT
+# endif
+#else
+# if defined(LEO_DLL)
+    #define LEO_EXPORT __declspec(dllimport)
+# else
+    #define LEO_EXPORT extern
+# endif
+#endif
+
+#include <stdint.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+
+//------------------------------------------------------------------------------
+// Initialization API
+
+/*
+    leo_init()
+
+    Perform static initialization for the library, verifying that the platform
+    is supported.
+
+    Returns 0 on success and other values on failure.
+*/
+
+LEO_EXPORT int leo_init_(int version);
+#define leo_init() leo_init_(LEO_VERSION)
+
+
+//------------------------------------------------------------------------------
+// Shared Constants / Datatypes
+
+// Results
+typedef enum LeopardResultT
+{
+    Leopard_Success           =  0, // Operation succeeded
+
+    Leopard_NeedMoreData      = -1, // Not enough recovery data received
+    Leopard_TooMuchData       = -2, // Buffer counts are too high
+    Leopard_InvalidSize       = -3, // Buffer size must be a multiple of 64 bytes
+    Leopard_InvalidCounts     = -4, // Invalid counts provided
+    Leopard_InvalidInput      = -5, // A function parameter was invalid
+    Leopard_Platform          = -6, // Platform is unsupported
+    Leopard_CallInitialize    = -7, // Call leo_init() first
+} LeopardResult;
+
+// Convert Leopard result to string
+LEO_EXPORT const char* leo_result_string(LeopardResult result);
+
+
+//------------------------------------------------------------------------------
+// Encoder API
+
+/*
+    leo_encode_work_count()
+
+    Calculate the number of work_data buffers to provide to leo_encode().
+
+    The sum of original_count + recovery_count must not exceed 65536.
+
+    Returns the work_count value to pass into leo_encode().
+    Returns 0 on invalid input.
+*/
+LEO_EXPORT unsigned leo_encode_work_count(
+    unsigned original_count,
+    unsigned recovery_count);
+
+/*
+    leo_encode()
+
+    Generate recovery data.
+
+    original_count: Number of original_data[] buffers provided.
+    recovery_count: Number of desired recovery data buffers.
+    buffer_bytes:   Number of bytes in each data buffer.
+    original_data:  Array of pointers to original data buffers.
+    work_count:     Number of work_data[] buffers, from leo_encode_work_count().
+    work_data:      Array of pointers to work data buffers.
+
+    The sum of original_count + recovery_count must not exceed 65536.
+    The recovery_count <= original_count.
+
+    The buffer_bytes must be a multiple of 64.
+    Each buffer should have the same number of bytes.
+    Even the last piece must be rounded up to the block size.
+
+    Let buffer_bytes = The number of bytes in each buffer:
+
+        original_count = static_cast<unsigned>(
+            ((uint64_t)total_bytes + buffer_bytes - 1) / buffer_bytes);
+
+    Or if the number of pieces is known:
+
+        buffer_bytes = static_cast<unsigned>(
+            ((uint64_t)total_bytes + original_count - 1) / original_count);
+
+    Returns Leopard_Success on success.
+    * The first set of recovery_count buffers in work_data will be the result.
+    Returns other values on errors.
+*/
+LEO_EXPORT LeopardResult leo_encode(
+    uint64_t buffer_bytes,                    // Number of bytes in each data buffer
+    unsigned original_count,                  // Number of original_data[] buffer pointers
+    unsigned recovery_count,                  // Number of recovery_data[] buffer pointers
+    unsigned work_count,                      // Number of work_data[] buffer pointers, from leo_encode_work_count()
+    const void* const * const original_data,  // Array of pointers to original data buffers
+    void** work_data);                        // Array of work buffers
+
+
+//------------------------------------------------------------------------------
+// Decoder API
+
+/*
+    leo_decode_work_count()
+
+    Calculate the number of work_data buffers to provide to leo_decode().
+
+    The sum of original_count + recovery_count must not exceed 65536.
+
+    Returns the work_count value to pass into leo_encode().
+    Returns 0 on invalid input.
+*/
+LEO_EXPORT unsigned leo_decode_work_count(
+    unsigned original_count,
+    unsigned recovery_count);
+
+/*
+    leo_decode()
+
+    Decode original data from recovery data.
+
+    buffer_bytes:   Number of bytes in each data buffer.
+    original_count: Number of original_data[] buffers provided.
+    original_data:  Array of pointers to original data buffers.
+    recovery_count: Number of recovery_data[] buffers provided.
+    recovery_data:  Array of pointers to recovery data buffers.
+    work_count:     Number of work_data[] buffers, from leo_decode_work_count().
+    work_data:      Array of pointers to recovery data buffers.
+
+    Lost original/recovery data should be set to NULL.
+
+    The sum of recovery_count + the number of non-NULL original data must be at
+    least original_count in order to perform recovery.
+
+    Returns Leopard_Success on success.
+    Returns other values on errors.
+*/
+LEO_EXPORT LeopardResult leo_decode(
+    uint64_t buffer_bytes,                    // Number of bytes in each data buffer
+    unsigned original_count,                  // Number of original_data[] buffer pointers
+    unsigned recovery_count,                  // Number of recovery_data[] buffer pointers
+    unsigned work_count,                      // Number of buffer pointers in work_data[]
+    const void* const * const original_data,  // Array of original data buffers
+    const void* const * const recovery_data,  // Array of recovery data buffers
+    void** work_data);                        // Array of work data buffers
+
+
+#ifdef __cplusplus
+}
+#endif
+
+
+#endif // CAT_LEOPARD_RS_H
--- a/cpp/sse2neon/LICENSE
+++ b/cpp/sse2neon/LICENSE
@ -0,0 +1,19 @@
+MIT License
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/cpp/sse2neon/Makefile
+++ b/cpp/sse2neon/Makefile
@ -0,0 +1,67 @@
+ifndef CXX
+override CXX = g++
+endif
+
+ifndef CROSS_COMPILE
+    processor := $(shell uname -m)
+else # CROSS_COMPILE was set
+    CXX = $(CROSS_COMPILE)g++
+    CXXFLAGS += -static
+    LDFLAGS += -static
+    check_arm := $(shell echo | $(CROSS_COMPILE)cpp -dM - | grep " __ARM_ARCH " | cut -c20-)
+    ifeq ($(check_arm),8)
+        processor = aarch64
+    else ifeq ($(check_arm),7) # detect ARMv7-A only
+        processor = arm
+    else
+        $(error Unsupported cross-compiler)
+    endif
+endif
+
+EXEC_WRAPPER =
+ifdef CROSS_COMPILE
+EXEC_WRAPPER = qemu-$(processor)
+endif
+
+# Follow platform-specific configurations
+ifeq ($(processor),$(filter $(processor),aarch64 arm64))
+    ARCH_CFLAGS = -march=armv8-a+fp+simd+crc
+else ifeq ($(processor),$(filter $(processor),i386 x86_64))
+    ARCH_CFLAGS = -maes -mpclmul -mssse3 -msse4.2
+else ifeq ($(processor),$(filter $(processor),arm armv7l))
+    ARCH_CFLAGS = -mfpu=neon
+else
+    $(error Unsupported architecture)
+endif
+
+CXXFLAGS += -Wall -Wcast-qual -I. $(ARCH_CFLAGS) -std=gnu++14
+LDFLAGS	+= -lm
+OBJS = \
+    tests/binding.o \
+    tests/common.o \
+    tests/impl.o \
+    tests/main.o
+deps := $(OBJS:%.o=%.o.d)
+
+.SUFFIXES: .o .cpp
+.cpp.o:
+	$(CXX) -o $@ $(CXXFLAGS) -c -MMD -MF $@.d $<
+
+EXEC = tests/main
+
+$(EXEC): $(OBJS)
+	$(CXX) $(LDFLAGS) -o $@ $^
+
+check: tests/main
+	$(EXEC_WRAPPER) $^
+
+indent:
+	@echo "Formatting files with clang-format.."
+	@if ! hash clang-format-12; then echo "clang-format-12 is required to indent"; fi
+	clang-format-12 -i sse2neon.h tests/*.cpp tests/*.h
+
+.PHONY: clean check format
+clean:
+	$(RM) $(OBJS) $(EXEC) $(deps)
+
+-include $(deps)
--- a/cpp/sse2neon/README.md
+++ b/cpp/sse2neon/README.md
@ -0,0 +1,190 @@
+# sse2neon
+![Github Actions](https://github.com/DLTcollab/sse2neon/workflows/Github%20Actions/badge.svg?branch=master)
+
+A C/C++ header file that converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics.
+
+## Introduction
+
+`sse2neon` is a translator of Intel SSE (Streaming SIMD Extensions) intrinsics
+to [Arm NEON](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon),
+shortening the time needed to get an Arm working program that then can be used to
+extract profiles and to identify hot paths in the code.
+The header file `sse2neon.h` contains several of the functions provided by Intel
+intrinsic headers such as `<xmmintrin.h>`, only implemented with NEON-based counterparts
+to produce the exact semantics of the intrinsics.
+
+## Mapping and Coverage
+
+Header file | Extension |
+---|---|
+`<mmintrin.h>` | MMX |
+`<xmmintrin.h>` | SSE |
+`<emmintrin.h>` | SSE2 |
+`<pmmintrin.h>` | SSE3 |
+`<tmmintrin.h>` | SSSE3 |
+`<smmintrin.h>` | SSE4.1 |
+`<nmmintrin.h>` | SSE4.2 |
+`<wmmintrin.h>` | AES  |
+
+`sse2neon` aims to support SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and AES extension.
+
+In order to deliver NEON-equivalent intrinsics for all SSE intrinsics used widely,
+please be aware that some SSE intrinsics exist a direct mapping with a concrete
+NEON-equivalent intrinsic. However, others lack of 1-to-1 mapping, that means the
+equivalents are implemented using several NEON intrinsics.
+
+For example, SSE intrinsic `_mm_loadu_si128` has a direct NEON mapping (`vld1q_s32`),
+but SSE intrinsic `_mm_maddubs_epi16` has to be implemented with 13+ NEON instructions.
+
+## Usage
+
+- Put the file `sse2neon.h` in to your source code directory.
+
+- Locate the following SSE header files included in the code:
+```C
+#include <xmmintrin.h>
+#include <emmintrin.h>
+```
+  {p,t,s,n,w}mmintrin.h should be replaceable, but the coverage of these extensions might be limited though.
+
+- Replace them with:
+```C
+#include "sse2neon.h"
+```
+
+- Explicitly specify platform-specific options to gcc/clang compilers.
+  * On ARMv8-A 64-bit targets, you should specify the following compiler option: (Remove `crypto` and/or `crc` if your architecture does not support cryptographic and/or CRC32 extensions)
+  ```shell
+  -march=armv8-a+fp+simd+crypto+crc
+  ```
+  * On ARMv8-A 32-bit targets, you should specify the following compiler option:
+  ```shell
+  -mfpu=neon-fp-armv8
+  ```
+  * On ARMv7-A targets, you need to append the following compiler option:
+  ```shell
+  -mfpu=neon
+  ```
+
+## Compile-time Configurations
+
+Considering the balance between correctness and performance, `sse2neon` recognizes the following compile-time configurations:
+* `SSE2NEON_PRECISE_MINMAX`: Enable precise implementation of `_mm_min_ps` and `_mm_max_ps`. If you need consistent results such as NaN special cases, enable it.
+* `SSE2NEON_PRECISE_DIV`: Enable precise implementation of `_mm_rcp_ps` and `_mm_div_ps` by additional Netwon-Raphson iteration for accuracy.
+* `SSE2NEON_PRECISE_SQRT`: Enable precise implementation of `_mm_sqrt_ps` and `_mm_rsqrt_ps` by additional Netwon-Raphson iteration for accuracy.
+* `SSE2NEON_PRECISE_DP`: Enable precise implementation of `_mm_dp_pd`. When the conditional bit is not set, the corresponding multiplication would not be executed.
+
+The above are turned off by default, and you should define the corresponding macro(s) as `1` before including `sse2neon.h` if you need the precise implementations.
+
+## Run Built-in Test Suite
+
+`sse2neon` provides a unified interface for developing test cases. These test
+cases are located in `tests` directory, and the input data is specified at
+runtime. Use the following commands to perform test cases:
+```shell
+$ make check
+```
+
+You can specify GNU toolchain for cross compilation as well.
+[QEMU](https://www.qemu.org/) should be installed in advance.
+```shell
+$ make CROSS_COMPILE=aarch64-linux-gnu- check # ARMv8-A
+```
+or
+```shell
+$ make CROSS_COMPILE=arm-linux-gnueabihf- check # ARMv7-A
+```
+
+Check the details via [Test Suite for SSE2NEON](tests/README.md).
+
+## Adoptions
+Here is a partial list of open source projects that have adopted `sse2neon` for Arm/Aarch64 support.
+* [Aaru Data Preservation Suite](https://www.aaru.app/) is a fully-featured software package to preserve all storage media from the very old to the cutting edge, as well as to give detailed information about any supported image file (whether from Aaru or not) and to extract the files from those images.
+* [aether-game-utils](https://github.com/johnhues/aether-game-utils) is a collection of cross platform utilities for quickly creating small game prototypes in C++.
+* [ALE](https://github.com/sc932/ALE), aka Assembly Likelihood Evaluation, is a tool for evaluating accuracy of assemblies without the need of a reference genome.
+* [Apache Doris](https://doris.apache.org/) is a Massively Parallel Processing (MPP) based interactive SQL data warehousing for reporting and analysis.
+* [Apache Impala](https://impala.apache.org/) is a lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters.
+* [Apache Kudu](https://kudu.apache.org/) completes Hadoop's storage layer to enable fast analytics on fast data.
+* [ART](https://github.com/dinosaure/art) is an implementation in OCaml of [Adaptive Radix Tree](https://db.in.tum.de/~leis/papers/ART.pdf) (ART).
+* [Async](https://github.com/romange/async) is a set of c++ primitives that allows efficient and rapid development in C++17 on GNU/Linux systems.
+* [avec](https://github.com/unevens/avec) is a little library for using SIMD instructions on both x86 and Arm.
+* [BEAGLE](https://github.com/beagle-dev/beagle-lib) is a high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics packages.
+* [BitMagic](https://github.com/tlk00/BitMagic) implements compressed bit-vectors and containers (vectors) based on ideas of bit-slicing transform and Rank-Select compression, offering sets of method to architect your applications to use HPC techniques to save memory (thus be able to fit more data in one compute unit) and improve storage and traffic patterns when storing data vectors and models in files or object stores.
+* [bipartite_motif_finder](https://github.com/soedinglab/bipartite_motif_finder) as known as BMF (Bipartite Motif Finder) is an open source tool for finding co-occurences of sequence motifs in genomic sequences.
+* [Blender](https://www.blender.org/) is the free and open source 3D creation suite, supporting the entirety of the 3D pipeline.
+* [Boo](https://github.com/AxioDL/boo) is a cross-platform windowing and event manager similar to SDL or SFML, with additional 3D rendering functionality.
+* [CARTA](https://github.com/CARTAvis/carta-backend) is a new visualization tool designed for viewing radio astronomy images in CASA, FITS, MIRIAD, and HDF5 formats (using the IDIA custom schema for HDF5).
+* [Catcoon](https://github.com/i-evi/catcoon) is a [feedforward neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network) implementation in C.
+* [compute-runtime](https://github.com/intel/compute-runtime), the Intel Graphics Compute Runtime for oneAPI Level Zero and OpenCL Driver, provides compute API support (Level Zero, OpenCL) for Intel graphics hardware architectures (HD Graphics, Xe).
+* [Cog](https://github.com/losnoco/Cog) is a free and open source audio player for macOS.
+* [dab-cmdline](https://github.com/JvanKatwijk/dab-cmdline) provides entries for the functionality to handle Digital audio broadcasting (DAB)/DAB+ through some simple calls.
+* [DISTRHO](https://distrho.sourceforge.io/) is an open-source project for Cross-Platform Audio Plugins.
+* [EDGE](https://github.com/3dfxdev/EDGE) is an advanced OpenGL source port spawned from the DOOM engine, with focus on easy development and expansion for modders and end-users.
+* [Embree](https://github.com/embree/embree) is a collection of high-performance ray tracing kernels. Its target users are graphics application engineers who want to improve the performance of their photo-realistic rendering application by leveraging Embree's performance-optimized ray tracing kernels.
+* [emp-tool](https://github.com/emp-toolkit/emp-tool) aims to provide a benchmark for secure computation and allowing other researchers to experiment and extend.
+* [Exudyn](https://github.com/jgerstmayr/EXUDYN) is a C++ based Python library for efficient simulation of flexible multibody dynamics systems.
+* [FoundationDB](https://www.foundationdb.org) is a distributed database designed to handle large volumes of structured data across clusters of commodity servers.
+* [gmmlib](https://github.com/intel/gmmlib) is the Intel Graphics Memory Management Library that provides device specific and buffer management for the Intel Graphics Compute Runtime for OpenCL and the Intel Media Driver for VAAPI.
+* [iqtree2](https://github.com/iqtree/iqtree2) is an efficient and versatile stochastic implementation to infer phylogenetic trees by maximum likelihood.
+* [IResearch](https://github.com/iresearch-toolkit/iresearch) is a cross-platform, high-performance document oriented search engine library written entirely in C++ with the focus on a pluggability of different ranking/similarity models.
+* [kram](https://github.com/alecazam/kram) is a wrapper to several popular encoders to and from PNG/[KTX](https://www.khronos.org/opengles/sdk/tools/KTX/file_format_spec/) files with [LDR/HDR and BC/ASTC/ETC2](https://developer.arm.com/solutions/graphics-and-gaming/developer-guides/learn-the-basics/adaptive-scalable-texture-compression/single-page).
+* [libCML](https://github.com/belosthomas/libCML) is a SLAM library and scientific tool, which include a novel fast thread-safe graph map implementation.
+* [libscapi](https://github.com/cryptobiu/libscapi) stands for the "Secure Computation API", providing  reliable, efficient, and highly flexible cryptographic infrastructure.
+* [libmatoya](https://github.com/matoya/libmatoya) is a cross-platform application development library, providing various features such as common cryptography tasks.
+* [Loosejaw](https://github.com/TheHolyDiver/Loosejaw) provides deep hybrid CPU/GPU digital signal processing.
+* [Madronalib](https://github.com/madronalabs/madronalib) enables efficient audio DSP on SIMD processors with readable and brief C++ code.
+* [minimap2](https://github.com/lh3/minimap2) is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database.
+* [MMseqs2](https://github.com/soedinglab/MMseqs2) (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets.
+* [MRIcroGL](https://github.com/rordenlab/MRIcroGL) is a cross-platform tool for viewing NIfTI, DICOM, MGH, MHD, NRRD, AFNI format medical images.
+* [N2](https://github.com/oddconcepts/n2o) is an approximate nearest neighborhoods algorithm library written in C++, providing a much faster search speed than other implementations when modeling large dataset.
+* [nanors](https://github.com/sleepybishop/nanors) is a tiny, performant implementation of [Reed-Solomon codes](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction), capable of reaching multi-gigabit speeds on a single core.
+* [niimath](https://github.com/rordenlab/niimath) is a general image calculator with superior performance.
+* [NVIDIA GameWorks](https://developer.nvidia.com/gameworks-source-github) has been already used in a lot of games. These repositories are public on GitHub.
+* [ofxNDI](https://github.com/leadedge/ofxNDI) is an [openFrameworks](https://openframeworks.cc/) addon to allow sending and receiving images over a network using the [NewTek](https://en.wikipedia.org/wiki/NewTek) Network Device Protocol.
+* [OGRE](https://github.com/OGRECave/ogre) is a scene-oriented, flexible 3D engine written in C++ designed to make it easier and more intuitive for developers to produce games and demos utilising 3D hardware.
+* [Olive](https://github.com/olive-editor/olive) is a free non-linear video editor for Windows, macOS, and Linux.
+* [OpenXRay](https://github.com/OpenXRay/xray-16) is an improved version of the X-Ray engine, used in world famous S.T.A.L.K.E.R. game series by GSC Game World.
+* [parallel-n64](https://github.com/libretro/parallel-n64) is an optimized/rewritten Nintendo 64 emulator made specifically for [Libretro](https://www.libretro.com/).
+* [PFFFT](https://github.com/marton78/pffft) does 1D Fast Fourier Transforms, of single precision real and complex vectors.
+* [pixaccess](https://github.com/oliverue/pixaccess) provides the abstractions for integer and float bitmaps, pixels, and aliased (nearest neighbor) and anti-aliased (bi-linearly interpolated) pixel access.
+* [PlutoSDR Firmware](https://github.com/seanstone/plutosdr-fw) is the customized firmware for the [PlutoSDR](https://wiki.analog.com/university/tools/pluto) that can be used to introduce fundamentals of Software Defined Radio (SDR) or Radio Frequency (RF) or Communications as advanced topics in electrical engineering in a self or instructor lead setting.
+* [Pygame](https://www.pygame.org) is cross-platform and designed to make it easy to write multimedia software, such as games, in Python.
+* [R:RandomFieldsUtils](https://cran.r-project.org/web/packages/RandomFieldsUtils) provides various utilities might be used in spatial statistics and elsewhere. (CRAN)
+* [rkcommon](https://github.com/ospray/rkcommon) represents a common set of C++ infrastructure and CMake utilities used by various components of [Intel oneAPI Rendering Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/rendering-toolkit.html).
+* [RPCS3](https://github.com/RPCS3/rpcs3) is the world's first free and open-source PlayStation 3 emulator/debugger, written in C++.
+* [simd_utils](https://github.com/JishinMaster/simd_utils) is a header-only library implementing common mathematical functions using SIMD intrinsics.
+* [SMhasher](https://github.com/rurban/smhasher) provides comprehensive Hash function quality and speed tests.
+* [Spack](https://github.com/spack/spack) is a multi-platform package manager that builds and installs multiple versions and configurations of software.
+* [srsLTE](https://github.com/srsLTE/srsLTE) is an open source SDR LTE software suite.
+* [SSW](https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library) is a fast implementation of the [Smith-Waterman algorithm](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm), which uses the SIMD instructions to parallelize the algorithm at the instruction level.
+* [Surge](https://github.com/surge-synthesizer/surge) is an open source digital synthesizer.
+* [XEVE](https://github.com/mpeg5/xeve) (eXtra-fast Essential Video Encoder) is an open sourced and fast MPEG-5 EVC encoder.
+* [XMRig](https://github.com/xmrig/xmrig) is an open source CPU miner for [Monero](https://web.getmonero.org/) cryptocurrency.
+
+## Related Projects
+* [SIMDe](https://github.com/simd-everywhere/simde): fast and portable implementations of SIMD
+  intrinsics on hardware which doesn't natively support them, such as calling SSE functions on ARM.
+* [CatBoost's sse2neon](https://github.com/catboost/catboost/blob/master/library/cpp/sse/sse2neon.h)
+* [ARM\_NEON\_2\_x86\_SSE](https://github.com/intel/ARM_NEON_2_x86_SSE)
+* [AvxToNeon](https://github.com/kunpengcompute/AvxToNeon)
+* [sse2rvv](https://github.com/FeddrickAquino/sse2rvv): C header file that converts Intel SSE intrinsics to RISC-V Vector intrinsic.
+* [sse2msa](https://github.com/i-evi/sse2msa): A C/C++ header file that converts Intel SSE intrinsics to MIPS/MIPS64 MSA intrinsics.
+* [POWER/PowerPC support for GCC](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000) contains a series of headers simplifying porting x86\_64 code that makes explicit use of Intel intrinsics to powerpc64le (pure little-endian mode that has been introduced with the [POWER8](https://en.wikipedia.org/wiki/POWER8)).
+    - implementation: [xmmintrin.h](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/xmmintrin.h), [emmintrin.h](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/emmintrin.h), [pmmintrin.h](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/pmmintrin.h), [tmmintrin.h](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/tmmintrin.h), [smmintrin.h](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/rs6000/smmintrin.h)
+
+## Reference
+* [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)
+* [Arm Neon Intrinsics Reference](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics)
+* [Neon Programmer's Guide for Armv8-A](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/neon-programmers-guide-for-armv8-a)
+* [NEON Programmer's Guide](https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf)
+* [qemu/target/i386/ops_sse.h](https://github.com/qemu/qemu/blob/master/target/i386/ops_sse.h): Comprehensive SSE instruction emulation in C. Ideal for semantic checks.
+* [Porting Takua Renderer to 64-bit ARM- Part 1](https://blog.yiningkarlli.com/2021/05/porting-takua-to-arm-pt1.html)
+* [Porting Takua Renderer to 64-bit ARM- Part 2](https://blog.yiningkarlli.com/2021/07/porting-takua-to-arm-pt2.html)
+* [Comparing SIMD on x86-64 and arm64](https://blog.yiningkarlli.com/2021/09/neon-vs-sse.html)
+* [Getting started with AWS Graviton](https://github.com/aws/aws-graviton-getting-started)
+* [Port with SSE2Neon and SIMDe](https://developer.arm.com/documentation/102581/0200/Port-with-SSE2Neon-and-SIMDe)
+* [Genomics: Optimizing the BWA aligner for Arm Servers](https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/optimizing-genomics-and-the-bwa-aligner-for-arm-servers)
+
+## Licensing
+
+`sse2neon` is freely redistributable under the MIT License.
--- a/cpp/sse2neon/sse2neon.h
+++ b/cpp/sse2neon/sse2neon.h
--- a/hs-leopard.cabal
+++ b/hs-leopard.cabal
@ -0,0 +1,84 @@
+Cabal-Version:        2.4
+Name:                 hs-leopard
+Version:              0.0.1
+Synopsis:             Haskell bindings to the Leopard fast erasure coding library
+
+Description:          Haskell bindings to the Leopard fast erasure coding library.
+
+License:              BSD-3-Clause
+License-files:        LICENSE
+
+Author:               Balazs Komuves
+Copyright:            (c) 2026 Logos
+Maintainer:           balazs (at) free (dot) technology
+
+Stability:            Experimental
+Category:             Cryptography
+Tested-With:          GHC == 9.12.1
+Build-Type:           Simple
+
+--------------------------------------------------------------------------------
+
+extra-source-files:   cpp/LeopardCommon.h
+                      cpp/LeopardFF16.h
+                      cpp/LeopardFF8.h
+                      cpp/leopard.h
+                      cpp/sse2neon/sse2neon.h
+                      cpp/LICENSE
+                      README.md
+                      LICENSE
+
+--------------------------------------------------------------------------------
+
+source-repository head
+  type:                 git 
+  location:             https://github.com/logos-storage/hs-leopard
+
+--------------------------------------------------------------------------------
+
+Library
+
+  Build-Depends:        base >= 4 && <5, 
+                        array >= 0.5 && < 0.6,
+                        random >= 1.3 && < 1.4,
+                        bytestring >= 0.12 && < 0.14
+
+  Exposed-Modules:      Leopard
+                        Leopard.Codec
+                        Leopard.Example
+                        Leopard.Binding
+                        Leopard.Types
+                        Leopard.Misc
+
+  Default-Language:     Haskell2010
+  Default-Extensions:   BangPatterns
+
+  Hs-Source-Dirs:       src
+  Include-Dirs:         cpp
+
+  CXX-Sources:          cpp/LeopardCommon.cpp
+                        cpp/LeopardFF16.cpp
+                        cpp/LeopardFF8.cpp
+                        cpp/leopard.cpp
+
+  Default-Extensions:   ForeignFunctionInterface, CPP
+
+  ghc-options:          -fwarn-tabs -fno-warn-unused-matches -fno-warn-name-shadowing -fno-warn-unused-imports                        
+
+  cc-options:           -x c++
+  cxx-options:          -O3 -std=c++11 -lm
+  extra-libraries:      stdc++
+
+--------------------------------------------------------------------------------
+
+Executable testMain
+
+  build-depends:       base >= 4 && < 5, 
+                       bytestring >= 0.12 && < 0.14,
+                       hs-leopard
+
+  hs-source-dirs:      test
+  main-is:             testMain.hs
+  Default-Language:    Haskell2010
+
+--------------------------------------------------------------------------------
--- a/src/Leopard.hs
+++ b/src/Leopard.hs
@ -0,0 +1,16 @@
+
+module Leopard where
+
+--------------------------------------------------------------------------------
+
+import Data.Bits
+import Data.Word
+
+import Data.ByteString (ByteString)
+import qualified Data.ByteString as B
+
+import Leopard.Codec
+import Leopard.Types
+
+--------------------------------------------------------------------------------
+
--- a/src/Leopard/Binding.hs
+++ b/src/Leopard/Binding.hs
@ -0,0 +1,258 @@
+
+-- | Note: This is an internal module; use @Leopard.Codec@ instead
+
+{-# LANGUAGE ForeignFunctionInterface, CPP, Strict, ScopedTypeVariables #-}
+module Leopard.Binding where
+
+--------------------------------------------------------------------------------
+
+import Data.Word
+import Data.Array
+import Data.Maybe
+
+import Control.Monad
+
+import Foreign.C
+import Foreign.C.Types
+import Foreign.Ptr
+import Foreign.Storable
+import Foreign.Marshal
+
+import Data.ByteString (ByteString)
+import qualified Data.ByteString as B
+
+import Leopard.Types
+import Leopard.Misc
+
+--------------------------------------------------------------------------------
+-- * error handling
+
+data LeopardResult
+  = Success                           -- ^ Operation succeeded
+  | NeedMoreData                      -- ^ Not enough recovery data received
+  | TooMuchData                       -- ^ Buffer counts are too high
+  | InvalidSize                       -- ^ Buffer size must be a multiple of 64 bytes
+  | InvalidCounts                     -- ^ Invalid counts provided
+  | InvalidInput                      -- ^ A function parameter was invalid
+  | Platform                          -- ^ Platform is unsupported
+  | CallInitialize                    -- ^ Call leo_init() first
+  deriving (Eq,Show)
+
+instance Enum LeopardResult where
+
+  toEnum ( 0)  = Success             -- Operation succeeded
+  toEnum (-1)  = NeedMoreData        -- Not enough recovery data received
+  toEnum (-2)  = TooMuchData         -- Buffer counts are too high
+  toEnum (-3)  = InvalidSize         -- Buffer size must be a multiple of 64 bytes
+  toEnum (-4)  = InvalidCounts       -- Invalid counts provided
+  toEnum (-5)  = InvalidInput        -- A function parameter was invalid
+  toEnum (-6)  = Platform            -- Platform is unsupported
+  toEnum (-7)  = CallInitialize      -- Call leo_init() first 
+
+  toEnum _     = error "invalid leopard error code"
+
+  fromEnum _ = error "LeopardResult/fromEnum: not implemented"
+
+decodeLeopardResult :: LeopardResult -> Maybe String
+decodeLeopardResult result = case result of
+  Success        -> Nothing  -- "Operation succeeded"
+  NeedMoreData   -> Just "Not enough recovery data received"
+  TooMuchData    -> Just "Buffer counts are too high"
+  InvalidSize    -> Just "Buffer size must be a multiple of 64 bytes"
+  InvalidCounts  -> Just "Invalid counts provided"
+  InvalidInput   -> Just "A function parameter was invalid"
+  Platform       -> Just "Platform is unsupported"
+  CallInitialize -> Just "Call leo_init() first"
+
+--------------------------------------------------------------------------------
+-- * C++ bindings
+
+{-# NOINLINE initLeopard #-}
+initLeopard :: IO ()
+initLeopard = do
+  res <- cpp_leo_init leo_VERSION
+  if (res == 0)
+    then return ()
+    else fail "Leopard initialization failed"
+
+withLeopard :: IO a -> IO a
+withLeopard action = do
+  initLeopard
+  action
+
+unsafeEncodeIOList :: ECParams -> [ByteString] -> IO (Either LeopardResult [ByteString])
+unsafeEncodeIOList ecParams inputChunks = do
+  ei <- unsafeEncodeIO ecParams (arrayFromList inputChunks)
+  return $ case ei of
+    Left  err -> Left err
+    Right arr -> Right (elems arr) 
+
+--------------------------------------------------------------------------------
+
+-- | Takes @K@ input chunks, and returns @M@ parity chunks.
+--
+-- We assume that the chunks have a size which is a multiple of 64 bytes, as 
+-- the underlying `leopard` library assumes that too...
+--
+{-# NOINLINE unsafeEncodeIO #-}
+unsafeEncodeIO :: ECParams -> Array Int ByteString -> IO (Either LeopardResult (Array Int ByteString))
+unsafeEncodeIO ecParams@(ECParams k n) inputChunks = do
+  let m = n - k
+  work_cnt <- cpp_leo_encode_work_count (fromIntegral k) (fromIntegral m)
+  when (work_cnt == 0) $ fail "encode: `leo_encode_work_count` claims invalid input"
+  let work_cnt_int = fromIntegral work_cnt :: Int
+
+  let nchunks       = arrayLength inputChunks
+  let sizes         = map B.length (elems inputChunks)
+  let mb_chunk_size = isUniformList sizes
+
+  unless (k == nchunks)        $ fail "encode: we need exactly K input chunks"  
+  unless (isJust mb_chunk_size) $ fail "encode: chunk size must be uniform"
+
+  let chunk_size = fromJust mb_chunk_size
+  unless (isDivisibleBy64 chunk_size) $ fail "encode: chunk size should be divisible by 64"
+
+  allocaArray nchunks $ \(porigs :: Ptr PtrWord8) -> do
+    flipZipWithM_ [0..] (elems inputChunks) $ \idx bs -> withByteString bs $ \len ptr -> pokeElemOff porigs idx ptr
+
+    allocaArrays (replicate work_cnt_int chunk_size) $ \(ptrs :: [PtrWord8]) -> do 
+      allocaArray work_cnt_int $ \(pworks :: Ptr PtrWord8) -> do
+        flipZipWithM_ [0..] ptrs $ \idx ptr -> pokeElemOff pworks idx ptr
+          
+        res <- cpp_leo_encode 
+          (fromIntegral chunk_size)     -- Number of bytes in each data buffer                                                    
+          (fromIntegral k)              -- Number of original_data[] buffer pointers                                     
+          (fromIntegral m)              -- Number of recovery_data[] buffer pointers                                     
+          (fromIntegral work_cnt)       -- Number of work_data[] buffer pointers, from leo_encode_work_count()                  
+          porigs                        -- Array of pointers to original data buffers                          
+          pworks                        -- Array of work buffers                                               
+
+        if res /= 0 
+          then return (Left $ toEnum $ fromIntegral res)
+          else do
+            parityChunks <- forM [0..m-1] $ \j -> do
+              ptr <- peekElemOff pworks j
+              createByteString chunk_size ptr
+             
+            return $ Right $ listArray (0,m-1) parityChunks
+
+--------------------------------------------------------------------------------
+
+unsafeDecodeIOList :: ECParams -> [Maybe ByteString] -> IO (Either LeopardResult [ByteString])
+unsafeDecodeIOList ecParams mbChunks = do
+  ei <- unsafeDecodeIO ecParams (arrayFromList mbChunks)
+  return $ case ei of
+    Left  err -> Left err
+    Right arr -> Right (elems arr) 
+
+{-# NOINLINE unsafeDecodeIO #-}
+unsafeDecodeIO :: ECParams -> Array Int (Maybe ByteString) -> IO (Either LeopardResult (Array Int ByteString))
+unsafeDecodeIO ecParams@(ECParams k n) mbChunks = do
+  let m = n - k
+  work_cnt <- cpp_leo_decode_work_count (fromIntegral k) (fromIntegral m)
+  when (work_cnt == 0) $ fail "edeode: `leo_decode_work_count` claims invalid input"
+  let work_cnt_int = fromIntegral work_cnt :: Int
+
+  let nchunks       = arrayLength mbChunks
+  let sizes         = map B.length (catMaybes $ elems mbChunks)
+  let mb_chunk_size = isUniformList sizes
+
+  unless (n == nchunks)         $ fail "encode: we need exactly N encoded chunks"    
+  unless (isJust mb_chunk_size) $ fail "decode: chunk size must be uniform"
+
+  let chunk_size = fromJust mb_chunk_size
+  unless (isDivisibleBy64 chunk_size) $ fail "decode: chunk size should be divisible by 64"
+
+  let (origChunks,parityChunks) = splitAt k (elems mbChunks)
+
+  allocaArray k $ \(porigs :: Ptr PtrWord8) -> do
+    flipZipWithM_ [0..] origChunks $ \idx mb -> case mb of
+      Just bs  ->  withByteString bs $ \len ptr -> pokeElemOff porigs idx ptr
+      Nothing  ->  pokeElemOff porigs idx nullPtr
+
+    allocaArray k $ \(pparity :: Ptr PtrWord8) -> do
+      flipZipWithM_ [0..] parityChunks $ \idx mb -> case mb of
+        Just bs  ->  withByteString bs $ \len ptr -> pokeElemOff pparity idx ptr
+        Nothing  ->  pokeElemOff pparity idx nullPtr
+
+      allocaArrays (replicate work_cnt_int chunk_size) $ \(ptrs :: [PtrWord8]) -> do 
+        allocaArray work_cnt_int $ \(pworks :: Ptr PtrWord8) -> do
+          flipZipWithM_ [0..] ptrs $ \idx ptr -> pokeElemOff pworks idx ptr
+            
+          res <- cpp_leo_decode 
+            (fromIntegral chunk_size)     -- Number of bytes in each data buffer                                                    
+            (fromIntegral k)              -- Number of original_data[] buffer pointers                                     
+            (fromIntegral m)              -- Number of recovery_data[] buffer pointers                                     
+            (fromIntegral work_cnt)       -- Number of work_data[] buffer pointers, from leo_encode_work_count()                  
+            porigs                        -- Array of pointers to original data buffers                          
+            pparity                       -- Array of recovery data buffers
+            pworks                        -- Array of work buffers                                               
+  
+          if res /= 0 
+            then return (Left $ toEnum $ fromIntegral res)
+            else do
+              finalChunks <- forM [0..k-1] $ \j -> case origChunks!!j of
+                Just orig -> return orig
+                Nothing   -> do
+                  ptr <- peekElemOff pworks j
+                  createByteString chunk_size ptr
+               
+              return $ Right $ listArray (0,k-1) finalChunks
+
+--------------------------------------------------------------------------------
+
+type PtrWord8 = Ptr Word8
+
+leo_VERSION :: CInt
+leo_VERSION = 2
+
+foreign import ccall "leo_init_"  cpp_leo_init :: CInt -> IO CInt
+
+foreign import ccall "leo_result_string" cpp_leo_result_string :: CInt -> IO CString
+
+----------------------------------------
+
+{-
+    LEO_EXPORT unsigned leo_encode_work_count(
+        unsigned original_count,
+        unsigned recovery_count);
+-}
+
+foreign import ccall "leo_encode_work_count" cpp_leo_encode_work_count :: CUInt -> CUInt -> IO CUInt
+
+foreign import ccall "leo_decode_work_count" cpp_leo_decode_work_count :: CUInt -> CUInt -> IO CUInt
+
+----------------------------------------
+
+{-
+    LEO_EXPORT LeopardResult leo_encode(
+        uint64_t buffer_bytes,                    // Number of bytes in each data buffer
+        unsigned original_count,                  // Number of original_data[] buffer pointers
+        unsigned recovery_count,                  // Number of recovery_data[] buffer pointers
+        unsigned work_count,                      // Number of work_data[] buffer pointers, from leo_encode_work_count()
+        const void* const * const original_data,  // Array of pointers to original data buffers
+        void** work_data);                        // Array of work buffers
+-}
+
+--
+-- * `buffer_bytes` must be a multiple of 64
+-- * Each buffer should have the same number of bytes.
+-- * Even the last piece must be rounded up to the block size.
+-- * The first set of recovery_count buffers in work_data will be the result.
+--
+foreign import ccall "leo_encode" cpp_leo_encode :: Word64 -> CUInt -> CUInt -> CUInt -> Ptr (Ptr a) -> Ptr (Ptr a) -> IO CInt
+
+{-
+    LEO_EXPORT LeopardResult leo_decode(
+        uint64_t buffer_bytes,                    // Number of bytes in each data buffer
+        unsigned original_count,                  // Number of original_data[] buffer pointers
+        unsigned recovery_count,                  // Number of recovery_data[] buffer pointers
+        unsigned work_count,                      // Number of buffer pointers in work_data[]
+        const void* const * const original_data,  // Array of original data buffers
+        const void* const * const recovery_data,  // Array of recovery data buffers
+        void** work_data);        
+-}
+
+foreign import ccall "leo_decode" cpp_leo_decode :: Word64 -> CUInt -> CUInt -> CUInt -> Ptr (Ptr a) -> Ptr (Ptr a) -> Ptr (Ptr a) -> IO CInt
+
+--------------------------------------------------------------------------------
--- a/src/Leopard/Codec.hs
+++ b/src/Leopard/Codec.hs
@ -0,0 +1,36 @@
+
+{-# LANGUAGE Strict #-}
+module Leopard.Codec
+  ( LeopardResult
+  ,
+  ) 
+  where
+
+--------------------------------------------------------------------------------
+
+import Data.Bits
+import Data.Word
+import Data.Array
+
+import Data.ByteString (ByteString)
+import qualified Data.ByteString as B
+
+import Leopard.Binding
+import Leopard.Types
+import Leopard.Misc
+
+--------------------------------------------------------------------------------
+
+{-
+{-# NOINLINE #-}
+encodeIO :: ECParams -> ByteString -> IO EncodedData
+encodeIO ecParams@(ECParams k n) input 
+
+  let m = n - k
+
+  let orig_size    = B.length input
+  let chunk_size_0 = ceilDiv orig_size k
+  let chunk_size   = roundUpToMultipleOf 64 chunk_size_0 
+-}
+
+--------------------------------------------------------------------------------
--- a/src/Leopard/Example.hs
+++ b/src/Leopard/Example.hs
@ -0,0 +1,113 @@
+
+module Leopard.Example where
+
+--------------------------------------------------------------------------------
+
+import Data.Word
+import Data.Array
+import Data.Maybe
+
+import Control.Monad
+import System.Random
+
+import Data.ByteString (ByteString)
+import qualified Data.ByteString as B
+
+import Leopard.Codec
+import Leopard.Binding
+import Leopard.Types
+import Leopard.Misc
+
+--------------------------------------------------------------------------------
+
+init_ :: IO ()
+init_ = initLeopard 
+
+--------------------------------------------------------------------------------
+
+maxChunks :: Int
+maxChunks = 20
+
+exampleLowLevel :: IO ()
+exampleLowLevel = void (exampleLowLevel' True)
+
+testLowLevel :: Int -> IO Bool
+testLowLevel howMany = do
+  oks <- replicateM howMany (exampleLowLevel' False)
+  return (and oks)
+
+exampleLowLevel' :: Bool -> IO Bool
+exampleLowLevel' doPrint = withLeopard $ do
+
+  k <- randomRIO (2,maxChunks)
+  m <- randomRIO (1,k)
+  let n = k + m
+  let ecp = ECParams
+        { _ecK = k
+        , _ecN = n
+        }
+
+  -- let chunkSize = 64
+  chunkSize <- ((\x -> x * 64) <$> randomRIO (1,100))
+
+  exampleLowLevel'' ecp chunkSize doPrint
+
+--------------------------------------------------------------------------------
+
+exampleLowLevel'' :: ECParams -> Int -> Bool -> IO Bool
+exampleLowLevel'' ecp@(ECParams k n) chunkSize doPrint = do
+
+  let m = n - k
+
+  when doPrint $ do
+    putStrLn "Leopard example (low level)"
+    putStrLn "---------------------------"
+    putStrLn $ "K = " ++ show k
+    putStrLn $ "N = " ++ show n
+    putStrLn $ "M = " ++ show m
+    putStrLn $ "chunk size = " ++ show chunkSize ++ " bytes"
+
+  origs  <- replicateM k (randomByteString chunkSize)
+  parity <- failIfLeft =<< unsafeEncodeIOList ecp origs
+
+  let encoded = arrayFromList (origs ++ parity)
+  nbad <- randomRIO (0,m)
+  when doPrint $ putStrLn $ "#lost chunks = " ++ show nbad
+
+  partial <- elems <$> maskRandomly nbad encoded
+  let ngood = sum [ 1 | Just _ <- partial ]
+  unless (nbad + ngood == n) $ error "fatal: nbad + ngood /= N"
+
+  -- when doPrint $ print $ map isJust partial
+
+  decoded <- failIfLeft =<< unsafeDecodeIOList ecp partial
+
+  let ok = (origs == decoded)
+  when doPrint $ putStrLn $ "reconstruction successful = " ++ show ok
+
+{-
+  when doPrint $ do
+    printChunks "original"      origs
+    printChunks "parity"        parity
+    printChunks "reconstructed" decoded
+-}
+
+  return ok
+
+--------------------------------------------------------------------------------
+
+failIfLeft :: Either LeopardResult a -> IO a
+failIfLeft (Left  err) = fail (show $ decodeLeopardResult err)
+failIfLeft (Right res) = return res
+
+--------------------------------------------------------------------------------
+
+printChunks :: String -> [ByteString] -> IO ()
+printChunks title bss = do
+  putStrLn ""
+  putStrLn title
+  putStrLn (replicate (length title) '-')
+  flipZipWithM_ [0..] bss $ \idx bs -> do
+    putStrLn $ " - " ++ show idx ++ ": " ++ byteStringToHexString bs
+
+--------------------------------------------------------------------------------
--- a/src/Leopard/Misc.hs
+++ b/src/Leopard/Misc.hs
@ -0,0 +1,155 @@
+
+{-# LANGUAGE Strict #-}
+module Leopard.Misc where
+
+--------------------------------------------------------------------------------
+
+import Data.Bits
+import Data.Word
+import Data.Array
+
+import Control.Monad
+import System.Random
+
+import Foreign.Ptr
+import Foreign.ForeignPtr
+import Foreign.Marshal
+import Foreign.Storable
+
+import Text.Printf
+
+import Data.ByteString (ByteString)
+import qualified Data.ByteString          as B
+import qualified Data.ByteString.Internal as BI
+
+--------------------------------------------------------------------------------
+-- * Integer logarithm
+
+-- | Largest integer @k@ such that @2^k@ is smaller or equal to @n@
+integerLog2' :: Integer -> Int
+integerLog2' n = go n where
+  go 0 = -1
+  go k = 1 + go (shiftR k 1)
+
+-- | Smallest integer @k@ such that @2^k@ is larger or equal to @n@
+ceilingLog2' :: Integer -> Int
+ceilingLog2' 0 = 0
+ceilingLog2' n = 1 + go (n-1) where
+  go 0 = -1
+  go k = 1 + go (shiftR k 1)
+
+integerLog2 :: Int -> Int
+integerLog2 = integerLog2' . fromIntegral
+
+ceilingLog2 :: Int -> Int
+ceilingLog2 = ceilingLog2' . fromIntegral
+  
+--------------------------------------------------------------------------------
+-- * Division
+
+-- | @ceil( a / b )@
+ceilDiv :: Int -> Int -> Int
+ceilDiv a b = div (a+b-1) b
+
+isDivisibleBy64 :: Int -> Bool
+isDivisibleBy64 n = (mod n 64 == 0)
+
+-- | Rounding up to the multiple of the first argument
+roundUpToMultipleOf :: Int -> Int -> Int
+roundUpToMultipleOf size x = size * (ceilDiv x size)
+
+--------------------------------------------------------------------------------
+-- * Bytestrings
+
+partitionBS :: Int -> ByteString -> [ByteString]
+partitionBS len = go where
+  go :: ByteString -> [ByteString]
+  go bs = if B.null bs
+    then []
+    else B.take len bs : go (B.drop len bs)
+
+withByteString :: ByteString -> (Int -> Ptr Word8 -> IO a) -> IO a
+withByteString bs@(BI.BS fptr len) action = 
+  withForeignPtr fptr $ \ptr -> action len ptr
+
+createByteString :: Int -> Ptr Word8 -> IO ByteString
+createByteString len src = BI.create len $ \tgt -> copyBytes tgt src len
+
+randomByteString :: Int -> IO ByteString
+randomByteString len = do
+  xs <- replicateM len randomIO :: IO [Word8]
+  return (B.pack xs)
+
+byteStringToHexString :: ByteString -> String
+byteStringToHexString = concatMap f . B.unpack where
+  f :: Word8 -> String
+  f = printf "%02x"
+
+--------------------------------------------------------------------------------
+-- * Arrays
+
+arrayLength :: Array Int a -> Int
+arrayLength arr = let (u,v) = bounds arr in v - u + 1 
+
+arrayFromList :: [a] -> Array Int a
+arrayFromList xs = listArray (0,length xs - 1) xs
+
+--------------------------------------------------------------------------------
+-- * Random masks
+
+-- | There will be @k@ @Nothing@-s in the resulting array
+maskRandomly :: Int -> Array Int a -> IO (Array Int (Maybe a))
+maskRandomly k arr = do
+  mask <- randomBoolMask (arrayLength arr) k
+  let (u,v) = bounds arr
+  return $ listArray (u,v) 
+    [ if b then Just x else Nothing | (x,b) <- zip (elems arr) (elems mask) ]
+
+-- | @randomBoolMask n k@ will give you @k@ falses and @(n-k)@ trues
+randomBoolMask :: Int -> Int -> IO (Array Int Bool)
+randomBoolMask n k = go k trues where
+
+  trues :: Array Int Bool
+  trues = listArray (0,n-1) (replicate n True)
+
+  go :: Int -> Array Int Bool -> IO (Array Int Bool)
+  go 0 arr = return arr
+  go k arr = do
+    j <- randomRIO (0,n-1)
+    case arr!j of 
+      True  -> go (k-1) (arr // [(j,False)])
+      False -> go  k     arr
+
+--------------------------------------------------------------------------------
+-- * Marshal
+
+allocaArrays :: Storable a => [Int] -> ([Ptr a] -> IO b) -> IO b
+allocaArrays sizes action = go sizes [] where
+  go []     ptrs = action (reverse ptrs)
+  go (k:ks) ptrs = allocaArray k $ \ptr -> go ks (ptr : ptrs)
+
+--------------------------------------------------------------------------------
+-- * Monad
+
+flipZipWithM_ :: Monad m => [a] -> [b] -> (a -> b -> m ()) -> m ()
+flipZipWithM_ xs ys action = zipWithM_ action xs ys
+
+--------------------------------------------------------------------------------
+-- * Misc
+
+-- | If all the elements of the input list are the same, then it returns that element
+isUniformList :: Eq a => [a] -> Maybe a
+isUniformList [] = error "isUniformList: empty input"
+isUniformList (x0:x0s) = go x0s where
+  go []     = Just x0
+  go (u:us) = if u == x0 
+    then go us
+    else Nothing
+
+isUniformList_ :: Eq a => [a] -> a
+isUniformList_ xs = case isUniformList xs of
+  Just x  -> x
+  Nothing -> error "isUniformList_: not an uniform list"
+
+--------------------------------------------------------------------------------
+
--- a/src/Leopard/Types.hs
+++ b/src/Leopard/Types.hs
@ -0,0 +1,62 @@
+
+{-# LANGUAGE Strict #-}
+module Leopard.Types where
+
+--------------------------------------------------------------------------------
+
+import Data.Bits
+import Data.Word
+import Data.Array
+
+import Data.ByteString (ByteString)
+import qualified Data.ByteString as B
+
+import Leopard.Misc
+
+--------------------------------------------------------------------------------
+
+-- | Note: Recause of a restriction of the underlying Leopard library, you should have
+-- @K >= 2@, @N <= 2*K@ and @N <= 65536@.
+data ECParams = ECParams
+  { _ecK :: Int             -- ^ @K@ is the number of original chunks
+  , _ecN :: Int             -- ^ @N@ is the number of chunks after encoding
+  }
+  deriving (Eq,Show)
+
+-- | Number of \"parity\" chunks
+ecM :: ECParams -> Int           
+ecM params = _ecN params - _ecK params
+
+isValidECParams :: ECParams -> Bool
+isValidECParams (ECParams k n) = and
+  [ k >  1
+  , k <= 32768
+  , k <  n 
+  , n <= 2 * k
+  ]
+
+--------------------------------------------------------------------------------
+
+data Encoding = Encoding
+  { _ecParams     :: ECParams        -- ^ the erasure coding parameters
+  , _chunkSize    :: Int             -- ^ size of an EC chunk
+  , _origDataSize :: Int             -- ^ if not divisible by @K@, it can be smaller than @K x chunkSize@
+  }
+  deriving (Eq,Show)
+
+isValidEncoding :: Encoding -> Bool
+isValidEncoding (Encoding params@(ECParams k n) chunkSize dataSize) = and
+  [ isValidECParams params
+  , chunkSize == ceilDiv dataSize k
+  , isDivisibleBy64 chunkSize
+  ]
+
+--------------------------------------------------------------------------------
+
+data EncodedData = EncodedData 
+  { _encoding :: Encoding
+  , _chunks   :: Array Int ByteString
+  }
+  deriving (Eq,Show)
+
+--------------------------------------------------------------------------------
--- a/test/testMain.hs
+++ b/test/testMain.hs
@ -0,0 +1,16 @@
+
+module Main where
+
+--------------------------------------------------------------------------------
+
+import Leopard.Codec
+import Leopard.Example
+
+--------------------------------------------------------------------------------
+
+main :: IO ()
+main = do
+  exampleLowLevel 
+
+--------------------------------------------------------------------------------
+