BLAKE2b-SIMD ============ Pure Go implementation of BLAKE2b using SIMD optimizations. Introduction ------------ This package was initially based on the pure go [BLAKE2b](https://github.com/dchest/blake2b) implementation of Dmitry Chestnykh and merged with the (`cgo` dependent) AVX optimized [BLAKE2](https://github.com/codahale/blake2) implementation (which in turn is based on the [official implementation](https://github.com/BLAKE2/BLAKE2). It does so by using [Go's Assembler](https://golang.org/doc/asm) for amd64 architectures with a golang only fallback for other architectures. In addition to AVX there is also support for AVX2 as well as SSE. Best performance is obtained with AVX2 which gives roughly a **4X** performance increase approaching hashing speeds of **1GB/sec** on a single core. Benchmarks ---------- This is a summary of the performance improvements. Full details are shown below. | Technology | 128K | | ---------- |:-----:| | AVX2 | 3.94x | | AVX | 3.28x | | SSE | 2.85x | asm2plan9s ---------- In order to be able to work more easily with AVX2/AVX instructions, a separate tool was developed to convert AVX2/AVX instructions into the corresponding BYTE sequence as accepted by Go assembly. See [asm2plan9s](https://github.com/minio/asm2plan9s) for more information. bt2sum ------ [bt2sum](https://github.com/s3git/bt2sum) is a utility that takes advantages of the BLAKE2b SIMD optimizations to compute check sums using the BLAKE2 Tree hashing mode in so called 'unlimited fanout' mode. Technical details ----------------- BLAKE2b is a hashing algorithm that operates on 64-bit integer values. The AVX2 version uses the 256-bit wide YMM registers in order to essentially process four operations in parallel. AVX and SSE operate on 128-bit values simultaneously (two operations in parallel). Below are excerpts from `compressAvx2_amd64.s`, `compressAvx_amd64.s`, and `compress_generic.go` respectively. ``` VPADDQ YMM0,YMM0,YMM1 /* v0 += v4, v1 += v5, v2 += v6, v3 += v7 */ ``` ``` VPADDQ XMM0,XMM0,XMM2 /* v0 += v4, v1 += v5 */ VPADDQ XMM1,XMM1,XMM3 /* v2 += v6, v3 += v7 */ ``` ``` v0 += v4 v1 += v5 v2 += v6 v3 += v7 ``` Detailed benchmarks ------------------- Example performance metrics were generated on Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz - 6 physical cores, 12 logical cores running Ubuntu GNU/Linux with kernel version 4.4.0-24-generic (vanilla with no optimizations). ### AVX2 ``` $ benchcmp go.txt avx2.txt benchmark old ns/op new ns/op delta BenchmarkHash64-12 1481 849 -42.67% BenchmarkHash128-12 1428 746 -47.76% BenchmarkHash1K-12 6379 2227 -65.09% BenchmarkHash8K-12 37219 11714 -68.53% BenchmarkHash32K-12 140716 35935 -74.46% BenchmarkHash128K-12 561656 142634 -74.60% benchmark old MB/s new MB/s speedup BenchmarkHash64-12 43.20 75.37 1.74x BenchmarkHash128-12 89.64 171.35 1.91x BenchmarkHash1K-12 160.52 459.69 2.86x BenchmarkHash8K-12 220.10 699.32 3.18x BenchmarkHash32K-12 232.87 911.85 3.92x BenchmarkHash128K-12 233.37 918.93 3.94x ``` ### AVX2: Comparison to other hashing techniques ``` $ go test -bench=Comparison BenchmarkComparisonMD5-12 1000 1726121 ns/op 607.48 MB/s BenchmarkComparisonSHA1-12 500 2005164 ns/op 522.94 MB/s BenchmarkComparisonSHA256-12 300 5531036 ns/op 189.58 MB/s BenchmarkComparisonSHA512-12 500 3423030 ns/op 306.33 MB/s BenchmarkComparisonBlake2B-12 1000 1232690 ns/op 850.64 MB/s ``` Benchmarks below were generated on a MacBook Pro with a 2.7 GHz Intel Core i7. ### AVX ``` $ benchcmp go.txt avx.txt benchmark old ns/op new ns/op delta BenchmarkHash64-8 813 458 -43.67% BenchmarkHash128-8 766 401 -47.65% BenchmarkHash1K-8 4881 1763 -63.88% BenchmarkHash8K-8 36127 12273 -66.03% BenchmarkHash32K-8 140582 43155 -69.30% BenchmarkHash128K-8 567850 173246 -69.49% benchmark old MB/s new MB/s speedup BenchmarkHash64-8 78.63 139.57 1.78x BenchmarkHash128-8 166.98 318.73 1.91x BenchmarkHash1K-8 209.76 580.68 2.77x BenchmarkHash8K-8 226.76 667.46 2.94x BenchmarkHash32K-8 233.09 759.29 3.26x BenchmarkHash128K-8 230.82 756.56 3.28x ``` ### SSE ``` $ benchcmp go.txt sse.txt benchmark old ns/op new ns/op delta BenchmarkHash64-8 813 478 -41.21% BenchmarkHash128-8 766 411 -46.34% BenchmarkHash1K-8 4881 1870 -61.69% BenchmarkHash8K-8 36127 12427 -65.60% BenchmarkHash32K-8 140582 49512 -64.78% BenchmarkHash128K-8 567850 199040 -64.95% benchmark old MB/s new MB/s speedup BenchmarkHash64-8 78.63 133.78 1.70x BenchmarkHash128-8 166.98 311.23 1.86x BenchmarkHash1K-8 209.76 547.37 2.61x BenchmarkHash8K-8 226.76 659.20 2.91x BenchmarkHash32K-8 233.09 661.81 2.84x BenchmarkHash128K-8 230.82 658.52 2.85x ``` License ------- Released under the Apache License v2.0. You can find the complete text in the file LICENSE. Contributing ------------ Contributions are welcome, please send PRs for any enhancements.