145 lines
5.7 KiB
Markdown
145 lines
5.7 KiB
Markdown
|
BLAKE2b-SIMD
|
||
|
============
|
||
|
|
||
|
Pure Go implementation of BLAKE2b using SIMD optimizations.
|
||
|
|
||
|
Introduction
|
||
|
------------
|
||
|
|
||
|
This package was initially based on the pure go [BLAKE2b](https://github.com/dchest/blake2b) implementation of Dmitry Chestnykh and merged with the (`cgo` dependent) AVX optimized [BLAKE2](https://github.com/codahale/blake2) implementation (which in turn is based on the [official implementation](https://github.com/BLAKE2/BLAKE2). It does so by using [Go's Assembler](https://golang.org/doc/asm) for amd64 architectures with a golang only fallback for other architectures.
|
||
|
|
||
|
In addition to AVX there is also support for AVX2 as well as SSE. Best performance is obtained with AVX2 which gives roughly a **4X** performance increase approaching hashing speeds of **1GB/sec** on a single core.
|
||
|
|
||
|
Benchmarks
|
||
|
----------
|
||
|
|
||
|
This is a summary of the performance improvements. Full details are shown below.
|
||
|
|
||
|
| Technology | 128K |
|
||
|
| ---------- |:-----:|
|
||
|
| AVX2 | 3.94x |
|
||
|
| AVX | 3.28x |
|
||
|
| SSE | 2.85x |
|
||
|
|
||
|
asm2plan9s
|
||
|
----------
|
||
|
|
||
|
In order to be able to work more easily with AVX2/AVX instructions, a separate tool was developed to convert AVX2/AVX instructions into the corresponding BYTE sequence as accepted by Go assembly. See [asm2plan9s](https://github.com/minio/asm2plan9s) for more information.
|
||
|
|
||
|
bt2sum
|
||
|
------
|
||
|
|
||
|
[bt2sum](https://github.com/s3git/bt2sum) is a utility that takes advantages of the BLAKE2b SIMD optimizations to compute check sums using the BLAKE2 Tree hashing mode in so called 'unlimited fanout' mode.
|
||
|
|
||
|
Technical details
|
||
|
-----------------
|
||
|
|
||
|
BLAKE2b is a hashing algorithm that operates on 64-bit integer values. The AVX2 version uses the 256-bit wide YMM registers in order to essentially process four operations in parallel. AVX and SSE operate on 128-bit values simultaneously (two operations in parallel). Below are excerpts from `compressAvx2_amd64.s`, `compressAvx_amd64.s`, and `compress_generic.go` respectively.
|
||
|
|
||
|
```
|
||
|
VPADDQ YMM0,YMM0,YMM1 /* v0 += v4, v1 += v5, v2 += v6, v3 += v7 */
|
||
|
```
|
||
|
|
||
|
```
|
||
|
VPADDQ XMM0,XMM0,XMM2 /* v0 += v4, v1 += v5 */
|
||
|
VPADDQ XMM1,XMM1,XMM3 /* v2 += v6, v3 += v7 */
|
||
|
```
|
||
|
|
||
|
```
|
||
|
v0 += v4
|
||
|
v1 += v5
|
||
|
v2 += v6
|
||
|
v3 += v7
|
||
|
```
|
||
|
|
||
|
Detailed benchmarks
|
||
|
-------------------
|
||
|
|
||
|
Example performance metrics were generated on Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz - 6 physical cores, 12 logical cores running Ubuntu GNU/Linux with kernel version 4.4.0-24-generic (vanilla with no optimizations).
|
||
|
|
||
|
### AVX2
|
||
|
|
||
|
```
|
||
|
$ benchcmp go.txt avx2.txt
|
||
|
benchmark old ns/op new ns/op delta
|
||
|
BenchmarkHash64-12 1481 849 -42.67%
|
||
|
BenchmarkHash128-12 1428 746 -47.76%
|
||
|
BenchmarkHash1K-12 6379 2227 -65.09%
|
||
|
BenchmarkHash8K-12 37219 11714 -68.53%
|
||
|
BenchmarkHash32K-12 140716 35935 -74.46%
|
||
|
BenchmarkHash128K-12 561656 142634 -74.60%
|
||
|
|
||
|
benchmark old MB/s new MB/s speedup
|
||
|
BenchmarkHash64-12 43.20 75.37 1.74x
|
||
|
BenchmarkHash128-12 89.64 171.35 1.91x
|
||
|
BenchmarkHash1K-12 160.52 459.69 2.86x
|
||
|
BenchmarkHash8K-12 220.10 699.32 3.18x
|
||
|
BenchmarkHash32K-12 232.87 911.85 3.92x
|
||
|
BenchmarkHash128K-12 233.37 918.93 3.94x
|
||
|
```
|
||
|
|
||
|
### AVX2: Comparison to other hashing techniques
|
||
|
|
||
|
```
|
||
|
$ go test -bench=Comparison
|
||
|
BenchmarkComparisonMD5-12 1000 1726121 ns/op 607.48 MB/s
|
||
|
BenchmarkComparisonSHA1-12 500 2005164 ns/op 522.94 MB/s
|
||
|
BenchmarkComparisonSHA256-12 300 5531036 ns/op 189.58 MB/s
|
||
|
BenchmarkComparisonSHA512-12 500 3423030 ns/op 306.33 MB/s
|
||
|
BenchmarkComparisonBlake2B-12 1000 1232690 ns/op 850.64 MB/s
|
||
|
```
|
||
|
|
||
|
Benchmarks below were generated on a MacBook Pro with a 2.7 GHz Intel Core i7.
|
||
|
|
||
|
### AVX
|
||
|
|
||
|
```
|
||
|
$ benchcmp go.txt avx.txt
|
||
|
benchmark old ns/op new ns/op delta
|
||
|
BenchmarkHash64-8 813 458 -43.67%
|
||
|
BenchmarkHash128-8 766 401 -47.65%
|
||
|
BenchmarkHash1K-8 4881 1763 -63.88%
|
||
|
BenchmarkHash8K-8 36127 12273 -66.03%
|
||
|
BenchmarkHash32K-8 140582 43155 -69.30%
|
||
|
BenchmarkHash128K-8 567850 173246 -69.49%
|
||
|
|
||
|
benchmark old MB/s new MB/s speedup
|
||
|
BenchmarkHash64-8 78.63 139.57 1.78x
|
||
|
BenchmarkHash128-8 166.98 318.73 1.91x
|
||
|
BenchmarkHash1K-8 209.76 580.68 2.77x
|
||
|
BenchmarkHash8K-8 226.76 667.46 2.94x
|
||
|
BenchmarkHash32K-8 233.09 759.29 3.26x
|
||
|
BenchmarkHash128K-8 230.82 756.56 3.28x
|
||
|
```
|
||
|
|
||
|
### SSE
|
||
|
|
||
|
```
|
||
|
$ benchcmp go.txt sse.txt
|
||
|
benchmark old ns/op new ns/op delta
|
||
|
BenchmarkHash64-8 813 478 -41.21%
|
||
|
BenchmarkHash128-8 766 411 -46.34%
|
||
|
BenchmarkHash1K-8 4881 1870 -61.69%
|
||
|
BenchmarkHash8K-8 36127 12427 -65.60%
|
||
|
BenchmarkHash32K-8 140582 49512 -64.78%
|
||
|
BenchmarkHash128K-8 567850 199040 -64.95%
|
||
|
|
||
|
benchmark old MB/s new MB/s speedup
|
||
|
BenchmarkHash64-8 78.63 133.78 1.70x
|
||
|
BenchmarkHash128-8 166.98 311.23 1.86x
|
||
|
BenchmarkHash1K-8 209.76 547.37 2.61x
|
||
|
BenchmarkHash8K-8 226.76 659.20 2.91x
|
||
|
BenchmarkHash32K-8 233.09 661.81 2.84x
|
||
|
BenchmarkHash128K-8 230.82 658.52 2.85x
|
||
|
```
|
||
|
|
||
|
License
|
||
|
-------
|
||
|
|
||
|
Released under the Apache License v2.0. You can find the complete text in the file LICENSE.
|
||
|
|
||
|
Contributing
|
||
|
------------
|
||
|
|
||
|
Contributions are welcome, please send PRs for any enhancements.
|