Update eip-1057.md with code walk through. (#1073)

* Update eip-1057.md with code walkthrough

merge in code walkthrough from from ProgPoW readme.md

* Update eip-1057.md to summarize algorithm

Summarized key change for algorithm vs ethash

* Update eip-1057.md to clarify keccak details

* Update eip-1057.md to correct typo intro

* Update eip-1057.md to add hashrate units (MH/s)
This commit is contained in:
ifdefelse 2018-05-18 18:13:50 +08:00 committed by Nick Johnson
parent e73e5e41aa
commit a09553faca
1 changed files with 238 additions and 13 deletions

View File

@ -33,11 +33,15 @@ As miner rewards are reduced with Casper FFG, it will remain profitable to mine
## Specification
ProgPoW is based on Ethash. The algorithm has five main elements, each tuned for commodity GPUs while minimizing the possible advantage of a specialized ASIC.
ProgPoW is based on Ethash and follows the same general structure. The algorithm has five main changes from Ethash, each tuned for commodity GPUs while minimizing the possible advantage of a specialized ASIC.
The name of the algorithm comes from the fact that the inner loop between global memory accesses is a randomly generated program based on the block number. The random program is designed to both run efficiently on commodity GPUs and also cover most of the GPU's functionality. The random program sequence prevents the creation of a fixed pipeline implementation as seen in a specialized ASIC. The access size has also been tweaked to match contemporary GPUs.
In contrast to Ethash, the changes detailed below make ProgPoW dependent on the core compute capabilities in addition to memory bandwidth and size.
**Changes keccak_f1600 (with 64-bit words) to keccak_f800 (with 32-bit words).**
*On 64-bit architectures f1600 processes twice as many bits as f800 in roughly the same time. As GPUs are natively 32-bit architectures, f1600 takes twice as long as f800. ProgPow doesnt require all the bits f1600 can consume, thus reducing the size reduces the optimization opportunity for a specialized ASIC.*
*On 64-bit architectures f1600 processes twice as many bits as f800 in roughly the same time. As GPUs are natively 32-bit architectures, f1600 takes twice as long as f800. ProgPow doesnt require all the bits f1600 can consume, thus reducing the size reduces the optimization opportunity for a specialized ASIC.*
**Increases mix state.**
@ -70,6 +74,227 @@ ProgPoW uses **FNV1a** for merging data. The existing Ethash uses FNV1 for mergi
ProgPow uses [KISS99](https://en.wikipedia.org/wiki/KISS_(algorithm)) for random number generation. This is the simplest (fewest instruction) random generator that passes the TestU01 statistical test suite. A more complex random number generator like Mersenne Twister can be efficiently implemented on a specialized ASIC, providing an opportunity for efficiency gains.
```cpp
uint32_t fnv1a(uint32_t &h, uint32_t d)
{
return h = (h ^ d) * 0x1000193;
}
typedef struct {
uint32_t z, w, jsr, jcong;
} kiss99_t;
// KISS99 is simple, fast, and passes the TestU01 suite
// https://en.wikipedia.org/wiki/KISS_(algorithm)
// http://www.cse.yorku.ca/~oz/marsaglia-rng.html
uint32_t kiss99(kiss99_t &st)
{
uint32_t znew = (st.z = 36969 * (st.z & 65535) + (st.z >> 16));
uint32_t wnew = (st.w = 18000 * (st.w & 65535) + (st.w >> 16));
uint32_t MWC = ((znew << 16) + wnew);
uint32_t SHR3 = (st.jsr ^= (st.jsr << 17), st.jsr ^= (st.jsr >> 13), st.jsr ^= (st.jsr << 5));
uint32_t CONG = (st.jcong = 69069 * st.jcong + 1234567);
return ((MWC^CONG) + SHR3);
}
```
The `LANES*REGS` of mix data is initialized from the hashs seed.
```cpp
void fill_mix(
uint64_t hash_seed,
uint32_t lane_id,
uint32_t mix[PROGPOW_REGS]
)
{
// Use FNV to expand the per-warp seed to per-lane
// Use KISS to expand the per-lane seed to fill mix
uint32_t fnv_hash = 0x811c9dc5;
kiss99_t st;
st.z = fnv1a(fnv_hash, seed);
st.w = fnv1a(fnv_hash, seed >> 32);
st.jsr = fnv1a(fnv_hash, lane_id);
st.jcong = fnv1a(fnv_hash, lane_id);
for (int i = 0; i < PROGPOW_REGS; i++)
mix[i] = kiss99(st);
}
```
The main search algorithm uses the Keccak sponge function (a width of 800 bits, with a bitrate of 448, and a capacity of 352) to generate a seed, expands the seed, does a sequence of loads and random math on the mix data, and then compresses the result into a final Keccak permutation (with the same parameters as the first) for target comparison.
```cpp
bool progpow_search(
const uint64_t prog_seed,
const uint64_t nonce,
const hash32_t header,
const uint64_t target,
const uint64_t *g_dag, // gigabyte DAG located in framebuffer
const uint64_t *c_dag // kilobyte DAG located in l1 cache
)
{
uint32_t mix[PROGPOW_LANES][PROGPOW_REGS];
uint32_t result[4];
for (int i = 0; i < 4; i++)
result[i] = 0;
// keccak(header..nonce)
uint64_t seed = keccak_f800(header, nonce, result);
// initialize mix for all lanes
for (int l = 0; l < PROGPOW_LANES; l++)
fill_mix(seed, l, mix);
// execute the randomly generated inner loop
for (int i = 0; i < PROGPOW_CNT_MEM; i++)
progPowLoop(prog_seed, i, mix, g_dag, c_dag);
// Reduce mix data to a single per-lane result
uint32_t lane_hash[PROGPOW_LANES];
for (int l = 0; l < PROGPOW_LANES; l++)
{
lane_hash[l] = 0x811c9dc5
for (int i = 0; i < PROGPOW_REGS; i++)
fnv1a(lane_hash[l], mix[l][i]);
}
// Reduce all lanes to a single 128-bit result
for (int i = 0; i < 4; i++)
result[i] = 0x811c9dc5;
for (int l = 0; l < PROGPOW_LANES; l++)
fnv1a(result[l%4], lane_hash[l])
// keccak(header .. keccak(header..nonce) .. result);
return (keccak_f800(header, seed, result) <= target);
}
```
The inner loop uses FNV and KISS99 to generate a random sequence from the `prog_seed`. This random sequence determines which mix state is accessed and what random math is performed. Since the `prog_seed` changes relatively infrequently it is expected that `progPowLoop` will be compiled while mining instead of interpreted on the fly.
```cpp
kiss99_t progPowInit(uint64_t prog_seed, int mix_seq[PROGPOW_REGS])
{
kiss99_t prog_rnd;
uint32_t fnv_hash = 0x811c9dc5;
prog_rnd.z = fnv1a(fnv_hash, prog_seed);
prog_rnd.w = fnv1a(fnv_hash, prog_seed >> 32);
prog_rnd.jsr = fnv1a(fnv_hash, prog_seed);
prog_rnd.jcong = fnv1a(fnv_hash, prog_seed >> 32);
// Create a random sequence of mix destinations for merge()
// guaranteeing every location is touched once
// Uses FisherYates shuffle
for (int i = 0; i < PROGPOW_REGS; i++)
mix_seq[i] = i;
for (int i = PROGPOW_REGS - 1; i > 0; i--)
{
int j = kiss99(prog_rnd) % (i + 1);
swap(mix_seq[i], mix_seq[j]);
}
return prog_rnd;
}
```
The math operations that merge values into the mix data are ones chosen to maintain entropy.
```cpp
// Merge new data from b into the value in a
// Assuming A has high entropy only do ops that retain entropy
// even if B is low entropy
// (IE don't do A&B)
void merge(uint32_t &a, uint32_t b, uint32_t r)
{
switch (r % 4)
{
case 0: a = (a * 33) + b; break;
case 1: a = (a ^ b) * 33; break;
case 2: a = ROTL32(a, ((r >> 16) % 32)) ^ b; break;
case 3: a = ROTR32(a, ((r >> 16) % 32)) ^ b; break;
}
}
```
The math operations chosen for the random math are ones that are easy to implement in CUDA and OpenCL, the two main programming languages for commodity GPUs.
```cpp
// Random math between two input values
uint32_t math(uint32_t a, uint32_t b, uint32_t r)
{
switch (r % 11)
{
case 0: return a + b;
case 1: return a * b;
case 2: return mul_hi(a, b);
case 3: return min(a, b);
case 4: return ROTL32(a, b);
case 5: return ROTR32(a, b);
case 6: return a & b;
case 7: return a | b;
case 8: return a ^ b;
case 9: return clz(a) + clz(b);
case 10: return popcount(a) + popcount(b);
}
}
```
The main loop:
```cpp
// Helper to get the next value in the per-program random sequence
#define rnd() (kiss99(prog_rnd))
// Helper to pick a random mix location
#define mix_src() (rnd() % PROGPOW_REGS)
// Helper to access the sequence of mix destinations
#define mix_dst() (mix_seq[(mix_seq_cnt++)%PROGPOW_REGS])
void progPowLoop(
const uint64_t prog_seed,
const uint32_t loop,
uint32_t mix[PROGPOW_LANES][PROGPOW_REGS],
const uint64_t *g_dag,
const uint32_t *c_dag)
{
// All lanes share a base address for the global load
// Global offset uses mix[0] to guarantee it depends on the load result
uint32_t offset_g = mix[loop%PROGPOW_LANES][0] % DAG_SIZE;
// Lanes can execute in parallel and will be convergent
for (int l = 0; l < PROGPOW_LANES; l++)
{
// global load to sequential locations
uint64_t data64 = g_dag[offset_g + l];
// initialize the seed and mix destination sequence
int mix_seq[PROGPOW_REGS];
int mix_seq_cnt = 0;
kiss99_t prog_rnd = progPowInit(prog_seed, mix_seq);
uint32_t offset, data32;
int max_i = max(PROGPOW_CNT_CACHE, PROGPOW_CNT_MATH);
for (int i = 0; i < max_i; i++)
{
if (i < PROGPOW_CNT_CACHE)
{
// Cached memory access
// lanes access random location
offset = mix[l][mix_src()] % PROGPOW_CACHE_WORDS;
data32 = c_dag[offset];
merge(mix[l][mix_dst()], data32, rnd());
}
if (i < PROGPOW_CNT_MATH)
{
// Random Math
data32 = math(mix[l][mix_src()], mix[l][mix_src()], rnd());
merge(mix[l][mix_dst()], data32, rnd());
}
}
// Consume the global load data at the very end of the loop
// Allows full latency hiding
merge(mix[l][0], data64, rnd());
merge(mix[l][mix_dst()], data64>>32, rnd());
}
}
```
## Rationale
ProgPoW utilizes almost all parts of a commodity GPU, excluding:
@ -79,7 +304,7 @@ ProgPoW utilizes almost all parts of a commodity GPU, excluding:
Making use of either of these would have significant portability issues between commodity hardware vendors, and across programming languages.
Since the GPU is almost fully utilized, theres little opportunity for specialized ASICs to gain efficiency. Removing both the graphics pipeline and floating point math could provide up to 1.2x gains in efficency, compared to the 2x gains possible in Ethash, and 50x gains possible for CryptoNight.
Since the GPU is almost fully utilized, theres little opportunity for specialized ASICs to gain efficiency. Removing both the graphics pipeline and floating point math could provide up to 1.2x gains in efficiency, compared to the 2x gains possible in Ethash, and 50x gains possible for CryptoNight.
## Backwards Compatibility
@ -91,16 +316,16 @@ This PoW algorithm was tested against six different models from two different ma
As the algorithm nearly fully utilizes GPU functions in a natural way, the results reflect relative GPU performance that is similar to other gaming and graphics applications.
-----------------------
| Model | Hashrate |
| --------- | -------- |
| RX580 | 9.4 |
| Vega56 | 16.6 |
| Vega64   | 18.7 |
| GTX1070Ti | 13.1 |
| GTX1080   | 14.9 |
| GTX1080Ti | 21.8 |
------------------------
-------------------------------
| Model | Hashrate (MH/s) |
| --------- | --------------- |
| RX580 | 9.4 |
| Vega56 | 16.6 |
| Vega64   | 18.7 |
| GTX1070Ti | 13.1 |
| GTX1080   | 14.9 |
| GTX1080Ti | 21.8 |
-------------------------------
## Implementation