Update EIP-1057 to match current ProgPoW spec

2025-03-01 07:00:36 +00:00 · 2018-11-15 23:54:04 -08:00 · 2018-11-15 23:54:04 -08:00 · bf4566e3e2
commit bf4566e3e2
parent 7c2253ee9a
1 changed files with 122 additions and 90 deletions
--- a/EIPS/eip-1057.md
+++ b/EIPS/eip-1057.md
@ -1,7 +1,7 @@
 ---
 eip: 1057
 title: ProgPoW, a Programmatic Proof-of-Work
-author: Radix Pi <radix.pi.314@gmail.com>, Ifdef Else <ifdefelse@protonmail.com>   
+author: IfDefElse <ifdefelse@protonmail.com>
 discussions-to: https://ethereum-magicians.org/t/eip-progpow-a-programmatic-proof-of-work/272
 status: Draft
 type: Standards Track
@ -59,16 +59,20 @@ In contrast to Ethash, the changes detailed below make ProgPoW dependent on the
 *The DRAM read from the DAG is the same as Ethash’s, but with the size increased to `256 bytes`. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses.*
-The DAG file is generated according to traditional Ethash specifications, with an additional `PROGPOW_SIZE_CACHE` bytes generated that will be cached in the L1.
+The DAG file is generated according to traditional Ethash specifications.
 ProgPoW can be tuned using the following parameters.  The proposed settings have been tuned for a range of existing, commodity GPUs:
-* `PROGPOW_LANES:` The number of parallel lanes that coordinate to calculate a single hash instance; default is `32.`
+* `PROGPOW_PERIOD`: Number of blocks before changing the random program; default is `50`.
-* `PROGPOW_REGS:` The register file usage size; default is `16.` 
+* `PROGPOW_LANES`: The number of parallel lanes that coordinate to calculate a single hash instance; default is `16`.
-* `PROGPOW_CACHE_BYTES:` The size of the cache; default is `16 x 1024.`
+* `PROGPOW_REGS`: The register file usage size; default is `32`.
-* `PROGPOW_CNT_MEM:` The number of frame buffer accesses, defined as the outer loop of the algorithm; default is `64` (same as Ethash).
+* `PROGPOW_DAG_LOADS`: Number of uint32 loads from the DAG per lane; default is `4`;
-* `PROGPOW_CNT_CACHE:` The number of cache accesses per loop; default is `8.`
+* `PROGPOW_CACHE_BYTES`: The size of the cache; default is `16 x 1024`.
-* `PROGPOW_CNT_MATH:` The number of math operations per loop; default is `8.`
+* `PROGPOW_CNT_DAG`: The number of DAG accesses, defined as the outer loop of the algorithm; default is `64` (same as Ethash).
 * `PROGPOW_CNT_CACHE`: The number of cache accesses per loop; default is `12`.
 * `PROGPOW_CNT_MATH`: The number of math operations per loop; default is `20`.
 The random program changes every `PROGPOW_PERIOD` blocks (default `50`, roughly 12.5 minutes) to ensure the hardware executing the algorithm is fully programmable.  If the program only changed every DAG epoch (roughly 5 days) certain miners could have time to develop hand-optimized versions of the random sequence, giving them an undue advantage.
 ProgPoW uses **FNV1a** for merging data. The existing Ethash uses FNV1 for merging, but FNV1a provides better distribution properties.
@ -90,12 +94,14 @@ typedef struct {
 // http://www.cse.yorku.ca/~oz/marsaglia-rng.html
 uint32_t kiss99(kiss99_t &st)
 {
-    uint32_t znew = (st.z = 36969 * (st.z & 65535) + (st.z >> 16));
+    st.z = 36969 * (st.z & 65535) + (st.z >> 16);
-    uint32_t wnew = (st.w = 18000 * (st.w & 65535) + (st.w >> 16));
+    st.w = 18000 * (st.w & 65535) + (st.w >> 16);
-    uint32_t MWC = ((znew << 16) + wnew);
+    uint32_t MWC = ((st.z << 16) + st.w);
-    uint32_t SHR3 = (st.jsr ^= (st.jsr << 17), st.jsr ^= (st.jsr >> 13), st.jsr ^= (st.jsr << 5));
+    st.jsr ^= (st.jsr << 17);
-    uint32_t CONG = (st.jcong = 69069 * st.jcong + 1234567);
+    st.jsr ^= (st.jsr >> 13);
-    return ((MWC^CONG) + SHR3);
+    st.jsr ^= (st.jsr << 5);
    st.jcong = 69069 * st.jcong + 1234567;
    return ((MWC^st.jcong) + st.jsr);
 }
 ```
@ -121,59 +127,89 @@ void fill_mix(
 }
 ```
-The main search algorithm uses the Keccak sponge function (a width of 800 bits, with a bitrate of 448, and a capacity of 352) to generate a seed, expands the seed, does a sequence of loads and random math on the mix data, and then compresses the result into a final Keccak permutation (with the same parameters as the first) for target comparison.
+Like ethash Keccak is used to seed the sequence per-nonce and to produce the final result.  The keccak-f800 variant is used as the 32-bit word size matches the native word size of modern GPUs.  The implementation is a variant of SHAKE with width=800, bitrate=576, capacity=224, output=256, and no padding.  The result of keccak is treated as a 256-bit big-endian number - that is result byte 0 is the MSB of the value.
 ```cpp
 hash32_t keccak_f800_progpow(hash32_t header, uint64_t seed, hash32_t digest)
 {
    uint32_t st[25];
    for (int i = 0; i < 25; i++)
        st[i] = 0;
    for (int i = 0; i < 8; i++)
        st[i] = header.uint32s[i];
    st[8] = seed;
    st[9] = seed >> 32;
    for (int i = 0; i < 8; i++)
        st[10+i] = digest.uint32s[i];
    for (int r = 0; r < 22; r++)
        keccak_f800_round(st, r);
    hash32_t ret;
    for (int i=0; i<8; i++)
        ret.uint32s[i] = st[i];
    return ret;
 }
 ```
 The flow of the overall algorithm is:
 * A keccak hash of the header + nonce to create a seed
 * Use the seed to generate initial mix data
 * Loop multiple times, each time hashing random loads and random math into the mix data
 * Hash all the mix data into a single 256-bit value
 * A final keccak hash that is compared against the target
 ```cpp
 bool progpow_search(
-    const uint64_t prog_seed,
+    const uint64_t prog_seed, // value is (block_number/PROGPOW_PERIOD)
    const uint64_t nonce,
    const hash32_t header,
-    const uint64_t target,
+    const hash32_t target, // miner can use a uint64_t target, doesn't need the full 256 bit target
-    const uint64_t *g_dag, // gigabyte DAG located in framebuffer
+    const uint32_t *dag // gigabyte DAG located in framebuffer - the first portion gets cached
    const uint64_t *c_dag  // kilobyte DAG located in l1 cache
 )
 {
    uint32_t mix[PROGPOW_LANES][PROGPOW_REGS];
-    uint32_t result[4];
+    hash32_t digest;
-    for (int i = 0; i < 4; i++)
+    for (int i = 0; i < 8; i++)
-        result[i] = 0;
+        digest.uint32s[i] = 0;
    // keccak(header..nonce)
-    uint64_t seed = keccak_f800(header, nonce, result);
+    hash32_t seed_256 = keccak_f800_progpow(header, nonce, digest);
    // endian swap so byte 0 of the hash is the MSB of the value
    uint64_t seed = bswap(seed_256[0]) << 32 | bswap(seed_256[1]);
    // initialize mix for all lanes
    for (int l = 0; l < PROGPOW_LANES; l++)
-        fill_mix(seed, l, mix);
+        fill_mix(seed, l, mix[l]);
    // execute the randomly generated inner loop
-    for (int i = 0; i < PROGPOW_CNT_MEM; i++)
+    for (int i = 0; i < PROGPOW_CNT_DAG; i++)
-        progPowLoop(prog_seed, i, mix, g_dag, c_dag);
+        progPowLoop(prog_seed, i, mix, dag);
-    // Reduce mix data to a single per-lane result
+    // Reduce mix data to a per-lane 32-bit digest
-    uint32_t lane_hash[PROGPOW_LANES];
+    uint32_t digest_lane[PROGPOW_LANES];
    for (int l = 0; l < PROGPOW_LANES; l++)
    {
-        lane_hash[l] = 0x811c9dc5
+        digest_lane[l] = 0x811c9dc5
        for (int i = 0; i < PROGPOW_REGS; i++)
-            fnv1a(lane_hash[l], mix[l][i]);
+            fnv1a(digest_lane[l], mix[l][i]);
    }
-    // Reduce all lanes to a single 128-bit result
+    // Reduce all lanes to a single 256-bit digest
-    for (int i = 0; i < 4; i++)
+    for (int i = 0; i < 8; i++)
-        result[i] = 0x811c9dc5;
+        digest.uint32s[i] = 0x811c9dc5;
    for (int l = 0; l < PROGPOW_LANES; l++)
-        fnv1a(result[l%4], lane_hash[l])
+        fnv1a(digest.uint32s[l%8], digest_lane[l])
-    // keccak(header .. keccak(header..nonce) .. result);
+    // keccak(header .. keccak(header..nonce) .. digest);
-    return (keccak_f800(header, seed, result) <= target);
+    return (keccak_f800_progpow(header, seed, digest) <= target);
 }
 ```
 The inner loop uses FNV and KISS99 to generate a random sequence from the `prog_seed`.  This random sequence determines which mix state is accessed and what random math is performed. Since the `prog_seed` changes relatively infrequently it is expected that `progPowLoop` will be compiled while mining instead of interpreted on the fly.
 ```cpp
-
+kiss99_t progPowInit(uint64_t prog_seed, int mix_seq_dst[PROGPOW_REGS], int mix_seq_cache[PROGPOW_REGS])
 kiss99_t progPowInit(uint64_t prog_seed, int mix_seq[PROGPOW_REGS])
 {
    kiss99_t prog_rnd;
    uint32_t fnv_hash = 0x811c9dc5;
@ -181,15 +217,22 @@ kiss99_t progPowInit(uint64_t prog_seed, int mix_seq[PROGPOW_REGS])
    prog_rnd.w = fnv1a(fnv_hash, prog_seed >> 32);
    prog_rnd.jsr = fnv1a(fnv_hash, prog_seed);
    prog_rnd.jcong = fnv1a(fnv_hash, prog_seed >> 32);
-    // Create a random sequence of mix destinations for merge()
+    // Create a random sequence of mix destinations for merge() and mix sources for cache reads
-    // guaranteeing every location is touched once
+    // guarantees every destination merged once
-    // Uses Fisher–Yates shuffle
+    // guarantees no duplicate cache reads, which could be optimized away
    // Uses Fisher-Yates shuffle
    for (int i = 0; i < PROGPOW_REGS; i++)
-        mix_seq[i] = i;
+    {
        mix_seq_dst[i] = i;
        mix_seq_cache[i] = i;
    }
    for (int i = PROGPOW_REGS - 1; i > 0; i--)
    {
-        int j = kiss99(prog_rnd) % (i + 1);
+        int j;
-        swap(mix_seq[i], mix_seq[j]);
+        j = kiss99(prog_rnd) % (i + 1);
        swap(mix_seq_dst[i], mix_seq_dst[j]);
        j = kiss99(prog_rnd) % (i + 1);
        swap(mix_seq_cache[i], mix_seq_cache[j]);
    }
    return prog_rnd;
 }
@ -241,60 +284,66 @@ The main loop:
 ```cpp
 // Helper to get the next value in the per-program random sequence
-#define rnd()    (kiss99(prog_rnd))
+#define rnd()       (kiss99(prog_rnd))
 // Helper to pick a random mix location
-#define mix_src() (rnd() % PROGPOW_REGS)
+#define mix_src()   (rnd() % PROGPOW_REGS)
 // Helper to access the sequence of mix destinations
-#define mix_dst() (mix_seq[(mix_seq_cnt++)%PROGPOW_REGS])
+#define mix_dst()   (mix_seq_dst[(mix_seq_dst_cnt++)%PROGPOW_REGS])
 // Helper to access the sequence of cache sources
 #define mix_cache() (mix_seq_cache[(mix_seq_cache_cnt++)%PROGPOW_REGS])
 void progPowLoop(
    const uint64_t prog_seed,
    const uint32_t loop,
    uint32_t mix[PROGPOW_LANES][PROGPOW_REGS],
-    const uint64_t *g_dag,
+    const uint32_t *dag)
    const uint32_t *c_dag)
 {
    // All lanes share a base address for the global load
    // Global offset uses mix[0] to guarantee it depends on the load result
-    uint32_t offset_g = mix[loop%PROGPOW_LANES][0] % DAG_SIZE;
+    uint32_t offset_g = mix[loop%PROGPOW_LANES][0] % (DAG_BYTES / (PROGPOW_LANES*PROGPOW_DAG_LOADS*sizeof(uint32_t)));
    // Lanes can execute in parallel and will be convergent
    for (int l = 0; l < PROGPOW_LANES; l++)
    {
-        // global load to sequential locations
+        // global load to the 256 byte DAG entry
-        uint64_t data64 = g_dag[offset_g + l];
+        // every lane can access every part of the entry
        uint32_t data_g[PROGPOW_DAG_LOADS];
        uint32_t offset_l = offset_g * PROGPOW_LANES + (l ^ loop) % PROGPOW_LANES;
        for (int i = 0; i < PROGPOW_DAG_LOADS; i++)
            data_g[i] = dag[offset_l * PROGPOW_DAG_LOADS + i];
        // initialize the seed and mix destination sequence
-        int mix_seq[PROGPOW_REGS];
+        int mix_seq_dst[PROGPOW_REGS];
-        int mix_seq_cnt = 0;
+        int mix_seq_cache[PROGPOW_REGS];
-        kiss99_t prog_rnd = progPowInit(prog_seed, mix_seq);
+        int mix_seq_dst_cnt = 0;
        int mix_seq_cache_cnt = 0;
        kiss99_t prog_rnd = progPowInit(prog_seed, mix_seq_dst, mix_seq_cache);
        uint32_t offset, data32;
        int max_i = max(PROGPOW_CNT_CACHE, PROGPOW_CNT_MATH);
        for (int i = 0; i < max_i; i++)
        {
            if (i < PROGPOW_CNT_CACHE)
            {
                // Cached memory access
-                // lanes access random location
+                // lanes access random 32-bit locations within the first portion of the DAG
-                offset = mix[l][mix_src()] % PROGPOW_CACHE_WORDS;
+                uint32_t offset = mix[l][mix_cache()] % (PROGPOW_CACHE_BYTES/sizeof(uint32_t));
-                data32 = c_dag[offset];
+                uint32_t data = dag[offset];
-                merge(mix[l][mix_dst()], data32, rnd());
+                merge(mix[l][mix_dst()], data, rnd());
            }
            if (i < PROGPOW_CNT_MATH)
            {
                // Random Math
-                data32 = math(mix[l][mix_src()], mix[l][mix_src()], rnd());
+                uint32_t data = math(mix[l][mix_src()], mix[l][mix_src()], rnd());
-                merge(mix[l][mix_dst()], data32, rnd());
+                merge(mix[l][mix_dst()], data, rnd());
            }
        }
-        // Consume the global load data at the very end of the loop
+        // Consume the global load data at the very end of the loop to allow full latency hiding
-        // Allows full latency hiding
+        // Always merge into mix[0] to feed the offset calculation
-        merge(mix[l][0], data64, rnd());
+        merge(mix[l][0], data_g[0], rnd());
-        merge(mix[l][mix_dst()], data64>>32, rnd());
+        for (int i = 1; i < PROGPOW_DAG_LOADS; i++)
            merge(mix[l][mix_dst()], data_g[i], rnd());
    }
 }
 ```
 ## Rationale
 ProgPoW utilizes almost all parts of a commodity GPU, excluding:
@ -308,24 +357,7 @@ Since the GPU is almost fully utilized, there’s little opportunity  for specia
 ## Backwards Compatibility
-This algorithm is not backwards compatible with the existing Ethash, and will require a fork for adoption. Furthermore, the network hashrate will halve as the time spent in the core is now balanced with time spent in memory.
+This algorithm is not backwards compatible with the existing Ethash, and will require a fork for adoption. Furthermore, the network hashrate will halve since twice as much memory is loaded per hash.
 ## Test Cases
 This PoW algorithm was tested against six different models from two different manufacturers. Selected models span two different chips and memory types from each manufacturer (Polaris20-GDDR5 and Vega10-HBM2 for AMD; GP104-GDDR5 and GP102-GDDR5X for NVIDIA). The average hashrate results are listed below. Additional tests are ongoing.
 As the algorithm nearly fully utilizes GPU functions in a natural way, the results reflect relative GPU performance that is similar to other gaming and graphics applications.
 -------------------------------
 | Model     | Hashrate (MH/s) |
 | --------- | --------------- |
 | RX580     |      9.4        |
 | Vega56    |      16.6       |
 | Vega64    |      18.7       |
 | GTX1070Ti |      13.1       |
 | GTX1080   |      14.9       |
 | GTX1080Ti |      21.8       |
 -------------------------------
 ## Implementation