Update EIP-1057 to match current ProgPoW spec

2025-03-01 07:00:36 +00:00 · 2018-11-15 23:54:04 -08:00 · 2018-11-15 23:54:04 -08:00 · bf4566e3e2
commit bf4566e3e2
parent 7c2253ee9a
1 changed files with 122 additions and 90 deletions
--- a/EIPS/eip-1057.md
+++ b/EIPS/eip-1057.md
@ -1,7 +1,7 @@
 ---
 eip: 1057
 title: ProgPoW, a Programmatic Proof-of-Work
-author: Radix Pi <radix.pi.314@gmail.com>, Ifdef Else <ifdefelse@protonmail.com>   
+author: IfDefElse <ifdefelse@protonmail.com>
 discussions-to: https://ethereum-magicians.org/t/eip-progpow-a-programmatic-proof-of-work/272
 status: Draft
 type: Standards Track
@ -15,11 +15,11 @@ The following is a proposal for an alternate proof-of-work algorithm - **“Prog

 ## Abstract

-The security of proof-of-work is built on a fair, randomized lottery where miners with similar resources have a similar chance of generating the next block.  
+The security of proof-of-work is built on a fair, randomized lottery where miners with similar resources have a similar chance of generating the next block.

 For Ethereum - a community based on widely distributed commodity hardware - specialized ASICs enable certain participants to gain a much greater chance of generating the next block, and undermine the distributed security.

-ASIC-resistance is a misunderstood problem. FPGAs, GPUs and CPUs can themselves be considered ASICs. Any algorithm that executes on a commodity ASIC can have a specialized ASIC made for it; most existing algorithms provide opportunities that reduce power usage and cost. Thus, the proper question to ask when solving ASIC-resistance is “how much more efficient will a specialized ASIC be, in comparison with commodity hardware?” 
+ASIC-resistance is a misunderstood problem. FPGAs, GPUs and CPUs can themselves be considered ASICs. Any algorithm that executes on a commodity ASIC can have a specialized ASIC made for it; most existing algorithms provide opportunities that reduce power usage and cost. Thus, the proper question to ask when solving ASIC-resistance is “how much more efficient will a specialized ASIC be, in comparison with commodity hardware?”

 EIP<NaN> presents an algorithm that is tuned for commodity GPUs where there is minimal opportunity for ASIC specialization.  This prevents specialized ASICs without resorting to a game of whack-a-mole where the network changes algorithms every few months.

@ -29,7 +29,7 @@ Until Ethereum transitions to a pure proof-of-stake model, proof-of-work will co

 Ethash allows for the creation of an ASIC that is roughly twice as efficient as a commodity GPU.  Ethash’s memory accesses are paired with a very small amount of fixed compute.  Most of a GPU’s capacity and complexity sits idle, wasting power, while waiting for DRAM accesses. A specialized ASIC can implement a much smaller (and cheaper) compute engine that burns much less power.

-As miner rewards are reduced with Casper FFG, it will remain profitable to mine on a specialized ASIC long after GPUs have exited the network. This will make it easier for an entity that has access to private ASICs to stage a 51% attack on the Ethereum network. 
+As miner rewards are reduced with Casper FFG, it will remain profitable to mine on a specialized ASIC long after GPUs have exited the network. This will make it easier for an entity that has access to private ASICs to stage a 51% attack on the Ethereum network.

 ## Specification

@ -57,18 +57,22 @@ In contrast to Ethash, the changes detailed below make ProgPoW dependent on the

 **Increases the DRAM read from 128 bytes to 256 bytes.**

-*The DRAM read from the DAG is the same as Ethash’s, but with the size increased to `256 bytes`. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses.* 
+*The DRAM read from the DAG is the same as Ethash’s, but with the size increased to `256 bytes`. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses.*

-The DAG file is generated according to traditional Ethash specifications, with an additional `PROGPOW_SIZE_CACHE` bytes generated that will be cached in the L1.
+The DAG file is generated according to traditional Ethash specifications.

 ProgPoW can be tuned using the following parameters.  The proposed settings have been tuned for a range of existing, commodity GPUs:

-* `PROGPOW_LANES:` The number of parallel lanes that coordinate to calculate a single hash instance; default is `32.`
-* `PROGPOW_REGS:` The register file usage size; default is `16.` 
-* `PROGPOW_CACHE_BYTES:` The size of the cache; default is `16 x 1024.`
-* `PROGPOW_CNT_MEM:` The number of frame buffer accesses, defined as the outer loop of the algorithm; default is `64` (same as Ethash).
-* `PROGPOW_CNT_CACHE:` The number of cache accesses per loop; default is `8.`
-* `PROGPOW_CNT_MATH:` The number of math operations per loop; default is `8.`
+* `PROGPOW_PERIOD`: Number of blocks before changing the random program; default is `50`.
+* `PROGPOW_LANES`: The number of parallel lanes that coordinate to calculate a single hash instance; default is `16`.
+* `PROGPOW_REGS`: The register file usage size; default is `32`.
+* `PROGPOW_DAG_LOADS`: Number of uint32 loads from the DAG per lane; default is `4`;
+* `PROGPOW_CACHE_BYTES`: The size of the cache; default is `16 x 1024`.
+* `PROGPOW_CNT_DAG`: The number of DAG accesses, defined as the outer loop of the algorithm; default is `64` (same as Ethash).
+* `PROGPOW_CNT_CACHE`: The number of cache accesses per loop; default is `12`.
+* `PROGPOW_CNT_MATH`: The number of math operations per loop; default is `20`.
+
+The random program changes every `PROGPOW_PERIOD` blocks (default `50`, roughly 12.5 minutes) to ensure the hardware executing the algorithm is fully programmable.  If the program only changed every DAG epoch (roughly 5 days) certain miners could have time to develop hand-optimized versions of the random sequence, giving them an undue advantage.

 ProgPoW uses **FNV1a** for merging data. The existing Ethash uses FNV1 for merging, but FNV1a provides better distribution properties.

@ -90,12 +94,14 @@ typedef struct {
 // http://www.cse.yorku.ca/~oz/marsaglia-rng.html
 uint32_t kiss99(kiss99_t &st)
 {
-    uint32_t znew = (st.z = 36969 * (st.z & 65535) + (st.z >> 16));
-    uint32_t wnew = (st.w = 18000 * (st.w & 65535) + (st.w >> 16));
-    uint32_t MWC = ((znew << 16) + wnew);
-    uint32_t SHR3 = (st.jsr ^= (st.jsr << 17), st.jsr ^= (st.jsr >> 13), st.jsr ^= (st.jsr << 5));
-    uint32_t CONG = (st.jcong = 69069 * st.jcong + 1234567);
-    return ((MWC^CONG) + SHR3);
+    st.z = 36969 * (st.z & 65535) + (st.z >> 16);
+    st.w = 18000 * (st.w & 65535) + (st.w >> 16);
+    uint32_t MWC = ((st.z << 16) + st.w);
+    st.jsr ^= (st.jsr << 17);
+    st.jsr ^= (st.jsr >> 13);
+    st.jsr ^= (st.jsr << 5);
+    st.jcong = 69069 * st.jcong + 1234567;
+    return ((MWC^st.jcong) + st.jsr);
 }
 ```

@ -121,59 +127,89 @@ void fill_mix(
 }
 ```

-The main search algorithm uses the Keccak sponge function (a width of 800 bits, with a bitrate of 448, and a capacity of 352) to generate a seed, expands the seed, does a sequence of loads and random math on the mix data, and then compresses the result into a final Keccak permutation (with the same parameters as the first) for target comparison.
+Like ethash Keccak is used to seed the sequence per-nonce and to produce the final result.  The keccak-f800 variant is used as the 32-bit word size matches the native word size of modern GPUs.  The implementation is a variant of SHAKE with width=800, bitrate=576, capacity=224, output=256, and no padding.  The result of keccak is treated as a 256-bit big-endian number - that is result byte 0 is the MSB of the value.

 ```cpp
+hash32_t keccak_f800_progpow(hash32_t header, uint64_t seed, hash32_t digest)
+{
+    uint32_t st[25];

+    for (int i = 0; i < 25; i++)
+        st[i] = 0;
+    for (int i = 0; i < 8; i++)
+        st[i] = header.uint32s[i];
+    st[8] = seed;
+    st[9] = seed >> 32;
+    for (int i = 0; i < 8; i++)
+        st[10+i] = digest.uint32s[i];
+
+    for (int r = 0; r < 22; r++)
+        keccak_f800_round(st, r);
+
+    hash32_t ret;
+    for (int i=0; i<8; i++)
+        ret.uint32s[i] = st[i];
+    return ret;
+}
+```
+
+The flow of the overall algorithm is:
+* A keccak hash of the header + nonce to create a seed
+* Use the seed to generate initial mix data
+* Loop multiple times, each time hashing random loads and random math into the mix data
+* Hash all the mix data into a single 256-bit value
+* A final keccak hash that is compared against the target
+
+```cpp
 bool progpow_search(
-    const uint64_t prog_seed,
+    const uint64_t prog_seed, // value is (block_number/PROGPOW_PERIOD)
    const uint64_t nonce,
    const hash32_t header,
-    const uint64_t target,
-    const uint64_t *g_dag, // gigabyte DAG located in framebuffer
-    const uint64_t *c_dag  // kilobyte DAG located in l1 cache
+    const hash32_t target, // miner can use a uint64_t target, doesn't need the full 256 bit target
+    const uint32_t *dag // gigabyte DAG located in framebuffer - the first portion gets cached
 )
 {
    uint32_t mix[PROGPOW_LANES][PROGPOW_REGS];
-    uint32_t result[4];
-    for (int i = 0; i < 4; i++)
-        result[i] = 0;
+    hash32_t digest;
+    for (int i = 0; i < 8; i++)
+        digest.uint32s[i] = 0;

    // keccak(header..nonce)
-    uint64_t seed = keccak_f800(header, nonce, result);
+    hash32_t seed_256 = keccak_f800_progpow(header, nonce, digest);
+    // endian swap so byte 0 of the hash is the MSB of the value
+    uint64_t seed = bswap(seed_256[0]) << 32 | bswap(seed_256[1]);

    // initialize mix for all lanes
    for (int l = 0; l < PROGPOW_LANES; l++)
-        fill_mix(seed, l, mix);
+        fill_mix(seed, l, mix[l]);

    // execute the randomly generated inner loop
-    for (int i = 0; i < PROGPOW_CNT_MEM; i++)
-        progPowLoop(prog_seed, i, mix, g_dag, c_dag);
+    for (int i = 0; i < PROGPOW_CNT_DAG; i++)
+        progPowLoop(prog_seed, i, mix, dag);

-    // Reduce mix data to a single per-lane result
-    uint32_t lane_hash[PROGPOW_LANES];
+    // Reduce mix data to a per-lane 32-bit digest
+    uint32_t digest_lane[PROGPOW_LANES];
    for (int l = 0; l < PROGPOW_LANES; l++)
    {
-        lane_hash[l] = 0x811c9dc5
+        digest_lane[l] = 0x811c9dc5
        for (int i = 0; i < PROGPOW_REGS; i++)
-            fnv1a(lane_hash[l], mix[l][i]);
+            fnv1a(digest_lane[l], mix[l][i]);
    }
-    // Reduce all lanes to a single 128-bit result
-    for (int i = 0; i < 4; i++)
-        result[i] = 0x811c9dc5;
+    // Reduce all lanes to a single 256-bit digest
+    for (int i = 0; i < 8; i++)
+        digest.uint32s[i] = 0x811c9dc5;
    for (int l = 0; l < PROGPOW_LANES; l++)
-        fnv1a(result[l%4], lane_hash[l])
+        fnv1a(digest.uint32s[l%8], digest_lane[l])

-    // keccak(header .. keccak(header..nonce) .. result);
-    return (keccak_f800(header, seed, result) <= target);
+    // keccak(header .. keccak(header..nonce) .. digest);
+    return (keccak_f800_progpow(header, seed, digest) <= target);
 }
 ```

 The inner loop uses FNV and KISS99 to generate a random sequence from the `prog_seed`.  This random sequence determines which mix state is accessed and what random math is performed. Since the `prog_seed` changes relatively infrequently it is expected that `progPowLoop` will be compiled while mining instead of interpreted on the fly.

 ```cpp
-
-kiss99_t progPowInit(uint64_t prog_seed, int mix_seq[PROGPOW_REGS])
+kiss99_t progPowInit(uint64_t prog_seed, int mix_seq_dst[PROGPOW_REGS], int mix_seq_cache[PROGPOW_REGS])
 {
    kiss99_t prog_rnd;
    uint32_t fnv_hash = 0x811c9dc5;
@ -181,15 +217,22 @@ kiss99_t progPowInit(uint64_t prog_seed, int mix_seq[PROGPOW_REGS])
    prog_rnd.w = fnv1a(fnv_hash, prog_seed >> 32);
    prog_rnd.jsr = fnv1a(fnv_hash, prog_seed);
    prog_rnd.jcong = fnv1a(fnv_hash, prog_seed >> 32);
-    // Create a random sequence of mix destinations for merge()
-    // guaranteeing every location is touched once
-    // Uses Fisher–Yates shuffle
+    // Create a random sequence of mix destinations for merge() and mix sources for cache reads
+    // guarantees every destination merged once
+    // guarantees no duplicate cache reads, which could be optimized away
+    // Uses Fisher-Yates shuffle
    for (int i = 0; i < PROGPOW_REGS; i++)
-        mix_seq[i] = i;
+    {
+        mix_seq_dst[i] = i;
+        mix_seq_cache[i] = i;
+    }
    for (int i = PROGPOW_REGS - 1; i > 0; i--)
    {
-        int j = kiss99(prog_rnd) % (i + 1);
-        swap(mix_seq[i], mix_seq[j]);
+        int j;
+        j = kiss99(prog_rnd) % (i + 1);
+        swap(mix_seq_dst[i], mix_seq_dst[j]);
+        j = kiss99(prog_rnd) % (i + 1);
+        swap(mix_seq_cache[i], mix_seq_cache[j]);
    }
    return prog_rnd;
 }
@ -241,60 +284,66 @@ The main loop:

 ```cpp
 // Helper to get the next value in the per-program random sequence
-#define rnd()    (kiss99(prog_rnd))
+#define rnd()       (kiss99(prog_rnd))
 // Helper to pick a random mix location
-#define mix_src() (rnd() % PROGPOW_REGS)
+#define mix_src()   (rnd() % PROGPOW_REGS)
 // Helper to access the sequence of mix destinations
-#define mix_dst() (mix_seq[(mix_seq_cnt++)%PROGPOW_REGS])
+#define mix_dst()   (mix_seq_dst[(mix_seq_dst_cnt++)%PROGPOW_REGS])
+// Helper to access the sequence of cache sources
+#define mix_cache() (mix_seq_cache[(mix_seq_cache_cnt++)%PROGPOW_REGS])

 void progPowLoop(
    const uint64_t prog_seed,
    const uint32_t loop,
    uint32_t mix[PROGPOW_LANES][PROGPOW_REGS],
-    const uint64_t *g_dag,
-    const uint32_t *c_dag)
+    const uint32_t *dag)
 {
    // All lanes share a base address for the global load
    // Global offset uses mix[0] to guarantee it depends on the load result
-    uint32_t offset_g = mix[loop%PROGPOW_LANES][0] % DAG_SIZE;
+    uint32_t offset_g = mix[loop%PROGPOW_LANES][0] % (DAG_BYTES / (PROGPOW_LANES*PROGPOW_DAG_LOADS*sizeof(uint32_t)));
    // Lanes can execute in parallel and will be convergent
    for (int l = 0; l < PROGPOW_LANES; l++)
    {
-        // global load to sequential locations
-        uint64_t data64 = g_dag[offset_g + l];
+        // global load to the 256 byte DAG entry
+        // every lane can access every part of the entry
+        uint32_t data_g[PROGPOW_DAG_LOADS];
+        uint32_t offset_l = offset_g * PROGPOW_LANES + (l ^ loop) % PROGPOW_LANES;
+        for (int i = 0; i < PROGPOW_DAG_LOADS; i++)
+            data_g[i] = dag[offset_l * PROGPOW_DAG_LOADS + i];

        // initialize the seed and mix destination sequence
-        int mix_seq[PROGPOW_REGS];
-        int mix_seq_cnt = 0;
-        kiss99_t prog_rnd = progPowInit(prog_seed, mix_seq);
+        int mix_seq_dst[PROGPOW_REGS];
+        int mix_seq_cache[PROGPOW_REGS];
+        int mix_seq_dst_cnt = 0;
+        int mix_seq_cache_cnt = 0;
+        kiss99_t prog_rnd = progPowInit(prog_seed, mix_seq_dst, mix_seq_cache);

-        uint32_t offset, data32;
        int max_i = max(PROGPOW_CNT_CACHE, PROGPOW_CNT_MATH);
        for (int i = 0; i < max_i; i++)
        {
            if (i < PROGPOW_CNT_CACHE)
            {
                // Cached memory access
-                // lanes access random location
-                offset = mix[l][mix_src()] % PROGPOW_CACHE_WORDS;
-                data32 = c_dag[offset];
-                merge(mix[l][mix_dst()], data32, rnd());
+                // lanes access random 32-bit locations within the first portion of the DAG
+                uint32_t offset = mix[l][mix_cache()] % (PROGPOW_CACHE_BYTES/sizeof(uint32_t));
+                uint32_t data = dag[offset];
+                merge(mix[l][mix_dst()], data, rnd());
            }
            if (i < PROGPOW_CNT_MATH)
            {
                // Random Math
-                data32 = math(mix[l][mix_src()], mix[l][mix_src()], rnd());
-                merge(mix[l][mix_dst()], data32, rnd());
+                uint32_t data = math(mix[l][mix_src()], mix[l][mix_src()], rnd());
+                merge(mix[l][mix_dst()], data, rnd());
            }
        }
-        // Consume the global load data at the very end of the loop
-        // Allows full latency hiding
-        merge(mix[l][0], data64, rnd());
-        merge(mix[l][mix_dst()], data64>>32, rnd());
+        // Consume the global load data at the very end of the loop to allow full latency hiding
+        // Always merge into mix[0] to feed the offset calculation
+        merge(mix[l][0], data_g[0], rnd());
+        for (int i = 1; i < PROGPOW_DAG_LOADS; i++)
+            merge(mix[l][mix_dst()], data_g[i], rnd());
    }
 }
 ```
-
 ## Rationale

 ProgPoW utilizes almost all parts of a commodity GPU, excluding:
@ -308,28 +357,11 @@ Since the GPU is almost fully utilized, there’s little opportunity  for specia

 ## Backwards Compatibility

-This algorithm is not backwards compatible with the existing Ethash, and will require a fork for adoption. Furthermore, the network hashrate will halve as the time spent in the core is now balanced with time spent in memory.
-
-## Test Cases
-
-This PoW algorithm was tested against six different models from two different manufacturers. Selected models span two different chips and memory types from each manufacturer (Polaris20-GDDR5 and Vega10-HBM2 for AMD; GP104-GDDR5 and GP102-GDDR5X for NVIDIA). The average hashrate results are listed below. Additional tests are ongoing.
-
-As the algorithm nearly fully utilizes GPU functions in a natural way, the results reflect relative GPU performance that is similar to other gaming and graphics applications.
-
-------------------------------
-| Model     | Hashrate (MH/s) |
-| --------- | --------------- |
-| RX580     |      9.4        |
-| Vega56    |      16.6       |
-| Vega64    |      18.7       |
-| GTX1070Ti |      13.1       |
-| GTX1080   |      14.9       |
-| GTX1080Ti |      21.8       |
-------------------------------
+This algorithm is not backwards compatible with the existing Ethash, and will require a fork for adoption. Furthermore, the network hashrate will halve since twice as much memory is loaded per hash.

 ## Implementation

-Please refer to the official code located at [ProgPOW](https://github.com/ifdefelse/ProgPOW) for the full code, implemented in the standard ethminer. 
+Please refer to the official code located at [ProgPOW](https://github.com/ifdefelse/ProgPOW) for the full code, implemented in the standard ethminer.

 ## Copyright