dagger-research/analysis/PoR_test_analysis_with_mult...

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "meNf3zqVKBXu"
      },
      "source": [
        "# PoR test analysis\n",
        "\n",
        "PoR tests are mostly defined over the single server scenario and assuming an idealized large block code, i.e. a code where a dataset is split into a large number of chunks (`K`) and encoded to an even larger number of chunks (`N`). Most theoretic results on the PoR tests are derived with these assumptions.\n",
        "\n",
        "In our use case, however, we should analyze it under more realistic assumptions, meaning:\n",
        " - the erasure code is limited to smaller block sizes (i.e. `N` and `K` are limited and the number of chunks can be higher than these\n",
        " - the dataset is spread over mulitiple nodes.\n",
        "\n",
        "The coding and the distribution of chunks over multiple nodes might be orchestrated in many ways. Factors influencing the true/false positive/negative outcome of the test include:\n",
        "- the selection of erasure coding parameters,\n",
        "- chunk placement strategy,\n",
        "- test query size,\n",
        "- the selection of chunks for the tests,\n",
        "- etc.\n",
        "\n",
        "In what follows we analyze the test outcome under various assumptions. 3 types of analysis will be provided (WIP)\n",
        "- static analysis under random erasure,\n",
        "- dynamic analysis under random erasure,\n",
        "- analysis under adversarial erasure.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nziy-LYp5PuA"
      },
      "source": [
        "# Reference Model\n",
        "\n",
        "Our reference model assumes the upload of a dataset as follows. The dataset is first split to blocks and each block is erasure coded. Assuming a systematic code, data chunks and parity chunks are the outcome of the erasure coding. Each of these chunks are then tagged using PoR tagging.\n",
        "\n",
        "Assuming a general structure that can accomodate various dataset sizes, block sizes, erasure code limitations and chunk size constraints, we can assume the following structure.\n",
        "\n",
        "A dataset of size `F` bytes is split into `B` blocks. Each block is individually erasure coded using an `[N,K]` erasure code, using a symbols size of `F/B/K`.\n",
        "\n",
        "*(Note: more complex multi-layer coding schemes are also possible, but left for later discussion)*\n",
        "\n",
        "Resulting erasure symbols are then indivually PoR tagged.\n",
        "PoR tagging, in it's base form using the BLS12-381 curve, operates on `Z=31` byte symbols, also called the sectors in PoR literature. However, this can be extended to larger sizes in two ways.\n",
        "- First, the PoR scheme has its own blocks, governed by the parameter `S`. `S`, the PoR block size, determines how large is the tagging overhead (inversely proportional to `S`, and also the size of the proofs (approx. `S+1` symbols),\n",
        "- Second, nothing prevents us from splitting the chunk into multiple (`T`) PoR units and apply tagging on each of these `T` units individually, handling the resulting tag as a \"vector tag\".\n",
        "\n",
        "Thus, while the erasure coded symbol contains  `F/B/K/Z` PoR symbols, the erasure code symbol size and the PoR block size does not have to be the same. To give an example, say `F/B/K = 3100` bytes is our chunk size, which is 100 PoR symbols. Resulting tag and proof sizes are shown below:\n",
        "\n",
        "| S | T | tag size (48T) bytes | tag overhead | proof size (32*S+48) bytes |\n",
        "| --- | --- | --- | --- | --- |\n",
        "|100|  1| 48   |  1.5%| 3248 |\n",
        "| 10| 10| 480  | 15%  |  368 |\n",
        "|  5| 20| 960  | 30%  |  208 |\n",
        "|  1|100| 4800 |150%  |   80 |\n",
        "- tags are on the G_1 curve, thus take 48 bytes in compressed form per symbol\n",
        "- proofs contain a linear combination of data blocks and a tag"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Notebook imports"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "q8_rGNQWxQY-"
      },
      "outputs": [],
      "source": [
        "from scipy.stats import hypergeom, binom\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "plt.rcParams['figure.figsize'] = [15, 5]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VXLaDHQTl-QO"
      },
      "source": [
        "# Non adversarial model (static)\n",
        "\n",
        "We assume \n",
        "- random chunk losses, and \n",
        "- random node outages.\n",
        "\n",
        "We evaluate \n",
        "- whether the file is recoverable\n",
        "- the probability of a test passing\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Qqq5fj0aIbq4"
      },
      "source": [
        "## Random chunk erasure\n",
        "\n",
        "Let's first start with independent random chunk losses and random queries.\n",
        "We study the probabilities of\n",
        "- still having the data recoverable\n",
        "- passing or failing various tests\n",
        "\n",
        "as a function of system parameters, dataset size, and the assumed chunk loss probability. \n",
        "\n",
        "We define two types of tests:\n",
        "- a simple random query, verifying $l$ chunks, passing only if all of these are available;\n",
        "- a multiquery, composed of $q$ simple queries, passing if at most *maxfail* of the $q$ queries fail."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "YKr9k_yfqKVy"
      },
      "outputs": [],
      "source": [
        "def P_recoverable(p_chunk_error, n, k, b):\n",
        "  ## Calculate_probability of recoverable file\n",
        "  ## assuming independent chunk failures in any given ECC block\n",
        "  e = n-k+1 # erased chunks meaning loss\n",
        "  P_block_pass = binom(n,p_chunk_error).cdf(e-1) #prob. less than e chunks fail in a block\n",
        "  return P_block_pass ** b\n",
        "\n",
        "def P_pass_random_query(p_chunk_error, l):\n",
        "  ## Calculate_probability of a random query failing\n",
        "  ## assuming independent chunk losses\n",
        "  P_chunk_pass = 1 - p_chunk_error #prob. less than e chunks fail in a block\n",
        "  return P_chunk_pass ** l\n",
        "\n",
        "def P_pass_random_multiquery(p_chunk_error, l, q=1, maxfail=0):\n",
        "  ## Calculate_probability of at most maxfail of q random queries failing\n",
        "  ## The test is passed if more than q-maxfail individual queries pass.\n",
        "  ## assuming independent chunk losses\n",
        "  P_chunk_pass = 1 - p_chunk_error #prob. less than e chunks fail in a block\n",
        "  P_query_pass = P_chunk_pass ** l\n",
        "  P_multiquery_pass = binom(q, 1-P_query_pass).cdf(maxfail)\n",
        "  return P_multiquery_pass\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tHlX0Ch8Dafz"
      },
      "source": [
        "To verify the above functions we set up a small simulation, and evaluate the same probabilities empirically as well.\n",
        "In this case we also allow for an optional additional layer of erasure coding \"horizontally\" on the row, i.e. among chunks on the\n",
        "same node."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gRzeSr0BN93m"
      },
      "outputs": [],
      "source": [
        "def random_chunks(p_chunk_error):\n",
        "  #chunks = np.full((N, B), True)\n",
        "  chunks = np.random.rand(N, B)\n",
        "  return chunks > p_chunk_error\n",
        "\n",
        "def recoverable(chunks, ecol, erow=0):\n",
        "  # ecol: maximum number of recoverable errors in a column\n",
        "  # erow: maximum number of recoverable errors in a row\n",
        "  badcol = chunks.shape[0] - np.count_nonzero(chunks, axis=0) # bad chunks per column\n",
        "  badrow = chunks.shape[1] - np.count_nonzero(chunks, axis=1)\n",
        "  return not (chunks[badrow > erow][:, badcol > ecol]).any()\n",
        "\n",
        "def Px_recoverable(p_chunk_error, n, k, mcol=0, tries=1000, random_chunks=random_chunks):\n",
        "  ## evaluate probability of dataset recoverable empirically\n",
        "  e = n-k # erased chunks meaning loss\n",
        "  count = 0 \n",
        "  for i in range(tries):\n",
        "    count += recoverable(random_chunks(p_chunk_error), e, mcol)\n",
        "  return count/tries\n",
        "\n",
        "def random_query(chunks, l):\n",
        "  rng = np.random.default_rng()\n",
        "  return rng.choice(chunks.flatten(), l, axis=0, replace=False).all()\n",
        "\n",
        "def Px_pass_random_query_on_chunks(chunks, l, tries = 100):\n",
        "  ## evaluate probability of passing test empirically\n",
        "  count = 0 \n",
        "  for i in range(tries):\n",
        "    count += random_query(chunks,l)\n",
        "  return count/tries\n",
        "\n",
        "def Px_pass_random_query(p_chunk_error, l, tries = 1000):\n",
        "  ## evaluate probability of passing test empirically\n",
        "  count = 0 \n",
        "  for i in range(tries):\n",
        "    count += random_query(random_chunks(p_chunk_error),l)\n",
        "  return count/tries\n",
        "\n",
        "def random_multiquery(chunks, l, q=1, maxfail=0):\n",
        "  count = 0 # number of queries passed\n",
        "  for i in range(q):\n",
        "    count += random_query(chunks, l)\n",
        "  return q-count <= maxfail\n",
        "\n",
        "def Px_pass_random_multiquery(p_chunk_error, l, q=1, maxfail=0, tries = 1000):\n",
        "  ## evaluate probability of passing test empirically\n",
        "  count = 0 \n",
        "  for i in range(tries):\n",
        "    count += random_multiquery(random_chunks(p_chunk_error), l, q, maxfail)\n",
        "  return count/tries\n",
        "\n",
        "def Stats_random_query_on_chunks(p_chunk_error, n, k, l, tries = 1000):\n",
        "  ## TP: recoverable and test for rec. passed\n",
        "  e = n-k # maximum number of errors in a column\n",
        "  tp, fp, tn, fn = 0, 0, 0, 0 \n",
        "  for i in range(tries):\n",
        "    chunks = random_chunks(p_chunk_error)\n",
        "    rec = recoverable(chunks, e)\n",
        "    testrec = random_query(chunks,l)\n",
        "    tp += rec and testrec\n",
        "    fp += rec and not testrec\n",
        "    tn += not rec and not testrec\n",
        "    fn += not rec and testrec\n",
        "  return (tp/tries, fn/tries, fp/tries, tn/tries)\n",
        "\n",
        "def Stats_random_multiquery_on_chunks(p_chunk_error, n, k, l, q=1, maxfail=0, tries = 1000):\n",
        "  ## TP: recoverable and test for rec. passed\n",
        "  e = n-k # maximum number of errors in a column\n",
        "  tp, fp, tn, fn = 0, 0, 0, 0 \n",
        "  for i in range(tries):\n",
        "    chunks = random_chunks(p_chunk_error)\n",
        "    rec = recoverable(chunks, e)\n",
        "    testrec = random_multiquery(chunks, l, q, maxfail)\n",
        "    tp += rec and testrec\n",
        "    fn += rec and not testrec\n",
        "    tn += not rec and not testrec\n",
        "    fp += not rec and testrec\n",
        "  return (tp/tries, fn/tries, fp/tries, tn/tries)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "TuX7JlWn9S66",
        "outputId": "e10a98c4-205f-43da-8c70-7738bd74915e"
      },
      "outputs": [],
      "source": [
        "#@title Encoding and Query Parameters\n",
        "\n",
        "#@markdown block code K\n",
        "K = 50 #@param {type:\"slider\", min:1, max:100, step:1}\n",
        "#@markdown block code N\n",
        "N = 100 #@param {type:\"slider\", min:1, max:300, step:1}\n",
        "#@markdown query length\n",
        "L = 10 #@param {type:\"slider\", min:1, max:100, step:1}\n",
        "#@markdown number of blocks (chunks per node)\n",
        "B = 100 #@param {type:\"slider\", min:1, max:100, step:1}\n",
        "\n",
        "F = K*B # chunks \n",
        "\n",
        "p_chunk_error = .34\n",
        "print(\"# Evaluating probabilities both analytical and empirical\")\n",
        "print(\"Prob file recoverable:\",\n",
        "      P_recoverable(p_chunk_error, N, K, B),\n",
        "      Px_recoverable(p_chunk_error, N, K))\n",
        "print(\"Prob random query passed:\",\n",
        "      P_pass_random_query(p_chunk_error, L),\n",
        "      Px_pass_random_query(p_chunk_error, L))\n",
        "print(\"Stats (TP, FN, FP, TN):\",\n",
        "      Stats_random_query_on_chunks(p_chunk_error, N, K, L))\n",
        "\n",
        "p_chunk_error = .14\n",
        "Q=10\n",
        "MAXFAIL=9\n",
        "print(\"Prob random multiquery passed:\",\n",
        "      P_pass_random_multiquery(p_chunk_error, L, Q, MAXFAIL),\n",
        "      Px_pass_random_multiquery(p_chunk_error, L, Q, MAXFAIL))\n",
        "print(\"Stats (TP, FN, FP, TN):\",\n",
        "      Stats_random_multiquery_on_chunks(p_chunk_error, N, K, L, Q, MAXFAIL))\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9yVSqq4n71V1"
      },
      "outputs": [],
      "source": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o6D-ljF_AD5y"
      },
      "source": [
        "\n",
        "Next, we plot these with a given set of system parameters as a funtion of chunk loss probability. This on one hand verifies that the analytic and empirical results match, on the other hand shows what to expect from tests in different system states."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 350
        },
        "id": "JBXQLupyIJ7G",
        "outputId": "4a8a9d0b-2f39-4c44-9b5c-77fa2061359b"
      },
      "outputs": [],
      "source": [
        "def plotall(X, xlabel, N, K, B, L, P_chunk_error):\n",
        "  fig = plt.figure()\n",
        "  ax = fig.add_subplot(111)\n",
        "  ax.plot(X, np.vectorize(Px_recoverable)(P_chunk_error, N, K, B), label=\"Emp. Prob file recoverable\")\n",
        "  ax.plot(X, np.vectorize(P_recoverable)(P_chunk_error, N, K, B), label=\"Prob file recoverable\")\n",
        "  ax.plot(X, np.vectorize(Px_pass_random_query)(P_chunk_error, L), label=\"Emp. Prob random query passed\")\n",
        "  ax.plot(X, np.vectorize(P_pass_random_query)(P_chunk_error, L), label=\"Prob random query passed\")\n",
        "  ax.plot(X, np.vectorize(Px_pass_random_multiquery)(P_chunk_error, L, Q, MAXFAIL), label=\"Emp. Prob random multiquery (9/10) passed\")\n",
        "  ax.plot(X, np.vectorize(P_pass_random_multiquery)(P_chunk_error, L, Q, MAXFAIL), label=\"Prob random multiquery (9/10) passed\")\n",
        "  ax.set_xlabel(xlabel)\n",
        "  ax.set_ylabel('probability')\n",
        "  ax.legend()\n",
        "  plt.show()\n",
        "\n",
        "def printplotall(X, xlabel, N, K, B, L, P_chunk_error):\n",
        "  def checkprint(v, label):\n",
        "    if not isinstance(v, np.ndarray):\n",
        "      print(label, v)\n",
        "  checkprint(N, \"block code N:\")\n",
        "  checkprint(K, \"block code K:\")\n",
        "  checkprint(B, \"chunks per node:\")\n",
        "  checkprint(L, \"query length:\")\n",
        "  plotall(X, xlabel, N, K, B, L, P_chunk_error)\n",
        "\n",
        "K = 50 # block code k\n",
        "N = 100 # block code n\n",
        "L = 5 # query size in chunks\n",
        "F = 5000 # file size in chunks\n",
        "P_chunk_error = np.arange(0,1,.01) \n",
        "Q=10\n",
        "MAXFAIL=9\n",
        "\n",
        "B = int(F/K) # chunks per node\n",
        "\n",
        "printplotall(P_chunk_error, \"chunk failure probability\", N, K, B, L, P_chunk_error)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RynKTvu0Bw_L"
      },
      "source": [
        "As expected, empirical and analytical curves overlap.\n",
        "\n",
        "Assuming random losses, whether the file is recoverable is close to a step function of chunk loss probability, which is a behavior we prefer as it simplifies estimating system state. \n",
        "\n",
        "A simple random query as a test (at least with these parameters) is clearly not a good indicator of dataset loss.\n",
        "\n",
        "The multi-query is much more promising, although clearly there are problems of accuracy."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mEw6Ou9dDN8K"
      },
      "source": [
        "In the next plot, we look more into the details of the information contained in the outputs of a multi-query. `Q=10` queries (each challenging `L=5` chunks) are performed, and the test is passed if at most `maxfail` of these fails."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 547
        },
        "id": "A9mhMKZlJ3tH",
        "outputId": "194bdf4b-d81f-4da7-cc83-1fdc687189d6"
      },
      "outputs": [],
      "source": [
        "X, xlabel = P_chunk_error, \"chunk failure probability\"\n",
        "fig = plt.figure()\n",
        "ax = fig.add_subplot(111)\n",
        "ax.plot(X, np.vectorize(P_recoverable)(P_chunk_error, N, K, B), label=\"Prob file recoverable\")\n",
        "ax.plot(X, np.vectorize(P_pass_random_multiquery)(P_chunk_error, L, 10, 1), label=\"Prob random multiquery (1/10) passed\")\n",
        "ax.plot(X, np.vectorize(P_pass_random_multiquery)(P_chunk_error, L, 10, 6), label=\"Prob random multiquery (6/10) passed\")\n",
        "ax.plot(X, np.vectorize(P_pass_random_multiquery)(P_chunk_error, L, 10, 7), label=\"Prob random multiquery (7/10) passed\")\n",
        "ax.plot(X, np.vectorize(P_pass_random_multiquery)(P_chunk_error, L, 10, 8), label=\"Prob random multiquery (8/10) passed\")\n",
        "ax.plot(X, np.vectorize(P_pass_random_multiquery)(P_chunk_error, L, 10, 9), label=\"Prob random multiquery (9/10) passed\")\n",
        "ax.set_xlabel(\"chunk failure probability\")\n",
        "ax.set_ylabel('probability')\n",
        "ax.legend()\n",
        "plt.show()\n",
        "\n",
        "ax.set_yscale('log')\n",
        "ax.set_ylim(1e-4, 1)\n",
        "fig\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 547
        },
        "id": "FQ20rapiEzmp",
        "outputId": "d2e771ad-42c2-4a20-eecf-41fc72fbca45"
      },
      "outputs": [],
      "source": [
        "X, xlabel = P_chunk_error, \"chunk failure probability\"\n",
        "fig = plt.figure()\n",
        "ax = fig.add_subplot(111)\n",
        "ax.plot(X, np.vectorize(P_recoverable)(P_chunk_error, N, K, B), label=\"Prob file recoverable\")\n",
        "ax.plot(X, np.vectorize(P_pass_random_query)(P_chunk_error, 50), label=\"Prob random query L=50 passed\")\n",
        "#ax.plot(X, np.vectorize(P_pass_random_multiquery)(P_chunk_error, 50, 1, 0), label=\"Prob random multiquery L=50 (1/0) passed\")\n",
        "ax.plot(X, np.vectorize(P_pass_random_multiquery)(P_chunk_error, 5, 10, 9), label=\"Prob random multiquery L=5 (5/10) passed\")\n",
        "ax.plot(X, np.vectorize(P_pass_random_multiquery)(P_chunk_error, 1, 50, 19), label=\"Prob random multiquery L=1 (25/50) passed\")\n",
        "ax.set_xlabel(\"chunk failure probability\")\n",
        "ax.set_ylabel('probability')\n",
        "ax.legend()\n",
        "plt.show()\n",
        "\n",
        "ax.set_yscale('log')\n",
        "ax.set_ylim(1e-4, 1)\n",
        "fig\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XSUbFZN3EHmD"
      },
      "source": [
        "### Test Accuracy\n",
        "\n",
        "Time to analyze test outcomes from the perspective of True/False Positive/Negative results."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pFzlmGfYGHU-"
      },
      "outputs": [],
      "source": [
        "from functools import partial\n",
        "\n",
        "def plotall(X, xlabel, N, K, B, L, P_chunk_error):\n",
        "  fig = plt.figure()\n",
        "  ax = fig.add_subplot(111)\n",
        "  #stats = np.vectorize(Stats_random_multiquery_on_chunks)(p_chunk_error, N, K, L, Q, MAXFAIL)\n",
        "  ax.plot(X, np.vectorize(P_recoverable)(P_chunk_error, N, K, B), label=\"Prob file recoverable\")\n",
        "  stats = np.vectorize(partial(Stats_random_multiquery_on_chunks, n=N, k=K, l=L, q=Q, maxfail=MAXFAIL))(P_chunk_error)\n",
        "  ax.stackplot(X, stats, labels=['TP', 'FN', 'FP', 'TN'])\n",
        "  ax.set_xlabel(xlabel)\n",
        "  ax.set_ylabel('rate')\n",
        "  ax.legend()\n",
        "  plt.show()\n",
        "\n",
        "def printplotall(X, xlabel, N, K, B, L, P_chunk_error):\n",
        "  def checkprint(v, label):\n",
        "    if not isinstance(v, np.ndarray):\n",
        "      print(label, v)\n",
        "  checkprint(N, \"block code N:\")\n",
        "  checkprint(K, \"block code K:\")\n",
        "  checkprint(B, \"chunks per node:\")\n",
        "  checkprint(L, \"query length:\")\n",
        "  checkprint(Q, \"multi-query length:\")\n",
        "  checkprint(MAXFAIL, \"multi-query max fail:\")\n",
        "  plotall(X, xlabel, N, K, B, L, P_chunk_error)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 386
        },
        "id": "cP1zaInKfUzz",
        "outputId": "3fc22cd2-c50c-4739-d923-e25a590e4403"
      },
      "outputs": [],
      "source": [
        "K = 50 # block code k\n",
        "N = 100 # block code n\n",
        "L = 5 # query size in chunks\n",
        "F = 5000 # file size in chunks\n",
        "P_chunk_error = np.arange(0,1,.05) \n",
        "Q=10\n",
        "MAXFAIL=9\n",
        "B = int(F/K) # chunks per node\n",
        "\n",
        "printplotall(P_chunk_error, \"chunk failure probability\", N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kTcsyKTIOrQb"
      },
      "source": [
        "As a reminder, in the above plot:\n",
        "- TP: recoverable and test passed\n",
        "- FN: recoverable, but test failed. So we think it is lost, while it is not \n",
        "- FP: not recoverable, but test passed. So we don't know it is already lost\n",
        "- TN: not recoverable, and test showed it\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "PO3dNOO3PuDw",
        "outputId": "2dc6bbf8-da6a-4616-ed92-95f55b7c5327"
      },
      "outputs": [],
      "source": [
        "MAXFAIL=7\n",
        "printplotall(P_chunk_error, \"chunk failure probability\", N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "j30udthI-dfb",
        "outputId": "a78be517-49b9-453d-9a31-1d2baf47dcf1"
      },
      "outputs": [],
      "source": [
        "F = 500000 # file size in chunks\n",
        "B = int(F/K) # chunks per node\n",
        "MAXFAIL=9\n",
        "printplotall(P_chunk_error, f'chunk failure probability (B={B})', N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "aYZkKpBvDXtG",
        "outputId": "2a30ccde-ee93-46a1-88ce-59c8b69c328e"
      },
      "outputs": [],
      "source": [
        "F = 250 # file size in chunks\n",
        "B = int(F/K) # chunks per node\n",
        "MAXFAIL=9\n",
        "printplotall(P_chunk_error, f'chunk failure probability (B={B})', N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "u8caH7ICHEEd"
      },
      "source": [
        "## Random node failures\n",
        "\n",
        "Let's focus our attention on node erasures and outages. There are two types of node issues:\n",
        " - node erasures, which are permanent losses\n",
        " - node outages, which are temporary\n",
        "\n",
        "It is the usualy assumption that a test cannot differentate between the two. Yet, both exist and their effects are different. Outages add noise to the system, but do not provoke a real loss of the data. Durability is not affected by outages. Node erasures, instead, provoke data loss just as individual chunk losses.\n",
        "\n",
        "Model:\n",
        " - p_node_outage\n",
        " - p_node_erasure"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aCPdkhspbU42"
      },
      "source": [
        "Naive approach: just redefine the random_chunks function to remove entire rows"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7bYPbZtYSqkm"
      },
      "outputs": [],
      "source": [
        "def random_chunks(p_node_error):\n",
        "  #chunks = np.full((N, B), True)\n",
        "  #chunks = np.random.rand(N, B)\n",
        "  nodes = np.random.rand(N)\n",
        "  nodes = nodes > p_node_error\n",
        "  chunks = np.outer(nodes, np.full((B), True))\n",
        "  return chunks\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "IiVrcWbychiU",
        "outputId": "3cf625c8-6cd7-440a-a306-44f3f8adf2b4"
      },
      "outputs": [],
      "source": [
        "N=10\n",
        "B=6\n",
        "random_chunks(.3)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 386
        },
        "id": "bEvxCW6hc3k9",
        "outputId": "e1074596-c40a-4973-b75e-e711db5135df"
      },
      "outputs": [],
      "source": [
        "K = 50 # block code k\n",
        "N = 100 # block code n\n",
        "L = 1 # query size in chunks\n",
        "F = 50 # file size in chunks\n",
        "P_chunk_error = np.arange(0,1,.05) \n",
        "Q=10\n",
        "MAXFAIL=9\n",
        "B = int(F/K) # chunks per node\n",
        "\n",
        "printplotall(P_chunk_error, \"chunk failure probability\", N, K, B, L, P_chunk_error)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "u9FvNVbffNsP",
        "outputId": "215b5dd1-3ef6-4c3b-e451-ea356d7e0c1e"
      },
      "outputs": [],
      "source": [
        "MAXFAIL=7\n",
        "printplotall(P_chunk_error, \"chunk failure probability\", N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "psehnP6XTY9h"
      },
      "outputs": [],
      "source": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UgHKKoobp1r4"
      },
      "outputs": [],
      "source": [
        "# F = 50 # file size in chunks\n",
        "# B = int(F/K) # chunks per node\n",
        "# MAXFAIL=9\n",
        "# printplotall(P_chunk_error, f'chunk failure probability (B={B})', N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gWmSG00Sp3oE"
      },
      "outputs": [],
      "source": [
        "# L = 1 # query size in chunks\n",
        "# printplotall(P_chunk_error, f'chunk failure probability (B={B})', N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 386
        },
        "id": "cz6GhAsHsJiD",
        "outputId": "40357a56-c604-426a-bd2e-4d64671bdfaf"
      },
      "outputs": [],
      "source": [
        "MAXFAIL=6\n",
        "printplotall(P_chunk_error, f'chunk failure probability (B={B})', N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "k9icyECAsuIu"
      },
      "outputs": [],
      "source": [
        "def random_chunks(p_node_error):\n",
        "  #chunks = np.full((N, B), True)\n",
        "  #chunks = np.random.rand(N, B)\n",
        "  nodesfail = np.random.choice(N, round(p_node_error*N),replace=False )\n",
        "  nodes = np.full((N), True)\n",
        "  for i in nodesfail:\n",
        "    nodes[i] = False\n",
        "  chunks = np.outer(nodes, np.full((B), True))\n",
        "  return chunks"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "24p9fwnE0yqa",
        "outputId": "4f70c9d8-a6e1-49c9-f401-9266173baba4"
      },
      "outputs": [],
      "source": [
        "N=10\n",
        "B=6\n",
        "random_chunks(.3)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 386
        },
        "id": "NPycd_9ouLDB",
        "outputId": "82a917c8-8334-464b-f14e-d51af49a3d1c"
      },
      "outputs": [],
      "source": [
        "K = 50 # block code k\n",
        "N = 100 # block code n\n",
        "L = 1 # query size in chunks\n",
        "F = 50 # file size in chunks\n",
        "P_chunk_error = np.arange(0,1,.05) \n",
        "Q=10\n",
        "MAXFAIL=6\n",
        "B = int(F/K) # chunks per node\n",
        "printplotall(P_chunk_error, f'chunk failure probability (B={B})', N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 386
        },
        "id": "UY48YsxW8pMi",
        "outputId": "4d12472b-4313-44ba-ad3a-9981c68584bd"
      },
      "outputs": [],
      "source": [
        "MAXFAIL=5\n",
        "printplotall(P_chunk_error, f'chunk failure probability (B={B})', N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 386
        },
        "id": "WPNS2rD-83_5",
        "outputId": "e93d4dcb-3df0-4864-b2bf-7e860be6aadf"
      },
      "outputs": [],
      "source": [
        "Q=20\n",
        "MAXFAIL=10\n",
        "printplotall(P_chunk_error, f'chunk failure probability (B={B})', N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 386
        },
        "id": "BGs6mEtYNEFf",
        "outputId": "d2322faa-1e7a-4ad8-b3ca-b9268a6fe198"
      },
      "outputs": [],
      "source": [
        "K = 50 # block code k\n",
        "N = 100 # block code n\n",
        "L = 1 # query size in chunks\n",
        "F = 50 # file size in chunks\n",
        "P_chunk_error = np.arange(0,1,.05) \n",
        "Q=10\n",
        "MAXFAIL=6\n",
        "B = int(F/K) # chunks per node\n",
        "printplotall(P_chunk_error, f'chunk failure probability (B={B})', N, K, B, L, P_chunk_error)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_GIAFijBS-02"
      },
      "source": [
        "# Adversarial erasure model\n",
        "\n",
        "A peculiarity of our trustless systems compared to the application of erasure codes and PoR in a datacenter scenario is that erasures could be adversarial. An adversary (or adversaries) can use their knowledge about the encoding and verification to cherry-pick chunks to minimize the chances of getting discovered while maximizing harm.\n",
        "\n",
        "Assuming an [N,K] Reed-Solomon erasure coding, the erasure of `N-K+1` chunks from a block makes that block unrecoverable. In the adversarial scenario, each other block can remain intact, i.e. the removal of `N-K-1` selected chunks from the overall `N*B` chunks makes the file unrecoverable."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "fj3jOcZtS-03",
        "outputId": "6e603b1d-72d7-46d2-96e2-768596c72f31"
      },
      "outputs": [],
      "source": [
        "K = 50 # block code k\n",
        "N = 100 # block code n\n",
        "L = 100 # query size in chunks\n",
        "F = 5000 # file size in chunks\n",
        "B = F/K # chunks per node\n",
        "C = N*B # total chunks\n",
        "\n",
        "# ADVERSARIAL assumption: n-k+1 chunks erased from one ECC block\n",
        "E = N-K+1 # erased chunks\n",
        "\n",
        "# random query over all n*b chunks WITH replacement (assume l << n*b)\n",
        "def P_pass_random(e, c, l):\n",
        "  return (1 - e/c) ** l  \n",
        "print(\"random:\", P_pass_random(E, C, L))\n",
        "\n",
        "# random query over all n*b chunks WITH replacement, node first (assume l << n*b)\n",
        "# probability of selecting node without lost chunk: ((n-e)/n)\n",
        "# probability of selecting node with lost chunk: (e/n)\n",
        "def P_pass_random2(n, e, b, l):\n",
        "   return ((n-e)/n + e/n * (b-1)/b) ** l\n",
        "print(\"random:\", P_pass_random2(N, E ,B ,L))\n",
        "\n",
        "# random query over all n*b chunks WITHOUT replacement\n",
        "def P_pass_random_wor(e, c, l):\n",
        "  return hypergeom(c, e, l).pmf(0)  \n",
        "print(\"random w/o replacement:\", P_pass_random_wor(E, C, L))\n",
        "\n",
        "def P_pass_random_wor2(n, e, b, l):\n",
        "  c = n*b\n",
        "  return P_pass_random_wor(e, c, l)\n",
        "\n",
        "# hit L nodes, query 1 random chunk from each (assume l <= n)\n",
        "#sum p(k lines with error selected) * (b-1)/b ** k\n",
        "def P_pass_pernode_random(n, e, b, l):\n",
        "  ret = 0\n",
        "  for i in np.arange(0, e+1):\n",
        "    ret += hypergeom(n, e, l).pmf(i) * ((b-1)/b) ** i\n",
        "  return ret\n",
        "print(\"L nodes, 1 random chunk each:\", P_pass_pernode_random(N, E, B, L))\n",
        "\n",
        "#P_pass_pernode_random = (n-e)/n + e/n * (b-1)/b\n",
        "def P_pass_pernode_random(n, e, b, l):\n",
        "  return ((n-l)/n + l/n*(1 - 1/b)) ** e # for nodes with no error, pass. For the e nodes with single error, pass with 1-1/b, independent per node\n",
        "print(\"L nodes, 1 random chunk each (approx):\", P_pass_pernode_random(N, E, B, L))\n",
        "\n",
        "# hit L nodes, query same chunk (assume l <= n)\n",
        "# P(good chunk) + P(bad chunk) * P(least 1 bad node | bad chunk)\n",
        "def P_pass_pernode_same(n, e, b, l):\n",
        "  return np.where(\n",
        "      l <= n,\n",
        "      (b-1)/b + 1/b * ((n-l)/n) ** l,\n",
        "      np.NaN)\n",
        "print(\"L nodes, same chunk each:\", P_pass_pernode_same(N, E, B, L))\n",
        "\n",
        "# select random node, query random chunks from it (assume b >= l)\n",
        "def P_pass_samenode_random(n, e, b, l):\n",
        "  return np.where(\n",
        "      b >= l,\n",
        "      (n-e)/n  + (e/n) * (1-l/b) ** l,\n",
        "      np.NaN)\n",
        "print(\"random node, L chunks:\",P_pass_samenode_random(N, E, B, L))\n",
        "\n",
        "# select random node, query random chunks from it (assume b >= l)\n",
        "def P_pass_samenode_random_wor(n, e, b, l):\n",
        "  return np.where(\n",
        "      b >= l,\n",
        "      (n-e)/n  + (e/n) * hypergeom(b, 1, l).pmf(0),\n",
        "      np.NaN)\n",
        "print(\"random node, L chunks w/o repl:\",P_pass_samenode_random_wor(N, E, B, L))\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "785xmXP1S-04",
        "outputId": "c02c0e74-e9fc-472c-9f91-91426e1e0fd9"
      },
      "outputs": [],
      "source": [
        "def plotall(title, X, xlabel, N, E, B, L):\n",
        "  fig = plt.figure()\n",
        "  ax = fig.add_subplot(111)\n",
        "  #ax.plot(X, 1 - P_pass_random2(N, E, B, L), 'm', label=\"random2\")\n",
        "  ax.plot(X, 1 - P_pass_random_wor2(N, E, B, L), 'c', label=\"random_wor\")\n",
        "  ax.plot(X, 1 - P_pass_pernode_random(N, E, B, L), 'k', label=\"pernode_random\")\n",
        "  ax.plot(X, 1 - P_pass_pernode_same(N, E, B, L), 'g', label=\"pernode_same\")\n",
        "  #ax.plot(X, 1 - P_pass_samenode_random(N, E, B, L), 'b', label=\"samenode_random\")\n",
        "  ax.plot(X, 1 - P_pass_samenode_random_wor(N, E, B, L), 'r', label=\"samenode_random_wor\")\n",
        "  ax.set_xlabel(xlabel)\n",
        "  ax.set_ylabel('probability of detecting adv. erasure')\n",
        "  ax.legend()\n",
        "  fig.suptitle(title)\n",
        "  plt.show()\n",
        "\n",
        "def printplotall(title, X, xlabel, N, E, B, L):\n",
        "  def checkprint(v, label):\n",
        "    if not isinstance(v, np.ndarray):\n",
        "      print(label, v)\n",
        "  checkprint(N, \"block code N:\")\n",
        "  checkprint(E, \"block code N-K+1:\")\n",
        "  checkprint(B, \"chunks per node:\")\n",
        "  checkprint(L, \"query length:\")\n",
        "  plotall(title, X, xlabel, N, E, B, L)\n",
        "\n",
        "K = 50 # block code k\n",
        "N = 100 # block code n\n",
        "L = 40 # query size in chunks\n",
        "F = 500000 # file size in chunks\n",
        "B = F/K # chunks per node\n",
        "# ADVERSARIAL assumption: n-k+1 chunks erased from one ECC block\n",
        "E = N-K+1 # erased chunks\n",
        "\n",
        "L = np.arange(1,200) # query size in chunks\n",
        "printplotall(\"Test against adv. erasure, as a function of query size\",\n",
        "    L, \"query size in chunks\", N, E, B, L)\n",
        "L = 10 # query size in chunks\n",
        "\n",
        "F = np.arange(50, 50000, 100) # file size in chunks\n",
        "B = F/K # chunks per node\n",
        "printplotall(\"Test against adv. erasure, as a function of number of erasure blocks\",\n",
        "    B, \"chunks per node\", N, E, B, L)\n",
        "F = 5000 # file size in chunks\n",
        "B = F/K # chunks per node\n",
        "\n",
        "N = np.arange(50, 500) # block code n\n",
        "E = N-K+1 # erased chunks\n",
        "printplotall(\"Test against adv. erasure, as a function of code rate\",\n",
        "    N, \"block code 'n'\", N, E, B, L)\n",
        "N = 10 # block code n\n",
        "E = N-K+1 # erased chunks"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "PoR test analysis with multiple storage nodes.ipynb",
      "provenance": []
    },
    "interpreter": {
      "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
    },
    "kernelspec": {
      "display_name": "Python 3.9.12 64-bit",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.13"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}