Block discovery simulator and analysis (#175)

* add block discovery simulator * add analysis document for simpler cases of block discovery
2023-09-28 02:45:58 -03:00 · 2023-09-28 02:45:58 -03:00 · 1d23b31461
parent 7ace179f2f
commit 1d23b31461
20 changed files with 3782 additions and 0 deletions
--- a/analysis/block-discovery-sim/.Rbuildignore
+++ b/analysis/block-discovery-sim/.Rbuildignore
@ -0,0 +1,4 @@
+^renv$
+^renv\.lock$
+^.*\.Rproj$
+^\.Rproj\.user$
--- a/analysis/block-discovery-sim/.Rprofile
+++ b/analysis/block-discovery-sim/.Rprofile
@ -0,0 +1 @@
+source("renv/activate.R")
--- a/analysis/block-discovery-sim/.gitignore
+++ b/analysis/block-discovery-sim/.gitignore
@ -0,0 +1,5 @@
+.Rproj.user
+.RData
+.Rhistory
+*nb.html
+rsconnect
--- a/analysis/block-discovery-sim/DESCRIPTION
+++ b/analysis/block-discovery-sim/DESCRIPTION
@ -0,0 +1,18 @@
+Package: blockdiscoverysim
+Title: Block Discovery Simulator
+Version: 0.0.0.9000
+Description: Simple Simulation for Block Discovery
+Encoding: UTF-8
+Roxygen: list(markdown = TRUE)
+RoxygenNote: 7.2.3
+Depends: 
+  shiny (>= 1.7.4.1),
+  tidyverse (>= 2.0.0),
+  purrr (>= 1.0.1),
+  VGAM (>= 1.1-8),
+  R6 (>= 2.2.2),
+  plotly (>= 4.10.2)
+Suggests:
+  devtools,
+  testthat (>= 3.0.0)
+Config/testthat/edition: 3
--- a/analysis/block-discovery-sim/R/collate.R
+++ b/analysis/block-discovery-sim/R/collate.R
@ -0,0 +1,19 @@
+# We do this hack because rsconnect doesn't seem to like us bundling the app
+# as a package.
+
+order <- c(
+  'R/partition.R',
+  'R/stats.R',
+  'R/node.R',
+  'R/sim.R'
+)
+
+library(R6)
+library(purrr)
+library(tidyverse)
+
+lapply(order, source)
+
+run <- function() {
+  rmarkdown::run('./block-discovery-sim.Rmd')
+}
--- a/analysis/block-discovery-sim/R/node.R
+++ b/analysis/block-discovery-sim/R/node.R
@ -0,0 +1,14 @@
+Node <- R6Class(
+  'Node',
+  public = list(
+    node_id = NULL,
+    storage = NULL,
+    
+    initialize = function(node_id, storage) {
+      self$node_id = node_id
+      self$storage = storage
+    },
+    
+    name = function() paste0('node ', self$node_id)
+  )
+)
--- a/analysis/block-discovery-sim/R/partition.R
+++ b/analysis/block-discovery-sim/R/partition.R
@ -0,0 +1,18 @@
+#' Generates a random partition of a block array among a set of nodes. The
+#' partitioning follows the supplied distribution.
+#' 
+#' @param block_array a vector containing blocks
+#' @param network_size the number of nodes in the network
+#' @param distribution a sample generator which generates a vector of n 
+#'    samples when called as distribution(n).
+#' 
+partition <- function(block_array, network_size, distribution) {
+  buckets <- distribution(length(block_array))
+  
+  # We won't attempt to shift the data, instead just checking that it is 
+  # positive.
+  stopifnot(all(buckets >= 0))
+  
+  buckets <- trunc(buckets * (network_size - 1) / max(buckets)) + 1
+  sapply(1:network_size, function(i) which(buckets == i))
+}
--- a/analysis/block-discovery-sim/R/sim.R
+++ b/analysis/block-discovery-sim/R/sim.R
@ -0,0 +1,30 @@
+run_download_simulation <- function(swarm, max_steps, coding_rate) {
+  total_blocks <- sum(sapply(swarm, function(node) length(node$storage)))
+  required_blocks <- round(total_blocks * coding_rate)
+  completed_blocks <- 0
+  storage <- c()
+
+  step <- 1  
+  stats <- Stats$new()
+  while ((step < max_steps) && (completed_blocks < required_blocks)){
+    neighbor <- swarm |> select_neighbor()
+    storage <- neighbor |> download_blocks(storage)
+    
+    completed_blocks <- length(storage)
+    stats$add_stat(
+      step = step,
+      selected_neighbor = neighbor$node_id,
+      total_blocks = total_blocks,
+      required_blocks = required_blocks,
+      completed_blocks = completed_blocks
+    )
+    
+    step <- step + 1
+  }
+  
+  stats$as_tibble()
+}
+
+select_neighbor <- function(neighborhood) neighborhood[[sample(1:length(neighborhood), size = 1)]]
+
+download_blocks <- function(neighbor, storage) unique(c(neighbor$storage, storage))
--- a/analysis/block-discovery-sim/R/stats.R
+++ b/analysis/block-discovery-sim/R/stats.R
@ -0,0 +1,17 @@
+Stats <- R6Class(
+  'Stats',
+  public = list(
+    stats = NULL,
+    
+    initialize = function() {
+      self$stats = list(list())
+    },
+    
+    add_stat = function(...) {
+      self$stats <- c(self$stats, list(rlang::dots_list(...)))
+      self
+    },
+    
+    as_tibble = function() purrr::map_df(self$stats, as_tibble)
+  )
+)
--- a/analysis/block-discovery-sim/README.md
+++ b/analysis/block-discovery-sim/README.md
@ -0,0 +1,37 @@
+Simple Block Discovery Simulator
+================================
+
+Simple simulator for understanding of block discovery dynamics.
+
+## Hosted Version
+
+You can access the block discovery simulator on [shinyapps](https://gmega.shinyapps.io/block-discovery-sim/)
+
+## Running
+
+You will need R 4.1.2 with [renv](https://rstudio.github.io/renv/) installed. I also strongly recommend you run this
+from [RStudio](https://posit.co/products/open-source/rstudio/) as you will otherwise need to [install pandoc and set it up manually before running](https://stackoverflow.com/questions/28432607/pandoc-version-1-12-3-or-higher-is-required-and-was-not-found-r-shiny).
+
+Once that's cared for and you are in the R terminal (Console in RStudio), you will need to first install deps:
+
+```R
+> renv::install()
+```
+
+If you are outside RStudio, then you will need to restart your R session. After that, you should load the package:
+
+```R
+devtools::load_all()
+```
+
+run the tests:
+
+```R
+testthat::test_package('blockdiscoverysim')
+```
+
+and, if all goes well, launch the simulator:
+
+```R
+run()
+```
--- a/analysis/block-discovery-sim/block-discovery-sim.Rmd
+++ b/analysis/block-discovery-sim/block-discovery-sim.Rmd
@ -0,0 +1,202 @@
+---
+title: "Block Discovery Sim"
+output: html_document
+runtime: shiny
+
+# rsconnect uses this
+resource_files:
+- R/node.R
+- R/partition.R
+- R/sim.R
+- R/stats.R
+---
+
+## Goal
+
+The goal of this experiment is to understand -- under different assumptions about how blocks are partitioned among nodes -- how long a hypothetical downloader would take to discover enough blocks to make a successful download from storage nodes by randomly sampling the swarm. We therefore do not account for download times or network latency - we just measure how many times the node randomly samples the swarm before figuring out where enough of the blocks are.
+
+```{r echo = FALSE, message = FALSE}
+library(shiny)
+library(plotly)
+
+source('R/collate.R')
+
+knitr::opts_chunk$set(echo = FALSE, message = FALSE)
+```
+
+```{r}
+runs <- 10
+max_steps <- Inf
+```
+
+```{r}
+DISTRIBUTIONS <- list(
+  'uniform' = runif,
+  'exponential' = rexp,
+  'pareto' = VGAM::rparetoI
+)
+```
+
+
+## Network
+
+* Select the parameters of the network you would like to use in the experiments. 
+* Preview the shape of the partitions by looking at the chart.
+* Generate more random partitions by clicking "Generate Another".
+
+```{r}
+fluidPage(
+  sidebarPanel(
+    numericInput(
+      'swarm_size', 
+      label = 'size of the swarm', 
+      value = 20, 
+      min = 1, 
+      max = 10000
+    ),
+    numericInput(
+      'file_size',
+      label = 'number of blocks in the file',
+      value = 1000, 
+      min = 1,
+      max = 1e6
+    ),
+    selectInput(
+      'partition_distribution',
+      label = 'shape of the distribution for the partitions',
+      choices = names(DISTRIBUTIONS)
+    ),
+    actionButton(
+      'generate_network',
+      label = 'Generate Another'
+    )
+  ),
+  mainPanel(
+    plotOutput('network_sample')
+  )
+)
+```
+
+```{r}
+observe({
+  input$generate_network
+  output$network_sample <- renderPlot({
+    purrr::map_dfr(
+      generate_network(
+        number_of_blocks = input$file_size, 
+        network_size = input$swarm_size, 
+        partition_distribution = input$partition_distribution
+      ), 
+      function(node) tibble(node_id = node$node_id, blocks = length(node$storage))
+    ) %>%
+      ggplot() + 
+        geom_bar(
+          aes(x = node_id, y = blocks), 
+          stat = 'identity', 
+          col = 'black', 
+          fill = 'lightgray'
+        ) +
+        labs(x = 'node') + 
+        theme_minimal()
+  })}
+)
+```
+
+## Experiment
+
+Select the number of experiment runs. Each experiment will generate a network and then simulate a download operation where a hypothetical node:
+
+1. joins the swarm;
+2. samples one neighbor per round in a round-based download protocol and asks for its block list.
+
+The experiment ends when the downloading node recovers "enough" blocks. If we let the total number of blocks in the file be $n$ and the coding rate $r$, then the simulation ends when the set of blocks $D$ discovered by the downloading node satisfies $\left|D\right| \geq n\times r$.
+
+We then show a "discovery curve": a curve that emerges as we look at the percentage of blocks the downloader has discovered so far as a function of the number of contacts it made. 
+
+The curve is actually an average of all experiments, meaning that a point $(5, 10\%)$ should be interpreted as: "on average, after $5$ contacts, a downloader will have discovered $10\%$ of the blocks it needs to get a successful download". We show the $5^{th}$ percentile and the $95^{th}$ percentiles of the experiments as error bands around the average.
+
+```{r}
+fluidPage(
+    fluidRow(
+      class='well',
+      column(
+        width = 6, 
+        sliderInput('runs', 'How many experiments to run', min = 10, max = 10000, value = 10),
+        actionButton('do_run', 'Run')
+      ),
+      column(
+        width = 6, 
+        numericInput('coding_rate', 'Coding rate (percentage of blocks required for a successful download)', 
+                     min = 0.1, max = 1.0, step = 0.05, value = 0.5)
+      )
+    )
+)
+```
+
+```{r}
+experiment_results <- reactive({
+  lapply(1:input$runs, function(i) {
+    generate_network(
+      number_of_blocks = input$file_size, 
+      network_size = input$swarm_size, 
+      partition_distribution = input$partition_distribution
+    ) |> run_experiment(run_id = i, coding_rate = input$coding_rate)
+  })
+}) |> bindEvent(
+  input$do_run,
+  ignoreNULL = TRUE,
+  ignoreInit = TRUE
+)
+```
+
+```{r}
+renderPlotly({
+  plot_results(do.call(rbind, experiment_results())) 
+})
+```
+
+```{r}
+generate_network <- function(number_of_blocks, network_size, partition_distribution) {
+  block_array <- sample(1:number_of_blocks, replace = FALSE)
+  
+  partitions <- partition(block_array, network_size, DISTRIBUTIONS[[partition_distribution]])
+  sapply(1:network_size, function(i) Node$new(
+    node_id = i, 
+    storage = partitions[[i]])
+  )
+}
+```
+
+```{r}
+run_experiment <- function(network, coding_rate, run_id = 0) {
+  run_download_simulation(
+    swarm = network,
+    coding_rate = coding_rate,
+    max_steps = max_steps
+  ) |> mutate(
+    run = run_id
+  )
+}
+```
+
+```{r}
+plot_results <- function(results) {
+  stats <- results |>
+    mutate(completion = pmin(1.0, completed_blocks / required_blocks)) |>
+    group_by(step) |>
+    summarise(
+      average = mean(completion),
+      p_95 = quantile(completion, 0.95), 
+      p_05 = quantile(completion, 0.05),
+      .groups = 'drop'
+    )
+
+  plotly::ggplotly(ggplot(stats, aes(x = step)) +
+    geom_line(aes(y = average), col = 'black', lwd = 1) + 
+    geom_ribbon(aes(ymin = p_05, ymax = p_95), fill = 'grey80', alpha = 0.5) +
+    labs(x = 'contacts',  y = 'blocks discovered (%)') +
+    scale_y_continuous(labels = scales::percent_format()) + 
+    theme_minimal())
+}
+```
+
--- a/analysis/block-discovery-sim/block-discovery-sim.Rproj
+++ b/analysis/block-discovery-sim/block-discovery-sim.Rproj
@ -0,0 +1,17 @@
+Version: 1.0
+
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+
+RnwWeave: Sweave
+LaTeX: pdfLaTeX
+
+BuildType: Package
+PackageUseDevtools: Yes
+PackageInstallArgs: --no-multiarch --with-keep.source
--- a/analysis/block-discovery-sim/renv.lock
+++ b/analysis/block-discovery-sim/renv.lock
--- a/analysis/block-discovery-sim/renv/.gitignore
+++ b/analysis/block-discovery-sim/renv/.gitignore
@ -0,0 +1,7 @@
+library/
+local/
+cellar/
+lock/
+python/
+sandbox/
+staging/
--- a/analysis/block-discovery-sim/renv/activate.R
+++ b/analysis/block-discovery-sim/renv/activate.R
--- a/analysis/block-discovery-sim/renv/settings.json
+++ b/analysis/block-discovery-sim/renv/settings.json
@ -0,0 +1,17 @@
+{
+  "bioconductor.version": null,
+  "external.libraries": [],
+  "ignored.packages": [],
+  "package.dependency.fields": [
+    "Imports",
+    "Depends",
+    "LinkingTo"
+  ],
+  "r.version": null,
+  "snapshot.type": "implicit",
+  "use.cache": true,
+  "vcs.ignore.cellar": true,
+  "vcs.ignore.library": true,
+  "vcs.ignore.local": true,
+  "vcs.manage.ignores": true
+}
--- a/analysis/block-discovery-sim/tests/testthat.R
+++ b/analysis/block-discovery-sim/tests/testthat.R
@ -0,0 +1,11 @@
+# This file is part of the standard setup for testthat.
+# It is recommended that you do not modify it.
+#
+# Where should you do additional test configuration?
+# Learn more about the roles of various files in:
+# * https://r-pkgs.org/tests.html
+# * https://testthat.r-lib.org/reference/test_package.html#special-files
+
+library(testthat)
+
+test_check("blockdiscoverysim")
--- a/analysis/block-discovery-sim/tests/testthat/test-partition.R
+++ b/analysis/block-discovery-sim/tests/testthat/test-partition.R
@ -0,0 +1,18 @@
+test_that(
+  "should partition into linearly scaled buckets", {
+    samples <- c(1, 100, 500, 800, 850)
+        
+    partitions <- partition(
+      block_array = 1:5, 
+      network_size = 4,
+      distribution = function(n) samples[1:n]
+    )
+    
+    expect_equal(partitions, list(
+      c(1, 2), 
+      c(3),
+      c(4), 
+      c(5))
+    )
+  }
+)
--- a/analysis/block-discovery-sim/tests/testthat/test-stats.R
+++ b/analysis/block-discovery-sim/tests/testthat/test-stats.R
@ -0,0 +1,17 @@
+test_that(
+  "should collect stats as they are input", {
+    stats <- Stats$new()
+    
+    stats$add_stat(a = 1, b = 2, name = 'hello')
+    stats$add_stat(a = 1, b = 3, name = 'world')
+    
+    expect_equal(
+      stats$as_tibble(),
+      tribble(
+        ~a, ~b, ~name,
+        1,  2,  'hello',
+        1,  3,  'world',
+      )
+    )
+  }
+)
--- a/analysis/block-discovery.Rmd
+++ b/analysis/block-discovery.Rmd
@ -0,0 +1,104 @@
+---
+title: "Block Discovery Problem"
+output: 
+  bookdown::gitbook:
+    number_sections: false
+---
+
+$$
+\newcommand{\rv}[1]{\textbf{#1}}
+\newcommand{\imin}{\rv{I}_{\text{min}}}
+$$
+
+## Problem Statement
+
+Let $F = \left\{b_1, \cdots, b_m\right\}$ be an erasure-coded file, and let $O = \left\{o_1, \cdots, o_n\right\}$ be a set of nodes storing that file. We define a _storage function_ $s \longrightarrow O \times 2^F$ as a function mapping subsets of $F$ into nodes in $O$.
+
+In the simplified block discovery problem, we have a _downloader node_ which is attempting to construct a subset $D \subseteq F$ of blocks by repeatedly sampling nodes from $O$. "Discovery", in this context, can be seen as the downloader node running a round-based protocol where, at round $i$, it samples a random contact $o_i$ and learns about $s(o_i)$.
+
+To make this slightly more formal, we denote $D_i \subseteq F$ to be the set of blocks that the downloader has learned after $i^{th}$ contact. By the way we state the protocol to work, we have that:
+
+$$
+\begin{equation}
+  D_i = D_{i - 1} \cup s(o_i)
+  (\#eq:discovery)
+\end{equation}
+$$
+
+Since the file is erasure coded, the goal of the downloader is to learn some $D_i$ such that:
+
+$$
+\begin{equation}
+  \left|D_i\right| \geq c \times \left|F\right|
+  (\#eq:complete)
+\end{equation}
+$$
+
+When $D_i$ satisfies Eq. \@ref(eq:complete), we say that $D_i$ is _complete_. We can then state the problem as follows.
+
+**Statement.** Let $\imin$ be a random variable representing the first round at which $D_i$ is complete. We want to estimate $F(i) = \mathbb{P}(\imin \leq i)$; namely, the probability that the downloader has discovered all relevant blocks by round $i$. 
+
+## Case (1) - Erasure Coding but no Replication
+
+If we assume there is no replication then, unless we contact the same node twice, every node we contact contributes with new information. Indeed, the absence of replication implies, necessarily, that:
+
+$$
+\begin{equation}
+  \bigcap_{o \in O} s(o) = \emptyset 
+  (\#eq:disjoint)
+\end{equation}
+$$
+
+So that if we are contacting a new node at round $i$, we must necessarily have that:
+
+$$
+\begin{equation}
+  \left|D_{i}\right| \stackrel{1}{=} \left|D_{i - 1} \cup s(o_i)\right| \stackrel{2}{=} \left|D_{i - 1}\right| + \left|s(o_i)\right|
+  (\#eq:monotonic)
+\end{equation}
+$$
+Where (1) follows from Eq. \@ref(eq:discovery), and (2) follows from the $s(o_i)$ being disjoint (Eq. \@ref(eq:disjoint)). This leads to the corollary:
+
+**Corollary 1.** After $\lceil c \times n\rceil$ rounds, the downloader will necessarily have learned enough blocks to download $F$.
+
+which follows trivially from Eq. \@ref(eq:monotonic) and the implication that $D_{\lceil c \times n\rceil}$ must be complete. $\blacksquare$
+
+As for $F(i)$, note that we can estimate the probability of completion by estimating the probability that $|D_i|$ is bigger than the completion number (Eq. \@ref(eq:complete)). How exactly that looks like and how tractable it is, however, depends on the formulation we give it.
+
+### Independent Partition Sizes
+
+Suppose we knew the distribution for partition sizes in $O$, i.e., we knew that the number of blocks assigned to a node in $O$ follows some distribution $\mathcal{B}$ (e.g., a truncated Gaussian).
+
+If we have a "large enough" network, this means we would be able to approximate the number of blocks assigned to each node as $m$ independent random variables $\rv{Y}_i$, where $\rv{Y}_i \sim \mathcal{B}$. In that case, we would be able to express the total number of blocks learned by the downloader by round $i$ as a random variable $\rv{L}_i$ which represents the sum of the iid random variables $\rv{Y}_j \sim \mathcal{B}$:
+
+$$
+\begin{equation}
+  \rv{L}_i \sim \sum_{j = 1}^{i} \rv{Y}_j 
+  (\#eq:learning-sum)
+\end{equation}
+$$
+
+The shape of the distribution would be the $i$-fold convolution of $\mathcal{B}$ with itself, which can be tractable for some distributions.
+
+More interestingly, though, Eq. \@ref(eq:learning-sum) allows us to express a $\mathcal{B}$-independent estimate of the average number of rounds a downloader will undergo before completing a download. We have that:
+
+$$
+\mathbb{E}(\rv{L}_i) \sim \sum_{j = 1}^i \mathbb{E}(\rv{Y}_j)  = i\mathbb{E}(\rv{Y}) = i\times \mu_{\rv{Y}}
+$$
+
+We can then solve for $i$ and the completion condition to get:
+
+$$
+\begin{equation}
+  i \times \mu_{\rv{Y}} \geq c \times |F| \iff i \geq \frac{c \times |F|}{\mu_{\rv{Y}}}
+  (\#eq:average-completion)
+\end{equation}
+$$
+
+note that this is intuitive to the point of being trivial. If we let $c = 1$, we get $i \geq |F|/\mu_{\rv{Y}}$, which just means that on average the node will have to sample a number of nodes equal to the number of blocks over the average partition size. In practice we can use $\overline{\mu_\rv{Y}} = \frac{1}{m}\sum_i \left|s(o_i)\right|$ instead of $\mu_{\rv{Y}}$ to estimate what $i$ can look like.
+
+### Non-Independent Partition Sizes
+
+If we cannot approximate partition sizes and independent random variables, then the problem changes. Stripping it down, we can cast it as follows. We have a set of integers $P = \{p_1, \cdots, p_m\}$ representing the sizes of each partition. We then want to understand the distribution of the partial sums for random permutations of $P$. 
+
+As I understand it, there is no good way of addressing this without running simulations. The difference is that if we assume disjoint partitions then the simulations are a lot simpler as we do not need to track the contents of $D_i$.