bittorrent-benchmarks/analysis/exploratory/exploratory-analysis.Rmd

---
title: "Static Network Experiment"
subtitle: "Download Times -- Single Experiment Exploratory Analysis"
output:
  bookdown::html_notebook2:
    number_sections: TRUE
    toc: TRUE
---

## Goals

The goal for this notebook is to provide a simple analysis for download time distributions over repeated runs of our [static dissemination experiment](https://github.com/codex-storage/bittorrent-benchmarks/blob/95651ad9d7e5ac4fb7050767cbac94ea75c8c07b/benchmarks/core/experiments/static_experiment.py#L22) over a _single parameter set_; i.e., a set of experiments for which:

* network size;
* number of seeders;
* number of leechers;
* file size;

remains constant.

```{r, warning=FALSE, message=FALSE}
library(tidyverse)
library(jsonlite)
```

## Experiment Parameters

```{r}
EXPERIMENT_ROOT <- 'data/codex/parsed'
experiment_file <- function(filename) file.path(EXPERIMENT_ROOT, filename)
```

```{r}
experiment_meta <- jsonlite::read_json(fs::dir_ls(EXPERIMENT_ROOT, glob='*.jsonl'))
```

```{r}
n_leechers <- length(experiment_meta$nodes$nodes) - experiment_meta$seeders
tribble(
  ~parameter, ~value,
  'network size', length(experiment_meta$nodes$nodes),
  'number of seeders', experiment_meta$seeders,
  'number of leechers', n_leechers,
  'file size (bytes)', experiment_meta$file_size,
)
```

## Download Logs

```{r}
downloads <- read_csv(
  fs::dir_ls(EXPERIMENT_ROOT, glob='*download*'),
  show_col_types = FALSE,
) |>
  mutate(
    temp = str_remove(dataset_name, '^dataset-'),
    seed_set = as.numeric(str_extract(temp, '^\\d+')),
    run = as.numeric(str_extract(temp, '\\d+$')),
    value = value * experiment_meta$download_metric_unit_bytes
  ) |>
  select(-temp, -name)
```

```{r}
downloads <- downloads |>
  group_by(node, seed_set, run) |>
  arrange(timestamp) |>
  mutate(
    piece_count = seq_along(timestamp)
  ) |>
  ungroup() |>
  mutate(completed = value / experiment_meta$file_size)
```

## Results

### Sanity Checks and Loss Statistics

```{r fig.cap='Raw data plot, per experiment.', fig.width=10, fig.height=10}
ggplot(downloads |>
         filter(seed_set < 3) |>
         group_by(seed_set, run) |>
         mutate(timestamp = as.numeric(timestamp - min(timestamp))) |>
         ungroup()) +
  geom_line(aes(x = timestamp, y = completed, col = node), lwd=0.7) +
  scale_y_continuous(labels = scales::percent_format()) +
  facet_grid(run ~ seed_set, labeller = labeller(
    run = as_labeller(function(x) paste0("run: ", x)),
    seed_set = as_labeller(function(x) paste0("seed set: ", x)))) +
  xlab('elapsed time (seconds)') +
  ylab('download completion (%)') +
  theme_bw(base_size = 15)
```

Have any nodes failed to download the entire file?

```{r}
downloads |>
  group_by(node, seed_set, run) |>
  summarise(completed = max(completed)) |>
  filter(completed < 1.0)
```

Do we have as many runs and seed sets as we expect?

```{r}
downloads |>
  select(seed_set, node, run) |>
  distinct() |>
  group_by(seed_set, node) |>
  count() |>
  filter(n != experiment_meta$repetitions)
```


### Computing Download Times

We define the _download time_ for a node $d$ as the time elapsed from the client's response to a [request to download a dataset](https://github.com/codex-storage/bittorrent-benchmarks/blob/95651ad9d7e5ac4fb7050767cbac94ea75c8c07b/benchmarks/core/network.py#L58) and the time at which the client reports having received the last piece of the downloaded file. The form of this request, as well as how the request is done, is client-specific, but typically involves an RPC or REST API call in one flavour or the other. Since seeders are already in possession of the file by construction, we only measure download times at _leechers_.

```{r}
download_requests <- read_csv(
  experiment_file('request_event.csv'), show_col_types = FALSE) |>
  mutate(destination = gsub("\"", "", destination))
```

```{r}
download_start <- download_requests |>
  select(-request_id) |>
  filter(name == 'leech', type == 'RequestEventType.end') |>
   mutate(
    # We didn't log those on the runner side so I have to reconstruct them.
    run = rep(rep(
      1:experiment_meta$repetitions - 1,
      each = n_leechers), times=experiment_meta$seeder_sets),
    seed_set = rep(
      1:experiment_meta$seeder_sets - 1,
      each = n_leechers * experiment_meta$repetitions),
  ) |>
  transmute(node = destination, run, seed_set, seed_request_time = timestamp)
```

```{r}
download_times <- downloads |>
  left_join(download_start, by = c('node', 'run', 'seed_set')) |>
  mutate(
    elapsed_download_time = as.numeric(timestamp - seed_request_time)
  ) |>
  group_by(node, run, seed_set) |>
  mutate(lookup_time = as.numeric(min(timestamp) - seed_request_time)) |>
  ungroup()
```

If we did this right, the elapsed download time can never be negative, and neither can the lookup time.

```{r}
download_times |> filter(elapsed_download_time < 0 | lookup_time < 0)
```

We can now actually compute statistics on the download times.

```{r}
download_time_stats <- download_times |>
  filter(!is.na(elapsed_download_time)) |>
  group_by(piece_count, completed) |>
  summarise(
    mean = mean(elapsed_download_time),
    median = median(elapsed_download_time),
    max = max(elapsed_download_time),
    min = min(elapsed_download_time),
    p90 = quantile(elapsed_download_time, p = 0.9),
    p10 = quantile(elapsed_download_time, p = 0.1),
    .groups = 'drop'
  )
```


```{r}
ggplot(download_time_stats) +
  geom_ribbon(aes(xmin = p10, xmax = p90, y = completed),
              fill = scales::alpha('blue', 0.5), col = 'lightgray') +
  geom_line(aes(x = median, y = completed)) +
  theme_minimal() +
  ylab("completion") +
  xlab("time (seconds)") +
  scale_y_continuous(labels = scales::percent) +
  ggtitle(paste0('download time (Codex, ',rlang::as_bytes(experiment_meta$file_size),' file)'))
```