research/whisper_scalability/scalability.md

# Notes

Bloom filter and probability of false positives


## Questions

### How big is an envelope?

`[ Expiry, TTL, Topic, Data, Nonce ]` 4+4+4+arb+8, where arb is data field as
factor of 256 bytes (minus salt?), according to EIP627. Can't find.

512 bytes seems like minimum, and 4 kB max. Let's assume 1 kB for now.

**Question: How big is an envelope in Status? Does it change?**

### How many envelopes over some time period for a full node?

Running `time ./shh_basic_client --mainnet --watch --post --port=30000`, connected to all full nodes with a full bloom filter we get incoming messages:

`WRN 2019-10-14 13:05:55+09:00 Message Ingress Stats                      tid=9527 added=6 allowed=58 disallowed=281 disallowedBloom=0 disallowedPow=281 disallowedSize=0 duplicate=27 invalid=23 noHandshake=0`

Note that pow is incorrect, so that's 281+58 ~ 330 envelopes in total per minute.

Assuming this is representative for whole network load, we get ~2k envelopes per
hour and ~50k envelopes per 24h. This corresponds to roughly ~2mb/h and
~50mb/day, assuming below is accurate.

Note that in https://discuss.status.im/t/performance-of-mailservers/1215 we see
1467846 envelopes for 30days, which is ~50k/day. While this was in May, this is
another data point.

### How many per topic?

`time ./shh_basic_client --mainnet --watch --post --port=30000 | grep topicStr > foo.log` with same settings as above for 1 minute:

`wc -l foo.log` 159 entries, so half of above. Not sure why.

```
cat foo.log | grep 5C6C9B56 | wc -l
159
```

All are from that weird topic. Hypothesis: no one was using Status during this
minute. Lets run again. Sent 4 messages from mobile, public and private chat
(1). Plus launched Status app so discovery topic. Indeed:

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> wc -l foo.log
664 foo.log
```

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep 5C6C9B56 | wc -l
186
```

Constant at roughly x3 a second.

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep F8946AAC | wc -l
432
```

Discovery topic, um that's a lot! Does this imply duplicate?

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep 9C22FF5F | wc -l
36
```

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep CD423760 | wc -l
10
```

Not quite sure what to make of this, tbh.

Lets run again, just launch app (30s): It's all in either 5C6C9B56 topic
(mystery), 126 (for 30s), or discovery F8946AAC 911 (30s)!

The discovery topic is going crazy.

Hypothesis: a lot of duplicate envelopes getting through here.

### How many envelopes have already been received?

Checking hashes during 30s run. Also go to app and send one message. 159 total hashes and

`0102C0C42044FDCAC2D64CAF7EF35AA759BEA21703EE8BA7AEFFD176E2280089` 4

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | awk '{print $8}' | sort | uniq -c
      1
      8 hash=0002B6BA794DD457D55D67AD68BB8C49C98791AFEF466534FC97E28062F763FB
      8 hash=00637F3882A2EFEC89CF40249DC59FDC9A049B78D42EDB6E116B0D76BE0AA523
      4 hash=0102C0C42044FDCAC2D64CAF7EF35AA759BEA21703EE8BA7AEFFD176E2280089
     24 hash=2D567B7E97FA2510A1299EA84F140B19DA0B2012BE431845B433BA238A22282C
     22 hash=40D7D3BCC784CC9D663446A9DFB06D55533F80799261FDD30E30CC70853572CE
     21 hash=707DA8C56605C57C930CE910F2700E379C350C845B5DAE20A9F8A6DBA4F59B2B
     24 hash=AC8C3ABE198ABE3BF286E80E25B0CFF08B3573F42F61499FB258F519C1CF9F18
     24 hash=C4C3D64886ED31A387B7AE57C904D702AEE78036E9446B30E964A149134B0D56
     24 hash=D4A1D17641BD08E58589B120E7F8F399D23DA1AF1BA5BD3FED295CD852BC17DA
```

For how many connected nodes is this? WhisperNodes is 24, so assuming that's
duplication factor. Urf. But for Status nodes this should be lower?

### How does this behave with light users?

### How does this vary with Bloom filters?

Scalability trades off privacy vs traffic. I.e .false positives = privacy =
bandwidth.

Bloom size is 512-bits, and we have m topics, which leads to some p false
positive rate.

Assuming optimal number of hash functions, k = (m/n) ln2. Note that this is set
to 3 in Whisper afaict.

https://hur.st/bloomfilter/?n=1000&p=&m=512&k=3

Woaha, so at 512 bits, k=3 the probability of false positives is 1% at ~50
topics, ~10% at 100 topics and essentially 1 at 1000 topics.

Which makes sense, since the bloom would be full by then. Question: is this
actual items in filter or universe? This is for all that filter is being tested
to.

You need about 10 bits per element or 1 byte. So 50 topics checks out. Um...
sigh.

Accurate that this means 1% traffic overhead if you listen to 50 topics? How
many topics does a normal app listen to? It quickly explodes! Actually is this
accurate? Because if you get 100% it isn't 100% of traffic, it is _all_ traffic.

topicMatch vs bloomMatch.

ok, 3 main factors:

1 big topics
  - discover, then 5k one
    - what happens
2. duplicate messages
  - number of peers
  - mailservers
3. bloom filter
  - false positive
  - direct api call
4. disconnect bad peers

offline case dominating over online