research/whisper_scalability/scalability.md

# Notes

Bloom filter and probability of false positives


## Questions

### How big is an envelope?

`[ Expiry, TTL, Topic, Data, Nonce ]` 4+4+4+arb+8, where arb is data field as
factor of 256 bytes (minus salt?), according to EIP627. Can't find.

512 bytes seems like minimum, and 4 kB max. Let's assume 1 kB for now.

**Question: How big is an envelope in Status? Does it change?**

### How many envelopes over some time period for a full node?

Running `time ./shh_basic_client --mainnet --watch --post --port=30000`, connected to all full nodes with a full bloom filter we get incoming messages:

`WRN 2019-10-14 13:05:55+09:00 Message Ingress Stats                      tid=9527 added=6 allowed=58 disallowed=281 disallowedBloom=0 disallowedPow=281 disallowedSize=0 duplicate=27 invalid=23 noHandshake=0`

Note that pow is incorrect, so that's 281+58 ~ 330 envelopes in total per minute.

Assuming this is representative for whole network load, we get ~2k envelopes per
hour and ~50k envelopes per 24h. This corresponds to roughly ~2mb/h and
~50mb/day, assuming below is accurate.

Note that in https://discuss.status.im/t/performance-of-mailservers/1215 we see
1467846 envelopes for 30days, which is ~50k/day. While this was in May, this is
another data point.

### How many per topic?

`time ./shh_basic_client --mainnet --watch --post --port=30000 | grep topicStr > foo.log` with same settings as above for 1 minute:

`wc -l foo.log` 159 entries, so half of above. Not sure why.

```
cat foo.log | grep 5C6C9B56 | wc -l
159
```

All are from that weird topic. Hypothesis: no one was using Status during this
minute. Lets run again. Sent 4 messages from mobile, public and private chat
(1). Plus launched Status app so discovery topic. Indeed:

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> wc -l foo.log
664 foo.log
```

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep 5C6C9B56 | wc -l
186
```

Constant at roughly x3 a second.

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep F8946AAC | wc -l
432
```

Discovery topic, um that's a lot! Does this imply duplicate?

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep 9C22FF5F | wc -l
36
```

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep CD423760 | wc -l
10
```

Not quite sure what to make of this, tbh.

Lets run again, just launch app (30s): It's all in either 5C6C9B56 topic
(mystery), 126 (for 30s), or discovery F8946AAC 911 (30s)!

The discovery topic is going crazy.

Hypothesis: a lot of duplicate envelopes getting through here.

### How many envelopes have already been received?

Checking hashes during 30s run. Also go to app and send one message. 159 total hashes and

`0102C0C42044FDCAC2D64CAF7EF35AA759BEA21703EE8BA7AEFFD176E2280089` 4

```
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | awk '{print $8}' | sort | uniq -c 
      1 
      8 hash=0002B6BA794DD457D55D67AD68BB8C49C98791AFEF466534FC97E28062F763FB
      8 hash=00637F3882A2EFEC89CF40249DC59FDC9A049B78D42EDB6E116B0D76BE0AA523
      4 hash=0102C0C42044FDCAC2D64CAF7EF35AA759BEA21703EE8BA7AEFFD176E2280089
     24 hash=2D567B7E97FA2510A1299EA84F140B19DA0B2012BE431845B433BA238A22282C
     22 hash=40D7D3BCC784CC9D663446A9DFB06D55533F80799261FDD30E30CC70853572CE
     21 hash=707DA8C56605C57C930CE910F2700E379C350C845B5DAE20A9F8A6DBA4F59B2B
     24 hash=AC8C3ABE198ABE3BF286E80E25B0CFF08B3573F42F61499FB258F519C1CF9F18
     24 hash=C4C3D64886ED31A387B7AE57C904D702AEE78036E9446B30E964A149134B0D56
     24 hash=D4A1D17641BD08E58589B120E7F8F399D23DA1AF1BA5BD3FED295CD852BC17DA
```

For how many connected nodes is this? WhisperNodes is 24, so assuming that's
duplication factor. Urf. But for Status nodes this should be lower?

### How does this behave with light users?

### How does this vary with Bloom filters?

Scalability trades off privacy vs traffic. I.e .false positives = privacy =
bandwidth.

Bloom size is 512-bits, and we have m topics, which leads to some p false
positive rate.

Assuming optimal number of hash functions, k = (m/n) ln2. Note that this is set
to 3 in Whisper afaict.

https://hur.st/bloomfilter/?n=1000&p=&m=512&k=3

Woaha, so at 512 bits, k=3 the probability of false positives is 1% at ~50
topics, ~10% at 100 topics and essentially 1 at 1000 topics.

Which makes sense, since the bloom would be full by then. Question: is this
actual items in filter or universe? This is for all that filter is being tested
to.

You need about 10 bits per element or 1 byte. So 50 topics checks out. Um...
sigh.

Accurate that this means 1% traffic overhead if you listen to 50 topics? How
many topics does a normal app listen to? It quickly explodes! Actually is this
accurate? Because if you get 100% it isn't 100% of traffic, it is _all_ traffic.

topicMatch vs bloomMatch.

ok, 3 main factors:

1 big topics
  - discover, then 5k one
    - what happens
2. duplicate messages
  - number of peers
  - mailservers
3. bloom filter
  - false positive
  - direct api call
4. disconnect bad peers

offline case dominating over online
raw notes on whisper scalability 2019-10-16 04:31:00 +00:00			`# Notes`

			`Bloom filter and probability of false positives`


			`## Questions`

			`### How big is an envelope?`

			`[ Expiry, TTL, Topic, Data, Nonce ]` 4+4+4+arb+8, where arb is data field as
			`factor of 256 bytes (minus salt?), according to EIP627. Can't find.`

			`512 bytes seems like minimum, and 4 kB max. Let's assume 1 kB for now.`

			`Question: How big is an envelope in Status? Does it change?`

			`### How many envelopes over some time period for a full node?`

			Running `time ./shh_basic_client --mainnet --watch --post --port=30000`, connected to all full nodes with a full bloom filter we get incoming messages:

			`WRN 2019-10-14 13:05:55+09:00 Message Ingress Stats tid=9527 added=6 allowed=58 disallowed=281 disallowedBloom=0 disallowedPow=281 disallowedSize=0 duplicate=27 invalid=23 noHandshake=0`

			`Note that pow is incorrect, so that's 281+58 ~ 330 envelopes in total per minute.`

			`Assuming this is representative for whole network load, we get ~2k envelopes per`
			`hour and ~50k envelopes per 24h. This corresponds to roughly ~2mb/h and`
			`~50mb/day, assuming below is accurate.`

			`Note that in https://discuss.status.im/t/performance-of-mailservers/1215 we see`
			`1467846 envelopes for 30days, which is ~50k/day. While this was in May, this is`
			`another data point.`

			`### How many per topic?`

			`time ./shh_basic_client --mainnet --watch --post --port=30000 \| grep topicStr > foo.log` with same settings as above for 1 minute:

			`wc -l foo.log` 159 entries, so half of above. Not sure why.

			```
			`cat foo.log \| grep 5C6C9B56 \| wc -l`
			`159`
			```

			`All are from that weird topic. Hypothesis: no one was using Status during this`
			`minute. Lets run again. Sent 4 messages from mobile, public and private chat`
			`(1). Plus launched Status app so discovery topic. Indeed:`

			```
			`oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> wc -l foo.log`
			`664 foo.log`
			```

			```
			`oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log \| grep 5C6C9B56 \| wc -l`
			`186`
			```

			`Constant at roughly x3 a second.`

			```
			`oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log \| grep F8946AAC \| wc -l`
			`432`
			```

			`Discovery topic, um that's a lot! Does this imply duplicate?`

			```
			`oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log \| grep 9C22FF5F \| wc -l`
			`36`
			```

			```
			`oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log \| grep CD423760 \| wc -l`
			`10`
			```

			`Not quite sure what to make of this, tbh.`

			`Lets run again, just launch app (30s): It's all in either 5C6C9B56 topic`
			`(mystery), 126 (for 30s), or discovery F8946AAC 911 (30s)!`

			`The discovery topic is going crazy.`

			`Hypothesis: a lot of duplicate envelopes getting through here.`

			`### How many envelopes have already been received?`

			`Checking hashes during 30s run. Also go to app and send one message. 159 total hashes and`

			`0102C0C42044FDCAC2D64CAF7EF35AA759BEA21703EE8BA7AEFFD176E2280089` 4

			```
			`oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log \| awk '{print $8}' \| sort \| uniq -c`
			`1`
			`8 hash=0002B6BA794DD457D55D67AD68BB8C49C98791AFEF466534FC97E28062F763FB`
			`8 hash=00637F3882A2EFEC89CF40249DC59FDC9A049B78D42EDB6E116B0D76BE0AA523`
			`4 hash=0102C0C42044FDCAC2D64CAF7EF35AA759BEA21703EE8BA7AEFFD176E2280089`
			`24 hash=2D567B7E97FA2510A1299EA84F140B19DA0B2012BE431845B433BA238A22282C`
			`22 hash=40D7D3BCC784CC9D663446A9DFB06D55533F80799261FDD30E30CC70853572CE`
			`21 hash=707DA8C56605C57C930CE910F2700E379C350C845B5DAE20A9F8A6DBA4F59B2B`
			`24 hash=AC8C3ABE198ABE3BF286E80E25B0CFF08B3573F42F61499FB258F519C1CF9F18`
			`24 hash=C4C3D64886ED31A387B7AE57C904D702AEE78036E9446B30E964A149134B0D56`
			`24 hash=D4A1D17641BD08E58589B120E7F8F399D23DA1AF1BA5BD3FED295CD852BC17DA`
			```

			`For how many connected nodes is this? WhisperNodes is 24, so assuming that's`
			`duplication factor. Urf. But for Status nodes this should be lower?`

			`### How does this behave with light users?`

			`### How does this vary with Bloom filters?`

			`Scalability trades off privacy vs traffic. I.e .false positives = privacy =`
			`bandwidth.`

			`Bloom size is 512-bits, and we have m topics, which leads to some p false`
			`positive rate.`

			`Assuming optimal number of hash functions, k = (m/n) ln2. Note that this is set`
			`to 3 in Whisper afaict.`

			`https://hur.st/bloomfilter/?n=1000&p=&m=512&k=3`

			`Woaha, so at 512 bits, k=3 the probability of false positives is 1% at ~50`
			`topics, ~10% at 100 topics and essentially 1 at 1000 topics.`

			`Which makes sense, since the bloom would be full by then. Question: is this`
			`actual items in filter or universe? This is for all that filter is being tested`
			`to.`

			`You need about 10 bits per element or 1 byte. So 50 topics checks out. Um...`
			`sigh.`

			`Accurate that this means 1% traffic overhead if you listen to 50 topics? How`
			`many topics does a normal app listen to? It quickly explodes! Actually is this`
			`accurate? Because if you get 100% it isn't 100% of traffic, it is _all_ traffic.`

			`topicMatch vs bloomMatch.`

			`ok, 3 main factors:`

			`1 big topics`
			`- discover, then 5k one`
			`- what happens`
			`2. duplicate messages`
			`- number of peers`
			`- mailservers`
			`3. bloom filter`
			`- false positive`
			`- direct api call`
			`4. disconnect bad peers`

			`offline case dominating over online`