5.0 KiB
Notes
Bloom filter and probability of false positives
Questions
How big is an envelope?
[ Expiry, TTL, Topic, Data, Nonce ]
4+4+4+arb+8, where arb is data field as
factor of 256 bytes (minus salt?), according to EIP627. Can't find.
512 bytes seems like minimum, and 4 kB max. Let's assume 1 kB for now.
Question: How big is an envelope in Status? Does it change?
How many envelopes over some time period for a full node?
Running time ./shh_basic_client --mainnet --watch --post --port=30000
, connected to all full nodes with a full bloom filter we get incoming messages:
WRN 2019-10-14 13:05:55+09:00 Message Ingress Stats tid=9527 added=6 allowed=58 disallowed=281 disallowedBloom=0 disallowedPow=281 disallowedSize=0 duplicate=27 invalid=23 noHandshake=0
Note that pow is incorrect, so that's 281+58 ~ 330 envelopes in total per minute.
Assuming this is representative for whole network load, we get ~2k envelopes per hour and ~50k envelopes per 24h. This corresponds to roughly ~2mb/h and ~50mb/day, assuming below is accurate.
Note that in https://discuss.status.im/t/performance-of-mailservers/1215 we see 1467846 envelopes for 30days, which is ~50k/day. While this was in May, this is another data point.
How many per topic?
time ./shh_basic_client --mainnet --watch --post --port=30000 | grep topicStr > foo.log
with same settings as above for 1 minute:
wc -l foo.log
159 entries, so half of above. Not sure why.
cat foo.log | grep 5C6C9B56 | wc -l
159
All are from that weird topic. Hypothesis: no one was using Status during this minute. Lets run again. Sent 4 messages from mobile, public and private chat (1). Plus launched Status app so discovery topic. Indeed:
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> wc -l foo.log
664 foo.log
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep 5C6C9B56 | wc -l
186
Constant at roughly x3 a second.
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep F8946AAC | wc -l
432
Discovery topic, um that's a lot! Does this imply duplicate?
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep 9C22FF5F | wc -l
36
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | grep CD423760 | wc -l
10
Not quite sure what to make of this, tbh.
Lets run again, just launch app (30s): It's all in either 5C6C9B56 topic (mystery), 126 (for 30s), or discovery F8946AAC 911 (30s)!
The discovery topic is going crazy.
Hypothesis: a lot of duplicate envelopes getting through here.
How many envelopes have already been received?
Checking hashes during 30s run. Also go to app and send one message. 159 total hashes and
0102C0C42044FDCAC2D64CAF7EF35AA759BEA21703EE8BA7AEFFD176E2280089
4
oskarth@localhost /home/oskarth/git/nim-eth/tests/p2p> cat foo.log | awk '{print $8}' | sort | uniq -c
1
8 hash=0002B6BA794DD457D55D67AD68BB8C49C98791AFEF466534FC97E28062F763FB
8 hash=00637F3882A2EFEC89CF40249DC59FDC9A049B78D42EDB6E116B0D76BE0AA523
4 hash=0102C0C42044FDCAC2D64CAF7EF35AA759BEA21703EE8BA7AEFFD176E2280089
24 hash=2D567B7E97FA2510A1299EA84F140B19DA0B2012BE431845B433BA238A22282C
22 hash=40D7D3BCC784CC9D663446A9DFB06D55533F80799261FDD30E30CC70853572CE
21 hash=707DA8C56605C57C930CE910F2700E379C350C845B5DAE20A9F8A6DBA4F59B2B
24 hash=AC8C3ABE198ABE3BF286E80E25B0CFF08B3573F42F61499FB258F519C1CF9F18
24 hash=C4C3D64886ED31A387B7AE57C904D702AEE78036E9446B30E964A149134B0D56
24 hash=D4A1D17641BD08E58589B120E7F8F399D23DA1AF1BA5BD3FED295CD852BC17DA
For how many connected nodes is this? WhisperNodes is 24, so assuming that's duplication factor. Urf. But for Status nodes this should be lower?
How does this behave with light users?
How does this vary with Bloom filters?
Scalability trades off privacy vs traffic. I.e .false positives = privacy = bandwidth.
Bloom size is 512-bits, and we have m topics, which leads to some p false positive rate.
Assuming optimal number of hash functions, k = (m/n) ln2. Note that this is set to 3 in Whisper afaict.
https://hur.st/bloomfilter/?n=1000&p=&m=512&k=3
Woaha, so at 512 bits, k=3 the probability of false positives is 1% at ~50 topics, ~10% at 100 topics and essentially 1 at 1000 topics.
Which makes sense, since the bloom would be full by then. Question: is this actual items in filter or universe? This is for all that filter is being tested to.
You need about 10 bits per element or 1 byte. So 50 topics checks out. Um... sigh.
Accurate that this means 1% traffic overhead if you listen to 50 topics? How many topics does a normal app listen to? It quickly explodes! Actually is this accurate? Because if you get 100% it isn't 100% of traffic, it is all traffic.
topicMatch vs bloomMatch.
ok, 3 main factors:
1 big topics
- discover, then 5k one
- what happens
- duplicate messages
- number of peers
- mailservers
- bloom filter
- false positive
- direct api call
- disconnect bad peers
offline case dominating over online