To break a potential read/write deadlock, gossipsub uses an unbounded
queue for writes - when peers are too slow to process this queue, it may
end up growing without bounds causing high memory usage.
Here, we introduce a maximum write queue length after which the peer is
disconnected - the queue is generous enough that any "normal" usage
should be fine - writes that are `await`:ed are not affected, only
writes that are launched in an `asyncSpawn` task or similar.
* avoid unnecessary copy of message when there are no send observers
* release message memory earlier in gossipsub
* simplify pubsubpeer logging
* add helper to read EOF marker after closing stream (else stream stay
alive until timeout/reset)
* don't assert on empty channel message
* don't loop when writing to chronos (no need)
When messages can't be sent to peer, we try to establish a send
connection - this causes messages to stack up as more and more unsent
messages are blocked on the dial lock.
* remove dial lock
* run reconnection loop in background task
* add peer lifecycle events
* rework peer events to not use connection events
* don't use result in pubsub and switch init
* wip
* use ordered hashes and remove logscope
* logging
* add missing test
* small fixes
This change modifies how the backpressure algorithm in bufferstream
works - in particular, instead of working byte-by-byte, it will now work
seq-by-seq.
When data arrives, it usually does so in packets - in the current
bufferstream, the packet is read then split into bytes which are fed one
by one to the bufferstream. On the reading side, the bytes are popped of
the bufferstream, again byte by byte, to satisfy `readOnce` requests -
this introduces a lot of synchronization traffic because the checks for
full buffer and for async event handling must be done for every byte.
In this PR, a queue of length 1 is used instead - this means there will
at most exist one "packet" in `pushTo`, one in the queue and one in the
slush buffer that is used to store incomplete reads.
* avoid byte-by-byte copy to buffer, with synchronization in-between
* reuse AsyncQueue synchronization logic instead of rolling own
* avoid writeHandler callback - implement `write` method instead
* simplify EOF signalling by only setting EOF flag in queue reader (and
reset)
* remove BufferStream pipes (unused)
* fixes drainBuffer deadlock when drain is called from within read loop
and thus blocks draining
* fix lpchannel init order
if the connection is already closed (because the remote closes during
identfiy for example), an exception would be raised which would leave
the connection in limbo, beacuse it would not go through the rest of
internalConnect.
Also, if the connection is already closed, the disconnect event would be
scheduled before the connect event :/
* mcache fixes
* remove timed cache - the window shifting already removes old messages
* ref -> object
* avoid unnecessary allocations with `[]` operator
* simplify init
* fix several gossipsub/floodsub issues
* floodsub, gossipsub: don't rebroadcast messages that fail validation
(!)
* floodsub, gossipsub: don't crash when unsubscribing from unknown
topics (!)
* gossipsub: don't send message to peers that are not interested in the
topic, when messages don't share topic list
* floodsub: don't repeat all messages for each message when
rebroadcasting
* floodsub: allow sending empty data
* floodsub: fix inefficient unsubscribe
* sync floodsub/gossipsub logging
* gossipsub: include incoming messages in mcache (!)
* gossipsub: don't rebroadcast already-seen messages (!)
* pubsubpeer: remove incoming/outgoing seen caches - these are already
handled in gossipsub, floodsub and will cause trouble when peers try to
resubscribe / regraft topics (because control messages will have same
digest)
* timedcache: reimplement without timers (fixes timer leaks and extreme
inefficiency due to per-message closures, futures etc)
* timedcache: ref -> obj
* remove send lock
When mplex receives data it will block until a reader has processed the
data. Thus, when a large message is received, such as a gossipsub
subscription table, all of mplex will be blocked until all reading is
finished.
However, if at the same time a `dial` to establish a gossipsub send
connection is ongoing, that `dial` will be blocked because mplex is no
longer reading data - specifically, it might indeed be the connection
that's processing the previous data that is waiting for a send
connection.
There are other problems with the current code:
* If an exception is raised, it is not necessarily raised for the same
connection as `p.sendConn`, so resetting `p.sendConn` in the exception
handling is wrong
* `p.isConnected` is checked before taking the lock - thus, if it
returns false, a new dial will be started. If a new task enters `send`
before dial is finished, it will also determine `p.isConnected` is
false, then get stuck on the lock - when the previous task finishes and
releases the lock, the new task will _also_ dial and thus reset
`p.sendConn` causing a leak.
* prefer existing connection
simplifies flow
* move pubsub of off switch, pass switch into pubsub
* use join on lpstreams
* properly cleanup up failed peers
* fix tests
* fix peertable hasPeerId
* fix tests
* rework sending, remove helpers from pubsubpeer, unify in broadcast
* further split broadcast into send
* use send where appropriate
* use formatIt
* improve trace
Co-authored-by: Giovanni Petrantoni <giovanni@fragcolor.xyz>
Every Chronicles log record has an existing `msg` property matching
the static string supplied in the log statement. Thus, it's currently
not possible to use `msg` as the name of a user property:
https://github.com/status-im/nim-chronicles/issues/86
* prefer PeerID in switch api
This avoids ref issues like ref identity and nil
* use existing peerinfo instance if possible
* remove secureCodec
there may be multiple connections per peerinfo with different codecs
* avoid some extra async::
* add finegrained timeouts to pubsub
* use 10 millis timeout in tests
* finalization
* revert timeouts
* use `atEof` for reads
* adjust timeouts and use atEof for reads
* use atEof for reads
* set isEof flag
* no backoff for pubsub streams
* temp timer increase, make macos finalize
* don't call `subscribePeer` in libp2p anymore
* more traces
* leak tests
* lower timeouts
* handle exceptions in control message
* don't use `cancelAndWait`
* handle exceptions in helpers
* wip
* don't send empty messages
* check for leaks properly
* don't use cancelAndWait
* don't await subscribption sends
* remove subscrivePeer calls from switch
* trying without the hooks again
* Fix gossip messages seqno according to spec
* Add peers back to gossipsub table, slow down heartbeat
* Revert "Add peers back to gossipsub table, slow down heartbeat"
This reverts commit 01e2e62172.
* make seqno a threadvar, remove from peerinfo
* seqno refactor, into pubsub
* Minprotobuf initial commit
* Fix noise.
* Add signed integers support.
Add checks for field number value.
Remove some casts.
* Fix compile errors.
* Fix comments and constants.
* more cleanup
* fix tests
* merging master
* remove `withLock` as it conflicts with stdlib
* wip
* more fanout ttl
Co-authored-by: Giovanni Petrantoni <giovanni@fragcolor.xyz>
* gossipsub is a function of subscription messages only
* graft/prune work with mesh, get filled up from gossipsub
* fix race conditions with await
* fix exception unsafety when grafting/pruning
* fix allowing up to DHi peers in mesh on incoming graft
* fix metrics in several places
* Peer resultification and defect only
* Fixing some tests
* test fixes
* Rename peer into peerid
* better result error message in identify
* further merge fixes
* don't send public key in message when not signing (information leak)
* don't run rebalance if there are peers in gossip (see #242)
* don't crash randomly on bad peer id from remote
* count published messages
* don't call `switch.dial` in `subscribeToPeer`
* add secureconn constructor
* close in the correct order
* concurent dial lock and track in/out conns better
* make tests pass
* add todo comment
* disconect peers that open too many connections
* wip
* do connection and muxer tracking in one place
* prevent nil pointer in observers
* drop connections when peers is over max
* prevent channel leaks
* don't use closure to handle channel
* Remove noise padding payload (spec removed it)
* add log scope in secure
* avoid defect array out of range in switch secure when "na"
* improve identify traces
* wip noise fixes
* noise protobuf adjustments (trying)
* add more debugging messages/traces, improve their actual contents
* re-enable ID check in noise
* bump go daemon tag version
* bump go daemon tag version
* enable noise in daemonapi
* interop testing, (both secio and noise will be tested)
* azure cache bump (p2pd)
* CI changes
- Travis: use Go 1.14
- azure-pipelines.yml: big cleanup
- Azure: bump cache keys
- build 64-bit p2pd on 32-bit Windows
- install both Mingw-w64 architectures
* noise logging fixes
* alternate testing between noise and secio
* increase timeout to avoid VM errors in CI (multistream tests)
* refactor heartbeat management in gossipsub
* remove locking within heartbeat
* refactor heartbeat management in gossipsub
* remove locking within heartbeat
Co-authored-by: Ștefan Talpalaru <stefantalpalaru@yahoo.com>
* count published messages
* don't call `switch.dial` in `subscribeToPeer`
* don't use delegation in connection
* move connection out to own file
* don't breakout on reset
* make sure to call close on secured conn
* add lpstream tracing
* don't breackdown by conn id
* fix import
* remove unused lable
* reset connection on exception
* add additional metrics for skipped messages
* check for nil in secure.close
* Start adding some metrics to pubsub
In order to visualize it's functionality
Still WIP
* more metrics
* add per topic metrics
* finishup with requested metrics
* add a metrisServer define to start local server
* PR fixes and cleanup
This means we can use it from other protocols that inherit GossipSub. Otherwise,
a lot of internal state (heartbeat lock etc) doesn't get initialized properly.
* call write until all is written out
* wip: rework with proper half-closed
* add eof and closed handling
* wip
* close connection on chronos close
* don't use read
* make noise work again
* don't reraise just yet
* fixes after backporting
* remove on transport close cleanup
* revert back allread
* rust interop fixes
* read from stream
* inc count before closing
* rebasing master
* store incomming connections
* fix merge
* remove unneeded changes
* use internal close flag to indicate disposal
* call write until all is written out
* add comments to lpchannel fields
* add an eof flag to signal which end closed
* wip: rework with proper half-closed
* add eof and closed handling
* propagate closes to piped
* call parent close
* moving bufferstream trackers out
* move writeLock to bufferstream
* move writeLock out
* remove unused call
* wip
* rebasing master
* fix mplex tests
* wip
* fix bufferstream after backport
* wip
* rename to differentiate from chronos tracker
* close connection on chronos close
* make reset request asyncCheck
* fix channel cleanup
* misc
* don't use read
* fix backports
* make noise work again
* proper exception handling
* don't reraise just yet
* add convenience templates
* dont double wrap
* use async pragma
* fixes after backporting
* muxer owns connection
* remove on transport close cleanup
* revert back allread
* adding some todos
* read from stream
* inc count before closing
* rebasing master
* rebase master
* use correct exception type
* use try/finally insted of defer
* fix compile in trace mode
* reset channels on mplex close
* handle a few exceptions
Some of these are maybe too aggressive, but in return, they'll log
their exception - more refactoring needed to sort this out - right now
we get crashes on unhandled exceptions of unknown origin
* during connection setup
* while closing channels
* while processing pubsubs
* catch exceptions that are raised and don't try to catch exceptions that are not raised
* propagate cancellederror
* one more
* more
* more
* make interop tests less fragile
* Raise expiration time in gossipsub fanout test for slow CI
Co-authored-by: Dmitriy Ryajov <dryajov@gmail.com>
Co-authored-by: Giovanni Petrantoni <giovanni@fragcolor.xyz>
* add verify signature flag
* add sign flag to enable/disable msg signing
* moving internal tests out to their own file
* cleanup nimble file
* remove unneeded tests
* move pubsub tests out
* fix tests
* Add chronos trackers and used them to sanitize resource disposal
* Chronos trackers for transport tests wip
* No more chronos leaks in testtransport
* Make tcp transport and test more robust when closing
* Test async leaking tracking wip
* Fix a regression in wire connect
* Add chronos trackers to more tests and sanitize resource closure
* Wip fixing floodsub tests
* Floodsub wip
* Made floodsub basically deterministic, hit a nim bug with captures tho
* Wrap up floodsub tests refactor
* Wrapping up
* Add allFuturesThrowing utility
* Fix missing allFuturesThrowing in noise tests!
* Make tests green
* attempt fixing gossipsub failing cases
* Make sure to check also fanout in waitSub
* More verbose traces
* Gossipsub test improvments
* Refactor TcpTransport remove asyncCheck
* Add Connection trackers
* Add stricter connection tracking, wip mplex fix
* More asynccheck removal, in order to avoid connection leaks
* bump chronicles requirement
* Enable tracker dump to check CI output
* Wait for more futures in testmplex
* Remove tracker dump messages
* add tryAndWarn utility, fix mplex issue with go interop
* All allFuturesThrowing to directchat too
* make sure to cleanup on transport close
* Start removing allFutures
* More allfutures removal
* Complete allFutures removal except legacy and tests
* Introduce table values copies to prevent error
* Switch to allFinished
* Resolve TODOs in flood/gossip
* muxer handler, log and re-raise
* Add a common and flexible way to check multiple futures
* Make traces less verbose with shortHexDump utility
* Rename shortHexDump into shortLog
* Improve shortLog, add shortLog for crypto keys
* Add proper shortLog implementations in messages
* Fix signed varints.
Add tests for signed varints.
Remove some casts to allow usage at compile time.
* Fix vsizeof() on 32bit platforms.
* Add `hint` and `zint` types for proper signed integer encoding.
* Fix varint related bugs.
* Update requirements.
* Fix interop tests because of fixed readLine.
* Add putVarint, getVarint and tests.
* fix: don't allow replacing pubkey
* fix: several small improvements
* removing pubkey setter
* improove error handling
* remove the use of Option[T] if not needed
* don't use optional
* fix-ci: temporarily pin p2pd to a working tag
* fix example to comply with latest changes
* bumping p2pd again to a higher version
* Bugfix: Dialing an already connected peer may lead to crash
* Introduced a standard_setup module allowing to instantiate
the `Switch` object in an easier manner.
* Added `Switch.disconnect(peer)`
* Trailing space removed (sorry about polluting the diff)
* extract public and private keys fields from peerid
* allow assigning a public key
* cleaned up TODOs
* make pubsub prefix a const
* public key should be an `Option`
This adds gossipsub and floodsub, as well as basic interop testing with the go libp2p daemon.
* add close event
* wip: gossipsub
* splitting rpc message
* making message handling more consistent
* initial gossipsub implementation
* feat: nim 1.0 cleanup
* wip: gossipsub protobuf
* adding encoding/decoding of gossipsub messages
* add disconnect handler
* add proper gossipsub msg handling
* misc: cleanup for nim 1.0
* splitting floodsub and gossipsub tests
* feat: add mesh rebalansing
* test pubsub
* add mesh rebalansing tests
* testing mesh maintenance
* finishing mcache implementatin
* wip: commenting out broken tests
* wip: don't run heartbeat for now
* switchout debug for trace logging
* testing gossip peer selection algorithm
* test stream piping
* more work around message amplification
* get the peerid from message
* use timed cache as backing store
* allow setting timeout in constructor
* several changes to improve performance
* more through testing of msg amplification
* prevent gc issues
* allow piping to self and prevent deadlocks
* improove floodsub
* allow running hook on cache eviction
* prevent race conditions
* prevent race conditions and improove tests
* use hashes as cache keys
* removing useless file
* don't create a new seq
* re-enable pubsub tests
* fix imports
* reduce number of runs to speed up tests
* break out control message processing
* normalize sleeps between steps
* implement proper transport filtering
* initial interop testing
* clean up floodsub publish logic
* allow dialing without a protocol
* adding multiple reads/writes
* use protobuf varint in mplex
* don't loose conn's peerInfo
* initial interop pubsub tests
* don't duplicate connections/peers
* bring back interop tests
* wip: interop
* re-enable interop and daemon tests
* add multiple read write tests from handlers
* don't cleanup channel prematurely
* use correct channel to send/receive msgs
* adjust tests with latest changes
* include interop tests
* remove temp logging output
* fix ci
* use correct public key serialization
* additional tests for pubsub interop