When the sync queue processes results for a blocks by range request,
and the requested range contained some slots that are already finalized,
`BlockError.MissingParent` currently leads to `PeerScoreBadBlocks` even
when the error occurs on a non-finalized slot in the requested range.
This patch changes the scoring in that case to `PeerScoreMissingBlocks`
for consistency with range requests solely covering non-finalized slots,
and, likewise, rewinds the sync queue to the next `rewindSlot`.
Follows up on https://github.com/status-im/nimbus-eth2/pull/3461 which
ensured that repeated `beaconBlocksByRange` requests get shrinked to
account for potential out-of-band advancements to `safeSlot`, with
similar logic for the initial request.
When a `beaconBlocksByRange` response advances the `safeSlot`, but later
has errors, the sync queue keeps repeating that same request until it is
fulfilled without errors. Data up through `safeSlot` is considered to be
immutable, i.e., finalized, so re-requesting that data is not useful.
By advancing the sync progress in that scenario, those redundant query
portions can be avoided. Note, the finalized block _itself_ is always
requested, even in the initial request. This behaviour is kept same.
* Refactor and optimize logs.
* Introduce shortLog(SyncRequest).
* Address review comment.
* make sync queue logs more consistent
Adds a few minor logging improvements:
- Fixes a typo (`was happened` -> `has happened`)
- Avoids passing `reset_slot` argument to log statement multiple times
- Uses same `rewind_to_slot` label when logging in both sync directions
- Consistent rewind point logging
Co-authored-by: cheatfate <eugene.kabanov@status.im>
* harden and speed up block sync
The `GetBlockBy*` server implementation currently reads SSZ bytes from
database, deserializes them into a Nim object then serializes them right
back to SSZ - here, we eliminate the deser/ser steps and send the bytes
straight to the network. Unfortunately, the snappy recoding must still
be done because of differences in framing.
Also, the quota system makes one giant request for quota right before
sending all blocks - this means that a 1024 block request will be
"paused" for a long time, then all blocks will be sent at once causing a
spike in database reads which potentially will see the reading client
time out before any block is sent.
Finally, on the reading side we make several copies of blocks as they
travel through various queues - this was not noticeable before but
becomes a problem in two cases: bellatrix blocks are up to 10mb (instead
of .. 30-40kb) and when backfilling, we process a lot more of them a lot
faster.
* fix status comparisons for nodes syncing from genesis (#3327 was a bit
too hard)
* don't hit database at all for post-altair slots in GetBlock v1
requests
* Harden handling of unviable forks
In our current handling of unviable forks, we allow peers to send us
blocks that come from a different fork - this is not necessarily an
error as it can happen naturally, but it does open up the client to a
case where the same unviable fork keeps getting requested - rather than
allowing this to happen, we'll now give these peers a small negative
score - if it keeps happening, we'll disconnect them.
* keep track of unviable forks in quarantine, to avoid filling it with
known junk
* collect peer scores in single module
* descore peers when they send unviable blocks during sync
* don't give score for duplicate blocks
* increase quarantine size to a level that allows finality to happen
under optimal conditions - this helps avoid downloading the same blocks
over and over in case of an unviable fork
* increase initial score for new peers to make room for one more failure
before disconnection
* log and score invalid/unviable blocks in requestmanager too
* avoid ChainDAG dependency in quarantine
* reject gossip blocks with unviable parent
* continue processing unviable sync blocks in order to build unviable
dag
* docs
* Update beacon_chain/consensus_object_pools/block_pools_types.nim
* add unviable queue test
Time in the beacon chain is expressed relative to the genesis time -
this PR creates a `beacon_time` module that collects helpers and
utilities for dealing the time units - the new module does not deal with
actual wall time (that's remains in `beacon_clock`).
Collecting the time related stuff in one place makes it easier to find,
avoids some circular imports and allows more easily identifying the code
actually needs wall time to operate.
* move genesis-time-related functionality into `spec/beacon_time`
* avoid using `chronos.Duration` for time differences - it does not
support negative values (such as when something happens earlier than it
should)
* saturate conversions between `FAR_FUTURE_XXX`, so as to avoid
overflows
* fix delay reporting in validator client so it uses the expected
deadline of the slot, not "closest wall slot"
* simplify looping over the slots of an epoch
* `compute_start_slot_at_epoch` -> `start_slot`
* `compute_epoch_at_slot` -> `epoch`
A follow-up PR will (likely) introduce saturating arithmetic for the
time units - this is merely code moves, renames and fixing of small
bugs.
* SyncManager cleanups for backfill support
Cleanups, fixes and simplifications, in anticipation of backfill support
for the `SyncManager`:
* reformat sync progress indicator to show time left and % done more
prominently:
* old: `sync="sPssPsssss:2:2.4229:00h57m (2706898)"`
* new: `sync="14d12h31m (0.52%) 1.1378slots/s (wQQQQQDDQQ:1287520)"`
* reset average speed when going out of sync
* pass all block errors to sync manager, including duplicate/unviable
* penalize peers for reporting a head block that is outside of our
expected wall clock time (they're likely on a different network or
trying to disrupt sync)
* remove `SyncFailureKind` (unused)
* remove `inRange` (unused)
* add `Q` for sync queue requests that are in the `SyncQueue` but not
yet in the `BlockProcessor` queue
* update last slot in `SyncQueue` after getting peer status
* fix race condition between `wakeupWaiters` and `resetWait`, where
workers would not be correctly reset if block verification returned a
completed future without event loop
* log syncmanager direction
* Fix ordering issue.
Some of the requests size of which are not equal to `chunkSize` could be processed in wrong order which could lead to sync process freezes.
Co-authored-by: cheatfate <eugene.kabanov@status.im>
* move quarantine outside of chaindag
The quarantine has been part of the ChainDAG for the longest time, but
this design has a few issues:
* the function in which blocks are verified and added to the dag becomes
reentrant and therefore difficult to reason about - we're currently
using a stateful flag to work around it
* quarantined blocks bypass the processing queue leading to a processing
stampede
* the quarantine flow is unsuitable for orphaned attestations - these
should also should be quarantined eventually
Instead of processing the quarantine inside ChainDAG, this PR moves
re-queueing to `block_processor` which already is responsible for
dealing with follow-up work when a block is added to the dag
This sets the stage for keeping attestations in the quarantine as well.
Also:
* make `BlockError` `{.pure.}`
* avoid use of `ValidationResult` in block clearance (that's for gossip)
Simpler module name for stuff that covers forks
* check that runtime config matches database state
* also include some assorted altair cleanups
* use "standard" genesis fork in local testnet to work around missing
runtime config support
* bump libp2p
* altair sync v2
Use V2 sync requests after the altair fork has happened, according to
the wall clock
* Fix the behavior of the v1 req/resp calls after Altair
Co-authored-by: Zahary Karadjov <zahary@gmail.com>
* gossip_to_consensus -> block_processor (it's processing only blocks,
but not only from gossip)
* measure queue and validation time for blocks
* measure assignment and state loading times for updateStateData
* avoid some unnecessary block copies in block sync
* warn that database is corrupt if we hit tail without a state
* Introduce unittest2 and junit reports
* fix XML path
* don't combine multiple CI runs
* fixup
* public combined report also
Co-authored-by: Ștefan Talpalaru <stefantalpalaru@yahoo.com>
When blocks and attestations arrive, they are SSZ-decoded twice: once
for validation and once for processing. This branch enqueues the decoded
block directly for processing, avoiding the second, slow
deserialization.
* move processing of blocks and attestations to queue
* ...and out from beacon_node
* split attestation processing into attestations and aggregates
* also updates metrics
* clean up logging to better follow the lifetime of gossip: arrival,
validation and processing
* drop attestations and aggregates if there are too many
* try to prioritise blocks and aggregates before single-validator
attestations
* Allow sync manager process blocks one by one.
* Log storeBlock() and updateHead() duration.
* Calculate duration only for blocks added without any error.
* Fix float compilation error.
* Fix duration.
* Fix SyncQueue tests.
Add SeenTable to avoid continuous attempts to dead peers.
Refactor onSecond.
Block backward sync while forward sync is working.
SyncManager now checks responses according corresponding requests + tests.
SyncManager now watching for not progressing local_head_slot and resets SyncQueue.
* Add ability to reset state of sync manager.
Fix bug when sync got stuck on `zero-point` reset.
Fix bug when sync got stuck when some of the workers waiting for failing one.
* Remove debugging comments and imports.
* Remove not used pendingLock.