When initializing from a state that's not aligned to an epoch boundary,
an earlier state is loaded that's epoch aligned, and subsequently topped
up with the missing blocks. `dag.headSyncCommittee` is initialized prior
to topping up the missing blocks, though. If the sync committee changes
while applying the blocks (e.g., a sync committee period boundary hits),
the cached information becomes unlinked from `dag.head`, leading to
valid blocks based on that chain being rejected. To fix this, move cache
initialization after the top up with blocks. This has been observed on
Goerli by initializing from 7919502 and attempting to top up 7920111.
The block gets rejected with an invalid state root on nodes that have
restarted after setting 7920111 as head, while it gets accepted by all
other nodes. Error message is `block: state root verification failed`.
The incorrect initialization behaviour was introduced in #4592, before
which the sync committee cache was initialized after applying blocks.
`batchVerify`'s precondition is a non-empty signature list:
```nim
if input.len == 0:
# Spec precondition
return false
```
This means that in eras without any blocks (as has happened on Goerli),
calling it leads to era files being reported as invalid.
Using a dedicated branch for researching the effectiveness of split view
scenario handling simplifies testing and avoids having partial work on
`unstable`. If we want, we can reintroduce it under a `--debug` flag at
a later time. But for now, Goerli is a rare opoprtunity to test this,
maybe just for another week or so.
- https://github.com/status-im/infra-nimbus/pull/179
In split view situation, the canonical chain may only be served by a
tiny amount of peers, and branches may span long durations. Minority
branches may still have a large weight from attestations and should
be discovered. To assist with that, add a branch discovery module that
assists in such a situation by specifically targeting peers with unknown
histories and downloading from them, in addition to sync manager work
which handles popular branches.
There are situations where all states in the `blockchain_dag` are
occupied and cannot be borrowed.
- headState: Many assumptions in the code that it cannot be advanced
- clearanceState: Resets every time a new block gets imported, including
blocks from non-canonical branches
- epochRefState: Used even more frequently than clearanceState
This means that during the catch-up mechanic where the head state is
slowly advanced to wall clock to catch up on validator duties in the
situation where the canonical head is way behind non-canonical heads,
we cannot use any of the three existing states. In that situation,
Nimbus already consumes an increased amount of memory due to all the
`BlockRef`, fork choice states and so on, so experience is degraded.
It seems reasonable to allocate a fourth state temporarily during that
mechanic, until a new proposal could be made on the canonical chain.
Note that currently, on `unstable`, proposals _do_ happen every couple
hours because sync manager doesn't manage to discover additional heads
in a split-view scenario on Goerli. However, with the branch discovery
module, new blocks are discovered all the time, and the clearanceState
may no longer be borrowed as it is reset to different branch too often.
The extra state could also find other uses in the future, e.g., for
incremental computations as in reindexing the database, or online
collection of historical light client data.
Use the same eviction policy for blocks as already the case for blobs.
FIFO makes more sense, because it favors keeping ancestors of blocks
which need to be applied to the DAG before their children get eligible.
Blobs are cached from gossip and other sources for all orphans, not just
those specifically tagged as `blobless`. `blobless` only means that they
are actively fetched from the network. The `MaxBlobs` should be aligned
to match `MaxOrphans`. Note that blobs are tiny compared to blocks, so
this isn't a huge memory hog.
* handle case of unreachable block in `is_optimstic` helper
When a non-canonical block is still in the DB, it can be accessed via
`BlockId`, but `BlockRef` may be unavailable if the block was not
properly cleaned when it got orphaned. Report it as optimistic.
* `template` -> `func`
During sync, sometimes the same block gets encountered and added to
quarantine multiple times. If its parent is already known, quarantine
incorrectly registers it as missing, leading to re-download. This can
be fixed by registering the parent's deepest missing parent recursively.
Also increase the stickiness of `missing`. We only perform 4 attempts
within ~16 seconds before giving up. Very frequently, this is not enough
and there is no progress until sync manager kicks in even on holesky.
When restarting beacon node, orphaned blocks remain in the database but
on startup, only the canonical chain as selected by fork choice loads.
When a new block is discovered that builds on top of an orphaned block,
the orphaned block is re-downloaded using sync/request manager, despite
it already being present on disk. Such queries can be answered locally
to improve discovery speed of alternate forks.
The `clearanceState` points to the latest resolved block, regardless of
whether that block is canonical according to fork choice. If chain is
stalled and we want to prepare for resuming validator duties, we need
a recent state according to fork choice to avoid lag spikes and missing
slot timings.
Nimbus currently stops performing validator duties if the blockchain
does not progress for `node.config.syncHorizon` slots. This means that
the chain won't recover because no new blocks are proposed. To fix that,
continue performing validator duties if no progress is registered for a
long time, and none of our peers is indicating any progress.
On Goerli there are some instances of long streaks of empty epochs due
to different branches being built in parallel. They sometimes lead to
`Request for pruned historical state` logs requiring a BN restart to
resolve. Avoid that by trying to restore states from the entire non-
finalized history, to avoid losing sync in such situtions.
In `block_dag` there is a max depth of 100 years configured to detect
internal inconsistencies, e.g., circular references. As `BlockRef` was
changed long ago to only reflect the non-finalized chain segment, the
theoretically supported max depth can be reduced and simplified.
The `syncHorizon` describes the number of empty slots before the beacon
node considers itself to be out of sync. There are two places where we
currently set this to 50 slots, but it makes more sense to base it on
wall time, e.g., the 10 minutes that the default 50 are derived from.
In #5120, EIP-7044 support got added to the state transition function to
force `CAPELLA_FORK_VERSION` to be used when validiting `VoluntaryExit`
messages, irrespective of their `epoch`.
In #5637, similar logic was added when batch verifying BLS signatures,
which is used during gossip validation (libp2p gossipsub, and req/resp).
However, that logic did not match the one introduced in #5120, and only
uses `CAPELLA_FORK_VERSION` when a `VoluntaryExit`'s `epoch` was set to
a value `>= CAPELLA_FORK_EPOCH`. Otherwise, `BELLATRIX_FORK_VERSION`
would still be used when validating `VoluntaryExit`, e.g., with `epoch`
set to `0`, as is the case in this Holesky block:
- https://holesky.beaconcha.in/slot/1076985#voluntary-exits
Extracting the correct logic from #5120 into a function, and reusing it
when verifying BLS signatures fixes this issue, and also leverages the
exhaustive EF test suite that covers the (correct) #5120 logic.
This fix only affects networks that have EIP-7044 applied (post-Deneb).
Without the fix, Deneb blocks with a `VoluntaryExit` with `epoch` set to
`< CAPELLA_FORK_EPOCH` incorrectly fail to validate despite being valid.
Incorrect blocks that contain a malicious `VoluntaryExit` with `epoch`
set to `< CAPELLA_FORK_EPOCH` and signed using `BELLATRIX_FORK_VERSION`
_would_ pass the BLS verification stage, but subsequently fail the state
transition logic. Such blocks would still correctly be labeled invalid.
* track latest duration instead of total in new timing metrics
Change `db_checkpoint_seconds` and `state_replay_seconds` metrics to
record the latest duration instead of the total. `nim-metrics` already
synthesizes a `_total` metric from these implicitly.
* still have to use inc, metrics only synthesizes the name not the sum
* prefix with `beacon_dag`
Validator monitoring gained 2 new metrics for tracking when blocks are
included or not on the head chain.
Similar to attestations, if the block is produced in epoch N, reporting
will use the state when switching to epoch N+2 to do the reporting (so
as to reasonably stabilise the block inclusion in the face of reorgs).
Database checkpointing can take seconds, e.g., while Geth is syncing.
Add a debug log + metric for it, and also info log if it takes longer
than 250ms, same as for the existing `State replayed` log. If the log
shows up for a user while the system is not overloaded, it may point
to slow disk speed or thermal issue.
* compute post-merge randao mix without loading state
* avoid copying state on shuffling computation and compute epochref
* speed up state copy for block production
With checkpoint sync, the checkpoint block is typically unavailable at
the start, and only backfilled later. To avoid treating it as having
zero hash, execution disabled in some contexts, wrap the result of
`loadExecutionBlockHash` in `Opt` and handle block hash being unknown.
---------
Co-authored-by: Jacek Sieka <jacek@status.im>
When syncing, we log a notice each time someone asks us for a block that
we haven't backfilled yet. This is quite verbose and not unexpected,
because the status message does not allow indicating backfill progress.
When using checkpoint sync, only checkpoint state is available, block is
not downloaded and backfilled later.
`dag.backfill` tracks latest filled `slot`, and latest `parent_root` for
which no block has been synced yet.
In checkpoint sync, this assumption is broken, because there, the start
`dag.backfill.slot` is set based on checkpoint state slot, and the block
is also not available.
However, sync manager in backward mode also requests `dag.backfill.slot`
and `block_clearance` then backfills the checkpoint block once it is
synced. But, there is no guarantee that a peer ever sends us that block.
They could send us all parent blocks and solely omit the checkpoint
block itself. In that situation, we would accept the parent blocks and
advance `dag.backfill`, and subsequently never request the checkpoint
block again, resulting in gap inside blocks DB that is never filled.
To mitigate that, the assumption is restored that `dag.backfill.slot`
is the latest filled `slot`, and `dag.backfill.parent_root` is the next
block that needs to be synced. By setting `slot` to `tail.slot + 1` and
`parent_root` to `tail.root`, we put a fake summary into `dag.backfill`
so that `block_clearance` only proceeds once checkpoint block exists.
After checkpoint sync, historical block IDs cannot yet be queried.
However, they are needed to compute dependent roots of `ShufflingRef`.
To allow lookup, enable `getBlockIdAtSlot` to answer from compatible
states in memory; as long as they descend from the finalized checkpoint
and the requested slot is sufficiently recent, `block_roots` contains
everything to recover `BlockSlotId` up to `SLOTS_PER_HISTORICAL_ROOT`.
This is similar to how `attester_dependent_root` etc. are computed.
This accelerates the first couple minutes of checkpoint sync on Mainnet,
especially the time until finality advances past the synced checkpoint.
Finish the rename started in #4809 to have a consistent naming.
`ExecutionPayloadHash` suggests hash over payload instead of block.
`BlockHash` is also the canonical name in engine API.
When using checkpoint sync, the initial block is missing in the DB.
Update the LC data collector initialization to account for that,
avoiding a spurious error message when it is incorrectly accessed:
```
ERR 2024-02-07 11:21:55.416+01:00 Block failed to load unexpectedly topics="chaindag_lc" bid=d30517a7:8257504 tail=8257504
```
Also fixes a regression from #5691 that resulted in similar messages
while importing the first few blocks after checkpoint sync.
Thanks to @arnetheduck for reporting this.
Full caches should not be used to mark blocks as unviable. The unviable
status is quite persistent and a block marked as such won't be processed
again once the cache empties. Problem originally introduced in #4808.
If the initial state replays cover the finalized head, import matching
`LightClientBootstrap` into database.
This also addresses this error when light client requests bootstrap from
the genesis slot on networks that launch with Altair enabled.
```
{"lvl":"DBG","ts":"2023-10-04 11:17:49.665+00:00","msg":"LC bootstrap unavailable: Sync committee branch not cached","topics":"chaindag_lc","slot":0}
```