Commit Graph

591 Commits

Author SHA1 Message Date
Etan Kissling df0ff5f0fb
fix initialization of sync committee cache after loading non-epoch state (#6160)
When initializing from a state that's not aligned to an epoch boundary,
an earlier state is loaded that's epoch aligned, and subsequently topped
up with the missing blocks. `dag.headSyncCommittee` is initialized prior
to topping up the missing blocks, though. If the sync committee changes
while applying the blocks (e.g., a sync committee period boundary hits),
the cached information becomes unlinked from `dag.head`, leading to
valid blocks based on that chain being rejected. To fix this, move cache
initialization after the top up with blocks. This has been observed on
Goerli by initializing from 7919502 and attempting to top up 7920111.
The block gets rejected with an invalid state root on nodes that have
restarted after setting 7920111 as head, while it gets accepted by all
other nodes. Error message is `block: state root verification failed`.
The incorrect initialization behaviour was introduced in #4592, before
which the sync committee cache was initialized after applying blocks.
2024-04-03 23:03:06 +02:00
tersec 7fa32b7f02
add Electra to ConsensusFork enum (#6169)
* add Electra to ConsensusFork enum

* fix gnosis check
2024-04-03 16:43:43 +02:00
Etan Kissling 6a318d0f1a
avoid rejecting empty era file in verification (#6163)
`batchVerify`'s precondition is a non-empty signature list:

```nim
  if input.len == 0:
    # Spec precondition
    return false
```

This means that in eras without any blocks (as has happened on Goerli),
calling it leads to era files being reported as invalid.
2024-04-03 10:06:21 +02:00
Etan Kissling 0000f81df0
remove unused and redundant `PayloadID` type definition (#6165)
`PayloadID` is defined in `nim-web3` and our own Bellatrix definition
can be removed.
2024-04-03 07:27:00 +02:00
Etan Kissling 2dbe24c740
move split view catchup to research branch (#6133)
Using a dedicated branch for researching the effectiveness of split view
scenario handling simplifies testing and avoids having partial work on
`unstable`. If we want, we can reintroduce it under a `--debug` flag at
a later time. But for now, Goerli is a rare opoprtunity to test this,
maybe just for another week or so.

- https://github.com/status-im/infra-nimbus/pull/179
2024-03-25 19:09:31 +01:00
Etan Kissling fc9bc1da3a
add branch discovery module for supporting chain stall situation (#6125)
In split view situation, the canonical chain may only be served by a
tiny amount of peers, and branches may span long durations. Minority
branches may still have a large weight from attestations and should
be discovered. To assist with that, add a branch discovery module that
assists in such a situation by specifically targeting peers with unknown
histories and downloading from them, in addition to sync manager work
which handles popular branches.
2024-03-24 08:41:47 +00:00
Etan Kissling 66a9304fea
use separate state when catching up to perform validator duties (#6131)
There are situations where all states in the `blockchain_dag` are
occupied and cannot be borrowed.

- headState: Many assumptions in the code that it cannot be advanced
- clearanceState: Resets every time a new block gets imported, including
  blocks from non-canonical branches
- epochRefState: Used even more frequently than clearanceState

This means that during the catch-up mechanic where the head state is
slowly advanced to wall clock to catch up on validator duties in the
situation where the canonical head is way behind non-canonical heads,
we cannot use any of the three existing states. In that situation,
Nimbus already consumes an increased amount of memory due to all the
`BlockRef`, fork choice states and so on, so experience is degraded.
It seems reasonable to allocate a fourth state temporarily during that
mechanic, until a new proposal could be made on the canonical chain.

Note that currently, on `unstable`, proposals _do_ happen every couple
hours because sync manager doesn't manage to discover additional heads
in a split-view scenario on Goerli. However, with the branch discovery
module, new blocks are discovered all the time, and the clearanceState
may no longer be borrowed as it is reset to different branch too often.

The extra state could also find other uses in the future, e.g., for
incremental computations as in reindexing the database, or online
collection of historical light client data.
2024-03-24 07:18:33 +01:00
Etan Kissling c4a5bca629
update block quarantine eviction order to FIFO (#6129)
Use the same eviction policy for blocks as already the case for blobs.
FIFO makes more sense, because it favors keeping ancestors of blocks
which need to be applied to the DAG before their children get eligible.
2024-03-24 06:03:51 +01:00
Etan Kissling bedc601903
increase blob quarantine capacity to match block quarantine capacity (#6128)
Blobs are cached from gossip and other sources for all orphans, not just
those specifically tagged as `blobless`. `blobless` only means that they
are actively fetched from the network. The `MaxBlobs` should be aligned
to match `MaxOrphans`. Note that blobs are tiny compared to blocks, so
this isn't a huge memory hog.
2024-03-24 04:29:44 +01:00
Etan Kissling 33e34ee8bd
handle case of unreachable block in `is_optimstic` helper (#6124)
* handle case of unreachable block in `is_optimstic` helper

When a non-canonical block is still in the DB, it can be accessed via
`BlockId`, but `BlockRef` may be unavailable if the block was not
properly cleaned when it got orphaned. Report it as optimistic.

* `template` -> `func`
2024-03-22 22:50:21 +00:00
Etan Kissling 12a2f8c026
when adding duplicates to quarantine, schedule deepest missing parent (#6112)
During sync, sometimes the same block gets encountered and added to
quarantine multiple times. If its parent is already known, quarantine
incorrectly registers it as missing, leading to re-download. This can
be fixed by registering the parent's deepest missing parent recursively.

Also increase the stickiness of `missing`. We only perform 4 attempts
within ~16 seconds before giving up. Very frequently, this is not enough
and there is no progress until sync manager kicks in even on holesky.
2024-03-21 18:41:05 +00:00
Etan Kissling 6f466894ab
answer `RequestManager` queries from disk if possible (#6109)
When restarting beacon node, orphaned blocks remain in the database but
on startup, only the canonical chain as selected by fork choice loads.
When a new block is discovered that builds on top of an orphaned block,
the orphaned block is re-downloaded using sync/request manager, despite
it already being present on disk. Such queries can be answered locally
to improve discovery speed of alternate forks.
2024-03-21 18:37:31 +01:00
Etan Kissling eb5acdb7dd
make sure `clearanceState` builds on top of `headState` in chain stall (#6108)
The `clearanceState` points to the latest resolved block, regardless of
whether that block is canonical according to fork choice. If chain is
stalled and we want to prepare for resuming validator duties, we need
a recent state according to fork choice to avoid lag spikes and missing
slot timings.
2024-03-20 16:41:56 +01:00
Etan Kissling 035ca015e6
continue validator duties if chain does not progress for a long time (#6101)
Nimbus currently stops performing validator duties if the blockchain
does not progress for `node.config.syncHorizon` slots. This means that
the chain won't recover because no new blocks are proposed. To fix that,
continue performing validator duties if no progress is registered for a
long time, and none of our peers is indicating any progress.
2024-03-20 03:23:53 +01:00
Etan Kissling 595d110b37
avoid blocking deep reorgs > 64 epochs (#6099)
On Goerli there are some instances of long streaks of empty epochs due
to different branches being built in parallel. They sometimes lead to
`Request for pruned historical state` logs requiring a BN restart to
resolve. Avoid that by trying to restore states from the entire non-
finalized history, to avoid losing sync in such situtions.
2024-03-19 14:21:25 +01:00
tersec 0a6d189161
automated consensus spec URL updating to v1.4.0 (#6074) 2024-03-14 07:26:36 +01:00
Eugene Kabanov 72c844534f
Add Keymanager API graffiti endpoints. (#6054)
* Initial commit.

* Add more tests.

* Fix API mistypes.

* Fix mistypes in tests.

* Fix one more mistype.

* Fix affected tests because of error code 401.

* Add GetGraffitiResponse object.

* Add more tests.

* Fix compilation errors.

* Recover old behavior.

* Recover old behavior.

* Fix mistype.

* Test could not know default graffiti value.

* Make VC use adopted graffiti settings.

* Make BN use adopted graffiti settings.

* Update Alltests.

* Fix test.

* Revert "Fix test."

This reverts commit c735f855d3cb9c4a1c8e8af29d3f4438d068e31f.

* Workaround {.push raises.} requirement.

* Fix comment.

* Update Alltests.
2024-03-14 03:44:00 +00:00
Etan Kissling 4e2ffca44a
use fixed max depth for `BlockRef` (#6070)
In `block_dag` there is a max depth of 100 years configured to detect
internal inconsistencies, e.g., circular references. As `BlockRef` was
changed long ago to only reflect the non-finalized chain segment, the
theoretically supported max depth can be reduced and simplified.
2024-03-13 13:01:51 +01:00
Etan Kissling 8bd8ffe2bb
align default `syncHorizon` computation logic across networks (#6066)
The `syncHorizon` describes the number of empty slots before the beacon
node considers itself to be out of sync. There are two places where we
currently set this to 50 slots, but it makes more sense to base it on
wall time, e.g., the 10 minutes that the default 50 are derived from.
2024-03-12 21:51:18 +01:00
tersec 2a13c09615
add proposer reward accounting to block transitions (#6022)
* add proposer reward accounting to block transitions

* Update beacon_chain/spec/state_transition_block.nim

Co-authored-by: Etan Kissling <etan@status.im>

---------

Co-authored-by: Etan Kissling <etan@status.im>
2024-03-04 17:00:46 +00:00
Etan Kissling d8b8aee7b7
avoid style check issue with `syncAggregate` (#6013)
Style check confuses `func syncAggregate` because it accesses some other
object's `sync_aggregate` member in the body. Rename the func to avoid.
2024-03-02 02:54:37 +01:00
tersec fef831d92a
rm unused ForkedTrustedBeaconBlock; add some Electra overloads to consensus_object_pools; Electra BeaconBlock gossip support (#5965) 2024-02-26 06:49:12 +00:00
tersec a4f4a35845
Revert "initial Electra support skeleton" (#5955)
* Revert "initial Electra support skeleton (#5946)"

This reverts commit d09bf3b587.

* Update test_signing_node.nim
2024-02-25 19:42:44 +00:00
Etan Kissling f54fa083b4
fix EIP-7044 implementation when using batch verification (#5953)
In #5120, EIP-7044 support got added to the state transition function to
force `CAPELLA_FORK_VERSION` to be used when validiting `VoluntaryExit`
messages, irrespective of their `epoch`.

In #5637, similar logic was added when batch verifying BLS signatures,
which is used during gossip validation (libp2p gossipsub, and req/resp).
However, that logic did not match the one introduced in #5120, and only
uses `CAPELLA_FORK_VERSION` when a `VoluntaryExit`'s `epoch` was set to
a value `>= CAPELLA_FORK_EPOCH`. Otherwise, `BELLATRIX_FORK_VERSION`
would still be used when validating `VoluntaryExit`, e.g., with `epoch`
set to `0`, as is the case in this Holesky block:

- https://holesky.beaconcha.in/slot/1076985#voluntary-exits

Extracting the correct logic from #5120 into a function, and reusing it
when verifying BLS signatures fixes this issue, and also leverages the
exhaustive EF test suite that covers the (correct) #5120 logic.

This fix only affects networks that have EIP-7044 applied (post-Deneb).

Without the fix, Deneb blocks with a `VoluntaryExit` with `epoch` set to
`< CAPELLA_FORK_EPOCH` incorrectly fail to validate despite being valid.

Incorrect blocks that contain a malicious `VoluntaryExit` with `epoch`
set to `< CAPELLA_FORK_EPOCH` and signed using `BELLATRIX_FORK_VERSION`
_would_ pass the BLS verification stage, but subsequently fail the state
transition logic. Such blocks would still correctly be labeled invalid.
2024-02-25 15:25:26 +01:00
tersec d09bf3b587
initial Electra support skeleton (#5946) 2024-02-24 13:44:15 +00:00
tersec 0f155ebf95
some consensus spec v1.4.0-beta.7 spec URL updates (#5945) 2024-02-22 02:42:57 +00:00
tersec c73d7c6f6f
automated consensus spec URL updating to v1.4.0-beta.7 (#5942) 2024-02-21 19:44:48 +00:00
Etan Kissling 88045a91cd
rename new timing metrics, as `_total` suffix is implicit (#5917)
* track latest duration instead of total in new timing metrics

Change `db_checkpoint_seconds` and `state_replay_seconds` metrics to
record the latest duration instead of the total. `nim-metrics` already
synthesizes a `_total` metric from these implicitly.

* still have to use inc, metrics only synthesizes the name not the sum

* prefix with `beacon_dag`
2024-02-20 20:34:41 +01:00
Jacek Sieka 8d465a7d8c
vmon: Missed block metric (#5913)
Validator monitoring gained 2 new metrics for tracking when blocks are
included or not on the head chain.

Similar to attestations, if the block is produced in epoch N, reporting
will use the state when switching to epoch N+2 to do the reporting (so
as to reasonably stabilise the block inclusion in the face of reorgs).
2024-02-20 06:40:18 +02:00
Etan Kissling 92197ce690
add metric for database checkpoint duration (#5897)
Database checkpointing can take seconds, e.g., while Geth is syncing.
Add a debug log + metric for it, and also info log if it takes longer
than 250ms, same as for the existing `State replayed` log. If the log
shows up for a user while the system is not overloaded, it may point
to slow disk speed or thermal issue.
2024-02-19 11:00:11 +01:00
Jacek Sieka afdfe302f3
state loading optimizations (#5881)
* compute post-merge randao mix without loading state
* avoid copying state on shuffling computation and compute epochref
* speed up state copy for block production
2024-02-12 15:58:55 +01:00
tersec a4680cb7fa
refactor addHeadBlock() to research/ and tests/ helper (#5874)
* refactor addHeadBlock() to research/ and tests/ helper

* rm now-dead code
2024-02-09 23:46:51 +00:00
Etan Kissling 9593ef74b8
do not cache zero block hash if block unavailable (#5865)
With checkpoint sync, the checkpoint block is typically unavailable at
the start, and only backfilled later. To avoid treating it as having
zero hash, execution disabled in some contexts, wrap the result of
`loadExecutionBlockHash` in `Opt` and handle block hash being unknown.

---------

Co-authored-by: Jacek Sieka <jacek@status.im>
2024-02-09 22:10:38 +00:00
Etan Kissling 7c53841cd8
Revert "Revert "fix checkpoint block potentially not getting backfilled into DB (#5863)" (#5871)" (#5875)
This reverts commit 1575478b72.
2024-02-09 20:44:54 +01:00
Etan Kissling f2d92729a2
reduce verbosity of `Got request for pre-backfill slot` (#5876)
When syncing, we log a notice each time someone asks us for a block that
we haven't backfilled yet. This is quite verbose and not unexpected,
because the status message does not allow indicating backfill progress.
2024-02-09 20:32:31 +01:00
tersec 1575478b72
Revert "fix checkpoint block potentially not getting backfilled into DB (#5863)" (#5871)
This reverts commit 65e6f892de.
2024-02-09 12:49:07 +00:00
Etan Kissling 65e6f892de
fix checkpoint block potentially not getting backfilled into DB (#5863)
When using checkpoint sync, only checkpoint state is available, block is
not downloaded and backfilled later.

`dag.backfill` tracks latest filled `slot`, and latest `parent_root` for
which no block has been synced yet.

In checkpoint sync, this assumption is broken, because there, the start
`dag.backfill.slot` is set based on checkpoint state slot, and the block
is also not available.

However, sync manager in backward mode also requests `dag.backfill.slot`
and `block_clearance` then backfills the checkpoint block once it is
synced. But, there is no guarantee that a peer ever sends us that block.
They could send us all parent blocks and solely omit the checkpoint
block itself. In that situation, we would accept the parent blocks and
advance `dag.backfill`, and subsequently never request the checkpoint
block again, resulting in gap inside blocks DB that is never filled.

To mitigate that, the assumption is restored that `dag.backfill.slot`
is the latest filled `slot`, and `dag.backfill.parent_root` is the next
block that needs to be synced. By setting `slot` to `tail.slot + 1` and
`parent_root` to `tail.root`, we put a fake summary into `dag.backfill`
so that `block_clearance` only proceeds once checkpoint block exists.
2024-02-09 11:20:36 +01:00
Etan Kissling 4266e16835
allow `getBlockIdAtSlot` to answer queries from available states (#5869)
After checkpoint sync, historical block IDs cannot yet be queried.
However, they are needed to compute dependent roots of `ShufflingRef`.
To allow lookup, enable `getBlockIdAtSlot` to answer from compatible
states in memory; as long as they descend from the finalized checkpoint
and the requested slot is sufficiently recent, `block_roots` contains
everything to recover `BlockSlotId` up to `SLOTS_PER_HISTORICAL_ROOT`.
This is similar to how `attester_dependent_root` etc. are computed.

This accelerates the first couple minutes of checkpoint sync on Mainnet,
especially the time until finality advances past the synced checkpoint.
2024-02-09 11:13:00 +01:00
Etan Kissling e398078abc
`...ExecutionPayloadHash` --> `...ExecutionBlockHash` (#5864)
Finish the rename started in #4809 to have a consistent naming.
`ExecutionPayloadHash` suggests hash over payload instead of block.
`BlockHash` is also the canonical name in engine API.
2024-02-08 01:24:55 +01:00
Etan Kissling 94ba0a9bd1
consider block availability when initializing LC data collector (#5860)
When using checkpoint sync, the initial block is missing in the DB.
Update the LC data collector initialization to account for that,
avoiding a spurious error message when it is incorrectly accessed:

```
ERR 2024-02-07 11:21:55.416+01:00 Block failed to load unexpectedly          topics="chaindag_lc" bid=d30517a7:8257504 tail=8257504
```

Also fixes a regression from #5691 that resulted in similar messages
while importing the first few blocks after checkpoint sync.

Thanks to @arnetheduck for reporting this.
2024-02-07 18:03:19 +00:00
Etan Kissling b7026a683a
avoid marking blocks as unviable if `blobless` quarantine is full (#5858)
Full caches should not be used to mark blocks as unviable. The unviable
status is quite persistent and a block marked as such won't be processed
again once the cache empties. Problem originally introduced in #4808.
2024-02-07 13:38:20 +00:00
Jacek Sieka 6328c77778
raises for gossip (#5808)
* raises for gossip

* fix light client
2024-01-22 17:34:54 +01:00
tersec 6c53dc1e11
automated consensus spec URL updating to v1.4.0-beta.6 (#5804) 2024-01-20 11:19:47 +00:00
Eugene Kabanov 3648df7d4c
Fix VC not always be able to obtain feeRecipient value. (#5781)
Use state's validator value to obtain feeRecipient value.
Make feeRecipient and gasLimit calculation equal for BN and VC.
2024-01-19 14:36:04 +00:00
Etan Kissling be73ce2e9a
import finalized head LC bootstrap on launch (#5775)
If the initial state replays cover the finalized head, import matching
`LightClientBootstrap` into database.

This also addresses this error when light client requests bootstrap from
the genesis slot on networks that launch with Altair enabled.

```
{"lvl":"DBG","ts":"2023-10-04 11:17:49.665+00:00","msg":"LC bootstrap unavailable: Sync committee branch not cached","topics":"chaindag_lc","slot":0}
```
2024-01-18 22:51:26 +00:00
tersec cf1bec7670
update some deprecated stew/results to results imports (#5743) 2024-01-16 22:37:14 +00:00
tersec 69af8f943e
implement blob_sidecar Beacon API streaming (#5728) 2024-01-13 11:52:13 +02:00
Jacek Sieka 62cbdeefc5
verify `genesis_time` more strictly (fixes #1667) (#5694)
Bogus values lead to crashes down the line when timers overflow
2024-01-06 15:26:56 +01:00
Etan Kissling 7db95f047b
track latest `LightClientUpdate` only once fork choice selects it (#5691)
Instead of tracking the latest `LightClientUpdate` across all branches,
track the latest one on the current branch as selected by fork choice.
2024-01-03 23:36:05 +01:00
Etan Kissling 030226148d
rename `exit_pool` > `validator_change_pool` (#5679)
The `ExitPool` was renamed to `ValidatorChangePool` with Capella, but
the files were still using the previous name. Rename for consistency.
2023-12-23 06:55:47 +01:00