Since the sync committee duties are no longer updated on every slot
and previously the sync committee aggregators selection proofs were
generated during the duties update, this now resulted in the client
using stale selection proofs (they must be generated at each slot).
The fix consists of moving the selection proof generation logic in
a different function which is properly executed on each slot.
Other changes:
* The logtrace tool has been enhanced with a framework for adding
new simpler log aggregation and analysis algorithms.
The default CI testnet simulation will now ensure that the blocks
in the network have reasonable sync committee participation.
Adds a "Slot start" log to the LC that behaves similar to BN to inform
the user that the light client is doing something, and to indicate the
latest view of the network (finalized / optimistic).
* implement several capellaImplementationMissing points
* don't register validator activity for not-active validators
* don't check validator indices already coming out of committees which exist; must be active validators, or else other deeper bugs
* Initial commit.
* NextAttestationEntry type.
* Add doppelgangerCheck and actual check.
* Recover deleted check.
* Remove NextAttestainEntry changes.
* More cleanups for NextAttestationEntry.
* Address review comments.
* Remove GENESIS_EPOCH specific check branch.
* Decrease number of full epochs for doppelganger check in VC.
Co-authored-by: zah <zahary@status.im>
The various `PeerScore` constants are used for both beacon blocks and
LC objects, and will likely also find use for EIP4844 blob sidecars.
Renaming them to use more generically applicable names not referring
to blocks explicitly aymore.
We currently use `BlockError` for both beacon blocks and LC objects.
In light of EIP4844, we will likely also use it for blob sidecars.
To avoid confusion, renaming it to a more generic `VerifierError`,
and update its documentation to be more generic.
To avoid long lines as a followup, also renaming the `block_processor`'s
`BlockProcessingCompleted.completed`->`ProcessingStatus.completed` and
`BlockProcessingCompleted.notCompleted`->`ProcessingStatus.notCompleted`
This PR removes a bunch of code to make TNS aware of era files, avoiding
a duplicated backfill when era files are available.
* reuse chaindag for loading backfill state, replacing the TNS homebrew
* fix era block iteration to skip empty slots
* add tests for `can_advance_slots`
Without this, tools like `ncli_db` require command line parameter to
process mainnet blocks, which is a regression since the merge - this is
a subset of #4147.
When the EL/Builder fails to produce an execution payload, we fall back
to an empty `ExecutionPayload`. Even though it contains no transactions
it should refer to the configured fee recipient. This is useful for
privacy reasons (do not reveal the reason for the empty payload) and for
compliance with additional fee recipient rules by staking pools.
* move duty tracking code to `ActionTracker`
* fix earlier duties overwriting later ones
* re-run subnet selection when new duty appears
* log upcoming duties as soon as they're known (vs 4 epochs before)
To further tighten Nimbus against spam, this PR introduces a global
quota for block requests (shared between peers) as well as a general
per-peer request limit that applies to all libp2p requests.
* apply request quota before decoding message
* for high-bandwidth requests (blocks), apply a shared global quota
which helps manage bandwidth for high-peer setups
* add metrics
Currently, we require genesis and a checkpoint block and state to start
from an arbitrary slot - this PR relaxes this requirement so that we can
start with a state alone.
The current trusted-node-sync algorithm works by first downloading
blocks until we find an epoch aligned non-empty slot, then downloads the
state via slot.
However, current
[proposals](https://github.com/ethereum/beacon-APIs/pull/226) for
checkpointing prefer finalized state as
the main reference - this allows more simple access control and caching
on the server side - in particular, this should help checkpoint-syncing
from sources that have a fast `finalized` state download (like infura
and teku) but are slow when accessing state via slot.
Earlier versions of Nimbus will not be able to read databases created
without a checkpoint block and genesis. In most cases, backfilling makes
the database compatible except where genesis is also missing (custom
networks).
* backfill checkpoint block from libp2p instead of checkpoint source,
when doing trusted node sync
* allow starting the client without genesis / checkpoint block
* perform epoch start slot lookahead when loading tail state, so as to
deal with the case where the epoch start slot does not have a block
* replace `--blockId` with `--state-id` in TNS command line
* when replaying, also look at the parent of the last-known-block (even
if we don't have the parent block data, we can still replay from a
"parent" state) - in particular, this clears the way for implementing
state pruning
* deprecate `--finalized-checkpoint-block` option (no longer needed)
* cap maximum number of chunks to download from peer (fixes#1620)
* drop support for requesting blocks via v1 / phase0 protocol
* tighten bounds checking of fixed-size messages
In optimistic mode, Nimbus will sync optimistically even when the
execution client is offline / not available.
An optimistic node is less secure because it has not validated block
transactions via the execution client and can thus not be used for
validation duties.
* Allow chain dag without genesis / block
This PR enables the initialization of the dag without access to blocks
or genesis state - it is a prerequisite for implementing a number of
interesting features:
* checkpoint sync without any block download
* pruning of blocks and states
* backfill checkpoint block
* Fix doppelganger protection reorders validator indices in response issue.
* Add chronos metrics endpoint to nimbus REST API.
* Doppelganger protection now works on duties not on attestations.
Improve logging for doppelganger and indices.
* Improve doppelganger and indices logging.
* Add number of validators to logs.
* Move logging dumps from `debug` to `trace` level.
The LC REST API has been merged into the ethereum/beacon-APIs specs:
- https://github.com/ethereum/beacon-APIs/pull/247
Update URLs to v1 and update REST tests. Note that REST tests do not
start with Altair, so the tested BN will return empty / error responses.
Implements the latest proposal for providing LC data via REST, as of
https://github.com/ethereum/beacon-APIs/pull/247 with a v0 suffix.
Requests:
- `/eth/v0/beacon/light_client/bootstrap/{block_root}`
- `/eth/v0/beacon/light_client/updates?start_period={start_period}&count={count}`
- `/eth/v0/beacon/light_client/finality_update`
- `/eth/v0/beacon/light_client/optimistic_update`
HTTP Server-Sent Events (SSE):
- `light_client_finality_update_v0`
- `light_client_optimistic_update_v0`
Allow config of deployment phase via config instead of attempting to
derive from genesis content (when running relevant testnets), so that
we don't have to keep maintaining the list inside the binary.
The LC REST API has been merged into the ethereum/beacon-APIs specs:
- https://github.com/ethereum/beacon-APIs/pull/247
Update URLs to v1 and update REST tests. Note that REST tests do not
start with Altair, so the tested BN will return empty / error responses.
Allow config of deployment phase via config instead of attempting to
derive from genesis content (when running relevant testnets), so that
we don't have to keep maintaining the list inside the binary.
Implements the latest proposal for providing LC data via REST, as of
https://github.com/ethereum/beacon-APIs/pull/247 with a v0 suffix.
Requests:
- `/eth/v0/beacon/light_client/bootstrap/{block_root}`
- `/eth/v0/beacon/light_client/updates?start_period={start_period}&count={count}`
- `/eth/v0/beacon/light_client/finality_update`
- `/eth/v0/beacon/light_client/optimistic_update`
HTTP Server-Sent Events (SSE):
- `light_client_finality_update_v0`
- `light_client_optimistic_update_v0`
For JSON responses, "eth-consensus-version" header is handled in
`eth2_rest_serialization` for states and `rest_beacon_api` for blocks.
Align them to also be handled in `eth2_rest_serialization` for blocks.
The `ContentNotAcceptableError` is triggered when client either requests
an unsupported media type, or has form errors such as sending multiples.
Updating the description to also indicate non-supported Accept headers.
When EL `newPayload` is slow (e.g., Raspberry Pi with Besu), the epoch
and shuffling caches tend to fill up with multiple copies per epoch when
processing gossip and performing validator duties close to wall slot.
The old strategy of evicting oldest epoch led to the same item being
evicted over and over, leading to blocking of over 5 minutes in extreme
cases where alternate epochs/shuffling got loaded repeatedly.
Changing the cache eviction strategy to least-recently-used seems to
improve the situation drastically. A simple implementation was selected
based on single linked-list without a hashtable.
When backfilling LC updates (`--light-client-data-import-mode=full`),
the highest participation update is computed without ensuring that the
finalized header is in the same period. Updates sharing same period for
both finalized and attested headers should be preferred.
Fixes a bug leading to suboptimal update selection.
* avoid database race-condition inconsistency after fcU `INVALID` then crash
* ensure head doesn't fall behind finalized; add more tests for head movement/reloading DAG
In `eth2_rest_serialization` there was a `dump` function for
`KeystoresAndSlashingProtection` that does not seem to be used.
Removes that unused function.
For pre-encoded JSON REST responses we have `jsonResponsePlain`.
Adds a `sszResponsePlain` function to serve similar purpose for SSZ.
This avoids caller having to explicitly specify Http200 and media type.
When the BN's head is reorged while shut down, reloading the BN will not
assign `BlockRef` to alternate branches. However, blocks from other
branches are still present in the database, leading to their descendants
incorrectly marked as `UnviableFork`. By restricting the check to blocks
that have been finalized, they should be reported as `MissingParent`
instead, eventually re-assigning a `BlockRef` to them.
The `eth1_monitor` check to require engine API from bellatrix onward
has issues in setups where the EL and CL are started simultaneously
because the EL may not be ready to answer requests by the time that the
check is performed. This can be observed, e.g., on Raspberry Pi 4 when
using Besu as the EL client. Now that the merge transition happened, the
check is also not that useful anymore, as users have other ways to know
that their setup is not working correctly (e.g., repeated exchange logs)
When running as Gnosis-chain binary the config was no longer adjustable.
Restores loading custom configs when running as Gnosis-chain binary,
as long as the following keys remain same:
- SLOTS_PER_EPOCH=16
- SECONDS_PER_SLOT=5
- BASE_REWARD_FACTOR=25
- EPOCHS_PER_SYNC_COMMITTEE_PERIOD=512
This allows running the Gnosis-chain binary on custom test networks.
* detect mismatch of config and binary
When loading configuration that sets keys that Nimbus bakes into the
binary at compile-time, raise an error if the config is incompatible
instead of ignoring the conflicting value.
When TTD hits before Bellatrix, avoid waiting for new blocks and detect
the TTD block as the terminal block hash even before Bellatrix hits.
Also allow detecting EL genesis block as merge transition block.
This fixes the local testnet simulation with Geth to actually merge.
Since these files may have been created in a previous run or manually,
we want to keep loading them even on nodes that don't enable the
keystore API (for example static setups)
Other changes:
* log keystore loading progressively (#3699)
* print initial fee recipient when loading validators
* log dynamic fee recipient updates
The optimistic candidate block check that only imports a new block into
the EL client if its parent block also had execution enabled is not
needed anymore, as mainnet has merged and the attack period is over.
Port changes to `nextExchangeTransitionConfiguration` from BN to LC:
- 60 seconds delay before initial exchange
- 45 seconds interval between followup exchanges
- Only exchange post Bellatrix
When the sync queue processes results for a blocks by range request,
and the requested range contained some slots that are already finalized,
`BlockError.MissingParent` currently leads to `PeerScoreBadBlocks` even
when the error occurs on a non-finalized slot in the requested range.
This patch changes the scoring in that case to `PeerScoreMissingBlocks`
for consistency with range requests solely covering non-finalized slots,
and, likewise, rewinds the sync queue to the next `rewindSlot`.
Per spec, we should not be sending our detected terminal block to EL -
the EL configuration exchange should only look at values from
configuration and report mismatches.
`p.dataProvider` may become `nil` between individual attempts to
exchange transition configuration with the EL. Harden by capturing
the data provider on function start.
Note that other functions are already hardened, or are unaffected.
Only `close` transitions `p.dataProvider` to `nil`, and `close` is
only called by the main deposits import sequence. During the deposits
import, `close` is not called, so extra checks are not needed.
When there are a lot of deposits, we decompress the public key into a
crypto cache. To avoid having those caches grow unreasonably big,
make sure to operate on the decompressed pubkey instead.
* more efficient forkchoiceUpdated usage
* await rather than asyncSpawn; ensure head update before dag.updateHead
* use action tracker rather than attached validators to check for next slot proposal; use wall slot + 1 rather than state slot + 1 to correctly check when missing blocks
* re-add two-fcU case for when newPayload not VALID
* check dynamicFeeRecipientsStore for potential proposal
* remove duplicate checks for whether next proposer
`news` has a few open issues that are not present in `nim-websock`:
1. There is a 1 second delay between each MB of sent data.
2. Cancelling an ongoing `send` makes the entire WebSocket unusable.
3. Control packets do not have priority over ongoing message frames.
Using `news`, there are quite a few of these messages in Geth:
```
Previously seen beacon client is offline. Please ensure it is
operational to follow the chain!
```
It may take quite some time to reconnect when this happens.
Using `nim-websock`, this message still occurs because `eth1_monitor`
reconnects the EL connection when no new blocks occurred for 5 minutes,
but reconnecting is quick and the message is rarer.
The sync protocol does not distinguish between:
- All requested slots are empty
- Peer does not have data available about requested range
Therefore, we treat EOF for `beacon_blocks_by_range` and for
`beacon_blocks_by_range` as valid responses, as if the entire epoch
really contained no single block for any slot. Once a followup response
provides new blocks, we detect that some blocks were missing and rewind.
During backfill, we also request the known-to-exist `backfill.slot`,
so we can actually detect whether an epoch really does not have blocks
or whether a response is incomplete (`PeerScoreNoBlocks`).
When the BN-embedded LC makes sync progress, pass the corresponding
execution block hash to the EL via `engine_forkchoiceUpdatedV1`.
This allows the EL to sync to wall slot while the chain DAG is behind.
Renamed `--light-client` to `--sync-light-client` for clarity, and
`--light-client-trusted-block-root` to `--trusted-block-root` for
consistency with `nimbus_light_client`.
Note that this does not work well in practice at this time:
- Geth sticks to the optimistic sync:
"Ignoring payload while snap syncing" (when passing the LC head)
"Forkchoice requested unknown head" (when updating to LC head)
- Nethermind syncs to LC head but does not report ancestors as VALID,
so the main forward sync is still stuck in optimistic mode:
"Pre-pivot block, ignored and returned Syncing"
To aid EL client teams in fixing those issues, having this available
as a hidden option is still useful.
When issuing an engine API call while the EL is disconnected, a `nil`
pointer is dereferenced. Fixed by correctly initializing futures.
```
Traceback (most recent call last, using override)
vendor/nim-libp2p/libp2p/protocols/pubsub/pubsub.nim(890) main
beacon_chain/nimbus_beacon_node.nim(2139) main
beacon_chain/nimbus_beacon_node.nim(0) handleStartUpCmd
beacon_chain/nimbus_beacon_node.nim(0) doRunBeaconNode
beacon_chain/nimbus_beacon_node.nim(0) start
beacon_chain/nimbus_beacon_node.nim(1589) run
vendor/nimbus-build-system/vendor/Nim/lib/system/iterators_1.nim(107) poll
vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
beacon_chain/consensus_object_pools/consensus_manager.nim(297) updateHeadWithExecution
vendor/nim-chronos/chronos/asyncmacro2.nim(213) runProposalForkchoiceUpdated
vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
beacon_chain/consensus_object_pools/consensus_manager.nim(259) runProposalForkchoiceUpdated
beacon_chain/eth1/eth1_monitor.nim(0) forkchoiceUpdated
vendor/nim-chronos/chronos/asyncfutures2.nim(219) complete
vendor/nim-chronos/chronos/asyncfutures2.nim(149) cancelled
vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(610) signalHandler
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
```
The optimistic sync spec was updated since the LC based optsync module
was introduced. It is no longer necessary to wait for the justified
checkpoint to have execution enabled; instead, any block is okay to be
optimistically imported to the EL client, as long as its parent block
has execution enabled. Complex syncing logic has been removed, and the
LC optsync module will now follow gossip directly, reducing the latency
when using this module. Note that because this is now based on gossip
instead of using sync manager / request manager, that individual blocks
may be missed. However, EL clients should recover from this by fetching
missing blocks themselves.
When the EL fails to respond to `newPayload`, e.g., because connection
to the EL got interrupted, or due to misconfiguration, optimistic blocks
cannot be imported according to spec. This condition is treated the same
as if the peer returned a block with missing parent which gets the block
out of our processing queue, but can have nasty side effects.
For example, if sync manager asks for validation of a block known to be
in the finalized range, if it receives a `MissingParent` verdict, the
peer is immediately removed from the peer pool.
```
DBG 2022-08-24 11:45:26.874+02:00 newPayload: inserting block into execution engine parentHash=e4ca7424 blockHash=36cdc198 stateRoot=cf3902c1 receiptsRoot=56e81f17 prevRandao=0b49a172 blockNumber=1518089 gasLimit=30000000 gasUsed=0 timestamp=1657980396 extraDataLen=0 baseFeePerGas=7 numTransactions=0
ERR 2022-08-24 11:45:26.875+02:00 newPayload failed msg="Transport is not initialised (missing a call to connect?)"
DBG 2022-08-24 11:45:26.875+02:00 Block pool rejected peer's response topics="syncman" request=187232:32@1475 peer=16U*MsCJdx direction=forward blocks_map=xxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxx blocks_count=31 ok=false unviable=false missing_parent=true sync_ident=main
ERR 2022-08-24 11:45:26.875+02:00 Unexpected missing parent at finalized epoch slot topics="syncman" request=187232:32@1475 peer=16U*MsCJdx direction=forward rewind_to_slot=187232 blocks_count=31 blocks_map=xxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxx sync_ident=main
DBG 2022-08-24 11:45:26.875+02:00 Peer was removed from PeerPool due to low score topics="beacnde" peer=16U*MsCJdx peer_score=-1000 score_low_limit=0 score_high_limit=1000
DBG 2022-08-24 11:45:26.875+02:00 Lost connection to peer topics="networking" peer=16U*MsCJdx connections=0
```
By delaying issuing a verdict until the EL connection is restored and
`newPayload` successfully ran, the problem should be fixed. This also
induces back pressure to the sync manager by stopping download of new
blocks (or re-downloading the same block over and over again).
* Harden block proposal against expired slashings/exits
When a message is signed in a phase0 domain, it can no longer be
validated under bellatrix due to the correct fork no longer being
available in the `BeaconState`.
To ensure that all slashing/exits are still valid, in this PR we re-run
the checks in the state that we're proposing for, thus hardening against
both signatures and other changes in the state that might have
invalidated the message.
* fix same message added multiple times
in case of attestation slashing of multiple validators in one go
* support connecting to peers without bellatrix
Make discovery fork ID aware of scheduled Bellatrix fork to enable
connections to peers that don't have Bellatrix scheduled yet.
Without this, has peering issues with peers on older SW version.
* expand tests with compatibility checks
* more exhaustive compatibility checks