During sync, sometimes the same block gets encountered and added to
quarantine multiple times. If its parent is already known, quarantine
incorrectly registers it as missing, leading to re-download. This can
be fixed by registering the parent's deepest missing parent recursively.
Also increase the stickiness of `missing`. We only perform 4 attempts
within ~16 seconds before giving up. Very frequently, this is not enough
and there is no progress until sync manager kicks in even on holesky.
#6087 introduced a subtle change to `nim-web3` resulting in `Gwei` to be
serialized differently than before. Using a `distinct` type for `Gwei`
improves type safety and avoids such problems in the future.
Fix the `/eth/v1/beacon/deposit_snapshot` API to produce proper EIP-4881
compatible `DepositTreeSnapshot` responses. The endpoint used to expose
a Nimbus-specific database internal format.
Also fix trusted node sync to consume properly formatted EIP-4881 data
with `--with-deposit-snapshot`, and `--finalized-deposit-tree-snapshot`
beacon node launch option to use the EIP-4881 data. Further ensure that
`ncli_testnet` produces EIP-4881 formatted data for interoperability.
EIP-4881 was never correctly implemented, the `DepositTreeSnapshot`
structure has nothing to do with its actual definition. Reflect that
by renaming the type to a Nimbus-specific `DepositContractSnapshot`,
so that an actual EIP-4881 implementation can use the correct names.
- https://eips.ethereum.org/EIPS/eip-4881#specification
Notably, `DepositTreeSnapshot` contains a compressed sequence in
`finalized`, only containing the minimally required intermediate roots.
That also explains the incorrect REST response reported in #5508.
The non-canonical representation was introduced in #4303 and is also
persisted in the database. We'll have to maintain it for a while.
Annotate the `research` and `test` files for which no further changes
are needed to successfully compile them, to not interfere with periodic
tasks such as spec reference bumps.
`scanf` apparently has both a `bool` return as well as raising random
exceptions depending on what functions get called by the `macro`.
To make this explicit, catch the `ValueError` from the generated
`parseInt` call, to separate `scanf` behaviour from the actual SSZ
test logic. In the end, it mostly doesn't matter as there are some
`doAssert wasMatched` on the next line (not everywhere though).
But it still makes the `scanf` internals explicit, so is clearer.
Add `{.raises.}` annotations to `tests` files where needed to enable
`{.push raises: [].}`. Avoids interfering with periodic changes such as
spec version bumps, and avoids special casing folders when editing.
The effort to maintain `{.raises.}` is trivial after the initial round.
`test_peer_pool` is a bit different from the other tests as it uses
`closureScope` which doesn't play well with `{.push raises: [].}`.
Define an overload instead that allows passing `{.raises.}` to the
`template`. This allows using `unittest2`'s exception handler without
having to refactor the test.
`stderr.write` may fail, e.g., if no tty is connected, which may happen
in some CI configurations. Discard such failures and continue quitting
instead of raising the error.
This PR allows sharing the pubkey data between validators by using a
thread-local cache for pubkey data, netting about a 400mb mem usage
reduction on holesky due to us keeping 3 permanent + several ephemeral
state copies in memory at all times and each state copy holding a full
validator.
The PR also introduces a hash cache for the key which gives ~14% speedup
for a full state `hash_tree_root` - the key makes up for a large part of
the `Validator` htr time.
Finally, the time it takes to copy a state goes down as well from ~80m
ms to ~60, for reasons similar to htr.
We use a `ptr` even if a `ref` could in theory have been used - there is
not much practical benefit to a `ref` (given it's mutable) while a `ptr`
is cheaper and easier to copy (when copying temporary states).
We could go further and cache a cooked pubkey but it turns out this is
quite intrusive - in all the relevant places, we're already using a
cooked key from the immutable validator data so there are no immediate
performance gains of doing so while managing the compressed -> cooked
key mapping would become more difficult - something for a future PR
perhaps.
Co-authored-by: Etan Kissling <etan@status.im>
* compute post-merge randao mix without loading state
* avoid copying state on shuffling computation and compute epochref
* speed up state copy for block production
When using checkpoint sync, only checkpoint state is available, block is
not downloaded and backfilled later.
`dag.backfill` tracks latest filled `slot`, and latest `parent_root` for
which no block has been synced yet.
In checkpoint sync, this assumption is broken, because there, the start
`dag.backfill.slot` is set based on checkpoint state slot, and the block
is also not available.
However, sync manager in backward mode also requests `dag.backfill.slot`
and `block_clearance` then backfills the checkpoint block once it is
synced. But, there is no guarantee that a peer ever sends us that block.
They could send us all parent blocks and solely omit the checkpoint
block itself. In that situation, we would accept the parent blocks and
advance `dag.backfill`, and subsequently never request the checkpoint
block again, resulting in gap inside blocks DB that is never filled.
To mitigate that, the assumption is restored that `dag.backfill.slot`
is the latest filled `slot`, and `dag.backfill.parent_root` is the next
block that needs to be synced. By setting `slot` to `tail.slot + 1` and
`parent_root` to `tail.root`, we put a fake summary into `dag.backfill`
so that `block_clearance` only proceeds once checkpoint block exists.
After checkpoint sync, historical block IDs cannot yet be queried.
However, they are needed to compute dependent roots of `ShufflingRef`.
To allow lookup, enable `getBlockIdAtSlot` to answer from compatible
states in memory; as long as they descend from the finalized checkpoint
and the requested slot is sufficiently recent, `block_roots` contains
everything to recover `BlockSlotId` up to `SLOTS_PER_HISTORICAL_ROOT`.
This is similar to how `attester_dependent_root` etc. are computed.
This accelerates the first couple minutes of checkpoint sync on Mainnet,
especially the time until finality advances past the synced checkpoint.
Finish the rename started in #4809 to have a consistent naming.
`ExecutionPayloadHash` suggests hash over payload instead of block.
`BlockHash` is also the canonical name in engine API.
Avoid marking blocks invalid when corresponding `blobSidecarsByRange`
returns an incomplete / incorrect response while syncing. The block
itself may still be valid in that scenario.
This PR causes a few new warnings to appear - these are harmless but
will need addressing separately as they span several libraries.
* new asyncraises syntax
* asyncraises support in several modules
* `sink` usage reduces spurious copying
* `?` compatiblity for `async` + `results`
* remove `-d:chronosStrictException` (obsolete)
This requires all object types to be explicitly white-listed for
default serialization. The PR makes the minimal changes, although
a number of similar mechanisms in eth2_rest_serialization can now
be removed.
When the BN exits after writing new `head` to database, but before
completing the `updateFinalizedBlocks` call, the database is slightly
inconsistent due to the partial write. We currently fail to start up
after that. Fix that by catching up on partial `updateFinalizedBlocks`
tasks on start up, and add a test for this edge case.
This PR brings down the time to send 100 attestations from ~1s to
~100ms, making it feasible to run 10k validators on a single node (which
regularly send 300 attestations / slot).
This is done by batching the slashing protection database write in a
single transaction thus avoiding a slow fsync for every signature -
effects will be more pronounced on slow drives.
The benefit applies both to beacon and client validators.
* Initial commit.
* Fix issues and tests.
* Fix test compilation issue.
* Update AllTests.
* Change the most poor score name from <lowest> to <bad>.
Split sync committee message score in range, so lexicographic scores will not intersect with normal one.
Lexicographic scores should be below to normal scores.
* Address review comments.
Fix aggregated attestation scoring to use MAX_VALIDATORS_PER_COMMITTEE.
Fix sync committee contributions to use SYNC_SUBCOMMITTEE_SIZE.
Add getUniqueVotes test vectors.
* Post-rebase fixes.
* Address review comments.
* Return back score calculation based on actual bits length.
* AllTests modification.
Gnosis uses `MIN_EPOCHS_FOR_BLOCK_REQUESTS` = 33024, but the computed
safe minimum (that Nimbus was using) is 2304. Relax the compatibility
check to allow `MIN_EPOCHS_FOR_BLOCK_REQUESTS` above the safe minimum
and honor `config.yaml` preferences for `MIN_EPOCHS_FOR_BLOCK_REQUESTS`.
* ShufflingRef approach to next-epoch validator duty calculation/prediction
* refactor action_tracker.updateActions to take ShufflingRef + beacon_proposers; refactor maybeUpdateActionTrackerNextEpoch to be separate and reused function; add actual fallback logic
* document one possible set of conditions
* check epoch participation flags and inactivity scores to ensure no penalties and MAX_EFFECTIVE_BALANCE to ensure rewards don't matter
* correctly (un)shuffle each proposer index
* remove debugging assertion
The templates for `BeaconBlock`, `BeaconBlockBody` and `BeaconState`
are the only ones using a `macro` mechanism for code generation.
This prevents using the dot-syntax style `consensusFork.BeaconFoo`
in some situations, and also tends to trigger naming conflicts,
requiring the `Type` suffix. Furthermore, the `macro` only works
for types that are re-defined in every single `ConsensusFork`.
Replacing with the simpler but more verbose approach used for other
types for consistency and to avoid the downsides of the `macro`.
Furthermore, simplify `test_fixture_sanity_blocks` to use `forks` sugar.
Directly initialize `ForkedLightClientObj` instead of separately first
setting the `kind` (initializing everything to zero) and then assigning
the forky data after that.
`Eth2NetworkMetadata` has an `incompatible` case to hold an error string
in case the loaded file is not compatible with the compile-time config.
The same can be modeled with a `Result[Eth2NetworkMetadata, string]` and
avoids followup checks for the `incompatible` case.
For symmetry with `forkyState` when using `withState`, and to avoid
problems with shadowing of `blck` when using `withBlck` in `template`,
also rename the injected `blck` to `forkyBlck`.
- https://github.com/nim-lang/Nim/issues/22698
To allow testing https://github.com/ethereum/consensus-specs/issues/3466
add support for selecting fork choice version at launch. This means we
can deploy a different logic when `DENEB_FORK_EPOCH != FAR_FUTURE_EPOCH`
that won't be used on Mainnet.
* Add metadata for the Holesky network
* Add copyright banner to the new Nim module
* Working version
* Bump Chronos to fix downloading from Github
* Add checksum check of the downloaded file
* Clean up debugging code and obsolete imports
In Nim 2.0, the `test_fixture_operations` files fail to compile with:
```
Error: 'result' is of type <Result[system.void, system.cstring]> which cannot be captured as it would violate memory safety, declared here: /nimbus-eth2/tests/consensus_spec/bellatrix/test_fixture_operations.nim(130, 5); using '-d:nimNoLentIterators' helps in some cases. Consider using a <ref Result[system.void, system.cstring]> which can be captured.
```
Wrapping the `applyExecutionPayload` in a factory that takes `path`
avoids this problem.
Suite names were not being used because `test` has to have access to it
during instantiation - this PR cleans things up a little while at the
same time upgrading unittest2.
From old interop tests, a mock `eth1BlockHash` was defined in `base`.
To avoid accidental use by Nimbus, move to `tests` and rename it to
`mockEth1BlockHash`.
When a block is introduced to the system both via REST and gossip at the
same time, we will call `storeBlock` from two locations leading to a
dupliace check race condition as we wait for the EL.
This issue may manifest in particular when using an external block
builder that itself publishes the block onto the gossip network.
* refactor enqueue flow
* simplify calling `addBlock`
* complete request manager verifier future for blobless blocks
* re-verify parent conditions before adding block
among other things, it might have gone stale or finalized between one
call and the other
* async batch verification
When batch verification is done, the main thread is blocked reducing
concurrency.
With this PR, the new thread signalling primitive in chronos is used to
offload the full batch verification process to a separate thread
allowing the main threads to continue async operations while the other
threads verify signatures.
Similar to previous behavior, the number of ongoing batch verifications
is capped to prevent runaway resource usage.
In addition to the asynchronous processing, 3 addition changes help
drive throughput:
* A loop is used for batch accumulation: this prevents a stampede of
small batches in eager mode where both the eager and the scheduled batch
runner would pick batches off the queue, prematurely picking "fresh"
batches off the queue
* An additional small wait is introduced for small batches - this helps
create slightly larger batches which make better used of the increased
concurrency
* Up to 2 batches are scheduled to the threadpool during high pressure,
reducing startup latency for the threads
Together, these changes increase attestation verification throughput
under load up to 30%.
* fixup
* Update submodules
* fix blst build issues (and a PIC warning)
* bump
---------
Co-authored-by: Zahary Karadjov <zahary@gmail.com>
This PR removes a few hundred thousand temporary seq allocations during
state transition - in particular, the flag seq was allocated per
validator while committees are computed per attestation.
* Add new REST endpoints to monitor REST server connections and new chronos metrics.
* Bump head versions of chronos and presto.
* Bump chronos with regression fix.
* Remove outdated tests which was supposed to test pipeline mode.
* Disable pipeline mode in resttest.
* Update copyright year.
* Upgrade test_signing_node to start use AsyncProcess instead of std library's osproc.
Bump chronos to check graceful shutdown.
* Update AllTests.
* Bump chronos.
Split up the `ShufflingRef` acceleration logic into generically usable
parts and attester shuffling specific parts. The generic parts could be
used to accelerate other purposes, e.g., REST `/states/xxx/randao` API.
To enable additional use cases, e.g., `/states/###/randao` beacon API,
`ShufflingRef` acceleration logic needs to be able to operate on parts
of the DAG that do not have `BlockRef`. Changing `commonAncestor` to
act on `BlockId` instead of `BlockRef` is a step toward that and also
simplifies the logic some more.
Post-merge blocks contain all information to directly obtain RANDAO
without having to load any additional info. Take advantage of that to
further accelerate `ShufflingRef` computation. Note that it is still
necessary to verify that `blck` / `state` share a sufficiently recent
ancestor for the purpose of computing attester shufflings.
- new: 243.71s, 239.67s, 237.32s, 238.36s, 239.57s
- old: 251.33s, 234.29s, 249.28s, 237.03s, 236.78s
Running `makeTestDB` in tests currently always initializes DB with a
`phase0` state, preventing tests that configure a fork schedule that
starts in a different fork from working properly. Fix that by upgrading
the genesis state to whatever fork the fork schedule starts with.
Current RANDAO recovery logic is quite complex as it optimizes for the
minimum amount of database reads. Loading blocks isn't the bottleneck
though, so rather make the implementation more concise by avoiding the
complex strategy planning step. Note that this also prepares for an even
faster implementation for post-merge blocks in the future that extracts
RANDAO from `ExecutionPayload` directly if available, so even in cases
where efficiency is slightly lower, only historical data is affected.
`time nim c -r tests/test_blockchain_dag` (cached binary):
- new: 145.45s, 133.59s, 144.65s, 127.69s, 136.14s
- old: 149.15s, 150.84s, 135.77s, 137.49s, 133.89s
When the requestmanager is busy fetching blocks, the queue might get
filled with multiple entries of the same root - since there is no
deduplication, requests containing the same root multiple times will be
sent out.
Also, because the items sit in the queue for a long time potentially,
the request might be stale by the time that the manager is ready with
the previous request.
This PR removes the queue and directly fetches the blocks to download
from the quarantine which solves both problems (the quarantine already
de-duplicates and is clean of stale information).
Removing the queue for blobs is left for a future PR.
Co-authored-by: tersec <tersec@users.noreply.github.com>
* early exit `commonAncestor` when comparing with `finalizedHead`
As all `BlockRef` lead to `finalizedHead` (`parent == nil`),
can shortcut in that situation and immediately return `finalizedHead`
if passed as one of the arguments.
* typo in comment
* add test from #5152
Co-authored-by: tersec <tersec@users.noreply.github.com>
* add note about test complexity
* regenerate test summary
---------
Co-authored-by: tersec <tersec@users.noreply.github.com>
We have several modules that import `nim-eth` for the sole purpose of
its `keys.newRng` function. This function is meanwhile a simple wrapper
around `nim-bearssl`'s `HmacDrbgContext.new()`, so the import doesn't
really serve a use anymore. Replace `keys.newRng` with the direct call
to reduce `nim-eth` imports.
* require sync committee supermajority in CI
To better catch problems with sync committee messages in CI, extend
local testnet simulation to also verify that each block is signed
by a supermajority of the sync committee.
Requires #5083 and #5084
* lint
`produceSyncAggregate` is called in new slot when block is produced,
while the other functions in `sync_committee_msg_pool` are called in
previous slot. So, need to subtract 1 slot when producing sync aggregate
to accept the signatures using the old digest during fork transition.
* Refactor api.nim to provide more informative failure reasons.
Distinct between unexpected data and unexpected code.
Deprecate Option[T] usage.
* Fix 400 for produceBlindedBlock().
Get proper string conversion for strategy.
* Fix SSZ encoded versions of ProduceBlockResponseV2, ProduceBlockResponseV2 can be received and decoded.
Fix done() warnings.
Bump presto.
* Fix compilation error with new presto.
Use TcpNoDelay option for Web3Signer.
* Fix produceBlockV2() should provide SSZ responses too.
* Address block encoding issue.
* Fix signing test.
* Bump presto.
* Address review comments.
* also pack attestations where LMD vote is orphaned
When `attestation.data.beacon_block_root` gets orphaned, attestations
with a good `attestation.data.target.root` may still be valuable.
The LMD GHOST vote is not relevant for attestation rewards.
Switch to use the FFG vote (`attestation.data.target.root`) instead,
gossip validation ensures it is an ancestor of `beacon_block_root`.
* lint
* Clarify addOrphan error/logging
addOrphan returned a bool to indicate success. Change this to a Result
so that different errors can be distinguished.
* Update beacon_chain/consensus_object_pools/block_quarantine.nim
Co-authored-by: tersec <tersec@users.noreply.github.com>
* Update beacon_chain/gossip_processing/gossip_validation.nim
---------
Co-authored-by: tersec <tersec@users.noreply.github.com>
When doing sync for blocks older than
MIN_EPOCHS_FOR_BLOB_SIDECARS_REQUESTS, we skip the blobs by range
request, but we then pass en empty blob sequence to
validation, which then fails.
To fix this: Use an Option[Blobsidecars] to allow expressing the
distinction between "empty blob sequence" and "blobs unavailable". Use
the latter for "old" blocks, and don't attempt to run blob validation.
`SyncCommitteeMsgPool` grouped messages by their `beacon_block_root`.
This is problematic around sync committee period boundaries and forks.
Around sync committee period boundaries, members from both the current
and next sync committee may sign the same `beacon_block_root`; mixing
the signatures from both committees together is a mistake. Likewise,
around fork transitions, the `signing_root` changes, so those messages
also need to be segregated.
* don't run flaky CI in Windows
* reduce stack size in `test_light_client`
* regen tests
* re-enable maybe non-problematic test module on Windows
---------
Co-authored-by: Etan Kissling <etan@status.im>
When an uncached `ShufflingRef` is requested, we currently replay state
which can take several seconds. Acceleration is possible by:
1. Start from any state with locked-in `get_active_validator_indices`.
Any blocks / slots applied to such a state can only affect that
result for future epochs, so are viable for querying target epoch.
`compute_activation_exit_epoch(state.slot.epoch) > target.epoch`
2. Determine highest common ancestor among `state` and `target.blck`.
At the ancestor slot, same rules re `get_active_validator_indices`.
`compute_activation_exit_epoch(ancestorSlot.epoch) > target.epoch`
3. We now have a `state` that shares history with `target.blck` up
through a common ancestor slot. Any blocks / slots that the `state`
contains, which are not part of the `target.blck` history, affect
`get_active_validator_indices` at epochs _after_ `target.epoch`.
4. Select `state.randao_mixes[N]` that is closest to common ancestor.
Either direction is fine (above / below ancestor).
5. From that RANDAO mix, mix in / out all RANDAO reveals from blocks
in-between. This is just an XOR operation, so fully reversible.
`mix = mix xor SHA256(blck.message.body.randao_reveal)`
6. Compute the attester dependent slot from `target.epoch`.
`if epoch >= 2: (target.epoch - 1).start_slot - 1 else: GENESIS_SLOT`
7. Trace back from `target.blck` to the attester dependent slot.
We now have the destination for which we want to obtain RANDAO.
8. Mix in all RANDAO reveals from blocks up through the `dependentBlck`.
Same method, no special handling necessary for epoch transitions.
9. Combine `get_active_validator_indices` from `state` at `target.epoch`
with the recovered RANDAO value at `dependentBlck` to obtain the
requested shuffling, and construct the `ShufflingRef` without replay.
* more tests and simplify logic
* test with different number of deposits per branch
* Update beacon_chain/consensus_object_pools/blockchain_dag.nim
Co-authored-by: Jacek Sieka <jacek@status.im>
* `commonAncestor` tests
* lint
---------
Co-authored-by: Jacek Sieka <jacek@status.im>
`attachMerkleProofs` is used by `mockUpdateStateForNewDeposit` to create
a single deposit. The function doesn't work correctly when trying with
with multiple deposits, though. Fix this to enable more complex tests,
and also return the `deposit_root` for forming matching `Eth1Data`.