We generally send out attestations 250 ms after the block arrives.
Recent efficiency improvements have led to a slightly increased
incidence of "slot 0" issues where attestations are dropped by other
nodes because they have not yet had time to process the block due to
epoch processing taking time.
This PR mitigates the problem by increasing the window between receiving
the block and sending out attestations.
* Implement the new Altair req/resp protocols
Also fixes the altair message-id computation by providing the correct
forkdigest prefix in `isAltairTopic`.
Co-authored-by: Tanguy Cizain <tanguycizain@gmail.com>
We generally send out attestations 250 ms after the block arrives.
Recent efficiency improvements have led to a slightly increased
incidence of "slot 0" issues where attestations are dropped by other
nodes because they have not yet had time to process the block due to
epoch processing taking time.
This PR mitigates the problem by increasing the window between receiving
the block and sending out attestations.
* add blockchain_dag altair database reading; add rollback tests; fix some unnecessary type conversions
* remove debugging scaffolding
* proposeSignedBlock() will need to be async for merge; introduce altair types to VC
* introduce immutable Altair BeaconState
* add database support for Altair blocks and states
* add tests for Altair get/put/contains/delete state
* enable blockchain_dag Altair state database storing
* properly return error on getting missing altair block
* add Altair support to the block quarantine
* switch some spec/datatypes imports to spec/datatypes/base
* add Altair support to block_clearance
* allow runtime configuration of Altair transition slot
* enable Altair in block_sim, including in CI
* remove false OnBlockAdded dependency on phase.HashedBeaconState
* introduce altair data types into block_clearance; update some alpha.6 spec refs to alpha.7; add get_active_validator_indices_len ForkedHashedBeaconState wrapper
* switch many modules from using datatypes (with phase0 states/blocks) to datatypes/base (fork-independent); update spec refs from alpha.6 to alpha.7 and remove rm'd G2_POINT_AT_INFINITY
* switch more modules from using datatypes (with phase0 states/blocks) to datatypes/base (fork-independent); update spec refs from alpha.6 to alpha.7
* remove unnecessary phase0-only wrapper of get_attesting_indices(); allow signatures_batch to process either fork; remove O(n^2) nested loop in process_inactivity_updates(); add altair support to getAttestationsforTestBlock()
* add Altair versions of asSigVerified(), asTrusted(), and makeBeaconBlock()
* fix spec URL to be Altair for Altair makeBeaconBlock()
* strawman doppelganger detection walltime refactor
* move DoppelgangerProtection to Eth2Processor
* increase comment precision
* document difference between broadcastStartEpoch and nodeLaunchSlot, and allow for one-slot overlap to avoid false positives on intra-slot restarts
* strawman doppelganger detection walltime refactor
* move DoppelgangerProtection to Eth2Processor
* increase comment precision
* document difference between broadcastStartEpoch and nodeLaunchSlot, and allow for one-slot overlap to avoid false positives on intra-slot restarts
* update to Altair as of v1.1.0-alpha.7
* introduce Altair types into attestation pool
* avoid allocating/copying pubkeys excessively in get_next_sync_committee()
* use ForkedHashedBeaconState in StateData
* fix FAR_FUTURE_EPOCH -> slot overflow; almost always use assign()
* avoid stack allocation in maybeUpgradeStateToAltair()
* create and use dispatch functions for check_attester_slashing(), check_proposer_slashing(), and check_voluntary_exit()
* use getStateRoot() instead of various state.data.hbsPhase0.root
* remove withStateVars.hashedState(), which doesn't work as a design anymore
* introduce spec/datatypes/altair into beacon_chain_db
* fix inefficient codegen for getStateField(largeStateField)
* state_transition_slots() doesn't either need/use blocks or runtime presets
* combine process_slots(HBS)/state_transition_slots(HBS) which differ only in last-slot htr optimization
* getStateField(StateData, ...) was replaced by getStateField(ForkedHashedBeaconState, ...)
* fix rollback
* switch some state_transition(), process_slots, makeTestBlocks(), etc to use ForkedHashedBeaconState
* remove state_transition(phase0.HashedBeaconState)
* remove process_slots(phase0.HashedBeaconState)
* remove state_transition_block(phase0.HashedBeaconState)
* remove unused callWithBS(); separate case expression from if statement
* switch back from nested-ref-object construction to (ref Foo)(Bar())
* write uncompressed validator keys to database
Loading 150k+ validator keys on startup in compressed format takes a lot
of time - better store them in uncompressed format which makes behaviour
just after startup faster / more predictable.
* refactor cached validator key access
* fix isomorphic cast to work with non-var instances
* remove cooked pubkey cache - directly use database cache in chaindag
as well (one less cache to keep in sync)
* bump blscurve, introduce loadValid for known-to-be-valid keys
* load the cache of the current state epoch instead of the target state
epoch, when applying states and slots
* load state cache for each slot/block (for longer slot jumps)
* load state cache after full updateStateData
* look up two state cache epochs, instead of the same epoch twice :)
Instead of keeping a validator key list per EpochRef, this PR introduces
a single shared validator key list in ChainDAG, and cleans up some other
ChainDAG and key-related issues.
The PR does not introduce the validator key list in the state transition
- this is because we batch-check all signatures before entering the spec
code, thus the spec code never hits the cache.
A future refactor should _probably_ remove the threadvar altogether.
There's a few other small fixes in here that make the flow easier to
read:
* fix `var ChainDAGRef` -> `ChainDAGRef`
* fix `var QuarantineRef` -> `QuarantineRef`
* consistent `dag` variable name
* avoid using threadvar pubkey cache in most cases
* better error messages in batch signature checking
* update ChainDAG.effective_balance() to use StateData; rm unused ChainDAG.getBlockByPreciseSlot()
* update get_effective_balances to avoid god object; avoid most memory allocation in Altair epoch reward and penalty processing
* don't consider legacy database when writing state - this read is slow
on kvstore
* avoid epoch transition when there's an exact match in cache already
* simplify init to only consider checkpoint states
This way we perform the expensive epoch processing before the block
arrives.
Of course, this may lead to speculative misses which in turn lead to
replays - it's likely that in the case of a miss, we'll see a replay
regardless.
* gossip_to_consensus -> block_processor (it's processing only blocks,
but not only from gossip)
* measure queue and validation time for blocks
* measure assignment and state loading times for updateStateData
* avoid some unnecessary block copies in block sync
* warn that database is corrupt if we hit tail without a state
* proposed structure for hf1
* refactor datatypes.nim into datatypes/{base, phase0, hf1}.nim
* hf1 is Altair
* some syncing with alpha 2
* adjust epoch processing to disambiguate access to RewardFlags
* relocate StateData to stay consistent with meaning phase 0 StateData
* passes v1.1.0 alpha 5 SSZ consensus object tests
* Altair block header test fixtures work
* fix slash_validator() so that Altair attester slashings, proposer slashings, and voluntary exit textures work
* deposit operation Altair test fixtures work
* slot sanity and all but a couple epoch transition tests switched to Altair
* attestation Altair test fixtures work
* Altair block sanity test fixtures work
* add working altair sync committee tests
* improve workarounds for sum-types-across-modules Nim bug; incorporate SignedBeaconBlock root reconstuction to SSZ byte reader
This reverts commit eebc828778.
Adding a separate file turns out not to be enough. This PR reverts the
separate file change.
Another theory is that the large kvstore table causes cache thrashing -
all database connections share a common page cache which would explain
the poor performance of the separate file solution.
The V1 table structure shows great improvements in performance, but if
there's an old `kvstore` without rowid:s, these benefits are nullified:
reorgs during writes and deletes remain expensive (even if the
degradation is reduced somewhat).
This PR creates the tables in a new file instead, and uses the old file
as a read-only store - this has several interesting properties:
* the old database is left completely untouched - this guarantees that
downgrades work smooth (they'll only need to resync their missing
portions)
* starting sync after this PR means only a v1 database is created
* v0 databases stick around - no migration is performed (for now)
Future PR:s can introduce migration of the data from one database to
another - a simply copy will take hours which is downtime we want to
avoid - at that point, it might make sense to migrate straight to era
files instead.
* use StateData in place of BeaconState outside state transition code
* propagate more StateData usage
* remove withStateVars().state
* wrap get_beacon_committee(BeaconState, ...) as gbc(StateData, ...)
* switch makeAttestation() to use StateData
* use StateData wrapper/dispatcher for get_committee_count_per_slot()
* convert AttestationCache.init(), weak subjectivity functions, and updateValidatorMetrics()
* add get_shuffled_active_validator_indices(StateData) and get_block_root_at_slot(StateData)
* switch makeAttestationData() to StateData
* sync AllTests-mainnet.md after rebase
* Error when specifying an invalid --data-dir (or --validator-dir)
* Error when entering an invalid validator public key (e.g. invalid hex value)
* Warning when attempting to export a validator not present in the local database
Some unnecessary remains of the v1 mode has been removed as well
* Revert "Revert "Upgrade database schema" (#2570)"
This reverts commit 6057c2ffb4.
* ssz: fix loading empty lists into existing instances
Not a problem earlier because we didn't reuse instances
* bump nim-eth
* bump nim-web3
The `kvstore` design we're using now turns out to not be the best way to
use `sqlite` - in particular, there are some significant benefits to
using rowid in certain situations and to keep data in separate tables.
With this branch, there are massive improvements in startup time
(seconds instead of minutes) and state/block storage and pruning times
(milliseconds instead of seconds) - these improvements can in particular
be seen on slow drives and translate directly into better attestation
performance.
* update kvstore to new keyspace design
* remove `DirStoreRef` and the hidden `--state-db-kind` option - this
was an experiment to store large blobs in files, but with the new
kvstore, there's no compelling reason to do so
* remove `DbMap` - unused and would need updating for new keyspace
design
* introduce separate tables for each data type (blocks, states etc)
* remove "WITHOUT ROWID" pessimization for tables with large blobs
* close DbSeq statements explicitly (and earlier)
* store beacon block summaries in separate table, without SSZ
compression and load them all with single query on startup
* stop storing backwards compat full states
* mark genesis beacon block as trusted
* avoid faststreams when loading SSZ data
* remove `DisagreementBehavior` (unused)
This PR decreases the lead subscription time which should help
decrease bandwidth usage and CPU making the subscription for future
aggregation happen a bit later. There's room for more tuning here,
probably.
* fix missing negation from in #2550
* fix silly bitarray issues
* decrease subnet lead subscription time
* log all subnet switching source data
* rename subnet trackers to refer to stability and aggregate subnets
* more tests
Currently, we have a bit of a convoluted flow where when sending
attestations, we start broadcasting them over gossip then pass them to
the attestation validation to include them in the local attestation pool
- it should be the other way around: we should be checking attestations
_before_ gossipping them - this serves as an additional safety net to
ensure that we don't publish junk - this becomes more important when
publishing attestations from the API.
Also, the REST API was performing its own validation meaning
attestations coming from REST would be validated twice - finally, the
JSON RPC wasn't pre-validating and would happily broadcast invalid
attestations.
* Unified attestation production pipeline with the same flow for gossip,
locally and API-produced attestations: all are now validated and entered
into the pool, then broadcast/republished
* Refactor subnet handling with specific SubnetId alias, streamlining
where subnets are computed, avoiding the need to pass around the number
of active validators
* Move some of the subnet handling code to eth2_network
* Use BitArray throughout for subnet handling
This also makes future efforts to provide metrics and logs for
attestation efficiency easier
* Export rewards from epoch transition
* Use less memory for reward calculation (bool -> set[enum], field
alignment)
* Reuse reward memory when replaying, avoiding spike
* Allow replaying any range in ncli_db benchmark
When applying a block, we'll currently compute a state root for the
state after slot processing but before block processing - this is
unnecessary when a block is being applied because the intermediate state
root is never observed.
This PR reduces the number of database queries for slashing protection
from 5 reads and 1 write to 2 reads and 1 write in the optimistic case.
In the process, it removes user-level support for writing the database
in the version 1 format in order to simplify the code flow, and prevent
code rot. In particular, the v1 format was not covered by any unit tests
and has no advantages over v2. The concrete code to read and write it
remains for now, in particular to support upgrades from v1 to v2.
The branch also removes the use of concepts which doesn't work with
checked exceptions - in particular, this highlights code that both
raises exceptions and returns error codes, which could be cleaned up in
the future.
* Cache internal validator ID
* Rely on unique index to check for trivial duplicate votes
* Combine two surround vote queries into one
* Combine API for checking and registering slashing into single function
The slashing DB is normally not a bottleneck, but may become one with
high attached validator counts.
* Introduce unittest2 and junit reports
* fix XML path
* don't combine multiple CI runs
* fixup
* public combined report also
Co-authored-by: Ștefan Talpalaru <stefantalpalaru@yahoo.com>
* REST API test framework and tests.
* Fix ValidatorIndex tests to properly handle int32, but not uint32 values.
* Fix tests to follow latest REST fixes.
* refactor restapi.sh
and add it to the test suite
* Fix issues.
Add delay timeout which is required.
* Fix restapi.sh script for Windows.
Co-authored-by: Ștefan Talpalaru <stefantalpalaru@yahoo.com>
With the introduction of batching and lazy attestation aggregation, it
no longer makes sense to enqueue attestations between the signature
check and adding them to the attestation pool - this only takes up
valuable CPU without any real benefit.
* add successfully validated attestations to attestion pool directly
* avoid copying participant list around for single-vote attestations,
pass single validator index instead
* release decompressed gossip memory earlier, specially during async
message validation
* use cooked signatures in a few more places to avoid reloads and errors
* remove some Defect-raising versions of signature-loading
* release decompressed data memory before validating message
This patch writes a full genesis state to `kvstore` if one was missing,
which fixes 1.2.0 restarting sync when upgrading from 1.1.0, or when
downgrading to a pre-1.1.0 release.
* avoid creating indexed attestation just to check signatures - above
all, don't create it when not checking signatures ;)
* avoid pointer op when adding attestation to pool
* better iterator for yielding attestations
* add metric / log for attestation packing time
This is a revamp of the attestation pool that cleans up several aspects
of attestation processing as the network grows larger and block space
becomes more precious.
The aim is to better exploit the divide between attestation subnets and
aggregations by keeping the two kinds separate until it's time to either
produce a block or aggregate. This means we're no longer eagerly
combining single-vote attestations, but rather wait until the last
moment, and then try to add singles to all aggregates, including those
coming from the network.
Importantly, the branch improves on poor aggregate quality and poor
attestation packing in cases where block space is running out.
A basic greed scoring mechanism is used to select attestations for
blocks - attestations are added based on how much many new votes they
bring to the table.
* Collect single-vote attestations separately and store these until it's
time to make aggregates
* Create aggregates based on single-vote attestations
* Select _best_ aggregate rather than _first_ aggregate when on
aggregation duty
* Top up all aggregates with singles when it's time make the attestation
cut, thus improving the chances of grabbing the best aggregates out
there
* Improve aggregation test coverage
* Improve bitseq operations
* Simplify aggregate signature creation
* Make attestation cache temporary instead of storing it in attestation
pool - most of the time, blocks are not being produced, no need to keep
the data around
* Remove redundant aggregate storage that was used only for RPC
* Use tables to avoid some linear seeks when looking up attestation data
* Fix long cleanup on large slot jumps
* Avoid some pointers
* Speed up iterating all attestations for a slot (fixes#2490)