Instead of keeping a validator key list per EpochRef, this PR introduces
a single shared validator key list in ChainDAG, and cleans up some other
ChainDAG and key-related issues.
The PR does not introduce the validator key list in the state transition
- this is because we batch-check all signatures before entering the spec
code, thus the spec code never hits the cache.
A future refactor should _probably_ remove the threadvar altogether.
There's a few other small fixes in here that make the flow easier to
read:
* fix `var ChainDAGRef` -> `ChainDAGRef`
* fix `var QuarantineRef` -> `QuarantineRef`
* consistent `dag` variable name
* avoid using threadvar pubkey cache in most cases
* better error messages in batch signature checking
This way we perform the expensive epoch processing before the block
arrives.
Of course, this may lead to speculative misses which in turn lead to
replays - it's likely that in the case of a miss, we'll see a replay
regardless.
* gossip_to_consensus -> block_processor (it's processing only blocks,
but not only from gossip)
* measure queue and validation time for blocks
* measure assignment and state loading times for updateStateData
* avoid some unnecessary block copies in block sync
* warn that database is corrupt if we hit tail without a state
This reverts commit eebc828778.
Adding a separate file turns out not to be enough. This PR reverts the
separate file change.
Another theory is that the large kvstore table causes cache thrashing -
all database connections share a common page cache which would explain
the poor performance of the separate file solution.
The V1 table structure shows great improvements in performance, but if
there's an old `kvstore` without rowid:s, these benefits are nullified:
reorgs during writes and deletes remain expensive (even if the
degradation is reduced somewhat).
This PR creates the tables in a new file instead, and uses the old file
as a read-only store - this has several interesting properties:
* the old database is left completely untouched - this guarantees that
downgrades work smooth (they'll only need to resync their missing
portions)
* starting sync after this PR means only a v1 database is created
* v0 databases stick around - no migration is performed (for now)
Future PR:s can introduce migration of the data from one database to
another - a simply copy will take hours which is downtime we want to
avoid - at that point, it might make sense to migrate straight to era
files instead.
* use StateData in place of BeaconState outside state transition code
* propagate more StateData usage
* remove withStateVars().state
* wrap get_beacon_committee(BeaconState, ...) as gbc(StateData, ...)
* switch makeAttestation() to use StateData
* use StateData wrapper/dispatcher for get_committee_count_per_slot()
* convert AttestationCache.init(), weak subjectivity functions, and updateValidatorMetrics()
* add get_shuffled_active_validator_indices(StateData) and get_block_root_at_slot(StateData)
* switch makeAttestationData() to StateData
* sync AllTests-mainnet.md after rebase
* Revert "Revert "Upgrade database schema" (#2570)"
This reverts commit 6057c2ffb4.
* ssz: fix loading empty lists into existing instances
Not a problem earlier because we didn't reuse instances
* bump nim-eth
* bump nim-web3
The `kvstore` design we're using now turns out to not be the best way to
use `sqlite` - in particular, there are some significant benefits to
using rowid in certain situations and to keep data in separate tables.
With this branch, there are massive improvements in startup time
(seconds instead of minutes) and state/block storage and pruning times
(milliseconds instead of seconds) - these improvements can in particular
be seen on slow drives and translate directly into better attestation
performance.
* update kvstore to new keyspace design
* remove `DirStoreRef` and the hidden `--state-db-kind` option - this
was an experiment to store large blobs in files, but with the new
kvstore, there's no compelling reason to do so
* remove `DbMap` - unused and would need updating for new keyspace
design
* introduce separate tables for each data type (blocks, states etc)
* remove "WITHOUT ROWID" pessimization for tables with large blobs
* close DbSeq statements explicitly (and earlier)
* store beacon block summaries in separate table, without SSZ
compression and load them all with single query on startup
* stop storing backwards compat full states
* mark genesis beacon block as trusted
* avoid faststreams when loading SSZ data
* remove `DisagreementBehavior` (unused)
This PR decreases the lead subscription time which should help
decrease bandwidth usage and CPU making the subscription for future
aggregation happen a bit later. There's room for more tuning here,
probably.
* fix missing negation from in #2550
* fix silly bitarray issues
* decrease subnet lead subscription time
* log all subnet switching source data
* rename subnet trackers to refer to stability and aggregate subnets
* more tests
Currently, we have a bit of a convoluted flow where when sending
attestations, we start broadcasting them over gossip then pass them to
the attestation validation to include them in the local attestation pool
- it should be the other way around: we should be checking attestations
_before_ gossipping them - this serves as an additional safety net to
ensure that we don't publish junk - this becomes more important when
publishing attestations from the API.
Also, the REST API was performing its own validation meaning
attestations coming from REST would be validated twice - finally, the
JSON RPC wasn't pre-validating and would happily broadcast invalid
attestations.
* Unified attestation production pipeline with the same flow for gossip,
locally and API-produced attestations: all are now validated and entered
into the pool, then broadcast/republished
* Refactor subnet handling with specific SubnetId alias, streamlining
where subnets are computed, avoiding the need to pass around the number
of active validators
* Move some of the subnet handling code to eth2_network
* Use BitArray throughout for subnet handling
This PR reduces the number of database queries for slashing protection
from 5 reads and 1 write to 2 reads and 1 write in the optimistic case.
In the process, it removes user-level support for writing the database
in the version 1 format in order to simplify the code flow, and prevent
code rot. In particular, the v1 format was not covered by any unit tests
and has no advantages over v2. The concrete code to read and write it
remains for now, in particular to support upgrades from v1 to v2.
The branch also removes the use of concepts which doesn't work with
checked exceptions - in particular, this highlights code that both
raises exceptions and returns error codes, which could be cleaned up in
the future.
* Cache internal validator ID
* Rely on unique index to check for trivial duplicate votes
* Combine two surround vote queries into one
* Combine API for checking and registering slashing into single function
The slashing DB is normally not a bottleneck, but may become one with
high attached validator counts.
With the introduction of batching and lazy attestation aggregation, it
no longer makes sense to enqueue attestations between the signature
check and adding them to the attestation pool - this only takes up
valuable CPU without any real benefit.
* add successfully validated attestations to attestion pool directly
* avoid copying participant list around for single-vote attestations,
pass single validator index instead
* release decompressed gossip memory earlier, specially during async
message validation
* use cooked signatures in a few more places to avoid reloads and errors
* remove some Defect-raising versions of signature-loading
* release decompressed data memory before validating message
* batch attestations
* Fixes (but now need to investigate the chronos 0 .. 4095 crash similar to https://github.com/status-im/nimbus-eth2/issues/1518
* Try to remove the processing loop to no avail :/
* batch aggregates
* use resultsBuffer size for triggering deadline schedule
* pass attestation pool tests
* Introduce async gossip validators. May fix the 4096 bug (reentrancy issue?) (similar to sync unknown blocks #1518)
* Put logging at debug level, add speed info
* remove unnecessary batch info when it is known to be one
* downgrade some logs to trace level
* better comments [skip ci]
* Address most review comments
* only use ref for async proc
* fix exceptions in eth2_network
* update async exceptions in gossip_validation
* eth2_network 2nd pass
* change to sleepAsync
* Update beacon_chain/gossip_processing/batch_validation.nim
Co-authored-by: Jacek Sieka <jacek@status.im>
Co-authored-by: Jacek Sieka <jacek@status.im>
* Deferred DAG and fork choice pruning
* fixup
* Address https://github.com/status-im/nimbus-eth2/pull/2384/files#r589448448, rely only on onSLotEnd for state pruning
* no need to store needPruning in the data structure
* lastPrunePoint is updated in pruning proc
* Split eager and LazyPruning
* enforce pruning in updateHead
* refactor slot loop
* fix attestations being sent out early when _any_ block arrives (as
opposed to the block for the "correct" slot)
* fix attestations being sent out late when block already arrived
* refactor slot processing loop
* shutdown if clock moves backwards significantly
* fix docs
* notify caller whether the block actually arrived
* fix several memory leaks due to temporaries not being reset during
init
* avoid massive main() function with lots of stuff in it
* disable nim-prompt (unused)
* reuse validator pool instance in eth2_processor
* style cleanup
* database state storage benchmarking via ncli_db
* more cleanups from immutable validator state branch
* unexport some eth2_network constants and remove unused variables/templates
* make two PeerScore constants public
* Create CLI tool for slashing export
* Use SQLite as a DB instead of a KV-store
* Keeps v1 and v2 DBs around
* Uses the same schema as Lighthouse v1.1.0
* Passes all interchange tests + skeleton of finalization pruning
* Removes tests that would violate v5 / minimal slashing DB and MinSlot rules
* Migration tool added using low-watermark scheme for faster migration of large number of validators
* force pushing to fix unstable base
* increase attestation/aggregate queue sizes
when there are many validators, many aggregates and attestations arrive
every slot - increase the queue size a bit - also do batches on each
idle loop iteration since it's fairly quick
* don't score subnets for now
* wrapping up
* refactor and cleanups
* gossip parameters fixes
* comment fix
Co-authored-by: Jacek Sieka <jacek@status.im>
* use IntSet rather than HashSet[ValidatorIndex]
* add bounds check before uint64 -> int conversion
* use intsets in block transitions
* remove superfluous Nim issue explanation/reference
* allow always-on subscription to all attestation subnets when gossiping
* in subscribe-all-subnets mode, consider all subnets to be stability subnets for ENR purposes
* remove await/async from sub/unsub
* fix unsubscribe wrong key (missed _snappy)
* use the right libp2p commit hash
* remove unused async
* fix inspector
* fix subnet calculation in RPC and insert broadcast attestations into node's pool
* unify codepaths to ensure only mostly-checked-to-be-valid attestations enter the pool, even from node's own broadcasts
* update attestation pool tests for new validateAttestation param
Co-authored-by: Dustin Brody <tersec@users.noreply.github.com>
* make subnet cycling more robust; use one stability subnet/validator; explicitly represent gossip enabled/disabled
* fix asymmetry in _snappy being used for subscriptions but not unsubscriptions
* remove redundant comment
* minimal RPC and VC support for infoming BN of subnets
* create and verify slot signatures in RPC interface and VC
* loosen old slot check
* because Slot + uint64 works but uint64 + Slot doesn't
* document assumptions for head state use; don't clear stability subnets; guard against VC not having checked an epoch ahead, fixing a crash; clarify unsigned comparison
* revert unsub fix
* checkpoint database at end of each slot
To avoid spending time on synchronizing with the file system while doing
processing, the manual checkpointing mode turns off fsync during
processing and instead checkpoints the database when the slot has ended.
From an sqlite perspecitve, in WAL mode this guaranees database
consistency but may lead to data loss which is fine - anything missing
from the beacon chain database can be recovered on the next startup.
* log sync status and delay in slot start message
* bump
* Handle some web3 timeouts better
* Add support for developer .env files
* Eth1 improvements; Mainnet genesis state
Notable changes:
* The deposits table have been removed from the database. The client
will no longer process all deposits on start-up.
* The network metadata now includes a "state snapshot" of the deposit
contract. This allows the client to skip syncing deposits made prior
to the snapshot (i.e. genesis). Suitable metadata added for Pyrmont
and Mainnet.
* The Eth1 monitor won't be started unless there are validators attached
to the node.
* The genesis detection code is now optional and disabled by default
* Bugfix: The client should not produce blocks that will fail validation
when it hasn't downloaded the latest deposits yet
* Bugfix: Work around the database corruption affecting Pyrmont nodes
* Remove metadata for Toledo and Medalla
* log when database is loading (to avoid confusion)
* generate network keys later during startup
* fix quarantine not scheduling chain of parents for download and
increase size to one epoch
* log validator count, enr and peerid more clearly on startup
* Avoid hangs when wss:// is specified for a non-secure HTTP server
* Produce an ERROR when the web3 provider is unsupported, but still launch the node