Calculating rewards/penalties is slow due to how we compute sets of
attestations validators then use the sets for inclusion checks, to see
who attested. The dominant function during validated block processing /
epoch processing is hash set building and lookup.
This PR inverts the flow by removing the sets and creating a single
large validator status list, then applying all relevant state
attestations, then updating rewards and penalties.
This provides a 10x speedup to epoch processing which in turn speeds up
both empty slot and block processing - for example, on startup, we
replay all non-finalized blocks to prime fork choice - the same when
validating attestations or replaying states on reorg.
* update ve1.0.0-rc.0 preset spec references
* remove runtime preset ETH1_FOLLOW_DISTANCE from preset files; remove two CI build items to try to keep Travis from timing out
This addresses the issues by detecting and rejecting keystores with
incorrect PBKDF2 and SCrypt params. It also bumps the version of
nim-json-serialization to include a bugfix for incorrect parsing
of json files featuring comments.
It turns out that we often save lots of states in the database that are
the result of empty slot processing only - here, we make sure to only
save a state if a block follows - this fixes several issues:
* empty slot states are not always pruned leading to state database size
explosion
* storing states is (very) slow which slows down processing in general,
so we should only do it when it's likely to be useful
* attestation processing doesn't get stuck on saving random states that
won't appear in the chain history
* in exit pool, filter out already-packaged messages; bundle remaining messages into beaconblocks
* filter messages at block construction time
* allow adding up to intended capacity of buffers, beyond per-block limits
* document rationale/design for filtering mechanism
* addPeer() and addPeerNoWait() now returns PeerStatus, not bool.
Minor refactoring of PeerPool.
Fix tests.
* Refactor PeerPool.
Add lenSpace.
Add tests for lenSpace.
PeerPool.add procedures now return different error codes.
Fix SyncManager break/continue problem.
Fix connectWorker break/continue problem.
Refactor connectWorker and discoveryLoop.
Fix incoming/outgoing blocking problem.
* Refactor discovery loop.
Add checkPeer.
* Fix logic and compilation bugs.
* Adjust position of debugging log.
* Fix issue with maximum peers in PeerPool.
Optimize node record decoding.
* fix discoveryLoop.
* Remove aliases and fix tests using aliases.
about 40% better slot processing times (with LTO enabled) - these don't
do BLS but are used
heavily during replay (state transition = slot + block transition)
tests using a recent medalla state and advancing it 1000 slots:
```
./ncli slots --preState2:state-302271-3c1dbf19-c1f944bf.ssz --slot:1000
--postState2:xx.ssz
```
pre:
```
All time are ms
Average, StdDev, Min, Max, Samples,
Test
Validation is turned off meaning that no BLS operations are performed
39.236, 0.000, 39.236, 39.236, 1,
Load state from file
0.049, 0.002, 0.046, 0.063, 968,
Apply slot
256.504, 81.008, 213.471, 591.902, 32,
Apply epoch slot
28.597, 0.000, 28.597, 28.597, 1,
Save state to file
```
cast:
```
All time are ms
Average, StdDev, Min, Max, Samples,
Test
Validation is turned off meaning that no BLS operations are performed
37.079, 0.000, 37.079, 37.079, 1,
Load state from file
0.042, 0.002, 0.040, 0.090, 968,
Apply slot
215.552, 68.763, 180.155, 500.103, 32,
Apply epoch slot
25.106, 0.000, 25.106, 25.106, 1,
Save state to file
```
cast+rewards:
```
All time are ms
Average, StdDev, Min, Max, Samples,
Test
Validation is turned off meaning that no BLS operations are performed
40.049, 0.000, 40.049, 40.049, 1,
Load state from file
0.048, 0.001, 0.045, 0.060, 968,
Apply slot
164.981, 76.273, 142.099, 477.868, 32,
Apply epoch slot
28.498, 0.000, 28.498, 28.498, 1,
Save state to file
```
cast+rewards+shr
```
All time are ms
Average, StdDev, Min, Max, Samples,
Test
Validation is turned off meaning that no BLS operations are performed
12.898, 0.000, 12.898, 12.898, 1,
Load state from file
0.039, 0.002, 0.038, 0.054, 968,
Apply slot
139.971, 68.797, 120.088, 428.844, 32,
Apply epoch slot
24.761, 0.000, 24.761, 24.761, 1,
Save state to file
```
* Slashing protection + interchange initial commit
* Restrict the when UseSlashingProtection dance in other modules
* Integrate slashing tests in other all_tests
* Add attestation slashing protection support
* Add a message that mention if built with/without slashing protection
* no op the initialization proc
* test slashing protection in Jenkins (temp)
* where to configure NIMFLAGS in Jenkins ...
* Jenkins -> ensure Built with slashing protection
* Add slashing protection complete import
* use Opt.get(otherwise)
* Don't use negation in proc name
* Turn slashing protection on by default
* Refactor peer_pool.
Fix eth2_network peer counters.
Fix PeerPool do not allow to add more peers when empty space available.
* Remove unused imports.
* Add test for a bug.
* Fix eth2_network disconnect should deletePeer not release.
More PeerPool refactoring.
* remove some superfluous gcsafes
* remove getTailState (unused)
* don't store old epochrefs in blocks
* document attestation pool a bit
* remove `pcs =` cruft from log
* Quick fix to prune some states, pending smarter state storage
Adverse effects might include slow rewinds - typically the protocol
doesn't ask for pre-finalized states but RPC might
* document issue, add test
* fix cache miss log
* stop discarding non-existent future epochs during epoch state transitions; remove a pointless StateCache() construction in advance_slots()
* update nbench to pass StateCache to process_slots()
This implements disparity, resolving a part of
https://github.com/status-im/nim-beacon-chain/issues/1367
* make BeaconTime a duration for fractional seconds
* factor out attestation/aggregate validation
* simplify recording of queued attestations
* simplify attestation signature check
* fix blocks_received metric
* add some trivial validation tests
* remove unresolved attestation table - attestations for unknown blocks
are dropped instead (cannot verify their signature)
* initial - cheaper pruning - addresses #1534
* Pass tests: update offset when pruning, proper handling of pruned parents
* Use options instead of nil for nilable newHead (finalization passing but rootcause not solved)
* First line of defense against stackoverflow in tests
* Fix compute_delta offset after pruning
* Rebase fix - medalla ready
* Remove Option[BlockRef]
Replace shuffling function with zrnt version - `get_shuffled_seq` in
particular puts more strain on the GC by allocating superfluous seq's
which turns out to have a significant impact on block processing (when
replaying blocks for example) - 4x improvement on non-epoch, 1.5x on
epoch blocks (replay is done without signature checking)
Medalla, first 10k slots - pre:
```
Loaded 68973 blocks, head slot 117077
All time are ms
Average, StdDev, Min, Max, Samples,
Test
Validation is turned off meaning that no BLS operations are performed
76855.848, 0.000, 76855.848, 76855.848, 1,
Initialize DB
1.073, 0.914, 0.071, 12.454, 7831,
Load block from database
31.382, 0.000, 31.382, 31.382, 1,
Load state from database
85.644, 30.350, 3.056, 466.136, 7519,
Apply block
506.569, 91.129, 130.654, 874.786, 312,
Apply epoch block
```
post:
```
Loaded 68973 blocks, head slot 117077
All time are ms
Average, StdDev, Min, Max, Samples,
Test
Validation is turned off meaning that no BLS operations are performed
72457.303, 0.000, 72457.303, 72457.303, 1,
Initialize DB
1.015, 0.858, 0.070, 11.231, 7831,
Load block from database
28.983, 0.000, 28.983, 28.983, 1,
Load state from database
21.725, 17.461, 2.659, 393.217, 7519,
Apply block
324.012, 33.954, 45.452, 440.532, 312,
Apply epoch block
```
When blocks and attestations arrive, they are SSZ-decoded twice: once
for validation and once for processing. This branch enqueues the decoded
block directly for processing, avoiding the second, slow
deserialization.
* move processing of blocks and attestations to queue
* ...and out from beacon_node
* split attestation processing into attestations and aggregates
* also updates metrics
* clean up logging to better follow the lifetime of gossip: arrival,
validation and processing
* drop attestations and aggregates if there are too many
* try to prioritise blocks and aggregates before single-validator
attestations
* collect all epochrefs in specific blocks to make them easier to find
and to avoid lots of small seqs
* reuse validator key databases more aggressively by comparing keys
* make state cache available from within `withState`
* make epochRef available from within onBlockAdded callback
* integrate getEpochInfo into block resolution and epoch ref logic such
that epochrefs are created when blocks are added to pool or lazily when
needed by a getEpochRef
* fill state cache better from EpochRef, speeding up replay and
validation
* store epochRef in specific blocks to make them easier to find and
reuse
* fix database corruption when state is saved while replaying quarantine
* replay slots fully from block pool before processing state
* compare bls values more smartly
* store epoch state without block applied in database - it's recommended
to resync the node!
this branch will drastically speed up processing in times of long
non-finality, as well as cut memory usage by 10x during the recent
medalla madness.
* add attestation processing queue so attestations don't get processed
too early
* rework justified slot delay to match spec / lighthouse better
* keep less state in fork choice
* request epochref less
* Bump BLSCurve
* Use unified aggregation API
* use new blscurve with unified aggregate API
* bump
* fix toRaw
* replace state_sim combine with AggregateSignature
* Fix 32-bit
* Fix 32-bit for real and test deactivating ccache for fno-tree-lopp-vectorize flag
* change compilation switches to narrow down Linux issue
* Use -fno-tree-vectorize to disable both tree-loop-vectorize and tree-slp-vectorize
* blscurve now disables both Loop and SLP vectorization
* Add tests for the miracl/milagro fallback
* Travis has max log size of 4MB
* Test with Miracl in the finalization test
* fix state_sim log level
* Coment out the slow fallback tests
* fix invalid state root being written to database
When rewinding state data, the wrong block reference would be used when
saving the state root - this would cause state loading to fail by
loading a different state than expected, preventing blocks to be
applied.
* refactor state loading and saving to consistently use and set
StateData block
* avoid rollback when state is missing from database (as opposed to
being partially overwritten and therefore in need of rollback)
* don't store state roots for empty slots - previously, these were used
as a cache to avoid recalculating them in state transition, but this has
been superceded by hash tree root caching
* don't attempt loading states / state roots for non-epoch slots, these
are not saved to the database
* simplify rewinder and clean up funcitions after caches have been
reworked
* fix chaindag logscope
* add database reload metric
* re-enable clearance epoch tests
* names
* Eliminate possibilities for range errors and overflows
* Handle more properly invalid requests for furute slots
* Eliminate the confusing surrounding the MAX_REQUEST_BLOCKS constant
Addresses https://github.com/status-im/nim-beacon-chain/issues/1366
* Allow sync manager process blocks one by one.
* Log storeBlock() and updateHead() duration.
* Calculate duration only for blocks added without any error.
* Fix float compilation error.
* Fix duration.
* Fix SyncQueue tests.
* removed the BlockPool type and all of the proxy functions around it - passing the chain DAG and the quarantine explicitly where appropriately - they don't need to be bundled in a type
* fixed the build after the rebase
* more fork-choice fixes
* use target block/epoch to validate attestations
* make addLocalValidators sync
* add current and previous epoch to cache before doing state transition
* update head state using clearance state as a shortcut, when possible
* use blockslot for fork choice balances
* send attestations using epochref cache
* fix invalid finalized parent being used
also simplify epoch block traversal
* single error handling style in fork choice
* import fix, remove unused async
* Lazy loading of crypto objects
* Try to fix incorrect field access by hiding fields but no luck. SSZ/Chronicles/macro bug?
* Fix incorrect blsValue access. was "aggregate" not "chronicles"
* Fix tests that rely on the internal BLSValue representation
* limit attestations kept in attestation pool
With fork choice updated, the attestation pool only needs to keep track
of attestations that will eventually end up in blocks - we can thus
limit the horizon of attestations that we keep more aggressively.
To get here, we expose getEpochRef which gets metadata about a
particular epochref, and make sure to populate it when a block is added
- this ensures that state rewinds during block addition are minimized.
In addition, we'll use the target root/epoch when validating
attestations - this helps minimize the number of different states that
we need to rewind to, in general.
* remove CandidateChains.justifiedState
unused
* remove BlockPools.Head object
* avoid quadratic quarantine loop
* fix
* add helper for beacon committee length (used for quickly validating
attestations)
* refactor some attestation checks to do cheap checks early
* validate attestation epoch before computing active validator set
* clean up documentation / comments
* fill state cache on demand
* avoid converting from uint64 to int, and where most feasible, int type conversion at all
* .len.uint64 -> .len64
* fix 32-bit compilation
* try keeping state_sim loop variable/bounds as int for 32-bit Azure
* len64 -> lenu64
* fork choice fixes, round 3
* introduce checkpoint tracker
* split out fork choice backend that is independent of dag
* correctly update best checkpoint to use for head selection
* correctly consider wall clock when processing attestations
* preload head history only (only one history is loaded from database
anyway)
* love the DAG
* switch to fork choice v2
also remove BlockRef.children
* fix
* cache beacon committee size calculation
this fixes a bug in get_validator_churn_limit as well
* fix
* make committee counts consistently uint64
mixing feels like the worst of the two worlds
- delete "tests/simulation/{data,validators}" by default, because old
validator keys can lead to a crash with a cryptic error message
- actually start the Prometheus daemon on `make eth2_network_simulation`
- kill any running Prometheus daemon on exit
- fix some shell script syntax incompatible with Bash
* remove cruft
* reenable fork choice and fix several issues
* in addForkChoice_v2, the `.error` field would be accessed even when
Result is ok
* remove workaround for invalid block structure in fork choice
* fix `tmpState` being used recursively in callback, causing state
corruption while processing attestation
* fix block callback being called twice per block
* pass state to callback to avoid unnecessary rewinding
* enable head select, fix another bug
* never use `get` without `isOk`
* log nil blockref in case blockref is nil
* add missing error checking
* use correct epoch when updating attestation message
hash_tree_root was turning up when running beacon_node, turns out to be
repeated hash_tree_root invocations - this pr brings them back down to
normal.
this PR caches the root of a block in the SignedBeaconBlock object -
this has the potential downside that even invalid blocks will be hashed
(as part of deserialization) - later, one could imagine delaying this
until checks have passed
there's also some cleanup of the `cat=` logs which were applied randomly
and haphazardly, and to a large degree are duplicated by other
information in the log statements - in particular, topics fulfill the
same role
* restore EpochRef and flush statecaches on epoch transitions
* more targeted cache invalidation
* remove get_empty_per_epoch_cache(); implement simpler but still faster get_beacon_proposer_index()/compute_proposer_index() approach; add some abstraction layer for accessing the shuffled validator indices cache
* reduce integer type conversions
* remove most of rest of integer type conversion in compute_proposer_index()
This also assigns precise types to the constants in the minimal
and mainnet presets in order to reduce the chance of compilation
errors when custom presets are used (previously, only the custom
presets have precisely assigned types for the constants).
htr is fast now, so hitting the database to load the state root is no
longer motivated - the more simple code is less vulnerable to database
corruption.
* update most of the remaining non-fork-choice spec refs, updating code where necessary
* revert presumably harmless compute_signing_root() change, but this way, keep things really unchanged outside inspector
* Dual headed fork choice
* fix finalizedEpoch not moving
* reduce fork choice verbosity
* Add failing tests due to pruning
* Properly handle duplicate blocks in sync
* test_block_pool also add a test for duplicate blocks
* comments addressing review
* Fix fork choice v2, was missing integrating block proposed
* remove a spurious debug writeStackTrace
* update block_sim
* Use OrderedTable to ensure that we always load parents before children in fork choice
* Load the DAG data in fork choice at init if there is some (can sync witti)
* Cluster of quarantined blocks were not properly added to the fork choice
* Workaround async gcsafe warnings
* Update blockpoool tests
* Do the callback before clearing the quarantine
* Revert OrderedTable, implement topological sort of DAG, allow forkChoice to be initialized from arbitrary finalized heads
* Make it work with latest devel - Altona readyness
* Add a recovery mechanism when forkchoice desyncs with blockpool
* add the current problematic node to the stack
* Fix rebase indentation bug (but still producing invalid block)
* Fix cache at epoch boundaries and lateBlock addition
* adopt Result[void, string] in place of some bool return signatures
* string -> cstring to reduce memory allocations; ensure all err() strings are constants, with contextual information from higher-level callers
* logScope usage fixes
* homogenize err() reporting convention
* invalid signature in deposit isn't an error
* add tests for unviable blocks
also enable finalization tests in all test configs - they're plenty fast
now
also fix newClone for non-rvo cases. sigh.
* fixes
* cleanups
* fix ncli state root check flag
* add block dump to ncli_db
* limit ncli_db benchmark length
* tone down finalization logs
* introduce trusted blocks
We only store blocks whose signature we've verified in the database - as
such, there's no need to check it again, and most importantly, no need
to deserialize the signature when loading from database.
50x startup time improvement, 200x block load time improvement.
* fix rewinding when deposits have invalid signature
* speed up ancestor iteration by avoiding copy
* avoid deserializing signatures for trusted data
* load blocks lazily when rewinding (less memory used)
* chronicles workarounds
* document trustedbeaconblock
* re-add minimal preset constant checking; organize presets to better support multiple spec versions
* bump spec ref
* increase Azure timeout to 90 minutes to accomodate Nim compiler building
- better logging & retrying requests on the VC side if the BN fails for some reason
- VC now fetches the attestation duties 1 epoch in advance - in the future it will tell the BN to subscribe to the appropriate attestation topics in advance based on that info
- a bunch of other code cleanup & fixes such as better naming for consoles when using multitail, etc.
reviewed in PR #1184 - proper review of the API & VC are pending
* collect signature production and verificaiton in one place
Signatures are made over data and domain - here we collect all such
activities in one place.
Also:
* security: fix cast-before-range-check
* log block/attestation verification consistently
* run block verification based on `getProposer` in its own history
* clean up some unused stuff
* import
* missing raises
* random fixes
* create dump dir on startup
* don't crash on failure to write dump
* fix a few `uint64` instances being used when indexing arrays - this
should be a compile error but isn't due to compiler bugs
* fix standalone test_block_pool compilation
* add signed block processing in ncli
* reuse cache entry instead of allocating a new one
* allow for small clock disparities when validating blocks