With these changes, we can backfill about 400-500 slots/sec, which means
a full backfill of mainnet takes about 2-3h.
However, the CPU is not saturated - neither in server nor in client
meaning that somewhere, there's an artificial inefficiency in the
communication - 16 parallel downloads *should* saturate the CPU.
One plasible cause would be "too many async event loop iterations" per
block request, which would introduce multiple "sleep-like" delays along
the way.
I can push the speed up to 800 slots/sec by increasing parallel
downloads even further, but going after the root cause of the slowness
would be better.
* avoid some unnecessary block copies
* double parallel requests
When node is restarted before backfill has started but after some blocks
have finalized with forward sync, we would not start the backfill.
* also clean up one last `SomeSome`
* Initial commit.
* Fix current test suite.
* Fix keymanager api test.
* Fix wss_sim.
* Add more keystore_management tests.
* Recover deleted isEmptyDir().
* Add `HttpHostUri` distinct type.
Move keymanager calls away from rest_beacon_calls to rest_keymanager_calls.
Add REST serialization of RemoteKeystore and Keystore object.
Add tests for Remote Keystore management API.
Add tests for Keystore management API (Add keystore).
Fix serialzation issues.
* Fix test to use HttpHostUri instead of Uri.
* Add links to specification in comments.
* Remove debugging echoes.
* harden and speed up block sync
The `GetBlockBy*` server implementation currently reads SSZ bytes from
database, deserializes them into a Nim object then serializes them right
back to SSZ - here, we eliminate the deser/ser steps and send the bytes
straight to the network. Unfortunately, the snappy recoding must still
be done because of differences in framing.
Also, the quota system makes one giant request for quota right before
sending all blocks - this means that a 1024 block request will be
"paused" for a long time, then all blocks will be sent at once causing a
spike in database reads which potentially will see the reading client
time out before any block is sent.
Finally, on the reading side we make several copies of blocks as they
travel through various queues - this was not noticeable before but
becomes a problem in two cases: bellatrix blocks are up to 10mb (instead
of .. 30-40kb) and when backfilling, we process a lot more of them a lot
faster.
* fix status comparisons for nodes syncing from genesis (#3327 was a bit
too hard)
* don't hit database at all for post-altair slots in GetBlock v1
requests
* deactivate Doppelganger Protection during genesis
* also don't actually flag supposed-doppelgangers (because they're before broadcastStartEpoch) on GENESIS_SLOT start
* update action tracker on dependent-root-changing reorg (instead of
epoch change)
* don't try to log duties while syncing - we're not tracking actions yet
* fix slot used for doppelganger loss detection
These use a separate flow, and were previously only registered from the
network
* don't log successes in totals mode (TMI)
* remove `attestation-sent` event which is unused
* Fix a resource leak introduced in https://github.com/status-im/nimbus-eth2/pull/3279
* Don't restart the Eth1 syncing proggress from scratch in case of
monitor failures during Eth2 syncing.
* Switch to the primary operator as soon as it is back online.
* Log the web3 credentials in fewer places
Other changes:
The 'web3 test' command has been enhanced to obtain and print more
data regarding the selected provider.
- Request metadata_v2 (altair) by default instead of the v1
- Change the metadata pinger to a 3 failure-then-kick, instead of being time based
- Update kicker scorer to take into account topics which we're not subscribed to, to be sure that we will be able to publish correctly
- Add some metrics to give "fanout" health (in the same spirit of mesh health)
Notable improvements:
* A separate aggregation pass is no longer required.
* The user can opt to produce only aggregated data
(resuing in a much smaller data set).
* Large portion of the number cruching in Jupyter is now done in C
through the rich DataFrames API.
* Added support for comparisons against the "median" validator
performance in the network.
Make `validator exit command` work both with `JSON-RPC` and `REST` APIs
Fix problem with specifying rest-url using `localhost`
Change back exit error messages in `state_transition_block`
The current counters set gauges etc to the value of the _last_ validator
to be processed - as the name of the feature implies, we should be using
sums instead.
* fix missing beacon state metrics on startup, pre-first-head-selection
* fix epoch metrics not being updated on cross-epoch reorg
* Store finalized block roots in database (3s startup)
When the chain has finalized a checkpoint, the history from that point
onwards becomes linear - this is exploited in `.era` files to allow
constant-time by-slot lookups.
In the database, we can do the same by storing finalized block roots in
a simple sparse table indexed by slot, bringing the two representations
closer to each other in terms of conceptual layout and performance.
Doing so has a number of interesting effects:
* mainnet startup time is improved 3-5x (3s on my laptop)
* the _first_ startup might take slightly longer as the new index is
being built - ~10s on the same laptop
* we no longer rely on the beacon block summaries to load the full dag -
this is a lot faster because we no longer have to look up each block by
parent root
* a collateral benefit is that we no longer need to load the full
summaries table into memory - we get the RSS benefits of #3164 without
the CPU hit.
Other random stuff:
* simplify forky block generics
* fix withManyWrites multiple evaluation
* fix validator key cache not being updated properly in chaindag
read-only mode
* drop pre-altair summaries from `kvstore`
* recreate missing summaries from altair+ blocks as well (in case
database has lost some to an involuntary restart)
* print database startup timings in chaindag load log
* avoid allocating superfluos state at startup
* use a recursive sql query to load the summaries of the unfinalized
blocks
Additional sanity checking of the status message exchanged during a
fresh connection:
* check that head and finalized make sense, slot-wise
* verify that finalized root lies on the canonical chain, when possible
* re-check these things for every status message during sync