We have been seeing some port conflicts like:
```
[2023-08-15T00:31:47.625Z] Geth 0 failed to start
```
```
$ tail -n1 local-testnet-mainnet/logs/geth.?.txt
==> local-testnet-mainnet/logs/geth.0.txt <==
Fatal: Error starting protocol stack: listen tcp :6801: bind: address already in use
==> local-testnet-mainnet/logs/geth.1.txt <==
Fatal: Error starting protocol stack: listen tcp :6806: bind: address already in use
==> local-testnet-mainnet/logs/geth.2.txt <==
Fatal: Error starting protocol stack: listen tcp :6811: bind: address already in use
```
In order to debug this we'll need to add printing of some extra info
into `unstable` so feature branches include it.
Related: https://github.com/status-im/nimbus-eth2/issues/4575
Signed-off-by: Jakub Sokołowski <jakub@status.im>
* Add support for using custom remote signers in local sim
Other changes:
* Enable the Nimbus remote signer in the minimal simulation
* Move all log files into the `logs` folder of the simulation
* Create PID files for all processes and use them during the clean-up
phase instead of the previous more fragile methods for killing the
remaining processes.
* Kzg: Load trusted setup
* scripts/launch_local_testnet.sh: set FIELD_ELEMENTS_PER_BLOB
* Use right setup file for mainnet/minimal
* Force rebuild
* Add comment explaining why build with -f
* fix false positive getopt failure with multiple getopt matches in searched path
* also get launch_local_testnet
* also make_prometheus_config, called from launch_local_testnet
* Support for driving multiple EL nodes from a single Nimbus BN
Full list of changes:
* Eth1Monitor has been renamed to ELManager to match its current
responsibilities better.
* The ELManager is no longer optional in the code (it won't have
a nil value under any circumstances).
* The support for subscribing for headers was removed as it only
worked with WebSockets and contributed significant complexity
while bringing only a very minor advantage.
* The `--web3-url` parameter has been deprecated in favor of a
new `--el` parameter. The new parameter has a reasonable default
value and supports specifying a different JWT for each connection.
Each connection can also be configured with a different set of
responsibilities (e.g. download deposits, validate blocks and/or
produce blocks). On the command-line, these properties can be
configured through URL properties stored in the #anchor part of
the URL. In TOML files, they come with a very natural syntax
(althrough the URL scheme is also supported).
* The previously scattered EL-related state and logic is now moved
to `eth1_monitor.nim` (this module will be renamed to `el_manager.nim`
in a follow-up commit). State is assigned properly either to the
`ELManager` or the to individual `ELConnection` objects where
appropriate.
The ELManager executes all Engine API requests against all attached
EL nodes, in parallel. It compares their results and if there is a
disagreement regarding the validity of a certain payload, this is
detected and the beacon node is protected from publishing a block
with a potential execution layer consensus bug in it.
The BN provides metrics per EL node for the number of successful or
failed requests for each type Engine API requests. If an EL node
goes offline and connectivity is resoted later, we report the
problem and the remedy in edge-triggered fashion.
* More progress towards implementing Deneb block production in the VC
and comparing the value of blocks produced by the EL and the builder
API.
* Adds a Makefile target for the zhejiang testnet
First step in debugging issue most probably re-introduced by:
https://github.com/status-im/nimbus-eth2/pull/4551
Which causes the finalization tests script to kill other processes
unrelated to the given CI job.
Signed-off-by: Jakub Sokołowski <jakub@status.im>
* Local sim impovements
* Added support for running Capella and EIP-4844 simulations
by downloading the correct version of Geth.
* Added support for using Nimbus remote signer and Web3Signer.
Use 2 out of 3 threshold signing configuration in the mainnet
configuration and regular remote signing in the minimal one.
* The local testnet simulation can now use a payload builder.
This is currently not activated in CI due to lack of automated
procedures for installing third-party relays or builders.
You are adviced to use mergemock for now, but for most realistic
results, we can create a simple builder based on the nimbus-eth1
codebase that will be able to propose transactions from the regular
network mempool.
* Start the simulation from a merged state. This would allow us
to start removing pre-merge functionality such as the gossip
subsciption logic. The commit also removes the merge-forcing
hack installed after the TTD removal.
* Consolidate all the tools used in the local simulation into a
single `ncli_testnet` binary.
Another dumb mistake when using bourne shell:
```
/var/lib/dpkg/info/nimbus-beacon-node.postinst: 23: source: not found
```
Signed-off-by: Jakub Sokołowski <jakub@status.im>
The `postinst` wrapper script into which these scripts are embedded as
`after_upgrade` and `after_install` functions are executed using Bourne
shell(`sh`), so we cannot use the Bash specific `[[ ]]` test or it fails:
```
/var/lib/dpkg/info/nimbus-beacon-node.postinst: 22: [[: not found
```
Signed-off-by: Jakub Sokołowski <jakub@status.im>
The `/etc/os-release` file exists in most distributions and can be
easily read in Bash by sourcing it:
```
> docker run --rm -it debian:bullseye
root@2f5d6e038738:/# grep '^ID=' /etc/os-release
ID=debian
```
```
> docker run --rm -it ubuntu:22.04
root@316b572b6e4d:/# grep '^ID=' /etc/os-release
ID=ubuntu
```
The dependency on `lsb-release` tool
is unnecessary, and pulls in additional big dependencies like `python3`:
```
# apt show lsb-release | grep Depends
Depends: python3:any, distro-info-data
```
Which if used in a Docker container would make it unnecessarily big.
Signed-off-by: Jakub Sokołowski <jakub@status.im>
Otherwise installation in Docker containers fails with:
```
...
Adding new user `nimbus' (UID 101) with group `nimbus' ...
Not creating home directory `/home/nimbus'.
/var/lib/dpkg/info/nimbus-beacon-node.postinst: 39: systemctl: not found
dpkg: error processing package nimbus-beacon-node (--configure):
installed nimbus-beacon-node package post-installation script subprocess returned error exit status 127
Errors were encountered while processing:
nimbus-beacon-node
E: Sub-process /usr/bin/dpkg returned an error code (1)
```
Signed-off-by: Jakub Sokołowski <jakub@status.im>
* Working Makefile targets for Capella devnet2
make capella-devnet-2
make clean-capella-devnet-2
You'll need to have https://github.com/tmuxinator/tmuxinator installed.
It's available as a regular package in most Linux distributions or through
Nix or Brew on macOS.
This commit also fixes the initial hang in the Eth1 monitor in the "find
TTD block" procedure through a fix to the network metadata files which
hasn't been upstreamed yet.
Other changes:
* Disabled Geth snap sync in the simulation
When all Geth nodes are configured to run with snap sync enabled, they all
start snap sync after the first forkchoiceUpdated which causes the BNs to
skip validator duties because the EL is syncing. The snap sync never completes
due to poor connectivity between the Geth nodes in the simulation.
libp2p issues related to operation cancellations have been addressed in
https://github.com/status-im/nim-libp2p/pull/816
This means we can once more enable `--sync-light-client` in CI, without
having to deal with spurious CI failures due to the cancellation issues.
Other changes:
* More optimal search for TTD block.
* Add timeouts to all REST requests during trusted node sync.
Fixes#4037
* Removed support for storing a deposit snapshot in the network
metadata.
Since the sync committee duties are no longer updated on every slot
and previously the sync committee aggregators selection proofs were
generated during the duties update, this now resulted in the client
using stale selection proofs (they must be generated at each slot).
The fix consists of moving the selection proof generation logic in
a different function which is properly executed on each slot.
Other changes:
* The logtrace tool has been enhanced with a framework for adding
new simpler log aggregation and analysis algorithms.
The default CI testnet simulation will now ensure that the blocks
in the network have reasonable sync committee participation.
Local testnet simulation currently waits 5 seconds when starting each
individual Geth instance. Waiting a shorter amount saves almost a minute
per minimum + mainnet CI finalization job.
Measured startup times per Geth: Linux ~100ms, macOS Intel ~300ms.
Launching multiple local testnet simulation sequentially can lead to
existing EL processes from prior failed/aborted runs not being stopped
properly, subsequently leading to hard-to-debug CI test failures.
Fixing the cleanup logic addresses this problem.