nimbus-eth1/nimbus/p2p
Jamie Lokier 40fbed49cf
Sync fix: `GetBlockBodies` logic preventing sync, dropping peers
Fixes #864 "Sync progress stops at Goerli block 4494913", and equivalent on
other networks.

The block body fetcher in `blockchain_sync.nim` had an incorrect assumption
about how peers respond to `GetBlockBodies`.  It was issuing requests for N
block bodies and incorrectly handling replies which contained fewer than N
bodies.

Having received up to 192 headers in a batch, it split the range into smaller
`GetBlockBodies` requests, fetched each reply, then combined replies.  The
effect was Nimbus requested batches of 128+64 block bodies, received gaps in
the reply sequence, then aborted.

That meant it repeatedly fetched data, then discarded it, and fetched it again,
dropping good peers in the process.

Aborted and restarted batches occurred with earlier blocks too, but this became
more pronounced until there were no suitable peers at batch 4494913..4495104.

Here's a trace:

```
TRC 2021-09-29 02:40:24.977+01:00 Requesting block headers                   file=blockchain_sync.nim:224 start=4494913 count=192 peer=<ENODE>
TRC 2021-09-29 02:40:24.977+01:00 >> Sending eth.GetBlockHeaders (0x03)      file=protocol_eth65.nim:51 peer=<PEER> startBlock=4494913 max=192
TRC 2021-09-29 02:40:25.005+01:00 << Got reply eth.BlockHeaders (0x04)       file=protocol_eth65.nim:51 peer=<PEER> count=192
TRC 2021-09-29 02:40:25.007+01:00 >> Sending eth.GetBlockBodies (0x05)       file=protocol_eth65.nim:51 peer=<PEER> count=128
TRC 2021-09-29 02:40:25.209+01:00 << Got reply eth.BlockBodies (0x06)        file=protocol_eth65.nim:51 peer=<PEER> count=13
TRC 2021-09-29 02:40:25.210+01:00 >> Sending eth.GetBlockBodies (0x05)       file=protocol_eth65.nim:51 peer=<PEER> count=64
TRC 2021-09-29 02:40:25.290+01:00 << Got reply eth.BlockBodies (0x06)        file=protocol_eth65.nim:51 peer=<PEER> count=64
WRN 2021-09-29 02:40:25.306+01:00 Bodies len != headers.len                  file=blockchain_sync.nim:276 bodies=77 headers=192
TRC 2021-09-29 02:40:25.306+01:00 peer disconnected                          file=blockchain_sync.nim:403 peer=<PEER>
TRC 2021-09-29 02:40:25.306+01:00 Finished obtaining blocks                  file=blockchain_sync.nim:303 peer=<PEER>
```

In practice, for modern peers, Nimbus received shorter replies than it assumed
depending on the block sizes on the chain.  Geth/Erigon has 2MiB `BlockBodies`
response size soft limit.  OpenEthereum has 4MiB.

Up to Berlin (EIP-2929), Nimbus's fetcher failed often, but there were still
some peers serving what Nimbus needed.

Just after the start of Berlin, at batch 4494913..4495104 on Goerli, zero peers
responded with full size replies for the whole batch, so Nimbus couldn't
progress past that point.  But there was already a problem happening before
that for large blocks, dropping good peers and repeatedly fetching the same
block data.

Signed-off-by: Jamie Lokier <jamie@shareable.org>
2021-10-19 10:20:26 +01:00
..
chain config: fix new config based on input from jamie and zahary 2021-09-18 17:34:51 +07:00
clique fixes comments in clique.seal func 2021-08-24 16:14:17 +07:00
executor config: replace stdlib parseOpt with nim-confutils 2021-09-18 17:34:46 +07:00
validate Feature/implement poa processing (#748) 2021-07-14 16:13:27 +01:00
blockchain_sync.nim Sync fix: `GetBlockBodies` logic preventing sync, dropping peers 2021-10-19 10:20:26 +01:00
chain.nim Feature/implement poa processing (#748) 2021-07-14 16:13:27 +01:00
clique.nim clique: connect period and epoch from chain_config to engine 2021-08-11 17:42:41 +07:00
dao.nim replace state_db with accounts_cache 2020-05-30 10:14:59 +07:00
executor.nim Feature/implement poa processing (#748) 2021-07-14 16:13:27 +01:00
gaslimit.nim config: replace stdlib parseOpt with nim-confutils 2021-09-18 17:34:46 +07:00
validate.nim fix missing EIP-799 extra range check in header validation 2021-10-12 11:06:39 +07:00