Sync fix: `GetBlockBodies` logic preventing sync, dropping peers

Fixes #864 "Sync progress stops at Goerli block 4494913", and equivalent on
other networks.

The block body fetcher in `blockchain_sync.nim` had an incorrect assumption
about how peers respond to `GetBlockBodies`.  It was issuing requests for N
block bodies and incorrectly handling replies which contained fewer than N
bodies.

Having received up to 192 headers in a batch, it split the range into smaller
`GetBlockBodies` requests, fetched each reply, then combined replies.  The
effect was Nimbus requested batches of 128+64 block bodies, received gaps in
the reply sequence, then aborted.

That meant it repeatedly fetched data, then discarded it, and fetched it again,
dropping good peers in the process.

Aborted and restarted batches occurred with earlier blocks too, but this became
more pronounced until there were no suitable peers at batch 4494913..4495104.

Here's a trace:

```
TRC 2021-09-29 02:40:24.977+01:00 Requesting block headers                   file=blockchain_sync.nim:224 start=4494913 count=192 peer=<ENODE>
TRC 2021-09-29 02:40:24.977+01:00 >> Sending eth.GetBlockHeaders (0x03)      file=protocol_eth65.nim:51 peer=<PEER> startBlock=4494913 max=192
TRC 2021-09-29 02:40:25.005+01:00 << Got reply eth.BlockHeaders (0x04)       file=protocol_eth65.nim:51 peer=<PEER> count=192
TRC 2021-09-29 02:40:25.007+01:00 >> Sending eth.GetBlockBodies (0x05)       file=protocol_eth65.nim:51 peer=<PEER> count=128
TRC 2021-09-29 02:40:25.209+01:00 << Got reply eth.BlockBodies (0x06)        file=protocol_eth65.nim:51 peer=<PEER> count=13
TRC 2021-09-29 02:40:25.210+01:00 >> Sending eth.GetBlockBodies (0x05)       file=protocol_eth65.nim:51 peer=<PEER> count=64
TRC 2021-09-29 02:40:25.290+01:00 << Got reply eth.BlockBodies (0x06)        file=protocol_eth65.nim:51 peer=<PEER> count=64
WRN 2021-09-29 02:40:25.306+01:00 Bodies len != headers.len                  file=blockchain_sync.nim:276 bodies=77 headers=192
TRC 2021-09-29 02:40:25.306+01:00 peer disconnected                          file=blockchain_sync.nim:403 peer=<PEER>
TRC 2021-09-29 02:40:25.306+01:00 Finished obtaining blocks                  file=blockchain_sync.nim:303 peer=<PEER>
```

In practice, for modern peers, Nimbus received shorter replies than it assumed
depending on the block sizes on the chain.  Geth/Erigon has 2MiB `BlockBodies`
response size soft limit.  OpenEthereum has 4MiB.

Up to Berlin (EIP-2929), Nimbus's fetcher failed often, but there were still
some peers serving what Nimbus needed.

Just after the start of Berlin, at batch 4494913..4495104 on Goerli, zero peers
responded with full size replies for the whole batch, so Nimbus couldn't
progress past that point.  But there was already a problem happening before
that for large blocks, dropping good peers and repeatedly fetching the same
block data.

Signed-off-by: Jamie Lokier <jamie@shareable.org>
This commit is contained in:
Jamie Lokier 2021-10-04 18:13:33 +01:00
parent d93a8bc4a1
commit 40fbed49cf
No known key found for this signature in database
GPG Key ID: CBC25C68435C30A2
1 changed files with 25 additions and 21 deletions

View File

@ -6,7 +6,7 @@
# at your option. This file may not be copied, modified, or distributed except according to those terms. # at your option. This file may not be copied, modified, or distributed except according to those terms.
import import
std/[sets, options, random, hashes], std/[sets, options, random, hashes, sequtils],
chronos, chronicles, chronos, chronicles,
eth/common/eth_types, eth/common/eth_types,
eth/[p2p, p2p/private/p2p_types, p2p/rlpx, p2p/peer_pool], eth/[p2p, p2p/private/p2p_types, p2p/rlpx, p2p/peer_pool],
@ -239,8 +239,27 @@ proc obtainBlocksFromPeer(syncCtx: SyncContext, peer: Peer) {.async.} =
count=results.get.headers.len count=results.get.headers.len
shallowCopy(workItem.headers, results.get.headers) shallowCopy(workItem.headers, results.get.headers)
var bodies = newSeq[BlockBody]() var bodies = newSeqOfCap[BlockBody](workItem.headers.len)
var hashes = newSeq[KeccakHash]() var hashes = newSeqOfCap[KeccakHash](maxBodiesFetch)
template fetchBodies() =
tracePacket ">> Sending eth.GetBlockBodies (0x05)", peer,
count=hashes.len
let b = await peer.getBlockBodies(hashes)
if b.isNone:
raise newException(CatchableError, "Was not able to get the block bodies")
let bodiesLen = b.get.blocks.len
tracePacket "<< Got reply eth.BlockBodies (0x06)", peer,
count=bodiesLen
if bodiesLen == 0:
raise newException(CatchableError, "Zero block bodies received for request")
elif bodiesLen < hashes.len:
hashes.delete(0, bodiesLen - 1)
elif bodiesLen == hashes.len:
hashes.setLen(0)
else:
raise newException(CatchableError, "Too many block bodies received for request")
bodies.add(b.get.blocks)
var nextIndex = workItem.startIndex var nextIndex = workItem.startIndex
for i in workItem.headers: for i in workItem.headers:
if i.blockNumber != nextIndex: if i.blockNumber != nextIndex:
@ -249,25 +268,10 @@ proc obtainBlocksFromPeer(syncCtx: SyncContext, peer: Peer) {.async.} =
nextIndex = nextIndex + 1 nextIndex = nextIndex + 1
hashes.add(blockHash(i)) hashes.add(blockHash(i))
if hashes.len == maxBodiesFetch: if hashes.len == maxBodiesFetch:
tracePacket ">> Sending eth.GetBlockBodies (0x05)", peer, fetchBodies()
count=hashes.len
let b = await peer.getBlockBodies(hashes)
if b.isNone:
raise newException(CatchableError, "Was not able to get the block bodies.")
tracePacket "<< Got reply eth.BlockBodies (0x06)", peer,
count=b.get.blocks.len
hashes.setLen(0)
bodies.add(b.get.blocks)
if hashes.len != 0: while hashes.len != 0:
tracePacket ">> Sending eth.GetBlockBodies (0x05)", peer, fetchBodies()
count=hashes.len
let b = await peer.getBlockBodies(hashes)
if b.isNone:
raise newException(CatchableError, "Was not able to get the block bodies.")
tracePacket "<< Got reply eth.BlockBodies (0x06)", peer,
count=b.get.blocks.len
bodies.add(b.get.blocks)
if bodies.len == workItem.headers.len: if bodies.len == workItem.headers.len:
shallowCopy(workItem.bodies, bodies) shallowCopy(workItem.bodies, bodies)