Sync fix: `GetBlockBodies` logic preventing sync, dropping peers

Fixes #864 "Sync progress stops at Goerli block 4494913", and equivalent on other networks. The block body fetcher in `blockchain_sync.nim` had an incorrect assumption about how peers respond to `GetBlockBodies`. It was issuing requests for N block bodies and incorrectly handling replies which contained fewer than N bodies. Having received up to 192 headers in a batch, it split the range into smaller `GetBlockBodies` requests, fetched each reply, then combined replies. The effect was Nimbus requested batches of 128+64 block bodies, received gaps in the reply sequence, then aborted. That meant it repeatedly fetched data, then discarded it, and fetched it again, dropping good peers in the process. Aborted and restarted batches occurred with earlier blocks too, but this became more pronounced until there were no suitable peers at batch 4494913..4495104. Here's a trace: ``` TRC 2021-09-29 02:40:24.977+01:00 Requesting block headers file=blockchain_sync.nim:224 start=4494913 count=192 peer=<ENODE> TRC 2021-09-29 02:40:24.977+01:00 >> Sending eth.GetBlockHeaders (0x03) file=protocol_eth65.nim:51 peer=<PEER> startBlock=4494913 max=192 TRC 2021-09-29 02:40:25.005+01:00 << Got reply eth.BlockHeaders (0x04) file=protocol_eth65.nim:51 peer=<PEER> count=192 TRC 2021-09-29 02:40:25.007+01:00 >> Sending eth.GetBlockBodies (0x05) file=protocol_eth65.nim:51 peer=<PEER> count=128 TRC 2021-09-29 02:40:25.209+01:00 << Got reply eth.BlockBodies (0x06) file=protocol_eth65.nim:51 peer=<PEER> count=13 TRC 2021-09-29 02:40:25.210+01:00 >> Sending eth.GetBlockBodies (0x05) file=protocol_eth65.nim:51 peer=<PEER> count=64 TRC 2021-09-29 02:40:25.290+01:00 << Got reply eth.BlockBodies (0x06) file=protocol_eth65.nim:51 peer=<PEER> count=64 WRN 2021-09-29 02:40:25.306+01:00 Bodies len != headers.len file=blockchain_sync.nim:276 bodies=77 headers=192 TRC 2021-09-29 02:40:25.306+01:00 peer disconnected file=blockchain_sync.nim:403 peer=<PEER> TRC 2021-09-29 02:40:25.306+01:00 Finished obtaining blocks file=blockchain_sync.nim:303 peer=<PEER> ``` In practice, for modern peers, Nimbus received shorter replies than it assumed depending on the block sizes on the chain. Geth/Erigon has 2MiB `BlockBodies` response size soft limit. OpenEthereum has 4MiB. Up to Berlin (EIP-2929), Nimbus's fetcher failed often, but there were still some peers serving what Nimbus needed. Just after the start of Berlin, at batch 4494913..4495104 on Goerli, zero peers responded with full size replies for the whole batch, so Nimbus couldn't progress past that point. But there was already a problem happening before that for large blocks, dropping good peers and repeatedly fetching the same block data. Signed-off-by: Jamie Lokier <jamie@shareable.org>
2021-10-04 18:13:33 +01:00 · 2021-10-04 18:13:33 +01:00 · 40fbed49cf
parent d93a8bc4a1
commit 40fbed49cf
1 changed files with 25 additions and 21 deletions
--- a/nimbus/p2p/blockchain_sync.nim
+++ b/nimbus/p2p/blockchain_sync.nim
@ -6,7 +6,7 @@
 # at your option. This file may not be copied, modified, or distributed except according to those terms.

 import
-  std/[sets, options, random, hashes],
+  std/[sets, options, random, hashes, sequtils],
  chronos, chronicles,
  eth/common/eth_types,
  eth/[p2p, p2p/private/p2p_types, p2p/rlpx, p2p/peer_pool],
@ -239,8 +239,27 @@ proc obtainBlocksFromPeer(syncCtx: SyncContext, peer: Peer) {.async.} =
          count=results.get.headers.len
        shallowCopy(workItem.headers, results.get.headers)

-        var bodies = newSeq[BlockBody]()
-        var hashes = newSeq[KeccakHash]()
+        var bodies = newSeqOfCap[BlockBody](workItem.headers.len)
+        var hashes = newSeqOfCap[KeccakHash](maxBodiesFetch)
+        template fetchBodies() =
+          tracePacket ">> Sending eth.GetBlockBodies (0x05)", peer,
+            count=hashes.len
+          let b = await peer.getBlockBodies(hashes)
+          if b.isNone:
+            raise newException(CatchableError, "Was not able to get the block bodies")
+          let bodiesLen = b.get.blocks.len
+          tracePacket "<< Got reply eth.BlockBodies (0x06)", peer,
+            count=bodiesLen
+          if bodiesLen == 0:
+            raise newException(CatchableError, "Zero block bodies received for request")
+          elif bodiesLen < hashes.len:
+            hashes.delete(0, bodiesLen - 1)
+          elif bodiesLen == hashes.len:
+            hashes.setLen(0)
+          else:
+            raise newException(CatchableError, "Too many block bodies received for request")
+          bodies.add(b.get.blocks)
+
        var nextIndex = workItem.startIndex
        for i in workItem.headers:
          if i.blockNumber != nextIndex:
@ -249,25 +268,10 @@ proc obtainBlocksFromPeer(syncCtx: SyncContext, peer: Peer) {.async.} =
            nextIndex = nextIndex + 1
          hashes.add(blockHash(i))
          if hashes.len == maxBodiesFetch:
-            tracePacket ">> Sending eth.GetBlockBodies (0x05)", peer,
-              count=hashes.len
-            let b = await peer.getBlockBodies(hashes)
-            if b.isNone:
-              raise newException(CatchableError, "Was not able to get the block bodies.")
-            tracePacket "<< Got reply eth.BlockBodies (0x06)", peer,
-              count=b.get.blocks.len
-            hashes.setLen(0)
-            bodies.add(b.get.blocks)
+            fetchBodies()

-        if hashes.len != 0:
-          tracePacket ">> Sending eth.GetBlockBodies (0x05)", peer,
-            count=hashes.len
-          let b = await peer.getBlockBodies(hashes)
-          if b.isNone:
-            raise newException(CatchableError, "Was not able to get the block bodies.")
-          tracePacket "<< Got reply eth.BlockBodies (0x06)", peer,
-            count=b.get.blocks.len
-          bodies.add(b.get.blocks)
+        while hashes.len != 0:
+          fetchBodies()

        if bodies.len == workItem.headers.len:
          shallowCopy(workItem.bodies, bodies)