From e4b4b7f4af40b482774e4c284ab1bc0ad600cd2b Mon Sep 17 00:00:00 2001
From: Jamie Lokier <jamie@shareable.org>
Date: Fri, 2 Apr 2021 22:29:02 +0100
Subject: [PATCH] discv4: Fix Kademlia crash when trying to sync (#342)

Fixes status-im/nim-eth#341, status-im/nimbus-eth1#489.

When using discv4 (Kademlia) to find peers, there is a crash after a few
minutes.  It occurs for most of us on Eth1 mainnet, and everyone on Ropsten.

The cause is `findNodes` being called twice in succession to the same peer,
within about 5 seconds of each other.  ("About" 5 seconds, because Chronos does
not guarantee to run the timeout branch at a particular time, due to queuing
and clock reading delays.)

Then `findNodes` sends a duplicate message to the peer and calls
`waitNeighbours` to listen for the reply.  There's already a `waitNeighbours`
callback in a shared table, so that function hits an assert failure.

Ignoring the assert would be wrong as it would break timeout logic, and sending
`FindNodes` twice in rapid succession also makes us a bad peer.

As a simple workaround, just skip `findNodes` in this state and return a fake
empty `Neighbours` reply.  This is a bit of a hack as `findNodes` should not be
called like this; there's a logic error at a higher level.  But it works.

Tested for about 4 days constant operation on Ropsten.  The crash which occured
every few minutes no longer occurs, and discv4 keeps working.

Signed-off-by: Jamie Lokier <jamie@shareable.org>
---
 eth/p2p/kademlia.nim | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/eth/p2p/kademlia.nim b/eth/p2p/kademlia.nim
index bd3c70f..7ef2b92 100644
--- a/eth/p2p/kademlia.nim
+++ b/eth/p2p/kademlia.nim
@@ -371,6 +371,13 @@ proc lookup*(k: KademliaProtocol, nodeId: NodeId): Future[seq[Node]] {.async.} =
   var nodesSeen = initHashSet[Node]()
 
   proc findNode(nodeId: NodeId, remote: Node): Future[seq[Node]] {.async.} =
+    if remote in k.neighboursCallbacks:
+      # Sometimes findNode is called while another findNode is already in flight.
+      # It's a bug when this happens, and the logic should probably be fixed
+      # elsewhere.  However, this small fix has been tested and proven adequate.
+      debug "Peer in k.neighboursCallbacks already", peer = remote
+      result = newSeqOfCap[Node](BUCKET_SIZE)
+      return
     k.wire.sendFindNode(remote, nodeId)
     var candidates = await k.waitNeighbours(remote)
     if candidates.len == 0: