discv4: Fix Kademlia crash when trying to sync (#342)

Fixes status-im/nim-eth#341, status-im/nimbus-eth1#489.

When using discv4 (Kademlia) to find peers, there is a crash after a few
minutes.  It occurs for most of us on Eth1 mainnet, and everyone on Ropsten.

The cause is `findNodes` being called twice in succession to the same peer,
within about 5 seconds of each other.  ("About" 5 seconds, because Chronos does
not guarantee to run the timeout branch at a particular time, due to queuing
and clock reading delays.)

Then `findNodes` sends a duplicate message to the peer and calls
`waitNeighbours` to listen for the reply.  There's already a `waitNeighbours`
callback in a shared table, so that function hits an assert failure.

Ignoring the assert would be wrong as it would break timeout logic, and sending
`FindNodes` twice in rapid succession also makes us a bad peer.

As a simple workaround, just skip `findNodes` in this state and return a fake
empty `Neighbours` reply.  This is a bit of a hack as `findNodes` should not be
called like this; there's a logic error at a higher level.  But it works.

Tested for about 4 days constant operation on Ropsten.  The crash which occured
every few minutes no longer occurs, and discv4 keeps working.

Signed-off-by: Jamie Lokier <jamie@shareable.org>
This commit is contained in:
Jamie Lokier 2021-04-02 22:29:02 +01:00 committed by GitHub
parent c5dace27ca
commit e4b4b7f4af
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 7 additions and 0 deletions

View File

@ -371,6 +371,13 @@ proc lookup*(k: KademliaProtocol, nodeId: NodeId): Future[seq[Node]] {.async.} =
var nodesSeen = initHashSet[Node]() var nodesSeen = initHashSet[Node]()
proc findNode(nodeId: NodeId, remote: Node): Future[seq[Node]] {.async.} = proc findNode(nodeId: NodeId, remote: Node): Future[seq[Node]] {.async.} =
if remote in k.neighboursCallbacks:
# Sometimes findNode is called while another findNode is already in flight.
# It's a bug when this happens, and the logic should probably be fixed
# elsewhere. However, this small fix has been tested and proven adequate.
debug "Peer in k.neighboursCallbacks already", peer = remote
result = newSeqOfCap[Node](BUCKET_SIZE)
return
k.wire.sendFindNode(remote, nodeId) k.wire.sendFindNode(remote, nodeId)
var candidates = await k.waitNeighbours(remote) var candidates = await k.waitNeighbours(remote)
if candidates.len == 0: if candidates.len == 0: