From e4b4b7f4af40b482774e4c284ab1bc0ad600cd2b Mon Sep 17 00:00:00 2001 From: Jamie Lokier Date: Fri, 2 Apr 2021 22:29:02 +0100 Subject: [PATCH] discv4: Fix Kademlia crash when trying to sync (#342) Fixes status-im/nim-eth#341, status-im/nimbus-eth1#489. When using discv4 (Kademlia) to find peers, there is a crash after a few minutes. It occurs for most of us on Eth1 mainnet, and everyone on Ropsten. The cause is `findNodes` being called twice in succession to the same peer, within about 5 seconds of each other. ("About" 5 seconds, because Chronos does not guarantee to run the timeout branch at a particular time, due to queuing and clock reading delays.) Then `findNodes` sends a duplicate message to the peer and calls `waitNeighbours` to listen for the reply. There's already a `waitNeighbours` callback in a shared table, so that function hits an assert failure. Ignoring the assert would be wrong as it would break timeout logic, and sending `FindNodes` twice in rapid succession also makes us a bad peer. As a simple workaround, just skip `findNodes` in this state and return a fake empty `Neighbours` reply. This is a bit of a hack as `findNodes` should not be called like this; there's a logic error at a higher level. But it works. Tested for about 4 days constant operation on Ropsten. The crash which occured every few minutes no longer occurs, and discv4 keeps working. Signed-off-by: Jamie Lokier --- eth/p2p/kademlia.nim | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/eth/p2p/kademlia.nim b/eth/p2p/kademlia.nim index bd3c70f..7ef2b92 100644 --- a/eth/p2p/kademlia.nim +++ b/eth/p2p/kademlia.nim @@ -371,6 +371,13 @@ proc lookup*(k: KademliaProtocol, nodeId: NodeId): Future[seq[Node]] {.async.} = var nodesSeen = initHashSet[Node]() proc findNode(nodeId: NodeId, remote: Node): Future[seq[Node]] {.async.} = + if remote in k.neighboursCallbacks: + # Sometimes findNode is called while another findNode is already in flight. + # It's a bug when this happens, and the logic should probably be fixed + # elsewhere. However, this small fix has been tested and proven adequate. + debug "Peer in k.neighboursCallbacks already", peer = remote + result = newSeqOfCap[Node](BUCKET_SIZE) + return k.wire.sendFindNode(remote, nodeId) var candidates = await k.waitNeighbours(remote) if candidates.len == 0: