The findNode and findNodeFast operations were using the default aggressive
removal threshold (1.0) when timing out, while other timeout operations
(ping, talkReq, getProviders) correctly used NoreplyRemoveThreshold (0.5).
This inconsistency caused nodes with excellent reliability (1.0) to be removed
during heavy load scenarios when findNode/findNodeFast operations timed out,
even though the nodes were still healthy and simply slow to respond.
Changed findNode and findNodeFast timeout paths to use NoreplyRemoveThreshold,
ensuring consistent and more tolerant behavior across all timeout scenarios.
This aligns with Kademlia's recommendation to be conservative about removing
nodes, especially during temporary network congestion.
Evidence from logs showing the issue:
DBG - Node added to routing table topics="discv5 routingtable" tid=1 n=1ff*7a561e:10.244.0.208:6890
DBG - bucket topics="discv5" tid=1 depth=0 len=2 standby=0
DBG - node topics="discv5" tid=1 n=130*db8a1b:10.244.2.207:6890 rttMin=1 rttAvg=2 reliability=1.0
DBG - node topics="discv5" tid=1 n=1ff*7a561e:10.244.0.208:6890 rttMin=1 rttAvg=14 reliability=1.0
DBG - Node removed from routing table topics="discv5 routingtable" tid=1 n=1ff*7a561e:10.244.0.208:6890
DBG - Total nodes in discv5 routing table topics="discv5" tid=1 total=1
DBG - bucket topics="discv5" tid=1 depth=0 len=1 standby=0
DBG - node topics="discv5" tid=1 n=130*db8a1b:10.244.2.207:6890 rttMin=1 rttAvg=165 reliability=0.957
DBG - Node removed from routing table topics="discv5 routingtable" tid=1 n=130*db8a1b:10.244.2.207:6890
DBG - Total nodes in discv5 routing table topics="discv5" tid=1 total=0
First entry shows a node with perfect reliability (1.0) and 14ms RTT being
removed. Second shows a node with 95.7% reliability also being evicted.
Signed-off-by: Chrysostomos Nanakos <chris@include.gr>
UDP packets get lost easily. We can't just remove
nodes from the routing table at first loss, as it can
create issues in small networks and in cases of temporary
connection failures.
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
* Clear logs for adding and removing of nodes. routingtable log topic for filtering.
* Makes node ID shortening consistent with other short-id formats
* redundant else block
* fixes dependencies
We really don't need these to be 2 and 4 seconds.
Later we should tune it better based on measurements
or estimates. We should also check the relation between
these three values.
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
We do not need that many responses with FindNodeFast, since the
reposes can be ordered by distance
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
initialize wait for response before sending request.
This is needed in cases where the response arrives before
moving to the next instruction, such as a directly connected
test.
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>