mirror of
https://github.com/status-im/consul.git
synced 2025-01-09 13:26:07 +00:00
8e67d8eaeb
Occasionally we are seeing the go-test-api job timeout at 10 minutes. Looking at the stack trace I saw the following: 1. Lots of tests blocked on server.Stop in NewTestServerConfigT. This suggests that SIGINT is being sent to the server, but the server is not properly shutting down. 2. Over 20k goroutines that look like this: goroutine 16355 [select, 8 minutes]: net/http.(*persistConn).readLoop(0xc004270240) /usr/local/go/src/net/http/transport.go:2099 +0x99e created by net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go:1647 +0xc56 Issue 1 seems to be the main problem, but debugging that directly is not possible because our buffered logs do not get sent when the tests timeout. To mitigate this problem I've added a timeout to the cmd.Wait() to force kill the process and return an error. Unfortunately because we retry this operation, we still may not see the cause because the next attempt will likely pass. I'm tempted to remove the retry around NewTestServerConfigT. Issue 2 seems to be caused by not closing the response body. Since the request is performed many times in a loop, many goroutines are created and are not closed until the response body is closed.
Internal SDK
Please note that this folder, while public, is not meant for new consumers of these libraries; this should currently be considered an internal, not external, SDK. It is public due to existing needs from other HashiCorp software. The tags in this folder will stay at the 0.x.y level; accordingly users should expect that things can move around, disappear, or change API at any time.