Commit Graph

4723 Commits

Author SHA1 Message Date
R.B. Boyer 3c44116a8f
cache: refactor agent cache fetching to prevent unnecessary fetches on error (#14956)
This continues the work done in #14908 where a crude solution to prevent a
goroutine leak was implemented. The former code would launch a perpetual
goroutine family every iteration (+1 +1) and the fixed code simply caused a
new goroutine family to first cancel the prior one to prevent the
leak (-1 +1 == 0).

This PR refactors this code completely to:

- make it more understandable
- remove the recursion-via-goroutine strangeness
- prevent unnecessary RPC fetches when the prior one has errored.

The core issue arose from a conflation of the entry.Fetching field to mean:

- there is an RPC (blocking query) in flight right now
- there is a goroutine running to manage the RPC fetch retry loop

The problem is that the goroutine-leak-avoidance check would treat
Fetching like (2), but within the body of a goroutine it would flip that
boolean back to false before the retry sleep. This would cause a new
chain of goroutines to launch which #14908 would correct crudely.

The refactored code uses a plain for-loop and changes the semantics
to track state for "is there a goroutine associated with this cache entry"
instead of the former.

We use a uint64 unique identity per goroutine instead of a boolean so
that any orphaned goroutines can tell when they've been replaced when
the expiry loop deletes a cache entry while the goroutine is still running
and is later replaced.
2022-10-25 10:27:26 -05:00
R.B. Boyer da70daba43
test: ensure that all dependencies in a test agent use the test logger (#14996) 2022-10-24 17:02:38 -05:00
Chris S. Kim 9f0ed81cfd Remove invalid 1xx HTTP codes
These tests started failing in go1.19, presumably due to
support for valid 1xx responses being added.

https://github.com/golang/go/issues/56346
2022-10-24 16:12:08 -04:00
Chris S. Kim bde57c0dd0 Regenerate files according to 1.19.2 formatter 2022-10-24 16:12:08 -04:00
cskh db82ffe503
fix(peering): replicating wan address (#15108)
* fix(peering): replicating wan address

* add changelog

* unit test
2022-10-24 15:44:57 -04:00
Iryna Shustava 176abb5ff2
proxycfg: watch service-defaults config entries (#15025)
To support Destinations on the service-defaults (for tproxy with terminating gateway), we need to now also make servers watch service-defaults config entries.
2022-10-24 12:50:28 -06:00
Chris S. Kim b236e86030 Move oss-only test to its own file 2022-10-24 14:17:43 -04:00
R.B. Boyer d04cf25fa8
test: fix flaky TestHealthServiceNodes_NodeMetaFilter by waiting until the streaming subsystem has a valid grpc connection (#15019)
Also potentially unflakes TestHealthIngressServiceNodes for similar
reasons.
2022-10-24 13:09:53 -05:00
R.B. Boyer 300860412c
chore: update golangci-lint to v1.50.1 (#15022) 2022-10-24 11:48:02 -05:00
Venu Yanamandra efc813e92d
Update error message when restoring ENT snapshot in OSS (#15066) 2022-10-24 11:40:26 -04:00
freddygv d65e60de86 Return forbidden on permission denied
This commit updates the establish endpoint to bubble up a 403 status
code to callers when the establishment secret from the token is invalid.
This is a signal that a new peering token must be generated.
2022-10-20 17:11:49 -06:00
Chris S. Kim a7ea26192b Update expected encoding in test
go-memdb was updated in v1.3.3 to make integers in indexes sortable, which changed how integers were encoded.
2022-10-20 14:32:42 -04:00
freddygv 6d9be5fb15 Use plain TaggedAddressWAN 2022-10-19 16:32:44 -06:00
freddygv 8d211cc9cc Add unit test 2022-10-19 16:26:15 -06:00
cskh 058ee4fb84 fix: wan address isn't used by peering token 2022-10-19 16:33:25 -04:00
Nitya Dhanushkodi 5e156772f6
Remove ability to specify external addresses in GenerateToken endpoint (#14930)
* Reverts "update generate token endpoint to take external addresses (#13844)"

This reverts commit f47319b7c6.
2022-10-19 09:31:36 -07:00
Kyle Havlovitz 5c3427608b
Merge pull request #15035 from hashicorp/vault-ttl-update-warn
Warn instead of returning error when missing intermediate mount tune permissions
2022-10-18 15:41:52 -07:00
cskh d562d363fc
peering: skip registering duplicate node and check from the peer (#14994)
* peering: skip register duplicate node and check from the peer

* Prebuilt the nodes map and checks map to avoid repeated for loop

* use key type to struct: node id, service id, and check id
2022-10-18 16:19:24 -04:00
Chris S. Kim 29a297d3e9
Refactor client RPC timeouts (#14965)
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.
2022-10-18 15:05:09 -04:00
Kyle Havlovitz d122108992 Warn instead of returning an error when intermediate mount tune permission is missing 2022-10-18 12:01:25 -07:00
R.B. Boyer 0cca4c088d
test: possibly fix flake in TestIntentionGetExact (#15021)
Restructure test setup to be similar to TestAgent_ServerCertificate
and see if that's enough to avoid flaking after join.
2022-10-18 10:51:20 -05:00
R.B. Boyer fe2d41ddad
cache: prevent goroutine leak in agent cache (#14908)
There is a bug in the error handling code for the Agent cache subsystem discovered:

1. NotifyCallback calls notifyBlockingQuery which calls getWithIndex in
   a loop (which backs off on-error up to 1 minute)

2. getWithIndex calls fetch if there’s no valid entry in the cache

3. fetch starts a goroutine which calls Fetch on the cache-type, waits
   for a while (again with backoff up to 1 minute for errors) and then
   calls fetch to trigger a refresh

The end result being that every 1 minute notifyBlockingQuery spawns an
ancestry of goroutines that essentially lives forever.

This PR ensures that the goroutine started by `fetch` cancels any prior
goroutine spawned by the same line for the same key.

In isolated testing where a cache type was tweaked to indefinitely
error, this patch prevented goroutine counts from skyrocketing.
2022-10-17 14:38:10 -05:00
R.B. Boyer 02a858efa0
ca: fix a masked bug in leaf cert generation that would not be notified of root cert rotation after the first one (#15005)
In practice this was masked by #14956 and was only uncovered fixing the
other bug.

  go test ./agent -run TestAgentConnectCALeafCert_goodNotLocal

would fail when only #14956 was fixed.
2022-10-17 13:24:27 -05:00
Chris S. Kim 3d2dffff16
Merge pull request #13388 from deblasis/feature/health-checks_windows_service
Feature: Health checks windows service
2022-10-17 09:26:19 -04:00
Dan Upton f8b4b41205
proxycfg: fix goroutine leak when service is re-registered (#14988)
Fixes a bug where we'd leak a goroutine in state.run when the given
context was canceled while there was a pending update.
2022-10-17 11:31:10 +01:00
Kyle Havlovitz aaf892a383 Extend tcp keepalive settings to work for terminating gateways as well 2022-10-14 17:05:46 -07:00
Kyle Havlovitz 2c569f6b9c Update docs and add tcp_keepalive_probes setting 2022-10-14 17:05:46 -07:00
Kyle Havlovitz 2242d1ec4a Add TCP keepalive settings to proxy config for mesh gateways 2022-10-14 17:05:46 -07:00
Derek Menteer 2a33d0ff96 Fix issue with incorrect method signature on test. 2022-10-14 11:04:57 -05:00
Freddy 24d0c8801a
Merge pull request #14981 from hashicorp/peering/dial-through-gateways 2022-10-14 09:44:56 -06:00
Dan Upton 328e3ff563
proxycfg: rate-limit delivery of config snapshots (#14960)
Adds a user-configurable rate limiter to proxycfg snapshot delivery,
with a default limit of 250 updates per second.

This addresses a problem observed in our load testing of Consul
Dataplane where updating a "global" resource such as a wildcard
intention or the proxy-defaults config entry could starve the Raft or
Memberlist goroutines of CPU time, causing general cluster instability.
2022-10-14 15:52:00 +01:00
Derek Menteer 29ebcf5ff0 Add tests for peering state snapshots / restores. 2022-10-14 09:48:04 -05:00
Derek Menteer e3ff9912d0 Add test for ExportedServicesForAllPeersByName 2022-10-14 09:48:04 -05:00
Dan Upton e6b55d1d81
perf: remove expensive reflection from xDS hot path (#14934)
Replaces the reflection-based implementation of proxycfg's
ConfigSnapshot.Clone with code generated by deep-copy.

While load testing server-based xDS (for consul-dataplane) we discovered
this method is extremely expensive. The ConfigSnapshot struct, directly
or indirectly, contains a copy of many of the structs in the agent/structs
package, which creates a large graph for copystructure.Copy to traverse
at runtime, on every proxy reconfiguration.
2022-10-14 10:26:42 +01:00
freddygv c77123a2aa Use split var in tests 2022-10-13 17:12:47 -06:00
freddygv bf51021c07 Use split wildcard partition name
This way OSS avoids passing a non-empty label, which will be rejected in
OSS consul.
2022-10-13 16:55:28 -06:00
Freddy ee4cdc4985
Merge pull request #14935 from hashicorp/fix/alias-leak 2022-10-13 16:31:15 -06:00
freddygv 573aa408a1 Lint 2022-10-13 15:55:55 -06:00
Derek Menteer 0f424e3cdf Reset wait on ensureServerAddrSubscription 2022-10-13 15:58:26 -05:00
freddygv 96fdd3728a Fix CA init error code 2022-10-13 14:58:11 -06:00
freddygv 2c99a21596 Update leader routine to maybe use gateways 2022-10-13 14:58:00 -06:00
freddygv e69bc727ec Update peering establishment to maybe use gateways
When peering through mesh gateways we expect outbound dials to peer
servers to flow through the local mesh gateway addresses.

Now when establishing a peering we get a list of dial addresses as a
ring buffer that includes local mesh gateway addresses if the local DC
is configured to peer through mesh gateways. The ring buffer includes
the mesh gateway addresses first, but also includes the remote server
addresses as a fallback.

This fallback is present because it's possible that direct egress from
the servers may be allowed. If not allowed then the leader will cycle
back to a mesh gateway address through the ring.

When attempting to dial the remote servers we retry up to a fixed
timeout. If using mesh gateways we also have an initial wait in
order to allow for the mesh gateways to configure themselves.

Note that if we encounter a permission denied error we do not retry
since that error indicates that the secret in the peering token is
invalid.
2022-10-13 14:57:55 -06:00
malizz b0b0cbb8ee
increase protobuf size limit for cluster peering (#14976) 2022-10-13 13:46:51 -07:00
Derek Menteer 4e140c98bc Address PR comments. 2022-10-13 14:11:02 -05:00
Derek Menteer 1e394da400 Disallow peering to the same cluster. 2022-10-13 14:11:02 -05:00
Derek Menteer 8742fbe14f Prevent consul peer-exports by discovery chain. 2022-10-13 12:45:09 -05:00
Derek Menteer f366edcb8d Prevent the "consul" service from being exported. 2022-10-13 12:45:09 -05:00
Derek Menteer caa1396255 Add remote peer partition and datacenter info. 2022-10-13 10:37:41 -05:00
Dan Upton cbb4a030c4
xds: properly merge central config for "agentless" services (#14962) 2022-10-13 12:04:59 +01:00
Dan Upton 0af9f16343
bug: fix goroutine leaks caused by incorrect usage of `WatchCh` (#14916)
memdb's `WatchCh` method creates a goroutine that will publish to the
returned channel when the watchset is triggered or the given context
is canceled. Although this is called out in its godoc comment, it's
not obvious that this method creates a goroutine who's lifecycle you
need to manage.

In the xDS capacity controller, we were calling `WatchCh` on each
iteration of the control loop, meaning the number of goroutines would
grow on each autopilot event until there was catalog churn.

In the catalog config source, we were calling `WatchCh` with the
background context, meaning that the goroutine would keep running after
the sync loop had terminated.
2022-10-13 12:04:27 +01:00