Commit Graph

4704 Commits

Author SHA1 Message Date
Chris S. Kim 29a297d3e9
Refactor client RPC timeouts (#14965)
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.
2022-10-18 15:05:09 -04:00
R.B. Boyer 0cca4c088d
test: possibly fix flake in TestIntentionGetExact (#15021)
Restructure test setup to be similar to TestAgent_ServerCertificate
and see if that's enough to avoid flaking after join.
2022-10-18 10:51:20 -05:00
R.B. Boyer fe2d41ddad
cache: prevent goroutine leak in agent cache (#14908)
There is a bug in the error handling code for the Agent cache subsystem discovered:

1. NotifyCallback calls notifyBlockingQuery which calls getWithIndex in
   a loop (which backs off on-error up to 1 minute)

2. getWithIndex calls fetch if there’s no valid entry in the cache

3. fetch starts a goroutine which calls Fetch on the cache-type, waits
   for a while (again with backoff up to 1 minute for errors) and then
   calls fetch to trigger a refresh

The end result being that every 1 minute notifyBlockingQuery spawns an
ancestry of goroutines that essentially lives forever.

This PR ensures that the goroutine started by `fetch` cancels any prior
goroutine spawned by the same line for the same key.

In isolated testing where a cache type was tweaked to indefinitely
error, this patch prevented goroutine counts from skyrocketing.
2022-10-17 14:38:10 -05:00
R.B. Boyer 02a858efa0
ca: fix a masked bug in leaf cert generation that would not be notified of root cert rotation after the first one (#15005)
In practice this was masked by #14956 and was only uncovered fixing the
other bug.

  go test ./agent -run TestAgentConnectCALeafCert_goodNotLocal

would fail when only #14956 was fixed.
2022-10-17 13:24:27 -05:00
Chris S. Kim 3d2dffff16
Merge pull request #13388 from deblasis/feature/health-checks_windows_service
Feature: Health checks windows service
2022-10-17 09:26:19 -04:00
Dan Upton f8b4b41205
proxycfg: fix goroutine leak when service is re-registered (#14988)
Fixes a bug where we'd leak a goroutine in state.run when the given
context was canceled while there was a pending update.
2022-10-17 11:31:10 +01:00
Kyle Havlovitz aaf892a383 Extend tcp keepalive settings to work for terminating gateways as well 2022-10-14 17:05:46 -07:00
Kyle Havlovitz 2c569f6b9c Update docs and add tcp_keepalive_probes setting 2022-10-14 17:05:46 -07:00
Kyle Havlovitz 2242d1ec4a Add TCP keepalive settings to proxy config for mesh gateways 2022-10-14 17:05:46 -07:00
Derek Menteer 2a33d0ff96 Fix issue with incorrect method signature on test. 2022-10-14 11:04:57 -05:00
Freddy 24d0c8801a
Merge pull request #14981 from hashicorp/peering/dial-through-gateways 2022-10-14 09:44:56 -06:00
Dan Upton 328e3ff563
proxycfg: rate-limit delivery of config snapshots (#14960)
Adds a user-configurable rate limiter to proxycfg snapshot delivery,
with a default limit of 250 updates per second.

This addresses a problem observed in our load testing of Consul
Dataplane where updating a "global" resource such as a wildcard
intention or the proxy-defaults config entry could starve the Raft or
Memberlist goroutines of CPU time, causing general cluster instability.
2022-10-14 15:52:00 +01:00
Derek Menteer 29ebcf5ff0 Add tests for peering state snapshots / restores. 2022-10-14 09:48:04 -05:00
Derek Menteer e3ff9912d0 Add test for ExportedServicesForAllPeersByName 2022-10-14 09:48:04 -05:00
Dan Upton e6b55d1d81
perf: remove expensive reflection from xDS hot path (#14934)
Replaces the reflection-based implementation of proxycfg's
ConfigSnapshot.Clone with code generated by deep-copy.

While load testing server-based xDS (for consul-dataplane) we discovered
this method is extremely expensive. The ConfigSnapshot struct, directly
or indirectly, contains a copy of many of the structs in the agent/structs
package, which creates a large graph for copystructure.Copy to traverse
at runtime, on every proxy reconfiguration.
2022-10-14 10:26:42 +01:00
freddygv c77123a2aa Use split var in tests 2022-10-13 17:12:47 -06:00
freddygv bf51021c07 Use split wildcard partition name
This way OSS avoids passing a non-empty label, which will be rejected in
OSS consul.
2022-10-13 16:55:28 -06:00
Freddy ee4cdc4985
Merge pull request #14935 from hashicorp/fix/alias-leak 2022-10-13 16:31:15 -06:00
freddygv 573aa408a1 Lint 2022-10-13 15:55:55 -06:00
Derek Menteer 0f424e3cdf Reset wait on ensureServerAddrSubscription 2022-10-13 15:58:26 -05:00
freddygv 96fdd3728a Fix CA init error code 2022-10-13 14:58:11 -06:00
freddygv 2c99a21596 Update leader routine to maybe use gateways 2022-10-13 14:58:00 -06:00
freddygv e69bc727ec Update peering establishment to maybe use gateways
When peering through mesh gateways we expect outbound dials to peer
servers to flow through the local mesh gateway addresses.

Now when establishing a peering we get a list of dial addresses as a
ring buffer that includes local mesh gateway addresses if the local DC
is configured to peer through mesh gateways. The ring buffer includes
the mesh gateway addresses first, but also includes the remote server
addresses as a fallback.

This fallback is present because it's possible that direct egress from
the servers may be allowed. If not allowed then the leader will cycle
back to a mesh gateway address through the ring.

When attempting to dial the remote servers we retry up to a fixed
timeout. If using mesh gateways we also have an initial wait in
order to allow for the mesh gateways to configure themselves.

Note that if we encounter a permission denied error we do not retry
since that error indicates that the secret in the peering token is
invalid.
2022-10-13 14:57:55 -06:00
malizz b0b0cbb8ee
increase protobuf size limit for cluster peering (#14976) 2022-10-13 13:46:51 -07:00
Derek Menteer 4e140c98bc Address PR comments. 2022-10-13 14:11:02 -05:00
Derek Menteer 1e394da400 Disallow peering to the same cluster. 2022-10-13 14:11:02 -05:00
Derek Menteer 8742fbe14f Prevent consul peer-exports by discovery chain. 2022-10-13 12:45:09 -05:00
Derek Menteer f366edcb8d Prevent the "consul" service from being exported. 2022-10-13 12:45:09 -05:00
Derek Menteer caa1396255 Add remote peer partition and datacenter info. 2022-10-13 10:37:41 -05:00
Dan Upton cbb4a030c4
xds: properly merge central config for "agentless" services (#14962) 2022-10-13 12:04:59 +01:00
Dan Upton 0af9f16343
bug: fix goroutine leaks caused by incorrect usage of `WatchCh` (#14916)
memdb's `WatchCh` method creates a goroutine that will publish to the
returned channel when the watchset is triggered or the given context
is canceled. Although this is called out in its godoc comment, it's
not obvious that this method creates a goroutine who's lifecycle you
need to manage.

In the xDS capacity controller, we were calling `WatchCh` on each
iteration of the control loop, meaning the number of goroutines would
grow on each autopilot event until there was catalog churn.

In the catalog config source, we were calling `WatchCh` with the
background context, meaning that the goroutine would keep running after
the sync loop had terminated.
2022-10-13 12:04:27 +01:00
Hans Hasselberg 0d5935ab83
adding configuration option cloud.scada_address (#14936)
* adding scada_address

* config tests

* add changelog entry
2022-10-13 11:31:28 +02:00
Paul Glass bcda205f88
Add consul.xds.server.streamStart metric (#14957)
This adds a new consul.xds.server.streamStart metric to measure the time taken to first generate xDS resources after an xDS stream is opened.
2022-10-12 14:17:58 -05:00
Riddhi Shah 345191a0df
Service http checks data source for agentless proxies (#14924)
Adds another datasource for proxycfg.HTTPChecks, for use on server agents. Typically these checks are performed by local client agents and there is no equivalent of this in agentless (where servers configure consul-dataplane proxies).
Hence, the data source is mostly a no-op on servers but in the case where the service is present within the local state, it delegates to the cache data source.
2022-10-12 07:49:56 -07:00
Freddy 9ca8bb8ec4
Merge pull request #14958 from hashicorp/peering/nonce 2022-10-12 08:18:15 -06:00
freddygv 1b46b35041 Actually track nonce in test 2022-10-12 07:50:17 -06:00
Derek Menteer f330438a45 Fix incorrect backoff-wait logic. 2022-10-12 08:01:10 -05:00
freddygv 7f9a5d0f58 Add basic nonce management
This commit adds a monotonically increasing nonce to include in peering
replication response messages. Every ack/nack from the peer handling a
response will include this nonce, allowing to correlate the ack/nack
with a specific resource.

At the moment nothing is done with the nonce when it is received. In the
future we may want to add functionality such as retries on NACKs,
depending on the class of error.
2022-10-11 19:02:04 -06:00
Paul Glass d17af23641
gRPC server metrics (#14922)
* Move stats.go from grpc-internal to grpc-middleware
* Update grpc server metrics with server type label
* Add stats test to grpc-external
* Remove global metrics instance from grpc server tests
2022-10-11 17:00:32 -05:00
cskh e0356e1502
fix(peering): add missing grpc_tls_port for server address reconciliation (#14944) 2022-10-11 10:56:29 -04:00
freddygv f4cc4577ca Fix alias check leak
Preivously when alias check was removed it would not be stopped nor
cleaned up from the associated aliasChecks map.

This means that any time an alias check was deregistered we would
leak a goroutine for CheckAlias.run() because the stopCh would never
be closed.

This issue mostly affects service mesh deployments on platforms where
the client agent is mostly static but proxy services come and go
regularly, since by default sidecars are registered with an alias check.
2022-10-10 16:42:29 -06:00
James Oulman b8bd7a3058
Configure Envoy alpn_protocols based on service protocol (#14356)
* Configure Envoy alpn_protocols based on service protocol

* define alpnProtocols in a more standard way

* http2 protocol should be h2 only

* formatting

* add test for getAlpnProtocol()

* create changelog entry

* change scope is connect-proxy

* ignore errors on ParseProxyConfig; fixes linter

* add tests for grpc and http2 public listeners

* remove newlines from PR

* Add alpn_protocol configuration for ingress gateway

* Guard against nil tlsContext

* add ingress gateway w/ TLS tests for gRPC and HTTP2

* getAlpnProtocols: add TCP protocol test

* add tests for ingress gateway with grpc/http2 and per-listener TLS config

* add tests for ingress gateway with grpc/http2 and per-listener TLS config

* add Gateway level TLS config with mixed protocol listeners to validate ALPN

* update changelog to include ingress-gateway

* add http/1.1 to http2 ALPN

* go fmt

* fix test on custom-trace-listener
2022-10-10 13:13:56 -07:00
freddygv bf72df7b0e Fixup test 2022-10-10 13:20:14 -06:00
Chris S. Kim 4f4112662e Fix nil pointer 2022-10-10 13:20:14 -06:00
Chris S. Kim b0a4c5c563 Include stream-related information in peering endpoints 2022-10-10 13:20:14 -06:00
Paul Glass c0c187f1c5
Merge central config for GetEnvoyBootstrapParams (#14869)
This fixes GetEnvoyBootstrapParams to merge in proxy-defaults and service-defaults.

Co-authored-by: Dan Upton <daniel@floppy.co>
2022-10-10 12:40:27 -05:00
Freddy 4abad02abd
Merge pull request #14796 from hashicorp/peering/use-connect-ca 2022-10-07 10:37:37 -06:00
freddygv 7d4da6eb22 Fixup test 2022-10-07 09:34:16 -06:00
freddygv 3034df6a5c Require Connect and TLS to generate peering tokens
By requiring Connect and a gRPC TLS listener we can automatically
configure TLS for all peering control-plane traffic.
2022-10-07 09:06:29 -06:00
freddygv fac3ddc857 Use internal server certificate for peering TLS
A previous commit introduced an internally-managed server certificate
to use for peering-related purposes.

Now the peering token has been updated to match that behavior:
- The server name matches the structure of the server cert
- The CA PEMs correspond to the Connect CA

Note that if Conect is disabled, and by extension the Connect CA, we
fall back to the previous behavior of returning the manually configured
certs and local server SNI.

Several tests were updated to use the gRPC TLS port since they enable
Connect by default. This means that the peering token will embed the
Connect CA, and the dialer will expect a TLS listener.
2022-10-07 09:05:32 -06:00