Commit Graph

5479 Commits

Author SHA1 Message Date
Derek Menteer ac83ac1343
Fix streaming RPCs for agentless. (#20868)
* Fix streaming RPCs for agentless.

This PR fixes an issue where cross-dc RPCs were unable to utilize
the streaming backend due to having the node name set. The result
of this was the agent-cache being utilized, which would cause high
cpu utilization and memory consumption due to the fact that it
keeps queries alive for 72 hours before purging inactive entries.

This resource consumption is compounded by the fact that each pod
in consul-k8s gets a unique token. Since the agent-cache uses the
token as a component of the key, the same query is duplicated for
each pod that is deployed.

* Add changelog.
2024-03-15 14:44:51 -05:00
Derek Menteer 0ac8ae6c3b
Fix xDS deadlock due to syncLoop termination. (#20867)
* Fix xDS deadlock due to syncLoop termination.

This fixes an issue where agentless xDS streams can deadlock permanently until
a server is restarted. When this issue occurs, no new proxies are able to
successfully connect to the server.

Effectively, the trigger for this deadlock stems from the following return
statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202

When this happens, the entire `syncLoop()` terminates and stops consuming from
the following channel:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192

Which results in the `ConfigSource.cleanup()` function never receiving a
response and holding a mutex indefinitely:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247

Because this mutex is shared, it effectively deadlocks the server's ability to
process new xDS streams.

----

The fix to this issue involves removing the `chan chan struct{}` used like an
RPC-over-channels pattern and replacing it with two distinct channels:

+ `stopSyncLoopCh` - indicates that the `syncLoop()` should terminate soon.  +
`syncLoopDoneCh` - indicates that the `syncLoop()` has terminated.

Splitting these two concepts out and deferring a `close(syncLoopDoneCh)` in the
`syncLoop()` function ensures that the deadlock above should no longer occur.

We also now evict xDS connections of all proxies for the corresponding
`syncLoop()` whenever it encounters an irrecoverable error. This is done by
hoisting the new `syncLoopDoneCh` upwards so that it's visible to the xDS delta
processing. Prior to this fix, the behavior was to simply orphan them so they
would never receive catalog-registration or service-defaults updates.

* Add changelog.
2024-03-15 13:57:11 -05:00
Derek Menteer eabff257d7
Various bug-fixes and improvements (#20866)
* Shuffle the list of servers returned by `pbserverdiscovery.WatchServers`.

This randomizes the list of servers to help reduce the chance of clients
all connecting to the same server simultaneously. Consul-dataplane is one
such client that does not randomize its own list of servers.

* Fix potential goroutine leak in xDS recv loop.

This commit ensures that the goroutine which receives xDS messages from
proxies will not block forever if the stream's context is cancelled but
the `processDelta()` function never consumes the message (due to being
terminated).

* Add changelog.
2024-03-15 13:10:48 -05:00
sarahalsmiller 262f435800
NET-6821 Disable Terminating Gateway Auto Host Header Rewrite (#20802)
* disable terminating gateway auto host rewrite

* add changelog

* clean up unneeded additional snapshot fields

* add new field to docs

* squash

* fix test
2024-03-12 15:37:20 -05:00
Michael Zalimeni d4761c0ccd
security: upgrade google.golang.org/protobuf to 1.33.0 (#20801)
Resolves CVE-2024-24786.
2024-03-06 23:04:42 +00:00
Matt Keeler abe14f11e6
Remove redundant usage metrics (#20674)
* Remove redundant usage metrics

* Add the changelog

* Update website/content/docs/upgrading/upgrade-specific.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/upgrading/upgrade-specific.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/upgrading/upgrade-specific.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/upgrading/upgrade-specific.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/upgrading/upgrade-specific.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

---------

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
2024-03-05 14:09:47 -05:00
Matt Keeler 5c936fba33
Enable callers to control whether per-tenant usage metrics are included in calls to store.ServiceUsage (#20672)
* Enable callers to control whether per-tenant usage metrics are included in calls to store.ServiceUsage

* Add changelog
2024-03-01 13:44:55 -05:00
John Murret a1c6181677
DNS v2 - split up router into multiple responsibilities & break up router tests into multiple files. (#20688)
* Update agent/dns.go

Co-authored-by: Michael Zalimeni <michael.zalimeni@hashicorp.com>

* PR feedback

* split tests out into multiple files.

* Extract responsibilities from router into discoveryResultsFetcher, messageSerializer, responseGenerator.

* adding recordmaker tests

* add response generator test coverage.

* changing tests case name based on PR feedback

---------

Co-authored-by: Michael Zalimeni <michael.zalimeni@hashicorp.com>
2024-03-01 15:36:37 +00:00
John Murret a15a957a36
NET-8056 - v2 DNS Testing Improvements (#20710)
* NET-8056 - v2 DNS Testing Improvements

* adding TestDNSServer_Lifecycle

* add license headers to new files.
2024-03-01 05:42:42 -07:00
sarahalsmiller 670ee90a77
Use correct enterprise meta on wildcard service update (#20721)
* use correct enterprise meta on wildcard service update

* changelog

* rename changelog file
2024-02-26 12:03:08 -06:00
John Murret 26eed12f04
NET-7813 - DNS : SERVFAIL when resolving PTR records (#20679)
* NET-7813 - DNS : SERVFAIL when resolving PTR records

* Update agent/dns.go

Co-authored-by: Michael Zalimeni <michael.zalimeni@hashicorp.com>

* PR feedback

---------

Co-authored-by: Michael Zalimeni <michael.zalimeni@hashicorp.com>
2024-02-21 17:44:04 +00:00
Semir Patel 943426bc79
v2tenancy: add optional LicenseFeature to type Registration struct (#20673) 2024-02-20 14:42:31 -06:00
Dan Stough 14efb28086
fix(v2dns): add node ttl to workloads, comment cleanup, and changelog (#20643)
* fix(v2dns): add node ttl to workloads, plus comment cleanup

* docs(v2dns): changelog
2024-02-14 17:38:11 -05:00
Derek Menteer 9f7626d501
Ensure all topics are refreshed on FSM restore and add supervisor loop to v1 controller subscriptions (#20642)
Ensure all topics are refreshed on FSM restore and add supervisor loop to v1 controller subscriptions

This PR fixes two issues:

1. Not all streams were force closed whenever a snapshot restore happened. This means that anything consuming data from the stream (controllers, queries, etc) were unaware that the data they have is potentially stale / invalid. This first part ensures that all topics are purged.

2. The v1 controllers did not properly handle stream errors (which are likely to appear much more often due to 1 above) and so it introduces a supervisor thread to restart the watches when these errors occur.
2024-02-14 14:17:55 -06:00
Dan Stough 137c9c0973
[CE] Misc cleanup for V2 DNS (#20640)
* chore: gitignore zed editor

* chore(v2dns): remove ent/ce split from router

* fix(v2dns): v2 workloads now have tenancy in output

* feat(v2dns): support 'cluster' label

* chore(v2dns): less chatty debug logs
2024-02-14 12:40:38 -05:00
Melissa Kam 64cd172f30
[CC-7411] Fix environment variable precedence when linking to HCP (#20527)
Fix so that link API values are used over env vars

When a link is created via the API, those values should take precedence over
the values set by environment variables. This change loads all the env vars
initially as part of the config builder rather than on demand.
2024-02-13 14:06:18 -06:00
Michael Zalimeni 2c1addfd64
[NET-7015] DNS v2 + Catalog v2 int test (#20607)
test(v2dns): Add Catalog v2 integration test

Add a basic integration test covering major functionality tested against
Catalog v2 resources. This complements existing tests that ensure
compatibility between v1 and v2 DNS when testing against Catalog v1
resources.
2024-02-13 17:40:08 +00:00
Dan Stough 0f0b080514
[CE] feat(v2dns): add v2 style query metrics (#20608)
feat(v2dns): add v2 style query metrics
2024-02-13 12:08:01 -05:00
Semir Patel b716a9ef6b
resource: reconcile managed types every ~8hrs (#20606) 2024-02-13 10:51:54 -06:00
John Murret 7e8f2e5f08
NET-7644/NET-7634 - Implement query lookup for tagged addresses on nodes and services including WAN translation. (#20583)
NET-7644 - Implement tagged addresses and wan translation
2024-02-12 14:27:25 -05:00
Dan Stough 5802080db1
feat(v2dns): enable peering queries (#20581) 2024-02-12 14:25:45 -05:00
Nick Cellino 5fb6ab6a3a
Move HCP Manager lifecycle management out of Link controller (#20401)
* Add function to get update channel for watching HCP Link

* Add MonitorHCPLink function

This function can be called in a goroutine to manage the lifecycle
of the HCP manager.

* Update HCP Manager config in link monitor before starting

This updates HCPMonitorLink so it updates the HCP manager
with an HCP client and management token when a Link is upserted.

* Let MonitorHCPManager handle lifecycle instead of link controller

* Remove cleanup from Link controller and move it to MonitorHCPLink

Previously, the Link Controller was responsible for cleaning up the
HCP-related files on the file system. This change makes it so
MonitorHCPLink handles this cleanup. As a result, we are able to remove
the PlacementEachServer placement strategy for the Link controller
because it no longer needs to do this per-node cleanup.

* Remove HCP Manager dependency from Link Controller

The Link controller does not need to have HCP Manager
as a dependency anymore, so this removes that dependency
in order to simplify the design.

* Add Linked prefix to Linked status variables

This is in preparation for adding a new status type to the
Link resource.

* Add new "validated" status type to link resource

The link resource controller will now set a "validated" status
in addition to the "linked" status. This is needed so that other
components (eg the HCP manager) know when the Link is ready to link
with HCP.

* Fix tests

* Handle new 'EndOfSnapshot' WatchList event

* Fix watch test

* Remove unnecessary config from TestAgent_scadaProvider

Since the Scada provider is now started on agent startup
regardless of whether a cloud config is provided, this removes
the cloud config override from the relevant test.

This change is not exactly related to the changes from this PR,
but rather is something small and sort of related that was noticed
while working on this PR.

* Simplify link watch test and remove sleep from link watch

This updates the link watch test so that it uses more mocks
and does not require setting up the infrastructure for the HCP Link
controller.

This also removes the time.Sleep delay in the link watcher loop in favor
of an error counter. When we receive 10 consecutive errors, we shut down
the link watcher loop.

* Add better logging for link validation. Remove EndOfSnapshot test.

* Refactor link monitor test into a table test

* Add some clarifying comments to link monitor

* Simplify link watch test

* Test a bunch more errors cases in link monitor test

* Use exponential backoff instead of errorCounter in LinkWatch

* Move link watch and link monitor into a single goroutine called from server.go

* Refactor HCP link watcher to use single go-routine.

Previously, if the WatchClient errored, we would've never recovered
because we never retry to create the stream. With this change,
we have a single goroutine that runs for the life of the server agent
and if the WatchClient stream ever errors, we retry the creation
of the stream with an exponential backoff.
2024-02-12 10:48:23 -05:00
John Murret c8e4cea69c
set up ent and CE specific DNS tests to be able to run v1 and v2 (#20571) 2024-02-09 15:53:56 -07:00
Dan Stough 01001f630e
feat(v2dns): catalog v2 service query support (#20564) 2024-02-09 17:41:40 -05:00
Dan Stough 24e15cc24e
feat(v2dns): prepared query ttls (#20563) 2024-02-09 11:26:02 -05:00
John Murret 7cac918811
NET-7637 / NET-7659/NET-7636/NET-7647/NET-7648/NET-7646/NET-7649/NET-7645 - Multiple DNS v2 fixes (#20556) 2024-02-08 19:56:04 -07:00
Derek Menteer a1c8d4dd19
Decouple xds capacity controller and raft-autopilot (#20511)
Decouple xds capacity controller and autopilot

This prevents a potential bug where autopilot deadlocks while attempting
to execute `AutopilotDelegate.NotifyState()` on an xdscapacity controller
that stopped consuming messages.
2024-02-08 15:31:44 -06:00
Chris S. Kim 26661a1c3b
Add default intention policy (#20544) 2024-02-08 20:25:42 +00:00
Joshua Timmons 242b777547
Fix logging when we fail to export metrics to hcp (#20514) 2024-02-08 11:00:47 -05:00
Joshua Timmons c790740cc6
Fix: avoid redundant logs on failures to export metrics (#20519) 2024-02-08 11:00:20 -05:00
John Murret 8ac54707d6
DNS v2 Multiple fixes. (#20525)
* DNS v2 Multiple fixes.

* add license header

* get rid of DefaultIntentionPolicy change that was not supposed to be there.
2024-02-07 21:24:00 -07:00
Nathan Coleman 45d645471b
[NET-7414] Reconcile PST for mesh gateway workloads on change to ComputedExportedServices (#20271)
* Reconcile ProxyStateTemplate on change to ComputedExportedServices

* gofmt changeset

---------

Co-authored-by: NiniOak <anita.akaeze@hashicorp.com>
2024-02-07 21:27:13 +00:00
skpratt 57bad0df85
add traffic permissions excludes and tests (#20453)
* add traffic permissions tests

* review fixes

* Update internal/mesh/internal/controllers/sidecarproxy/builder/local_app.go

Co-authored-by: John Landa <jonathanlanda@gmail.com>

---------

Co-authored-by: John Landa <jonathanlanda@gmail.com>
2024-02-07 20:21:44 +00:00
Eric Haberkorn 1bd253021b
V1 Compat Exported Services Controller Optimizations (#20517)
V1 compat exported services controller optimizations

* Don't start the v2 exported services controller in v1 mode.
* Use the controller cache.
2024-02-07 14:05:42 -05:00
Matt Keeler 49e6c0232d
Panic for unregistered types (#20476)
* Panic when controllers attempt to make invalid requests to the resource service

This will help to catch bugs in tests that could cause infinite errors to be emitted.

* Disable the API GW v2 controller

With the previous commit, this would cause a server to panic due to watching a type which has not yet been created/registered.

* Ensure that a test server gets the full type registry instead of constructing its own

* Skip TestServer_ControllerDependencies

* Fix peering tests so that they use the full resource registry.
2024-02-06 11:23:06 -05:00
Dan Stough fcc43a9a36
feat(v2dns): catalog v2 SOA and NS support (#20480) 2024-02-06 11:12:04 -05:00
John Murret 3bf999e46b
NET-7631 - Fix Node records that point to external/ non-IP addresses (#20491)
* NET-7630 - Fix TXT record creation on node queries

* NET-7631 - Fix Node records that point to external/ non-IP addresses

* NET-7630 - Fix TXT record creation on node queries
2024-02-06 15:16:02 +00:00
John Murret 7d4deda640
NET-7630 - Fix TXT record creation on node queries (#20483) 2024-02-06 09:53:39 -05:00
Ashesh Vidyut cffb5d7c6e
Fix audit-log encoding issue (CC-7337) (#20345)
* add changes

* added changelog

* change update

* CE chnages

* Removed gzip size fix

* fix changelog

* Update .changelog/20345.txt

Co-authored-by: Hans Hasselberg <hans@hashicorp.com>

* Adding comments

---------

Co-authored-by: Abhishek Sahu <abhishek.sahu@hashicorp.com>
Co-authored-by: Hans Hasselberg <hans@hashicorp.com>
Co-authored-by: srahul3 <rahulsharma@hashicorp.com>
2024-02-06 16:40:07 +05:30
Tauhid Anjum 88b8a1cc36
NET-6776 - Update Routes controller to use ComputedFailoverPolicy CE (#20496)
Update Routes controller to use ComputedFailoverPolicy
2024-02-06 13:28:18 +05:30
Derek Menteer 922844b8e0
Fix issue with persisting proxy-defaults (#20481)
Fix issue with persisting proxy-defaults

This resolves an issue introduced in hashicorp/consul#19829
where the proxy-defaults configuration entry with an HTTP protocol
cannot be updated after it has been persisted once and a router
exists. This occurs because the protocol field is not properly
pre-computed before being passed into validation functions.
2024-02-05 16:00:19 -06:00
John Murret 0d434dafac
Do not parallelize DNS tests because they consume too many ports (#20482) 2024-02-05 14:54:05 -07:00
John Murret 602e3c4fd5
DNS V2 - Revise discovery result to have service and node name and address fields. (#20468)
* DNS V2 - Revise discovery result to have service and node name and address fields.

* NET-7488 - dns v2 add support for prepared queries in catalog v1 data model (#20470)

NET-7488 - dns v2 add support for prepared queries in catalog v1 data model.
2024-02-03 03:23:52 +00:00
Dan Stough 9602b43183
feat(v2dns): catalog v2 workload query support (#20466) 2024-02-02 18:29:38 -05:00
R.B. Boyer c029b20615
v2: ensure the controller caches are fully populated before first use (#20421)
The new controller caches are initialized before the DependencyMappers or the 
Reconciler run, but importantly they are not populated. The expectation is that 
when the WatchList call is made to the resource service it will send an initial 
snapshot of all resources matching a single type, and then perpetually send 
UPSERT/DELETE events afterward. This initial snapshot will cycle through the 
caching layer and will catch it up to reflect the stored data.

Critically the dependency mappers and reconcilers will race against the restoration 
of the caches on server startup or leader election. During this time it is possible a
 mapper or reconciler will use the cache to lookup a specific relationship and 
not find it. That very same reconciler may choose to then recompute some 
persisted resource and in effect rewind it to a prior computed state.

Change

- Since we are updating the behavior of the WatchList RPC, it was aligned to 
  match that of pbsubscribe and pbpeerstream using a protobuf oneof instead of the enum+fields option.

- The WatchList rpc now has 3 alternating response events: Upsert, Delete, 
  EndOfSnapshot. When set the initial batch of "snapshot" Upserts sent on a new 
  watch, those operations will be followed by an EndOfSnapshot event before beginning 
  the never-ending sequence of Upsert/Delete events.

- Within the Controller startup code we will launch N+1 goroutines to execute WatchList 
  queries for the watched types. The UPSERTs will be applied to the nascent cache
   only (no mappers will execute).

- Upon witnessing the END operation, those goroutines will terminate.

- When all cache priming routines complete, then the normal set of N+1 long lived 
watch routines will launch to officially witness all events in the system using the 
primed cached.
2024-02-02 15:11:05 -06:00
wangxinyi7 fb2b696c0e
missing prefix / (#20447)
* missing prefix / and fix typos
2024-02-02 12:48:45 -08:00
Eric Haberkorn 543c6a30af
Trigger the V1 Compat exported-services Controller when V1 Config Entries are Updated (#20456)
* Trigger the v1 compat exported-services controller when the v1 config entry is modified.

* Hook up exported-services config entries to the event publisher.
* Add tests to the v2 exported services shim.
* Use the local materializer trigger updates on the v1 compat exported services controller when exported-services config entries are modified.

* stop sleeping when context is cancelled
2024-02-02 15:30:04 -05:00
Eric Haberkorn d0243b618d
Change the multicluster group to v2 (#20430) 2024-02-01 12:08:26 -05:00
Chris S. Kim b6f10bc58f
Skip filter chain created by permissive mtls (#20406) 2024-01-31 16:39:12 -05:00
wangxinyi7 3b44be530d
only forwarding the resource service traffic in client agent to server agent (#20347)
* only forwarding the resource service traffic in client agent to server agent
2024-01-31 12:05:47 -08:00