Ensure all topics are refreshed on FSM restore and add supervisor loop to v1 controller subscriptions
This PR fixes two issues:
1. Not all streams were force closed whenever a snapshot restore happened. This means that anything consuming data from the stream (controllers, queries, etc) were unaware that the data they have is potentially stale / invalid. This first part ensures that all topics are purged.
2. The v1 controllers did not properly handle stream errors (which are likely to appear much more often due to 1 above) and so it introduces a supervisor thread to restart the watches when these errors occur.
Fix so that link API values are used over env vars
When a link is created via the API, those values should take precedence over
the values set by environment variables. This change loads all the env vars
initially as part of the config builder rather than on demand.
test(v2dns): Add Catalog v2 integration test
Add a basic integration test covering major functionality tested against
Catalog v2 resources. This complements existing tests that ensure
compatibility between v1 and v2 DNS when testing against Catalog v1
resources.
* docs: document behaviour of tls.https.verify_outgoing
At first it's not clear what verify_outgoing would do for the https
listener as it seems like Consul agent's don't make https requests. Upon
further investigation, it's clear that Consul agents do make https
requests in the following scenarios:
- to implement watches
- to perform checks
In the first scenario, this setting is used here:
a1c8d4dd19/agent/config/runtime.go (L1725)
In the second scenario, it's actually the internal_rpc setting that is
used:
a1c8d4dd19/tlsutil/config.go (L903)
* Update website/content/docs/agent/config/config-files.mdx
Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
---------
Co-authored-by: David Yu <dyu@hashicorp.com>
Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
* Updated docs for Consul ECS 0.8.x, architecture, tproxy support
* Apply suggestions from code review
Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
Co-authored-by: Ganesh S <ganesh.seetharaman@hashicorp.com>
* add apigw as feature, update images
---------
Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
Co-authored-by: Ganesh S <ganesh.seetharaman@hashicorp.com>
Wire the ComputedImplicitDestinations resource into the sidecar controller, replacing the inline version already present.
Also:
- Rewrite the controller to use the controller cache
- Rewrite it to no longer depend on ServiceEndpoints
- Remove the fetcher and (local) cache abstraction
* Add function to get update channel for watching HCP Link
* Add MonitorHCPLink function
This function can be called in a goroutine to manage the lifecycle
of the HCP manager.
* Update HCP Manager config in link monitor before starting
This updates HCPMonitorLink so it updates the HCP manager
with an HCP client and management token when a Link is upserted.
* Let MonitorHCPManager handle lifecycle instead of link controller
* Remove cleanup from Link controller and move it to MonitorHCPLink
Previously, the Link Controller was responsible for cleaning up the
HCP-related files on the file system. This change makes it so
MonitorHCPLink handles this cleanup. As a result, we are able to remove
the PlacementEachServer placement strategy for the Link controller
because it no longer needs to do this per-node cleanup.
* Remove HCP Manager dependency from Link Controller
The Link controller does not need to have HCP Manager
as a dependency anymore, so this removes that dependency
in order to simplify the design.
* Add Linked prefix to Linked status variables
This is in preparation for adding a new status type to the
Link resource.
* Add new "validated" status type to link resource
The link resource controller will now set a "validated" status
in addition to the "linked" status. This is needed so that other
components (eg the HCP manager) know when the Link is ready to link
with HCP.
* Fix tests
* Handle new 'EndOfSnapshot' WatchList event
* Fix watch test
* Remove unnecessary config from TestAgent_scadaProvider
Since the Scada provider is now started on agent startup
regardless of whether a cloud config is provided, this removes
the cloud config override from the relevant test.
This change is not exactly related to the changes from this PR,
but rather is something small and sort of related that was noticed
while working on this PR.
* Simplify link watch test and remove sleep from link watch
This updates the link watch test so that it uses more mocks
and does not require setting up the infrastructure for the HCP Link
controller.
This also removes the time.Sleep delay in the link watcher loop in favor
of an error counter. When we receive 10 consecutive errors, we shut down
the link watcher loop.
* Add better logging for link validation. Remove EndOfSnapshot test.
* Refactor link monitor test into a table test
* Add some clarifying comments to link monitor
* Simplify link watch test
* Test a bunch more errors cases in link monitor test
* Use exponential backoff instead of errorCounter in LinkWatch
* Move link watch and link monitor into a single goroutine called from server.go
* Refactor HCP link watcher to use single go-routine.
Previously, if the WatchClient errored, we would've never recovered
because we never retry to create the stream. With this change,
we have a single goroutine that runs for the life of the server agent
and if the WatchClient stream ever errors, we retry the creation
of the stream with an exponential backoff.
Creates a new controller to create ComputedImplicitDestinations resources by
composing ComputedRoutes, Services, and ComputedTrafficPermissions to
infer all ParentRef services that could possibly send some portion of traffic to a
Service that has at least one accessible Workload Identity. A followup PR will
rewire the sidecar controller to make use of this new resource.
As this is a performance optimization, rather than a security feature the following
aspects of traffic permissions have been ignored:
- DENY rules
- port rules (all ports are allowed)
Also:
- Add some v2 TestController machinery to help test complex dependency mappers.
Previously calling `index.New` would return an object with the index information such as the Indexer, whether it was required, and the name of the index as well as a radix tree to store indexed data.
Now the main `Index` type doesn’t contain the radix tree for indexed data. Instead the `IndexedData` method can be used to combine the main `Index` with a radix tree in the `IndexedData` structure.
The cache still only allows configuring the `Index` type and will invoke the `IndexedData` method on the provided indexes to get the structure that the cache can use for actual data management.
All of this makes it now safe to reuse the `index.Index` types.
Decouple xds capacity controller and autopilot
This prevents a potential bug where autopilot deadlocks while attempting
to execute `AutopilotDelegate.NotifyState()` on an xdscapacity controller
that stopped consuming messages.
* Update cache section for certain API calls
Providing more detail to the cache section to address behavior of API calls using streaming backend. This will help users understand that when '?index' is used and '?cached' is not, caching to servers will be bypassed, causing entry_fetch_max_burst and entry_fetch_rate to not be used in this scenario.
* Update website/content/docs/agent/config/config-files.mdx
Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
---------
Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
* Skip collecting data directory stats in dev mode
In dev mode, the data directory is not set, so this metrics collection would
always fail and logs errors.
* Log collection errors at DEBUG level
There isn't much action a user can take to fix these errors, so
logging them as DEBUG rather than ERROR.