Ensure all topics are refreshed on FSM restore and add supervisor loop to v1 controller subscriptions
This PR fixes two issues:
1. Not all streams were force closed whenever a snapshot restore happened. This means that anything consuming data from the stream (controllers, queries, etc) were unaware that the data they have is potentially stale / invalid. This first part ensures that all topics are purged.
2. The v1 controllers did not properly handle stream errors (which are likely to appear much more often due to 1 above) and so it introduces a supervisor thread to restart the watches when these errors occur.
Fix so that link API values are used over env vars
When a link is created via the API, those values should take precedence over
the values set by environment variables. This change loads all the env vars
initially as part of the config builder rather than on demand.
test(v2dns): Add Catalog v2 integration test
Add a basic integration test covering major functionality tested against
Catalog v2 resources. This complements existing tests that ensure
compatibility between v1 and v2 DNS when testing against Catalog v1
resources.
* Add function to get update channel for watching HCP Link
* Add MonitorHCPLink function
This function can be called in a goroutine to manage the lifecycle
of the HCP manager.
* Update HCP Manager config in link monitor before starting
This updates HCPMonitorLink so it updates the HCP manager
with an HCP client and management token when a Link is upserted.
* Let MonitorHCPManager handle lifecycle instead of link controller
* Remove cleanup from Link controller and move it to MonitorHCPLink
Previously, the Link Controller was responsible for cleaning up the
HCP-related files on the file system. This change makes it so
MonitorHCPLink handles this cleanup. As a result, we are able to remove
the PlacementEachServer placement strategy for the Link controller
because it no longer needs to do this per-node cleanup.
* Remove HCP Manager dependency from Link Controller
The Link controller does not need to have HCP Manager
as a dependency anymore, so this removes that dependency
in order to simplify the design.
* Add Linked prefix to Linked status variables
This is in preparation for adding a new status type to the
Link resource.
* Add new "validated" status type to link resource
The link resource controller will now set a "validated" status
in addition to the "linked" status. This is needed so that other
components (eg the HCP manager) know when the Link is ready to link
with HCP.
* Fix tests
* Handle new 'EndOfSnapshot' WatchList event
* Fix watch test
* Remove unnecessary config from TestAgent_scadaProvider
Since the Scada provider is now started on agent startup
regardless of whether a cloud config is provided, this removes
the cloud config override from the relevant test.
This change is not exactly related to the changes from this PR,
but rather is something small and sort of related that was noticed
while working on this PR.
* Simplify link watch test and remove sleep from link watch
This updates the link watch test so that it uses more mocks
and does not require setting up the infrastructure for the HCP Link
controller.
This also removes the time.Sleep delay in the link watcher loop in favor
of an error counter. When we receive 10 consecutive errors, we shut down
the link watcher loop.
* Add better logging for link validation. Remove EndOfSnapshot test.
* Refactor link monitor test into a table test
* Add some clarifying comments to link monitor
* Simplify link watch test
* Test a bunch more errors cases in link monitor test
* Use exponential backoff instead of errorCounter in LinkWatch
* Move link watch and link monitor into a single goroutine called from server.go
* Refactor HCP link watcher to use single go-routine.
Previously, if the WatchClient errored, we would've never recovered
because we never retry to create the stream. With this change,
we have a single goroutine that runs for the life of the server agent
and if the WatchClient stream ever errors, we retry the creation
of the stream with an exponential backoff.
Decouple xds capacity controller and autopilot
This prevents a potential bug where autopilot deadlocks while attempting
to execute `AutopilotDelegate.NotifyState()` on an xdscapacity controller
that stopped consuming messages.
* Panic when controllers attempt to make invalid requests to the resource service
This will help to catch bugs in tests that could cause infinite errors to be emitted.
* Disable the API GW v2 controller
With the previous commit, this would cause a server to panic due to watching a type which has not yet been created/registered.
* Ensure that a test server gets the full type registry instead of constructing its own
* Skip TestServer_ControllerDependencies
* Fix peering tests so that they use the full resource registry.
* NET-7630 - Fix TXT record creation on node queries
* NET-7631 - Fix Node records that point to external/ non-IP addresses
* NET-7630 - Fix TXT record creation on node queries
Fix issue with persisting proxy-defaults
This resolves an issue introduced in hashicorp/consul#19829
where the proxy-defaults configuration entry with an HTTP protocol
cannot be updated after it has been persisted once and a router
exists. This occurs because the protocol field is not properly
pre-computed before being passed into validation functions.
* DNS V2 - Revise discovery result to have service and node name and address fields.
* NET-7488 - dns v2 add support for prepared queries in catalog v1 data model (#20470)
NET-7488 - dns v2 add support for prepared queries in catalog v1 data model.
The new controller caches are initialized before the DependencyMappers or the
Reconciler run, but importantly they are not populated. The expectation is that
when the WatchList call is made to the resource service it will send an initial
snapshot of all resources matching a single type, and then perpetually send
UPSERT/DELETE events afterward. This initial snapshot will cycle through the
caching layer and will catch it up to reflect the stored data.
Critically the dependency mappers and reconcilers will race against the restoration
of the caches on server startup or leader election. During this time it is possible a
mapper or reconciler will use the cache to lookup a specific relationship and
not find it. That very same reconciler may choose to then recompute some
persisted resource and in effect rewind it to a prior computed state.
Change
- Since we are updating the behavior of the WatchList RPC, it was aligned to
match that of pbsubscribe and pbpeerstream using a protobuf oneof instead of the enum+fields option.
- The WatchList rpc now has 3 alternating response events: Upsert, Delete,
EndOfSnapshot. When set the initial batch of "snapshot" Upserts sent on a new
watch, those operations will be followed by an EndOfSnapshot event before beginning
the never-ending sequence of Upsert/Delete events.
- Within the Controller startup code we will launch N+1 goroutines to execute WatchList
queries for the watched types. The UPSERTs will be applied to the nascent cache
only (no mappers will execute).
- Upon witnessing the END operation, those goroutines will terminate.
- When all cache priming routines complete, then the normal set of N+1 long lived
watch routines will launch to officially witness all events in the system using the
primed cached.
* Trigger the v1 compat exported-services controller when the v1 config entry is modified.
* Hook up exported-services config entries to the event publisher.
* Add tests to the v2 exported services shim.
* Use the local materializer trigger updates on the v1 compat exported services controller when exported-services config entries are modified.
* stop sleeping when context is cancelled
* Add Stop method to telemetry provider
Stop the main loop of the provider and set the config
to disabled.
* Add interface for telemetry provider
Added for easier testing. Also renamed Run to Start, which better
fits with Stop.
* Add Stop method to HCP manager
* Add manager interface, rename implementation
Add interface for easier testing, rename existing Manager to HCPManager.
* Stop HCP manager in link Finalizer
* Attempt to cleanup if resource has been deleted
The link should be cleaned up by the finalizer, but there's an edge
case in a multi-server setup where the link is fully deleted on one
server before the other server reconciles. This will cover the case
where the reconcile happens after the resource is deleted.
* Add a delete mananagement token function
Passes a function to the HCP manager that deletes the management token
that was initially created by the manager.
* Delete token as part of stopping the manager
* Lock around disabling config, remove descriptions
* Check for ACL write permissions on write
Link eventually will be creating a token, so require acl:write.
* Convert Run to Start, only allow to start once
* Always initialize HCP components at startup
* Support for updating config and client
* Pass HCP manager to controller
* Start HCP manager in link resource
Start as part of link creation rather than always starting. Update
the HCP manager with values from the link before starting as well.
* Fix metrics sink leaked goroutine
* Remove the hardcoded disabled hostname prefix
The HCP metrics sink will always be enabled, so the length of sinks will
always be greater than zero. This also means that we will also always
default to prefixing metrics with the hostname, which is what our
documentation states is the expected behavior anyway.
* Add changelog
* Check and set running status in one method
* Check for primary datacenter, add back test
* Clarify merge reasoning, fix timing issue in test
* Add comment about controller placement
* Expand on breaking change, fix typo in changelog
* If a workload does not implement a port, it should not be included in the list of endpoints for the Envoy cluster for that port.
* Adds tenancy tests for xds controller and xdsv2 resource generation, and adds all those files.
* The original change in this PR was for filtering the list of endpoints by the port being routed to (bullet 1). Since I made changes to sidecarproxycontroller golden files, I realized some of the golden files were unused because of the tenancy changes, so when I deleted those, that broke xds controller tests which weren't correctly using tenancy. So when I fixed that, then the xdsv2 tests broke, so I added tenancy support there too. So now, from sidecarproxy controller -> xds controller -> xdsv2 we now have tenancy support and all the golden files are lined up.