This refactors and relocates the following packages to live under internal/gossip instead of either in the toplevel lib or agent/consul:
- librtt : related to serf coordinates
- libserf : random serf stuff
* Define file-system-certificate config entry
* Collect file-system-certificate(s) referenced by api-gateway onto snapshot
* Add file-system-certificate to config entry kind allow lists
* Remove inapplicable validation
This validation makes sense for inline certificates since Consul server is holding the certificate; however, for file system certificates, Consul server never actually sees the certificate.
* Support file-system-certificate as source for listener TLS certificate
* Add more required mappings for the new config entry type
* Construct proper TLS context based on certificate kind
* Add support or SDS in xdscommon
* Remove unused param
* Adds back verification of certs for inline-certificates
* Undo tangential changes to TLS config consumption
* Remove stray curly braces
* Undo some more tangential changes
* Improve function name for generating API gateway secrets
* Add changelog entry
* Update .changelog/20873.txt
Co-authored-by: Jared Kirschner <85913323+jkirschner-hashicorp@users.noreply.github.com>
* Add some nil-checking, remove outdated TODO
* Update test assertions to include file-system-certificate
* Add documentation for file-system-certificate config entry
Add new doc to nav
* Fix grammar mistake
* Rename watchmaps, remove outdated TODO
---------
Co-authored-by: Melisa Griffin <melisa.griffin@hashicorp.com>
Co-authored-by: Jared Kirschner <85913323+jkirschner-hashicorp@users.noreply.github.com>
This operation would previously fail due to unconsumed bytes in the
decoder buffer when reading the Ent snapshot (the first byte of the
record would be misinterpreted as a type indicator, and the remaining
bytes would fail to be deserialized or read as invalid data).
Ensure restore succeeds by decoding the ignored record as an
interface{}, which will consume the record bytes without requiring a
concrete target struct, then moving on to the next record.
* put conditionals are hcp initialization for consul server
* put more things behind configuration flags
* add changelog
* TestServer_hcpManager
* fix TestAgent_scadaProvider
Currently, when a client starts a blocking query and an ACL token expires within
that time, Consul will return ACL not found error with a 403 status code. However,
sometimes if an ACL token is invalidated at the same time as the query's deadline is reached,
Consul will instead return an empty response with a 200 status code.
This is because of the events being executed.
1. Client issues a blocking query request with timeout `t`.
2. ACL is deleted.
3. Server detects a change in ACLs and force closes the gRPC stream.
4. Client resubscribes with the same token and resets its state (view).
5. Client sees "ACL not found" error.
If ACL is deleted before step 4, the client is unaware that the stream was closed due to
an ACL error and will return an empty view (from the reset state) with the 200 status code.
To fix this problem, we introduce another state to the subsciption to indicate when a change
to ACLs has occured. If the server sees that there was an error due to ACL change, it will
re-authenticate the request and return an error if the token is no longer valid.
Fixes#20790
* disable terminating gateway auto host rewrite
* add changelog
* clean up unneeded additional snapshot fields
* add new field to docs
* squash
* fix test
Ensure all topics are refreshed on FSM restore and add supervisor loop to v1 controller subscriptions
This PR fixes two issues:
1. Not all streams were force closed whenever a snapshot restore happened. This means that anything consuming data from the stream (controllers, queries, etc) were unaware that the data they have is potentially stale / invalid. This first part ensures that all topics are purged.
2. The v1 controllers did not properly handle stream errors (which are likely to appear much more often due to 1 above) and so it introduces a supervisor thread to restart the watches when these errors occur.
* Add function to get update channel for watching HCP Link
* Add MonitorHCPLink function
This function can be called in a goroutine to manage the lifecycle
of the HCP manager.
* Update HCP Manager config in link monitor before starting
This updates HCPMonitorLink so it updates the HCP manager
with an HCP client and management token when a Link is upserted.
* Let MonitorHCPManager handle lifecycle instead of link controller
* Remove cleanup from Link controller and move it to MonitorHCPLink
Previously, the Link Controller was responsible for cleaning up the
HCP-related files on the file system. This change makes it so
MonitorHCPLink handles this cleanup. As a result, we are able to remove
the PlacementEachServer placement strategy for the Link controller
because it no longer needs to do this per-node cleanup.
* Remove HCP Manager dependency from Link Controller
The Link controller does not need to have HCP Manager
as a dependency anymore, so this removes that dependency
in order to simplify the design.
* Add Linked prefix to Linked status variables
This is in preparation for adding a new status type to the
Link resource.
* Add new "validated" status type to link resource
The link resource controller will now set a "validated" status
in addition to the "linked" status. This is needed so that other
components (eg the HCP manager) know when the Link is ready to link
with HCP.
* Fix tests
* Handle new 'EndOfSnapshot' WatchList event
* Fix watch test
* Remove unnecessary config from TestAgent_scadaProvider
Since the Scada provider is now started on agent startup
regardless of whether a cloud config is provided, this removes
the cloud config override from the relevant test.
This change is not exactly related to the changes from this PR,
but rather is something small and sort of related that was noticed
while working on this PR.
* Simplify link watch test and remove sleep from link watch
This updates the link watch test so that it uses more mocks
and does not require setting up the infrastructure for the HCP Link
controller.
This also removes the time.Sleep delay in the link watcher loop in favor
of an error counter. When we receive 10 consecutive errors, we shut down
the link watcher loop.
* Add better logging for link validation. Remove EndOfSnapshot test.
* Refactor link monitor test into a table test
* Add some clarifying comments to link monitor
* Simplify link watch test
* Test a bunch more errors cases in link monitor test
* Use exponential backoff instead of errorCounter in LinkWatch
* Move link watch and link monitor into a single goroutine called from server.go
* Refactor HCP link watcher to use single go-routine.
Previously, if the WatchClient errored, we would've never recovered
because we never retry to create the stream. With this change,
we have a single goroutine that runs for the life of the server agent
and if the WatchClient stream ever errors, we retry the creation
of the stream with an exponential backoff.
Decouple xds capacity controller and autopilot
This prevents a potential bug where autopilot deadlocks while attempting
to execute `AutopilotDelegate.NotifyState()` on an xdscapacity controller
that stopped consuming messages.
* Panic when controllers attempt to make invalid requests to the resource service
This will help to catch bugs in tests that could cause infinite errors to be emitted.
* Disable the API GW v2 controller
With the previous commit, this would cause a server to panic due to watching a type which has not yet been created/registered.
* Ensure that a test server gets the full type registry instead of constructing its own
* Skip TestServer_ControllerDependencies
* Fix peering tests so that they use the full resource registry.
Fix issue with persisting proxy-defaults
This resolves an issue introduced in hashicorp/consul#19829
where the proxy-defaults configuration entry with an HTTP protocol
cannot be updated after it has been persisted once and a router
exists. This occurs because the protocol field is not properly
pre-computed before being passed into validation functions.
* Trigger the v1 compat exported-services controller when the v1 config entry is modified.
* Hook up exported-services config entries to the event publisher.
* Add tests to the v2 exported services shim.
* Use the local materializer trigger updates on the v1 compat exported services controller when exported-services config entries are modified.
* stop sleeping when context is cancelled
* Add Stop method to telemetry provider
Stop the main loop of the provider and set the config
to disabled.
* Add interface for telemetry provider
Added for easier testing. Also renamed Run to Start, which better
fits with Stop.
* Add Stop method to HCP manager
* Add manager interface, rename implementation
Add interface for easier testing, rename existing Manager to HCPManager.
* Stop HCP manager in link Finalizer
* Attempt to cleanup if resource has been deleted
The link should be cleaned up by the finalizer, but there's an edge
case in a multi-server setup where the link is fully deleted on one
server before the other server reconciles. This will cover the case
where the reconcile happens after the resource is deleted.
* Add a delete mananagement token function
Passes a function to the HCP manager that deletes the management token
that was initially created by the manager.
* Delete token as part of stopping the manager
* Lock around disabling config, remove descriptions
* Check for ACL write permissions on write
Link eventually will be creating a token, so require acl:write.
* Convert Run to Start, only allow to start once
* Always initialize HCP components at startup
* Support for updating config and client
* Pass HCP manager to controller
* Start HCP manager in link resource
Start as part of link creation rather than always starting. Update
the HCP manager with values from the link before starting as well.
* Fix metrics sink leaked goroutine
* Remove the hardcoded disabled hostname prefix
The HCP metrics sink will always be enabled, so the length of sinks will
always be greater than zero. This also means that we will also always
default to prefixing metrics with the hostname, which is what our
documentation states is the expected behavior anyway.
* Add changelog
* Check and set running status in one method
* Check for primary datacenter, add back test
* Clarify merge reasoning, fix timing issue in test
* Add comment about controller placement
* Expand on breaking change, fix typo in changelog
* API Gateway proto
* fix lint issue
* new line
* run make proto format
* checkpoint
* stub
* Update internal/mesh/internal/controllers/apigateways/controller.go
* Move config-dependent methods to separate package
In order to reuse the fetching and file creation part of the
bootstrap package, move the code that would cause cyclical
dependencies to a different package.
* Export needed bootstrap methods and variables
Also add back validating persisted config and update tests.
* Add support to check for just management token
Add a new method that fetches the bootstrap configuration only if
there isn't a valid management token file instead of checking for
all the hcp-config files.
* Pass data dir as a dependency to link controller
The link controller needs to check the data directory for
the hcp-config files.
* Fetch bootstrap config for token in controller
Load the management token when reconciling a link resource, which will
fetch the agent boostrap configuration if the token is not already
persisted locally. Skip this step if the cluster is in read-only mode.
* Validate resource ID format in link creation
* Handle unauthorized and forbidden errors
Check for 401 and 403s when making GNM requests, exit bootstrap fetch
loop and return specific failure statuses for link.
* Move test function to a testing file
* Log load and status write errors
* Exported services api implemented
* Tests added, refactored code
* Adding server tests
* changelog added
* Proto gen added
* Adding codegen changes
* changing url, response object
* Fixing lint error by having namespace and partition directly
* Tests changes
* refactoring tests
* Simplified uniqueness logic for exported services, sorted the response in order of service name
* Fix lint errors, refactored code
Ultimately we will have to rectify wan federation with v2 catalog adjacent
experiments, but for now blanket prevent usage of the resource-apis,
v2dns, and v2tenancy experiments in secondary datacenters.
* Create HCP management token in HCP manager
* Change InitializeManagementToken to ManagementTokenUpserter
* Implement and use management token upsert function
* Fix race condition in test
* Add idea for improvement as comment
* Return early in upsertManagementToken if token exists
* Add Initializer to the controller
The Initializer adds support for running any required initialization
steps when the controller is first started.
* Implement HCP Link initializer
The link initializer will create a Link resource if the
cloud configuration has been set.
* Simplify retry logic and testing
* Remove internal retry, replace with logging logic
This add a fix to properly verify the gateway mode before creating a watch specific to mesh gateways. This watch have a high performance cost and when mesh gateways are not used is not used.
This also adds an optimization to only return the nodes when watching the Internal.ServiceDump RPC to avoid unnecessary disco chain compilation. As watches in proxy config only need the nodes.
* Option to set HCP client at runtime
Allows us to initially set a nil HCP client for the
telemetry provider and update it later.
* Set telemetry provider HCP client in HCP manager
Set the telemetry provider as a dependency and pass it to
the manager. Update the telemetry provider's HCP client
when the HCP manager starts.
* Add a provider interface for the metrics client
This provider will allow us to configure and reconfigure the
retryable HTTP client and the headers for the metrics client.
* Move HTTP retryable client to separate file
Copied directly from the metrics client.
* Abstract HCP specific values in HTTP client
Remove HCP specific references and instead initiate with
a generic TLS configuration and authentication source.
* Set up HTTP client and headers in the provider
Move setup from the metrics client to the HCP telemetry
provider.
* Update the telemetry provider in the HCP manager
Initialize the provider without the HCP configs and then update
it in the HCP manager to enable it.
* Improve test assertion, fix method comment
* Move client provider to metrics client
* Stop the manager on setup error
* Add separate lock for http configuration
* Start telemetry provider in HCP manager
* Update HCP client and config as part of Run
* Remove option to set config at initialization
* Simplify and clean up setting HCP configs
* Add test for telemetry provider Run method
* Fix race condition
* Use clone of HTTP headers
* Only allow initial update and run once
* Implement In-Process gRPC for use by controller caching/indexing
This replaces the pipe base listener implementation we were previously using. The new style CAN avoid cloning resources which our controller caching/indexing is taking advantage of to not duplicate resource objects in memory.
To maintain safety for controllers and for them to be able to modify data they get back from the cache and the resource service, the client they are presented in their runtime will be wrapped with an autogenerated client which clones request and response messages as they pass through the client.
Another sizable change in this PR is to consolidate how server specific gRPC services get registered and managed. Before this was in a bunch of different methods and it was difficult to track down how gRPC services were registered. Now its all in one place.
* Fix race in tests
* Ensure the resource service is registered to the multiplexed handler for forwarding from client agents
* Expose peer streaming on the internal handler
* Add HCCLink resource type
* Register HCCLink resource type with basic validation
* Add validation for required fields
* Add test for default ACLs
* Add no-op controller for HCCLink
* Add resource-apis semantic validation check in hcclink controller
* Add copyright headers
* Rename HCCLink to Link
* Add hcp_cluster_url to link proto
* Update 'disabled' reason with more detail
* Update link status name to consul.io/hcp/link
* Change link version from v1 to v2
* Use feature flag/experiment to enable v2 resources with HCP
* Update SCADA provider version
Also update mocks for SCADA provider.
* Create SCADA provider w/o HCP config, then update
Adds a placeholder config option to allow us to initialize a SCADA provider
without the HCP configuration. Also adds an update method to then add the
HCP configuration. We need this to be able to eventually always register a
SCADA listener at startup before the HCP config values are known.
* Pass cloud configuration to HCP manager
Save the entire cloud configuration and pass it to the HCP
manager.
* Update and start SCADA provider in HCP manager
Move config updating and starting to the HCP manager. The HCP manager
will eventually be responsible for all processes that contribute
to linking to HCP.