This refactors and relocates the following packages to live under internal/gossip instead of either in the toplevel lib or agent/consul:
- librtt : related to serf coordinates
- libserf : random serf stuff
security: upgrade vault/api to remove go-jose.v2
This dependency has an open vulnerability (GO-2024-2631), and is no
longer needed by the latest `vault/api`. This is a follow-up to the
upgrade of `go-jose/v3` in this repository to make all our dependencies
consolidate on v3.
Also remove the recently added security scan triage block for
GO-2024-2631, which was added due to incorrect reports that
`go-jose/v3@3.0.3` was impacted; in reality, is was this indirect
client dependency (not impacted by CVE) that the scanner was flagging. A
bug report has been filed to address the incorrect reporting.
* Fix xDS deadlock due to syncLoop termination.
This fixes an issue where agentless xDS streams can deadlock permanently until
a server is restarted. When this issue occurs, no new proxies are able to
successfully connect to the server.
Effectively, the trigger for this deadlock stems from the following return
statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202
When this happens, the entire `syncLoop()` terminates and stops consuming from
the following channel:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192
Which results in the `ConfigSource.cleanup()` function never receiving a
response and holding a mutex indefinitely:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247
Because this mutex is shared, it effectively deadlocks the server's ability to
process new xDS streams.
----
The fix to this issue involves removing the `chan chan struct{}` used like an
RPC-over-channels pattern and replacing it with two distinct channels:
+ `stopSyncLoopCh` - indicates that the `syncLoop()` should terminate soon. +
`syncLoopDoneCh` - indicates that the `syncLoop()` has terminated.
Splitting these two concepts out and deferring a `close(syncLoopDoneCh)` in the
`syncLoop()` function ensures that the deadlock above should no longer occur.
We also now evict xDS connections of all proxies for the corresponding
`syncLoop()` whenever it encounters an irrecoverable error. This is done by
hoisting the new `syncLoopDoneCh` upwards so that it's visible to the xDS delta
processing. Prior to this fix, the behavior was to simply orphan them so they
would never receive catalog-registration or service-defaults updates.
* Add changelog.
When executing the test with multiple tenancies the various mesh controllers remained running across multiple sub-tests. While the resourcetest.Client cleans up resources created with it, resources created by the controllers were not being cleaned up. This could result in subsequent sub tests encountering unexpected values and failing when they should pass.
* WIP
* got empty state working and cleaned up additional code
* Moved api gateway mapper to it's own file, removed mesh port usage for
api gateway, allow gateway to reconcile without routes
* Update internal/mesh/internal/controllers/gatewayproxy/controller.go
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* Feedback from PR Review:
- rename referencedTCPRoutes variable to be more descriptive
- fetchers are in alphabetical order
- move log statements to be more indicative of internal state
---------
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
Wire the ComputedImplicitDestinations resource into the sidecar controller, replacing the inline version already present.
Also:
- Rewrite the controller to use the controller cache
- Rewrite it to no longer depend on ServiceEndpoints
- Remove the fetcher and (local) cache abstraction
* Add function to get update channel for watching HCP Link
* Add MonitorHCPLink function
This function can be called in a goroutine to manage the lifecycle
of the HCP manager.
* Update HCP Manager config in link monitor before starting
This updates HCPMonitorLink so it updates the HCP manager
with an HCP client and management token when a Link is upserted.
* Let MonitorHCPManager handle lifecycle instead of link controller
* Remove cleanup from Link controller and move it to MonitorHCPLink
Previously, the Link Controller was responsible for cleaning up the
HCP-related files on the file system. This change makes it so
MonitorHCPLink handles this cleanup. As a result, we are able to remove
the PlacementEachServer placement strategy for the Link controller
because it no longer needs to do this per-node cleanup.
* Remove HCP Manager dependency from Link Controller
The Link controller does not need to have HCP Manager
as a dependency anymore, so this removes that dependency
in order to simplify the design.
* Add Linked prefix to Linked status variables
This is in preparation for adding a new status type to the
Link resource.
* Add new "validated" status type to link resource
The link resource controller will now set a "validated" status
in addition to the "linked" status. This is needed so that other
components (eg the HCP manager) know when the Link is ready to link
with HCP.
* Fix tests
* Handle new 'EndOfSnapshot' WatchList event
* Fix watch test
* Remove unnecessary config from TestAgent_scadaProvider
Since the Scada provider is now started on agent startup
regardless of whether a cloud config is provided, this removes
the cloud config override from the relevant test.
This change is not exactly related to the changes from this PR,
but rather is something small and sort of related that was noticed
while working on this PR.
* Simplify link watch test and remove sleep from link watch
This updates the link watch test so that it uses more mocks
and does not require setting up the infrastructure for the HCP Link
controller.
This also removes the time.Sleep delay in the link watcher loop in favor
of an error counter. When we receive 10 consecutive errors, we shut down
the link watcher loop.
* Add better logging for link validation. Remove EndOfSnapshot test.
* Refactor link monitor test into a table test
* Add some clarifying comments to link monitor
* Simplify link watch test
* Test a bunch more errors cases in link monitor test
* Use exponential backoff instead of errorCounter in LinkWatch
* Move link watch and link monitor into a single goroutine called from server.go
* Refactor HCP link watcher to use single go-routine.
Previously, if the WatchClient errored, we would've never recovered
because we never retry to create the stream. With this change,
we have a single goroutine that runs for the life of the server agent
and if the WatchClient stream ever errors, we retry the creation
of the stream with an exponential backoff.
Creates a new controller to create ComputedImplicitDestinations resources by
composing ComputedRoutes, Services, and ComputedTrafficPermissions to
infer all ParentRef services that could possibly send some portion of traffic to a
Service that has at least one accessible Workload Identity. A followup PR will
rewire the sidecar controller to make use of this new resource.
As this is a performance optimization, rather than a security feature the following
aspects of traffic permissions have been ignored:
- DENY rules
- port rules (all ports are allowed)
Also:
- Add some v2 TestController machinery to help test complex dependency mappers.
Previously calling `index.New` would return an object with the index information such as the Indexer, whether it was required, and the name of the index as well as a radix tree to store indexed data.
Now the main `Index` type doesn’t contain the radix tree for indexed data. Instead the `IndexedData` method can be used to combine the main `Index` with a radix tree in the `IndexedData` structure.
The cache still only allows configuring the `Index` type and will invoke the `IndexedData` method on the provided indexes to get the structure that the cache can use for actual data management.
All of this makes it now safe to reuse the `index.Index` types.
* Panic when controllers attempt to make invalid requests to the resource service
This will help to catch bugs in tests that could cause infinite errors to be emitted.
* Disable the API GW v2 controller
With the previous commit, this would cause a server to panic due to watching a type which has not yet been created/registered.
* Ensure that a test server gets the full type registry instead of constructing its own
* Skip TestServer_ControllerDependencies
* Fix peering tests so that they use the full resource registry.
The endpoints controller currently encodes the list of unique workload identities
referenced by all workload matched by a Service into a special data-bearing
status condition on that Service. This allows a downstream controller to avoid an
expensive watch on the ServiceEndpoints type just to get this data.
The current encoding does not lend itself well to machine parsing, which is what
the field is meant for, so this PR simplifies the encoding from:
"blah blah: " + strings.Join(ids, ",") + "."
to
strings.Join(ids, ",")
It also provides an exported utility function to easily extract this data.
The new controller caches are initialized before the DependencyMappers or the
Reconciler run, but importantly they are not populated. The expectation is that
when the WatchList call is made to the resource service it will send an initial
snapshot of all resources matching a single type, and then perpetually send
UPSERT/DELETE events afterward. This initial snapshot will cycle through the
caching layer and will catch it up to reflect the stored data.
Critically the dependency mappers and reconcilers will race against the restoration
of the caches on server startup or leader election. During this time it is possible a
mapper or reconciler will use the cache to lookup a specific relationship and
not find it. That very same reconciler may choose to then recompute some
persisted resource and in effect rewind it to a prior computed state.
Change
- Since we are updating the behavior of the WatchList RPC, it was aligned to
match that of pbsubscribe and pbpeerstream using a protobuf oneof instead of the enum+fields option.
- The WatchList rpc now has 3 alternating response events: Upsert, Delete,
EndOfSnapshot. When set the initial batch of "snapshot" Upserts sent on a new
watch, those operations will be followed by an EndOfSnapshot event before beginning
the never-ending sequence of Upsert/Delete events.
- Within the Controller startup code we will launch N+1 goroutines to execute WatchList
queries for the watched types. The UPSERTs will be applied to the nascent cache
only (no mappers will execute).
- Upon witnessing the END operation, those goroutines will terminate.
- When all cache priming routines complete, then the normal set of N+1 long lived
watch routines will launch to officially witness all events in the system using the
primed cached.
* Trigger the v1 compat exported-services controller when the v1 config entry is modified.
* Hook up exported-services config entries to the event publisher.
* Add tests to the v2 exported services shim.
* Use the local materializer trigger updates on the v1 compat exported services controller when exported-services config entries are modified.
* stop sleeping when context is cancelled
* Add validation of MeshGateway name + listeners
* Adds test for ValidateMeshGateway
* Fixes data fetcher test for gatewayproxy
---------
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
[NET-6429] Program ProxyStateTemplate to route cross-partition traffic to the correct destination mesh gateway
* Program mesh port to route wildcarded gateway SNI to the appropriate remote partition's mesh gateway
* Update target + route ports in service endpoint refs when building PST
* Use proper name of local datacenter when constructing SNI for gateway target
* Use destination identities for TLS when routing L4 traffic through the mesh gateway
* Use new constants, move comment to correct location
* Use new constants for port names
* Update test assertions
* Undo debug logging change
* Use a full EndpointRef on ComputedRoutes targets instead of just the ID
Today, the `ComputedRoutes` targets have the appropriate ID set for their `ServiceEndpoints` reference; however, the `MeshPort` and `RoutePort` are assumed to be that of the target when adding the endpoints reference in the sidecar's `ProxyStateTemplate`.
This is problematic when the target lives behind a `MeshGateway` and the `Mesh/RoutePort` used in the sidecar's `ProxyStateTemplate` should be that of the `MeshGateway` instead of the target.
Instead of assuming the `MeshPort` and `RoutePort` when building the `ProxyStateTemplate` for the sidecar, let's just add the full `EndpointRef` -- including the ID and the ports -- when hydrating the computed destinations.
* Make sure the UID from the existing ServiceEndpoints makes it onto ComputedRoutes
* Update test assertions
* Undo confusing whitespace change
* Remove one-line function wrapper
* Use plural name for endpoints ref
* Add constants for gateway name, kind and port names
* Add Stop method to telemetry provider
Stop the main loop of the provider and set the config
to disabled.
* Add interface for telemetry provider
Added for easier testing. Also renamed Run to Start, which better
fits with Stop.
* Add Stop method to HCP manager
* Add manager interface, rename implementation
Add interface for easier testing, rename existing Manager to HCPManager.
* Stop HCP manager in link Finalizer
* Attempt to cleanup if resource has been deleted
The link should be cleaned up by the finalizer, but there's an edge
case in a multi-server setup where the link is fully deleted on one
server before the other server reconciles. This will cover the case
where the reconcile happens after the resource is deleted.
* Add a delete mananagement token function
Passes a function to the HCP manager that deletes the management token
that was initially created by the manager.
* Delete token as part of stopping the manager
* Lock around disabling config, remove descriptions