* Fix xDS deadlock due to syncLoop termination.
This fixes an issue where agentless xDS streams can deadlock permanently until
a server is restarted. When this issue occurs, no new proxies are able to
successfully connect to the server.
Effectively, the trigger for this deadlock stems from the following return
statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202
When this happens, the entire `syncLoop()` terminates and stops consuming from
the following channel:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192
Which results in the `ConfigSource.cleanup()` function never receiving a
response and holding a mutex indefinitely:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247
Because this mutex is shared, it effectively deadlocks the server's ability to
process new xDS streams.
----
The fix to this issue involves removing the `chan chan struct{}` used like an
RPC-over-channels pattern and replacing it with two distinct channels:
+ `stopSyncLoopCh` - indicates that the `syncLoop()` should terminate soon. +
`syncLoopDoneCh` - indicates that the `syncLoop()` has terminated.
Splitting these two concepts out and deferring a `close(syncLoopDoneCh)` in the
`syncLoop()` function ensures that the deadlock above should no longer occur.
We also now evict xDS connections of all proxies for the corresponding
`syncLoop()` whenever it encounters an irrecoverable error. This is done by
hoisting the new `syncLoopDoneCh` upwards so that it's visible to the xDS delta
processing. Prior to this fix, the behavior was to simply orphan them so they
would never receive catalog-registration or service-defaults updates.
* Add changelog.
When executing the test with multiple tenancies the various mesh controllers remained running across multiple sub-tests. While the resourcetest.Client cleans up resources created with it, resources created by the controllers were not being cleaned up. This could result in subsequent sub tests encountering unexpected values and failing when they should pass.
* WIP
* got empty state working and cleaned up additional code
* Moved api gateway mapper to it's own file, removed mesh port usage for
api gateway, allow gateway to reconcile without routes
* Update internal/mesh/internal/controllers/gatewayproxy/controller.go
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* Feedback from PR Review:
- rename referencedTCPRoutes variable to be more descriptive
- fetchers are in alphabetical order
- move log statements to be more indicative of internal state
---------
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
Wire the ComputedImplicitDestinations resource into the sidecar controller, replacing the inline version already present.
Also:
- Rewrite the controller to use the controller cache
- Rewrite it to no longer depend on ServiceEndpoints
- Remove the fetcher and (local) cache abstraction
Creates a new controller to create ComputedImplicitDestinations resources by
composing ComputedRoutes, Services, and ComputedTrafficPermissions to
infer all ParentRef services that could possibly send some portion of traffic to a
Service that has at least one accessible Workload Identity. A followup PR will
rewire the sidecar controller to make use of this new resource.
As this is a performance optimization, rather than a security feature the following
aspects of traffic permissions have been ignored:
- DENY rules
- port rules (all ports are allowed)
Also:
- Add some v2 TestController machinery to help test complex dependency mappers.
* Panic when controllers attempt to make invalid requests to the resource service
This will help to catch bugs in tests that could cause infinite errors to be emitted.
* Disable the API GW v2 controller
With the previous commit, this would cause a server to panic due to watching a type which has not yet been created/registered.
* Ensure that a test server gets the full type registry instead of constructing its own
* Skip TestServer_ControllerDependencies
* Fix peering tests so that they use the full resource registry.
* Add validation of MeshGateway name + listeners
* Adds test for ValidateMeshGateway
* Fixes data fetcher test for gatewayproxy
---------
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
[NET-6429] Program ProxyStateTemplate to route cross-partition traffic to the correct destination mesh gateway
* Program mesh port to route wildcarded gateway SNI to the appropriate remote partition's mesh gateway
* Update target + route ports in service endpoint refs when building PST
* Use proper name of local datacenter when constructing SNI for gateway target
* Use destination identities for TLS when routing L4 traffic through the mesh gateway
* Use new constants, move comment to correct location
* Use new constants for port names
* Update test assertions
* Undo debug logging change
* Use a full EndpointRef on ComputedRoutes targets instead of just the ID
Today, the `ComputedRoutes` targets have the appropriate ID set for their `ServiceEndpoints` reference; however, the `MeshPort` and `RoutePort` are assumed to be that of the target when adding the endpoints reference in the sidecar's `ProxyStateTemplate`.
This is problematic when the target lives behind a `MeshGateway` and the `Mesh/RoutePort` used in the sidecar's `ProxyStateTemplate` should be that of the `MeshGateway` instead of the target.
Instead of assuming the `MeshPort` and `RoutePort` when building the `ProxyStateTemplate` for the sidecar, let's just add the full `EndpointRef` -- including the ID and the ports -- when hydrating the computed destinations.
* Make sure the UID from the existing ServiceEndpoints makes it onto ComputedRoutes
* Update test assertions
* Undo confusing whitespace change
* Remove one-line function wrapper
* Use plural name for endpoints ref
* Add constants for gateway name, kind and port names
[OG Author: michael.zalimeni@hashicorp.com, rebase needed a separate PR]
* v2: support virtual port in Service port references
In addition to Service target port references, allow users to specify a
port by stringified virtual port value. This is useful in environments
such as Kubernetes where typical configuration is written in terms of
Service virtual ports rather than workload (pod) target port names.
Retaining the option of referencing target ports by name supports VMs,
Nomad, and other use cases where virtual ports are not used by default.
To support both uses cases at once, we will strictly interpret port
references based on whether the value is numeric. See updated
`ServicePort` docs for more details.
* v2: update service ref docs for virtual port support
Update proto and generated .go files with docs reflecting virtual port
reference support.
* v2: add virtual port references to L7 topo test
Add coverage for mixed virtual and target port references to existing
test.
* update failover policy controller tests to work with computed failover policy and assert error conditions against FailoverPolicy and ComputedFailoverPolicy resources
* accumulate services; don't overwrite them in enterprise
---------
Co-authored-by: Michael Zalimeni <michael.zalimeni@hashicorp.com>
Co-authored-by: R.B. Boyer <rb@hashicorp.com>
* If a workload does not implement a port, it should not be included in the list of endpoints for the Envoy cluster for that port.
* Adds tenancy tests for xds controller and xdsv2 resource generation, and adds all those files.
* The original change in this PR was for filtering the list of endpoints by the port being routed to (bullet 1). Since I made changes to sidecarproxycontroller golden files, I realized some of the golden files were unused because of the tenancy changes, so when I deleted those, that broke xds controller tests which weren't correctly using tenancy. So when I fixed that, then the xdsv2 tests broke, so I added tenancy support there too. So now, from sidecarproxy controller -> xds controller -> xdsv2 we now have tenancy support and all the golden files are lined up.
* API Gateway proto
* fix lint issue
* new line
* run make proto format
* checkpoint
* stub
* Update internal/mesh/internal/controllers/apigateways/controller.go
* panic when passing an incorrect type to the data fetcher
* Add assertions for sidecarproxy datafetcher as well
* rename assertion function
* Add in comments to ensure devs know about potential panics for using
invalid types
* fix method call
* Use black hole cluster for default router when no matches
* Update test assertions
* Use null route cluster instead of black hole cluster concept
* Update test assertions
* Add cache resource decoding helpers
* Implement a common package for workload selection facilities. This includes:
* Controller cache Index
* ACL hooks
* Dependency Mapper to go from workload to list of resources which select it
* Dependency Mapper to go from a resource which selects workloads to all the workloads it selects.
* Update the endpoints controller to use the cache instead of custom mappers.
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* Upgrade Go to 1.21
* ci: detect Go backwards compatibility test version automatically
For our submodules and other places we choose to test against previous
Go versions, detect this version automatically from the current one
rather than hard-coding it.
* NET-6426 Create ProxyStateTemplate when reconciling MeshGateway resource
* Add TODO for switching fetch method based on gateway type
* Use gateway-kind in workload metadata instead of owner reference
* Create ProxyStateTemplate builder for gatewayproxy controller
* Update to use new controller interface
* Add copyright headers
* Set correct name for ProxyStateTemplate identity reference
* Generate empty ProxyStateTemplate by fetching MeshGateway
This cheats and looks up the MeshGateway directly. In the future, we will need a Workload => xGateway mapper
* Specify owner reference when writing ProxyStateTemplate
* Update dependency mapper to account for multiple controllers per resource type
* Regenerate v2 resource dependencies map
* Add helpful trace logs, tag TODOs with ticket identifiers
* NET-6899 Create name-aligned Service when reconciling MeshGateway resource
The Service has an owner reference added to it indicating that it belongs to a MeshGateway
* Specify port list when creating Service
* Use constants, add TODO w/ ticket reference
* Include gateway-kind in metadata of Service resource
* NET-6663 Modify sidecarproxy controller to skip xGateway resources
* Check workload metadata after nil-check for workload
* Add test asserting that workloads with meta gateway-kind are ignored
* Use more common pattern for map access to increase readability
* Add a make target to run lint-consul-retry on all the modules
* Cleanup sdk/testutil/retry
* Fix a bunch of retry.Run* usage to not use the outer testing.T
* Fix some more recent retry lint issues and pin to v1.4.0 of lint-consul-retry
* Fix codegen copywrite lint issues
* Don’t perform cleanup after each retry attempt by default.
* Use the common testutil.TestingTB interface in test-integ/tenancy
* Fix retry tests
* Update otel access logging extension test to perform requests within the retry block
test: Address occasional flakes in sidecarproxy/controller_test.go
We've observed an occasional flake in this test where some state check
fails. Adding in some wait wrappers to these state checks will hopefully
address the issue, assuming it is a simple flake.
* Add meshconfiguration/controller
* Add MeshConfiguration Registration function
* Fix the TODOs on the RegisterMeshGateway function
* Call RegisterMeshConfiguration
* Add comment to MeshConfigurationRegistration
* Add a test for Reconcile and some comments