consul/agent/xds
Derek Menteer 0ac8ae6c3b
Fix xDS deadlock due to syncLoop termination. (#20867)
* Fix xDS deadlock due to syncLoop termination.

This fixes an issue where agentless xDS streams can deadlock permanently until
a server is restarted. When this issue occurs, no new proxies are able to
successfully connect to the server.

Effectively, the trigger for this deadlock stems from the following return
statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202

When this happens, the entire `syncLoop()` terminates and stops consuming from
the following channel:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192

Which results in the `ConfigSource.cleanup()` function never receiving a
response and holding a mutex indefinitely:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247

Because this mutex is shared, it effectively deadlocks the server's ability to
process new xDS streams.

----

The fix to this issue involves removing the `chan chan struct{}` used like an
RPC-over-channels pattern and replacing it with two distinct channels:

+ `stopSyncLoopCh` - indicates that the `syncLoop()` should terminate soon.  +
`syncLoopDoneCh` - indicates that the `syncLoop()` has terminated.

Splitting these two concepts out and deferring a `close(syncLoopDoneCh)` in the
`syncLoop()` function ensures that the deadlock above should no longer occur.

We also now evict xDS connections of all proxies for the corresponding
`syncLoop()` whenever it encounters an irrecoverable error. This is done by
hoisting the new `syncLoopDoneCh` upwards so that it's visible to the xDS delta
processing. Prior to this fix, the behavior was to simply orphan them so they
would never receive catalog-registration or service-defaults updates.

* Add changelog.
2024-03-15 13:57:11 -05:00
..
accesslogs catalog,mesh,auth: Bump versions to v2beta1 (#18930) 2023-09-22 10:51:15 -06:00
config Fix ClusterLoadAssignment timeouts dropping endpoints. (#19871) 2023-12-11 09:25:11 -06:00
configfetcher chore: fix missing/incorrect license headers (#18555) 2023-08-22 17:23:54 -05:00
extensionruntime Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
naming [NET-4799] [OSS] xdsv2: listeners L4 support for connect proxies (#18436) 2023-08-15 11:57:07 -07:00
platform Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
proxystateconverter Remove V2 PeerName field from pbresource.Tenancy (#19865) 2024-01-29 15:08:31 -05:00
response add traffic permissions excludes and tests (#20453) 2024-02-07 20:21:44 +00:00
testcommon [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
testdata add traffic permissions excludes and tests (#20453) 2024-02-07 20:21:44 +00:00
validateupstream-test Fix ClusterLoadAssignment timeouts dropping endpoints. (#19871) 2023-12-11 09:25:11 -06:00
clusters.go Fix SAN matching on terminating gateways (#20417) 2024-01-31 12:17:45 -06:00
clusters_test.go Migrate remaining individual resource tests for service mesh to TestAllResourcesFromSnapshot (#19583) 2023-11-09 20:08:37 +00:00
delta.go Fix xDS deadlock due to syncLoop termination. (#20867) 2024-03-15 13:57:11 -05:00
delta_envoy_extender_ce_test.go Skip filter chain created by permissive mtls (#20406) 2024-01-31 16:39:12 -05:00
delta_envoy_extender_test.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
delta_test.go Fix xDS deadlock due to syncLoop termination. (#20867) 2024-03-15 13:57:11 -05:00
endpoints.go [NET-6221] Ensure LB policy set for locality-aware routing (CE) (#19283) 2023-10-19 10:13:27 -04:00
endpoints_test.go Migrate individual resource tests for Mesh Gateway to TestAllResourcesFromSnapshot (#19502) 2023-11-09 16:39:16 +00:00
failover_policy.go Fix SAN matching on terminating gateways (#20417) 2024-01-31 12:17:45 -06:00
failover_policy_ce.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
golden_test.go xds: update golden tests to be deterministic (#18707) 2023-09-11 11:40:19 -05:00
gw_per_route_filters_ce.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
jwt_authn.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
jwt_authn_ce.go [NET-5457] Fix CE code for jwt multiple virtual hosts bug (#19123) 2023-10-10 16:25:36 -04:00
jwt_authn_test.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
listeners.go Fix ClusterLoadAssignment timeouts dropping endpoints. (#19871) 2023-12-11 09:25:11 -06:00
listeners_apigateway.go Fix ClusterLoadAssignment timeouts dropping endpoints. (#19871) 2023-12-11 09:25:11 -06:00
listeners_ingress.go Fix ClusterLoadAssignment timeouts dropping endpoints. (#19871) 2023-12-11 09:25:11 -06:00
listeners_test.go Migrate remaining individual resource tests for service mesh to TestAllResourcesFromSnapshot (#19583) 2023-11-09 20:08:37 +00:00
locality_policy.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
locality_policy_ce.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
protocol_trace.go NET-5338 - NET-5338 - Run a v2 mode xds server (#18579) 2023-08-24 16:44:14 -06:00
rbac.go NET-6946 / NET-6941 - Replace usage of deprecated Envoy fields envoy.config.route.v3.HeaderMatcher.safe_regex_match and envoy.type.matcher.v3.RegexMatcher.google_re2 (#20013) 2024-01-03 09:53:39 -07:00
rbac_test.go add traffic permissions excludes and tests (#20453) 2024-02-07 20:21:44 +00:00
resources.go [NET-4799] [OSS] xdsv2: listeners L4 support for connect proxies (#18436) 2023-08-15 11:57:07 -07:00
resources_ce_test.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
resources_test.go Skip filter chain created by permissive mtls (#20406) 2024-01-31 16:39:12 -05:00
routes.go NET-6821 Disable Terminating Gateway Auto Host Header Rewrite (#20802) 2024-03-12 15:37:20 -05:00
routes_test.go Migrate individual resource tests for API Gateway to TestAllResourcesFromSnapshot (#19584) 2023-11-09 17:01:54 +00:00
secrets.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
server.go Fix xDS deadlock due to syncLoop termination. (#20867) 2024-03-15 13:57:11 -05:00
server_ce.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
testing.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
xds.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
xds_protocol_helpers_test.go Fix xDS deadlock due to syncLoop termination. (#20867) 2024-03-15 13:57:11 -05:00
z_xds_packages.go Various bits of cleanup detected when using Go Workspaces (#17462) 2023-06-05 16:08:39 -04:00
z_xds_packages_test.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00