consul/agent/proxycfg
Derek Menteer 0ac8ae6c3b
Fix xDS deadlock due to syncLoop termination. (#20867)
* Fix xDS deadlock due to syncLoop termination.

This fixes an issue where agentless xDS streams can deadlock permanently until
a server is restarted. When this issue occurs, no new proxies are able to
successfully connect to the server.

Effectively, the trigger for this deadlock stems from the following return
statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202

When this happens, the entire `syncLoop()` terminates and stops consuming from
the following channel:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192

Which results in the `ConfigSource.cleanup()` function never receiving a
response and holding a mutex indefinitely:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247

Because this mutex is shared, it effectively deadlocks the server's ability to
process new xDS streams.

----

The fix to this issue involves removing the `chan chan struct{}` used like an
RPC-over-channels pattern and replacing it with two distinct channels:

+ `stopSyncLoopCh` - indicates that the `syncLoop()` should terminate soon.  +
`syncLoopDoneCh` - indicates that the `syncLoop()` has terminated.

Splitting these two concepts out and deferring a `close(syncLoopDoneCh)` in the
`syncLoop()` function ensures that the deadlock above should no longer occur.

We also now evict xDS connections of all proxies for the corresponding
`syncLoop()` whenever it encounters an irrecoverable error. This is done by
hoisting the new `syncLoopDoneCh` upwards so that it's visible to the xDS delta
processing. Prior to this fix, the behavior was to simply orphan them so they
would never receive catalog-registration or service-defaults updates.

* Add changelog.
2024-03-15 13:57:11 -05:00
..
internal/watch [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
api_gateway.go Add the plumbing for APIGW JWT work (#18609) 2023-08-31 12:23:59 -04:00
api_gateway_ce.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
config_snapshot_glue.go Run copyright after running deep-copy as part of the Makefile/CI (#18741) 2023-09-11 13:50:52 -04:00
config_snapshot_glue_test.go Run copyright after running deep-copy as part of the Makefile/CI (#18741) 2023-09-11 13:50:52 -04:00
connect_proxy.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
data_sources.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
data_sources_ce.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
deep-copy.sh [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
ingress_gateway.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
manager.go Fix xDS deadlock due to syncLoop termination. (#20867) 2024-03-15 13:57:11 -05:00
manager_test.go parse config protocol on write to optimize disco-chain compilation (#19829) 2023-12-07 13:46:46 -05:00
mesh_gateway.go Fix to not create a watch to `Internal.ServiceDump` when mesh gateway is not used (#20168) 2024-01-18 16:44:53 -06:00
mesh_gateway_ce.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
naming.go OSS -> CE (community edition) changes (#18517) 2023-08-22 09:46:03 -05:00
naming_ce.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
naming_test.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
proxycfg.deepcopy.go Fix ClusterLoadAssignment timeouts dropping endpoints. (#19871) 2023-12-11 09:25:11 -06:00
proxycfg.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
snapshot.go Fix ClusterLoadAssignment timeouts dropping endpoints. (#19871) 2023-12-11 09:25:11 -06:00
snapshot_test.go Fix ClusterLoadAssignment timeouts dropping endpoints. (#19871) 2023-12-11 09:25:11 -06:00
state.go Fix to not create a watch to `Internal.ServiceDump` when mesh gateway is not used (#20168) 2024-01-18 16:44:53 -06:00
state_ce_test.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
state_test.go Fix to not create a watch to `Internal.ServiceDump` when mesh gateway is not used (#20168) 2024-01-18 16:44:53 -06:00
terminating_gateway.go Allow connections through Terminating Gateways from peered clusters NET-3463 (#18959) 2023-10-05 21:54:23 +00:00
testing.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
testing_api_gateway.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
testing_ce.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
testing_connect_proxy.go [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
testing_ingress_gateway.go Migrate individual resource tests for Ingress Gateway to TestAllResourcesFromSnapshot (#19506) 2023-11-09 16:08:07 +00:00
testing_mesh_gateway.go parse config protocol on write to optimize disco-chain compilation (#19829) 2023-12-07 13:46:46 -05:00
testing_peering.go Fix ClusterLoadAssignment timeouts dropping endpoints. (#19871) 2023-12-11 09:25:11 -06:00
testing_terminating_gateway.go NET-6821 Disable Terminating Gateway Auto Host Header Rewrite (#20802) 2024-03-12 15:37:20 -05:00
testing_tproxy.go Fix ClusterLoadAssignment timeouts dropping endpoints. (#19871) 2023-12-11 09:25:11 -06:00
testing_upstreams.go [NET-5455] Allow disabling request and idle timeouts with negative values in service router and service resolver (#19992) 2023-12-19 15:36:07 -08:00
testing_upstreams_ce.go Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
upstreams.go Fix to not create a watch to `Internal.ServiceDump` when mesh gateway is not used (#20168) 2024-01-18 16:44:53 -06:00