consul/internal
Derek Menteer 0ac8ae6c3b
Fix xDS deadlock due to syncLoop termination. (#20867)
* Fix xDS deadlock due to syncLoop termination.

This fixes an issue where agentless xDS streams can deadlock permanently until
a server is restarted. When this issue occurs, no new proxies are able to
successfully connect to the server.

Effectively, the trigger for this deadlock stems from the following return
statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202

When this happens, the entire `syncLoop()` terminates and stops consuming from
the following channel:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192

Which results in the `ConfigSource.cleanup()` function never receiving a
response and holding a mutex indefinitely:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247

Because this mutex is shared, it effectively deadlocks the server's ability to
process new xDS streams.

----

The fix to this issue involves removing the `chan chan struct{}` used like an
RPC-over-channels pattern and replacing it with two distinct channels:

+ `stopSyncLoopCh` - indicates that the `syncLoop()` should terminate soon.  +
`syncLoopDoneCh` - indicates that the `syncLoop()` has terminated.

Splitting these two concepts out and deferring a `close(syncLoopDoneCh)` in the
`syncLoop()` function ensures that the deadlock above should no longer occur.

We also now evict xDS connections of all proxies for the corresponding
`syncLoop()` whenever it encounters an irrecoverable error. This is done by
hoisting the new `syncLoopDoneCh` upwards so that it's visible to the xDS delta
processing. Prior to this fix, the behavior was to simply orphan them so they
would never receive catalog-registration or service-defaults updates.

* Add changelog.
2024-03-15 13:57:11 -05:00
..
auth Update ComputedTrafficPermissions ACL hooks (#20622) 2024-02-13 15:16:54 -05:00
catalog mesh: add ComputedImplicitDestinations resource for future use (#20547) 2024-02-09 15:42:10 -06:00
controller resource: reconcile managed types every ~8hrs (#20606) 2024-02-13 10:51:54 -06:00
dnsutil NET-7644/NET-7634 - Implement query lookup for tagged addresses on nodes and services including WAN translation. (#20583) 2024-02-12 14:27:25 -05:00
go-sso [Cloud][CC-6925] Updates to pushing server state (#19682) 2023-12-04 10:25:18 -05:00
hcp Move HCP Manager lifecycle management out of Link controller (#20401) 2024-02-12 10:48:23 -05:00
mesh Fix xDS deadlock due to syncLoop termination. (#20867) 2024-03-15 13:57:11 -05:00
multicluster [CE] Test tenancies for exported-services config manager (#20678) 2024-02-20 10:03:49 -07:00
protohcl security: upgrade google.golang.org/protobuf to 1.33.0 (#20801) 2024-03-06 23:04:42 +00:00
protoutil mesh: compute more of the xRoute features into ComputedRoutes (#18980) 2023-09-22 16:13:24 -05:00
radix [COMPLIANCE] License changes (#18443) 2023-08-11 09:12:13 -04:00
resource v2tenancy: make CE specific version of `resource.Registration` (#20681) 2024-02-20 15:38:06 -08:00
resourcehcl mesh: rename Upstreams and UpstreamsConfiguration to Destinations* (#18995) 2023-09-25 12:03:45 -06:00
storage v2: ensure the controller caches are fully populated before first use (#20421) 2024-02-02 15:11:05 -06:00
tenancy Remove V2 PeerName field from pbresource.Tenancy (#19865) 2024-01-29 15:08:31 -05:00
testing Add `consul snapshot decode` command (#20824) 2024-03-14 12:59:06 -04:00
tools security: upgrade google.golang.org/protobuf to 1.33.0 (#20801) 2024-03-06 23:04:42 +00:00