* Fix xDS deadlock due to syncLoop termination.
This fixes an issue where agentless xDS streams can deadlock permanently until
a server is restarted. When this issue occurs, no new proxies are able to
successfully connect to the server.
Effectively, the trigger for this deadlock stems from the following return
statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202
When this happens, the entire `syncLoop()` terminates and stops consuming from
the following channel:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192
Which results in the `ConfigSource.cleanup()` function never receiving a
response and holding a mutex indefinitely:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247
Because this mutex is shared, it effectively deadlocks the server's ability to
process new xDS streams.
----
The fix to this issue involves removing the `chan chan struct{}` used like an
RPC-over-channels pattern and replacing it with two distinct channels:
+ `stopSyncLoopCh` - indicates that the `syncLoop()` should terminate soon. +
`syncLoopDoneCh` - indicates that the `syncLoop()` has terminated.
Splitting these two concepts out and deferring a `close(syncLoopDoneCh)` in the
`syncLoop()` function ensures that the deadlock above should no longer occur.
We also now evict xDS connections of all proxies for the corresponding
`syncLoop()` whenever it encounters an irrecoverable error. This is done by
hoisting the new `syncLoopDoneCh` upwards so that it's visible to the xDS delta
processing. Prior to this fix, the behavior was to simply orphan them so they
would never receive catalog-registration or service-defaults updates.
* Add changelog.
* Adding explicit MPL license for sub-package
This directory and its subdirectories (packages) contain files licensed with the MPLv2 `LICENSE` file in this directory and are intentionally licensed separately from the BSL `LICENSE` file at the root of this repository.
* Adding explicit MPL license for sub-package
This directory and its subdirectories (packages) contain files licensed with the MPLv2 `LICENSE` file in this directory and are intentionally licensed separately from the BSL `LICENSE` file at the root of this repository.
* Updating the license from MPL to Business Source License
Going forward, this project will be licensed under the Business Source License v1.1. Please see our blog post for more details at <Blog URL>, FAQ at www.hashicorp.com/licensing-faq, and details of the license at www.hashicorp.com/bsl.
* add missing license headers
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
* Update copyright file headers to BUSL-1.1
---------
Co-authored-by: hashicorp-copywrite[bot] <110428419+hashicorp-copywrite[bot]@users.noreply.github.com>
Receiving an "acl not found" error from an RPC in the agent cache and the
streaming/event components will cause any request loops to cease under the
assumption that they will never work again if the token was destroyed. This
prevents log spam (#14144, #9738).
Unfortunately due to things like:
- authz requests going to stale servers that may not have witnessed the token
creation yet
- authz requests in a secondary datacenter happening before the tokens get
replicated to that datacenter
- authz requests from a primary TO a secondary datacenter happening before the
tokens get replicated to that datacenter
The caller will get an "acl not found" *before* the token exists, rather than
just after. The machinery added above in the linked PRs will kick in and
prevent the request loop from looping around again once the tokens actually
exist.
For `consul-dataplane` usages, where xDS is served by the Consul servers
rather than the clients ultimately this is not a problem because in that
scenario the `agent/proxycfg` machinery is on-demand and launched by a new xDS
stream needing data for a specific service in the catalog. If the watching
goroutines are terminated it ripples down and terminates the xDS stream, which
CDP will eventually re-establish and restart everything.
For Consul client usages, the `agent/proxycfg` machinery is ahead-of-time
launched at service registration time (called "local" in some of the proxycfg
machinery) so when the xDS stream comes in the data is already ready to go. If
the watching goroutines terminate it should terminate the xDS stream, but
there's no mechanism to re-spawn the watching goroutines. If the xDS stream
reconnects it will see no `ConfigSnapshot` and will not get one again until
the client agent is restarted, or the service is re-registered with something
changed in it.
This PR fixes a few things in the machinery:
- there was an inadvertent deadlock in fetching snapshot from the proxycfg
machinery by xDS, such that when the watching goroutine terminated the
snapshots would never be fetched. This caused some of the xDS machinery to
get indefinitely paused and not finish the teardown properly.
- Every 30s we now attempt to re-insert all locally registered services into
the proxycfg machinery.
- When services are re-inserted into the proxycfg machinery we special case
"dead" ones such that we unilaterally replace them rather that doing that
conditionally.
Previously, we'd begin a session with the xDS concurrency limiter
regardless of whether the proxy was registered in the catalog or in
the server's local agent state.
This caused problems for users who run `consul connect envoy` directly
against a server rather than a client agent, as the server's locally
registered proxies wouldn't be included in the limiter's capacity.
Now, the `ConfigSource` is responsible for beginning the session and we
only do so for services in the catalog.
Fixes: https://github.com/hashicorp/consul/issues/15753
Previously, the MergeNodeServiceWithCentralConfig method accepted a
ServiceSpecificRequest argument, of which only the Datacenter and
QueryOptions fields were used.
Digging a little deeper, it turns out these fields were only passed
down to the ComputeResolvedServiceConfig method (through the
ServiceConfigRequest struct) which didn't actually use them.
As such, not all call-sites passed a valid ServiceSpecificRequest
so it's safer to remove the argument altogether to prevent future
changes from depending on it.
memdb's `WatchCh` method creates a goroutine that will publish to the
returned channel when the watchset is triggered or the given context
is canceled. Although this is called out in its godoc comment, it's
not obvious that this method creates a goroutine who's lifecycle you
need to manage.
In the xDS capacity controller, we were calling `WatchCh` on each
iteration of the control loop, meaning the number of goroutines would
grow on each autopilot event until there was catalog churn.
In the catalog config source, we were calling `WatchCh` with the
background context, meaning that the goroutine would keep running after
the sync loop had terminated.
Fixes a bug where a service getting deleted from the catalog would cause
the ConfigSource to spin in a hot loop attempting to look up the service.
This is because we were returning a nil WatchSet which would always
unblock the select.
Kudos to @freddygv for discovering this!
At the end of this test we were trying to ensure that updating a service in the local state causes it to re-register the service with the config manager.
The config manager in the same method will also call RegisteredProxies to determine if any need to be removed. This portion of the test is not attempting to verify that behavior.
Because the test is only blocked waiting for the Register event before it can end and assert all the mock expectations were met, we may not see the call to RegisteredProxies. This is especially apparent when tests are run with the race detector.
As we don’t actually care if that method is executed before the end of the test we can simply transition from expecting it to be called exactly once to a 0 or 1 times assertion.
OSS port of enterprise PR 1822
Includes the necessary changes to the `proxycfg` and `xds` packages to enable
Consul servers to configure arbitrary proxies using catalog data.
Broadly, `proxycfg.Manager` now has public methods for registering,
deregistering, and listing registered proxies — the existing local agent
state-sync behavior has been moved into a separate component that makes use of
these methods.
When an xDS session is started for a proxy service in the catalog, a goroutine
will be spawned to watch the service in the server's state store and
re-register it with the `proxycfg.Manager` whenever it is updated (and clean
it up when the client goes away).