Currently, when a client starts a blocking query and an ACL token expires within
that time, Consul will return ACL not found error with a 403 status code. However,
sometimes if an ACL token is invalidated at the same time as the query's deadline is reached,
Consul will instead return an empty response with a 200 status code.
This is because of the events being executed.
1. Client issues a blocking query request with timeout `t`.
2. ACL is deleted.
3. Server detects a change in ACLs and force closes the gRPC stream.
4. Client resubscribes with the same token and resets its state (view).
5. Client sees "ACL not found" error.
If ACL is deleted before step 4, the client is unaware that the stream was closed due to
an ACL error and will return an empty view (from the reset state) with the 200 status code.
To fix this problem, we introduce another state to the subsciption to indicate when a change
to ACLs has occured. If the server sees that there was an error due to ACL change, it will
re-authenticate the request and return an error if the token is no longer valid.
Fixes#20790
* Shuffle the list of servers returned by `pbserverdiscovery.WatchServers`.
This randomizes the list of servers to help reduce the chance of clients
all connecting to the same server simultaneously. Consul-dataplane is one
such client that does not randomize its own list of servers.
* Fix potential goroutine leak in xDS recv loop.
This commit ensures that the goroutine which receives xDS messages from
proxies will not block forever if the stream's context is cancelled but
the `processDelta()` function never consumes the message (due to being
terminated).
* Add changelog.
The new controller caches are initialized before the DependencyMappers or the
Reconciler run, but importantly they are not populated. The expectation is that
when the WatchList call is made to the resource service it will send an initial
snapshot of all resources matching a single type, and then perpetually send
UPSERT/DELETE events afterward. This initial snapshot will cycle through the
caching layer and will catch it up to reflect the stored data.
Critically the dependency mappers and reconcilers will race against the restoration
of the caches on server startup or leader election. During this time it is possible a
mapper or reconciler will use the cache to lookup a specific relationship and
not find it. That very same reconciler may choose to then recompute some
persisted resource and in effect rewind it to a prior computed state.
Change
- Since we are updating the behavior of the WatchList RPC, it was aligned to
match that of pbsubscribe and pbpeerstream using a protobuf oneof instead of the enum+fields option.
- The WatchList rpc now has 3 alternating response events: Upsert, Delete,
EndOfSnapshot. When set the initial batch of "snapshot" Upserts sent on a new
watch, those operations will be followed by an EndOfSnapshot event before beginning
the never-ending sequence of Upsert/Delete events.
- Within the Controller startup code we will launch N+1 goroutines to execute WatchList
queries for the watched types. The UPSERTs will be applied to the nascent cache
only (no mappers will execute).
- Upon witnessing the END operation, those goroutines will terminate.
- When all cache priming routines complete, then the normal set of N+1 long lived
watch routines will launch to officially witness all events in the system using the
primed cached.
* Exported services api implemented
* Tests added, refactored code
* Adding server tests
* changelog added
* Proto gen added
* Adding codegen changes
* changing url, response object
* Fixing lint error by having namespace and partition directly
* Tests changes
* refactoring tests
* Simplified uniqueness logic for exported services, sorted the response in order of service name
* Fix lint errors, refactored code
* Increase timeouts for flakey peering test.
* Various test fixes.
* Fix race condition in reconcilePeering.
This resolves an issue where a peering object in the state store was
incorrectly mutated by a function, resulting in the test being flagged as
failing when the -race flag was used.
* Implement In-Process gRPC for use by controller caching/indexing
This replaces the pipe base listener implementation we were previously using. The new style CAN avoid cloning resources which our controller caching/indexing is taking advantage of to not duplicate resource objects in memory.
To maintain safety for controllers and for them to be able to modify data they get back from the cache and the resource service, the client they are presented in their runtime will be wrapped with an autogenerated client which clones request and response messages as they pass through the client.
Another sizable change in this PR is to consolidate how server specific gRPC services get registered and managed. Before this was in a bunch of different methods and it was difficult to track down how gRPC services were registered. Now its all in one place.
* Fix race in tests
* Ensure the resource service is registered to the multiplexed handler for forwarding from client agents
* Expose peer streaming on the internal handler
* Add a make target to run lint-consul-retry on all the modules
* Cleanup sdk/testutil/retry
* Fix a bunch of retry.Run* usage to not use the outer testing.T
* Fix some more recent retry lint issues and pin to v1.4.0 of lint-consul-retry
* Fix codegen copywrite lint issues
* Don’t perform cleanup after each retry attempt by default.
* Use the common testutil.TestingTB interface in test-integ/tenancy
* Fix retry tests
* Update otel access logging extension test to perform requests within the retry block
* [NET-6438] Add tenancy to xDS Tests
* [NET-6438] Add tenancy to xDS Tests
- Fixing imports
* [NET-6438] Add tenancy to xDS Tests
- Added cleanup post test run
* [NET-6356] Add tenancy to xDS Tests
- Added cleanup post test run
* [NET-6438] Add tenancy to xDS Tests
- using t.Cleanup instead of defer delete
* [NET-6438] Add tenancy to xDS Tests
- rebased
* [NET-6438] Add tenancy to xDS Tests
- rebased
* [NET-6356] Add tenancy to Failover Tests
* [NET-6438] Add tenancy to xDS Tests
- Added cleanup post test run
* [NET-6356] Add tenancy to failover Tests
- using t.Cleanup instead of defer delete
* fix: update watch endpoint to default based on scope
* test: additional test
* refactor: rename list validate function
* refactor: rename validate<Op>Request() -> ensure<Op>RequestValid() for consistency
Prior to the introduction of this configuration, grpc keepalive messages were
sent after 2 hours of inactivity on the stream. This posed issues in various
scenarios where the server-side xds connection balancing was unaware that envoy
instances were uncleanly killed / force-closed, since the connections would
only be cleaned up after ~5 minutes of TCP timeouts occurred. Setting this
config to a 30 second interval with a 20 second timeout ensures that at most,
it should take up to 50 seconds for a dead xds connection to be closed.
This change adds ACL hooks to the remaining catalog and mesh resources, excluding any computed ones. Those will for now continue using the default operator:x permissions.
It refactors a lot of the common testing functions so that they can be re-used between resources.
There are also some types that we don't yet support (e.g. virtual IPs) that this change adds ACL hooks to for future-proofing.
The ACLs.Read hook for a resource only allows for the identity of a
resource to be passed in for use in authz consideration. For some
resources we wish to allow for the current stored value to dictate how
to enforce the ACLs (such as reading a list of applicable services from
the payload and allowing service:read on any of them to control reading the enclosing resource).
This change update the interface to usually accept a *pbresource.ID,
but if the hook decides it needs more data it returns a sentinel error
and the resource service knows to defer the authz check until after
fetching the data from storage.
Adding coauthors who mobbed/paired at various points throughout last week.
Co-authored-by: Dan Stough <dan.stough@hashicorp.com>
Co-authored-by: Iryna Shustava <iryna@hashicorp.com>
Co-authored-by: John Murret <john.murret@hashicorp.com>
Co-authored-by: Michael Zalimeni <michael.zalimeni@hashicorp.com>
Co-authored-by: Ashwin Venkatesh <ashwin@hashicorp.com>
Co-authored-by: Michael Wilkerson <mwilkerson@hashicorp.com>
This PR enables the GetEnvoyBootstrapParams endpoint to construct envoy bootstrap parameters from v2 catalog and mesh resources.
* Make bootstrap request and response parameters less specific to services so that we can re-use them for workloads or service instances.
* Remove ServiceKind from bootstrap params response. This value was unused previously and is not needed for V2.
* Make access logs generation generic so that we can generate them using v1 or v2 resources.