1. Fix a bug where the peering leader routine would not track all active
peerings in the "stored" reconciliation map. This could lead to
tearing down streams where the token was generated, since the
ConnectedStreams() method used for reconciliation returns all streams
and not just the ones initiated by this leader routine.
2. Fix a race where stream contexts were being canceled before
termination messages were being processed by a peer.
Previously the leader routine would tear down streams by canceling
their context right after the termination message was sent. This
context cancelation could be propagated to the server side faster
than the termination message. Now there is a change where the
dialing peer uses CloseSend() to signal when no more messages will
be sent. Eventually the server peer will read an EOF after receiving
and processing the preceding termination message.
Using CloseSend() is actually not enough to address the issue
mentioned, since it doesn't wait for the server peer to finish
processing messages. Because of this now the dialing peer also reads
from the stream until an error signals that there are no more
messages. Receiving an EOF from our peer indicates that they
processed the termination message and have no additional work to do.
Given that the stream is being closed, all the messages received by
Recv are discarded. We only check for errors to avoid importing new
data.
When deleting a peering we do not want to delete the peering and all
imported data in a single operation, since deleting a large amount of
data at once could overload Consul.
Instead we defer deletion of peerings so that:
1. When a peering deletion request is received via gRPC the peering is
marked for deletion by setting the DeletedAt field.
2. A leader routine will monitor for peerings that are marked for
deletion and kick off a throttled deletion of all imported resources
before deleting the peering itself.
This commit mostly addresses point #1 by modifying the peering service
to mark peerings for deletion. Another key change is to add a
PeeringListDeleted state store function which can return all peerings
marked for deletion. This function is what will be watched by the
deferred deletion leader routine.
Previously, imported data would never be deleted. As
nodes/services/checks were registered and deregistered, resources
deleted from the exporting cluster would accumulate in the imported
cluster.
This commit makes updates to replication so that whenever an update is
received for a service name we reconcile what was present in the catalog
against what was received.
This handleUpdateService method can handle both updates and deletions.
When converting from Consul intentions to xds RBAC rules, services imported from other peers must encode additional data like partition (from the remote cluster) and trust domain.
This PR updates the PeeringTrustBundle to hold the sending side's local partition as ExportedPartition. It also updates RBAC code to encode SpiffeIDs of imported services with the ExportedPartition and TrustDomain.
Envoy's SPIFFE certificate validation extension allows for us to
validate against different root certificates depending on the trust
domain of the dialing proxy.
If there are any trust bundles from peers in the config snapshot then we
use the SPIFFE validator as the validation context, rather than the
usual TrustedCA.
The injected validation config includes the local root certificates as
well.
There are a handful of changes in this commit:
* When querying trust bundles for a service we need to be able to
specify the namespace of the service.
* The endpoint needs to track the index because the cache watches use
it.
* Extracted bulk of the endpoint's logic to a state store function
so that index tracking could be tested more easily.
* Removed check for service existence, deferring that sort of work to ACL authz
* Added the cache type
For mTLS to work between two proxies in peered clusters with different root CAs,
proxies need to configure their outbound listener to use different root certificates
for validation.
Up until peering was introduced proxies would only ever use one set of root certificates
to validate all mesh traffic, both inbound and outbound. Now an upstream proxy
may have a leaf certificate signed by a CA that's different from the dialing proxy's.
This PR makes changes to proxycfg and xds so that the upstream TLS validation
uses different root certificates depending on which cluster is being dialed.
I noticed that the JSON api endpoints for peerings json encodes protobufs directly, rather than converting them into their `api` package equivalents before marshal/unmarshaling them.
I updated this and used `mog` to do the annoying part in the middle.
Other changes:
- the status enum was converted into the friendlier string form of the enum for readability with tools like `curl`
- some of the `api` library functions were slightly modified to match other similar endpoints in UX (cc: @ndhanushkodi )
- peeringRead returns `nil` if not found
- partitions are NOT inferred from the agent's partition (matching 1.11-style logic)
The importing peer will need to know what SNI and SPIFFE name
corresponds to each exported service. Additionally it will need to know
at a high level the protocol in use (L4/L7) to generate the appropriate
connection pool and local metrics.
For replicated connect synthetic entities we edit the `Connect{}` part
of a `NodeService` to have a new section:
{
"PeerMeta": {
"SNI": [
"web.default.default.owt.external.183150d5-1033-3672-c426-c29205a576b8.consul"
],
"SpiffeID": [
"spiffe://183150d5-1033-3672-c426-c29205a576b8.consul/ns/default/dc/dc1/svc/web"
],
"Protocol": "tcp"
}
}
This data is then replicated and saved as-is at the importing side. Both
SNI and SpiffeID are slices for now until I can be sure we don't need
them for how mesh gateways will ultimately work.
* Install `buf` instead of `protoc`
* Created `buf.yaml` and `buf.gen.yaml` files in the two proto directories to control how `buf` generates/lints proto code.
* Invoke `buf` instead of `protoc`
* Added a `proto-format` make target.
* Committed the reformatted proto files.
* Added a `proto-lint` make target.
* Integrated proto linting with CI
* Fixed tons of proto linter warnings.
* Got rid of deprecated builtin protoc-gen-go grpc plugin usage. Moved to direct usage of protoc-gen-go-grpc.
* Unified all proto directories / go packages around using pb prefixes but ensuring all proto packages do not have the prefix.
Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions.
Co-authored-by: kisunji <ckim@hashicorp.com>
* mogify needed pbcommon structs
* mogify needed pbconnect structs
* fix compilation errors and make config_translate_test pass
* add missing file
* remove redundant oss func declaration
* fix EnterpriseMeta to copy the right data for enterprise
* rename pbcommon package to pbcommongogo
* regenerate proto and mog files
* add missing mog files
* add pbcommon package
* pbcommon no mog
* fix enterprise meta code generation
* fix enterprise meta code generation (pbcommongogo)
* fix mog generation for gogo
* use `protoc-go-inject-tag` to inject tags
* rename proto package
* pbcommon no mog
* use `protoc-go-inject-tag` to inject tags
* add non gogo proto to make file
* fix proto get
Introduces the capability to configure TLS differently for Consul's
listeners/ports (i.e. HTTPS, gRPC, and the internal multiplexed RPC
port) which is useful in scenarios where you may want the HTTPS or
gRPC interfaces to present a certificate signed by a well-known/public
CA, rather than the certificate used for internal communication which
must have a SAN in the form `server.<dc>.consul`.