This PR fixes a bug that was introduced in:
https://github.com/hashicorp/consul/pull/16021
A user setting a protocol in proxy-defaults would cause tproxy implicit
upstreams to not honor the upstream service's protocol set in its
`ServiceDefaults.Protocol` field, and would instead always use the
proxy-defaults value.
Due to the fact that upstreams configured with "tcp" can successfully contact
upstream "http" services, this issue was not recognized until recently (a
proxy-defaults with "tcp" and a listening service with "http" would make
successful requests, but not the opposite).
As a temporary work-around, users experiencing this issue can explicitly set
the protocol on the `ServiceDefaults.UpstreamConfig.Overrides`, which should
take precedence.
The fix in this PR removes the proxy-defaults protocol from the wildcard
upstream that tproxy uses to configure implicit upstreams. When the protocol
was included, it would always overwrite the value during discovery chain
compilation, which was not correct. The discovery chain compiler also consumes
proxy defaults to determine the protocol, so simply excluding it from the
wildcard upstream config map resolves the issue.
* # This is a combination of 9 commits.
# This is the 1st commit message:
init without tests
# This is the commit message #2:
change log
# This is the commit message #3:
fix tests
# This is the commit message #4:
fix tests
# This is the commit message #5:
added tests
# This is the commit message #6:
change log breaking change
# This is the commit message #7:
removed breaking change
# This is the commit message #8:
fix test
# This is the commit message #9:
keeping the test behaviour same
* # This is a combination of 12 commits.
# This is the 1st commit message:
init without tests
# This is the commit message #2:
change log
# This is the commit message #3:
fix tests
# This is the commit message #4:
fix tests
# This is the commit message #5:
added tests
# This is the commit message #6:
change log breaking change
# This is the commit message #7:
removed breaking change
# This is the commit message #8:
fix test
# This is the commit message #9:
keeping the test behaviour same
# This is the commit message #10:
made enable debug atomic bool
# This is the commit message #11:
fix lint
# This is the commit message #12:
fix test true enable debug
* parent 10f500e895d92cc3691ade7b74a33db755d22039
author absolutelightning <ashesh.vidyut@hashicorp.com> 1687352587 +0530
committer absolutelightning <ashesh.vidyut@hashicorp.com> 1687352592 +0530
init without tests
change log
fix tests
fix tests
added tests
change log breaking change
removed breaking change
fix test
keeping the test behaviour same
made enable debug atomic bool
fix lint
fix test true enable debug
using enable debug in agent as atomic bool
test fixes
fix tests
fix tests
added update on correct locaiton
fix tests
fix reloadable config enable debug
fix tests
fix init and acl 403
* revert commit
* Ensure RSA keys are at least 2048 bits in length
* Add changelog
* update key length check for FIPS compliance
* Fix no new variables error and failing to return when error exists from
validating
* clean up code for better readability
* actually return value
* Fix a bug that wrongly trims domains when there is an overlap with DC name
Before this change, when DC name and domain/alt-domain overlap, the domain name incorrectly trimmed from the query.
Example:
Given: datacenter = dc-test, alt-domain = test.consul.
Querying for "test-node.node.dc-test.consul" will faile, because the
code was trimming "test.consul" instead of just ".consul"
This change, fixes the issue by adding dot (.) before trimming
* trimDomain: ensure domain trimmed without modyfing original domains
* update changelog
---------
Co-authored-by: Dhia Ayachi <dhia@hashicorp.com>
For consistency, resource type names must follow these rules:
- `Group` must be snake case, and in most cases a single word.
- `GroupVersion` must be lowercase, start with a "v" and end with a number.
- `Kind` must be pascal case.
These were chosen because they map to our protobuf type naming
conventions.
Update CA provider docs
Clarify that providers can differ between
primary and secondary datacenters
Provide a comparison chart for consul vs
vault CA providers
Loosen Vault CA provider validation for RootPKIPath
Update Vault CA provider documentation
* Reject inbound Prop Override patch with Services
Services filtering is only supported for outbound TrafficDirection patches.
* Improve Prop Override unexpected type validation
- Guard against additional invalid parent and target types
- Add specific error handling for Any fields (unsupported)
Fix issue with streaming service health watches.
This commit fixes an issue where the health streams were unaware of service
export changes. Whenever an exported-services config entry is modified, it is
effectively an ACL change.
The bug would be triggered by the following situation:
- no services are exported
- an upstream watch to service X is spawned
- the streaming backend filters out data for service X (due to lack of exports)
- service X is finally exported
In the situation above, the streaming backend does not trigger a refresh of its
data. This means that any events that were supposed to have been received prior
to the export are NOT backfilled, and the watches never see service X spawning.
We currently have decided to not trigger a stream refresh in this situation due
to the potential for a thundering herd effect (touching exports would cause a
re-fetch of all watches for that partition, potentially). Therefore, a local
blocking-query approach was added by this commit for agentless.
It's also worth noting that the streaming subscription is currently bypassed
most of the time with agentful, because proxycfg has a `req.Source.Node != ""`
which prevents the `streamingEnabled` check from passing. This means that while
agents should technically have this same issue, they don't experience it with
mesh health watches.
Note that this is a temporary fix that solves the issue for proxycfg, but not
service-discovery use cases.
* agent: remove agent cache dependency from service mesh leaf certificate management
This extracts the leaf cert management from within the agent cache.
This code was produced by the following process:
1. All tests in agent/cache, agent/cache-types, agent/auto-config,
agent/consul/servercert were run at each stage.
- The tests in agent matching .*Leaf were run at each stage.
- The tests in agent/leafcert were run at each stage after they
existed.
2. The former leaf cert Fetch implementation was extracted into a new
package behind a "fake RPC" endpoint to make it look almost like all
other cache type internals.
3. The old cache type was shimmed to use the fake RPC endpoint and
generally cleaned up.
4. I selectively duplicated all of Get/Notify/NotifyCallback/Prepopulate
from the agent/cache.Cache implementation over into the new package.
This was renamed as leafcert.Manager.
- Code that was irrelevant to the leaf cert type was deleted
(inlining blocking=true, refresh=false)
5. Everything that used the leaf cert cache type (including proxycfg
stuff) was shifted to use the leafcert.Manager instead.
6. agent/cache-types tests were moved and gently replumbed to execute
as-is against a leafcert.Manager.
7. Inspired by some of the locking changes from derek's branch I split
the fat lock into N+1 locks.
8. The waiter chan struct{} was eventually replaced with a
singleflight.Group around cache updates, which was likely the biggest
net structural change.
9. The awkward two layers or logic produced as a byproduct of marrying
the agent cache management code with the leaf cert type code was
slowly coalesced and flattened to remove confusion.
10. The .*Leaf tests from the agent package were copied and made to work
directly against a leafcert.Manager to increase direct coverage.
I have done a best effort attempt to port the previous leaf-cert cache
type's tests over in spirit, as well as to take the e2e-ish tests in the
agent package with Leaf in the test name and copy those into the
agent/leafcert package to get more direct coverage, rather than coverage
tangled up in the agent logic.
There is no net-new test coverage, just coverage that was pushed around
from elsewhere.
This includes prioritize by localities on disco chain targets rather than
resolvers, allowing different targets within the same partition to have
different policies.
* Add header filter to api-gateway xDS golden test
* Stop adding all header filters to virtual host when generating xDS for api-gateway
* Regenerate xDS golden file for api-gateway w/ header filter
Ensure that the embedded api struct is properly parsed when
deserializing config containing a set ResourceFilter.Services field.
Also enhance existing integration test to guard against bugs and
exercise this field.
TLDR with many modules the versions included in each diverged quite a bit. Attempting to use Go Workspaces produces a bunch of errors.
This commit:
1. Fixes envoy-library-references.sh to work again
2. Ensures we are pulling in go-control-plane@v0.11.0 everywhere (previously it was at that version in some modules and others were much older)
3. Remove one usage of golang/protobuf that caused us to have a direct dependency on it.
4. Remove deprecated usage of the Endpoint field in the grpc resolver.Target struct. The current version of grpc (v1.55.0) has removed that field and recommended replacement with URL.Opaque and calls to the Endpoint() func when needing to consume the previous field.
4. `go work init <all the paths to go.mod files>` && `go work sync`. This syncrhonized versions of dependencies from the main workspace/root module to all submodules
5. Updated .gitignore to ignore the go.work and go.work.sum files. This seems to be standard practice at the moment.
6. Update doc comments in protoc-gen-consul-rate-limit to be go fmt compatible
7. Upgraded makefile infra to perform linting, testing and go mod tidy on all modules in a flexible manner.
8. Updated linter rules to prevent usage of golang/protobuf
9. Updated a leader peering test to account for an extra colon in a grpc error message.
When UpstreamEnvoyExtender was introduced, some code was left duplicated
between it and BasicEnvoyExtender. One path in that code panics when a
TProxy listener patch is attempted due to no upstream data in
RuntimeConfig matching the local service (which would only happen in
rare cases).
Instead, we can remove the special handling of upstream VIPs from
BasicEnvoyExtender entirely, greatly simplifying the listener filter
patch code and avoiding the panic. UpstreamEnvoyExtender, which needs
this code to function, is modified to ensure a panic does not occur.
This also fixes a second regression in which the Lua extension was not
applied to TProxy outbound listeners.
Sameness groups with default-for-failover enabled did not function properly with
tproxy whenever all instances of the service disappeared from the local cluster.
This occured, because there were no corresponding resolvers (due to the implicit
failover policy) which caused VIPs to be deallocated.
This ticket expands upon the VIP allocations so that both service-defaults and
service-intentions (without destination wildcards) will ensure that the virtual
IP exists.
This commit only contains the OSS PR (datacenter query param support).
A separate enterprise PR adds support for ap and namespace query params.
Resources in Consul can exists within scopes such as datacenters, cluster
peers, admin partitions, and namespaces. You can refer to those resources from
interfaces such as the CLI, HTTP API, DNS, and configuration files.
Some scope levels have consistent naming: cluster peers are always referred to
as "peer".
Other scope levels use a short-hand in DNS lookups...
- "ns" for namespace
- "ap" for admin partition
- "dc" for datacenter
...But use long-hand in CLI commands:
- "namespace" for namespace
- "partition" for admin partition
- and "datacenter"
However, HTTP API query parameters do not follow a consistent pattern,
supporting short-hand for some scopes but long-hand for others:
- "ns" for namespace
- "partition" for admin partition
- and "dc" for datacenter.
This inconsistency is confusing, especially for users who have been exposed to
providing scope names through another interface such as CLI or DNS queries.
This commit improves UX by consistently supporting both short-hand and
long-hand forms of the namespace, partition, and datacenter scopes in HTTP API
query parameters.
* add upstream service targeting to property override extension
* Also add baseline goldens for service specific property override extension.
* Refactor the extension framework to put more logic into the templates.
* fix up the golden tests
* Move hcp client to subpackage hcpclient (#16800)
* [HCP Observability] New MetricsClient (#17100)
* Client configured with TLS using HCP config and retry/throttle
* Add tests and godoc for metrics client
* close body after request
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* remove clone
* Extract CloudConfig and mock for future PR
* Switch to hclog.FromContext
* [HCP Observability] OTELExporter (#17128)
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Create new OTELExporter which uses the MetricsClient
Add transform because the conversion is in an /internal package
* Fix lint error
* early return when there are no metrics
* Add NewOTELExporter() function
* Downgrade to metrics SDK version: v1.15.0-rc.1
* Fix imports
* fix small nits with comments and url.URL
* Fix tests by asserting actual error for context cancellation, fix parallel, and make mock more versatile
* Cleanup error handling and clarify empty metrics case
* Fix input/expected naming in otel_transform_test.go
* add comment for metric tracking
* Add a general isEmpty method
* Add clear error types
* update to latest version 1.15.0 of OTEL
* [HCP Observability] OTELSink (#17159)
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Create new OTELExporter which uses the MetricsClient
Add transform because the conversion is in an /internal package
* Fix lint error
* early return when there are no metrics
* Add NewOTELExporter() function
* Downgrade to metrics SDK version: v1.15.0-rc.1
* Fix imports
* fix small nits with comments and url.URL
* Fix tests by asserting actual error for context cancellation, fix parallel, and make mock more versatile
* Cleanup error handling and clarify empty metrics case
* Fix input/expected naming in otel_transform_test.go
* add comment for metric tracking
* Add a general isEmpty method
* Add clear error types
* update to latest version 1.15.0 of OTEL
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Initialize OTELSink with sync.Map for all the instrument stores.
* Moved PeriodicReader init to NewOtelReader function. This allows us to use a ManualReader for tests.
* Switch to mutex instead of sync.Map to avoid type assertion
* Add gauge store
* Clarify comments
* return concrete sink type
* Fix lint errors
* Move gauge store to be within sink
* Use context.TODO,rebase and clenaup opts handling
* Rebase onto otl exporter to downgrade metrics API to v1.15.0-rc.1
* Fix imports
* Update to latest stable version by rebasing on cc-4933, fix import, remove mutex init, fix opts error messages and use logger from ctx
* Add lots of documentation to the OTELSink
* Fix gauge store comment and check ok
* Add select and ctx.Done() check to gauge callback
* use require.Equal for attributes
* Fixed import naming
* Remove float64 calls and add a NewGaugeStore method
* Change name Store to Set in gaugeStore, add concurrency tests in both OTELSink and gauge store
* Generate 100 gauge operations
* Seperate the labels into goroutines in sink test
* Generate kv store for the test case keys to avoid using uuid
* Added a race test with 300 samples for OTELSink
* Do not pass in waitgroup and use error channel instead.
* Using SHA 7dea2225a218872e86d2f580e82c089b321617b0 to avoid build failures in otel
* Fix nits
* [HCP Observability] Init OTELSink in Telemetry (#17162)
* Move hcp client to subpackage hcpclient (#16800)
* [HCP Observability] New MetricsClient (#17100)
* Client configured with TLS using HCP config and retry/throttle
* Add tests and godoc for metrics client
* close body after request
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* remove clone
* Extract CloudConfig and mock for future PR
* Switch to hclog.FromContext
* [HCP Observability] New MetricsClient (#17100)
* Client configured with TLS using HCP config and retry/throttle
* Add tests and godoc for metrics client
* close body after request
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* remove clone
* Extract CloudConfig and mock for future PR
* Switch to hclog.FromContext
* [HCP Observability] New MetricsClient (#17100)
* Client configured with TLS using HCP config and retry/throttle
* Add tests and godoc for metrics client
* close body after request
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* remove clone
* Extract CloudConfig and mock for future PR
* Switch to hclog.FromContext
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Create new OTELExporter which uses the MetricsClient
Add transform because the conversion is in an /internal package
* Fix lint error
* early return when there are no metrics
* Add NewOTELExporter() function
* Downgrade to metrics SDK version: v1.15.0-rc.1
* Fix imports
* fix small nits with comments and url.URL
* Fix tests by asserting actual error for context cancellation, fix parallel, and make mock more versatile
* Cleanup error handling and clarify empty metrics case
* Fix input/expected naming in otel_transform_test.go
* add comment for metric tracking
* Add a general isEmpty method
* Add clear error types
* update to latest version 1.15.0 of OTEL
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Initialize OTELSink with sync.Map for all the instrument stores.
* Moved PeriodicReader init to NewOtelReader function. This allows us to use a ManualReader for tests.
* Switch to mutex instead of sync.Map to avoid type assertion
* Add gauge store
* Clarify comments
* return concrete sink type
* Fix lint errors
* Move gauge store to be within sink
* Use context.TODO,rebase and clenaup opts handling
* Rebase onto otl exporter to downgrade metrics API to v1.15.0-rc.1
* Fix imports
* Update to latest stable version by rebasing on cc-4933, fix import, remove mutex init, fix opts error messages and use logger from ctx
* Add lots of documentation to the OTELSink
* Fix gauge store comment and check ok
* Add select and ctx.Done() check to gauge callback
* use require.Equal for attributes
* Fixed import naming
* Remove float64 calls and add a NewGaugeStore method
* Change name Store to Set in gaugeStore, add concurrency tests in both OTELSink and gauge store
* Generate 100 gauge operations
* Seperate the labels into goroutines in sink test
* Generate kv store for the test case keys to avoid using uuid
* Added a race test with 300 samples for OTELSink
* [HCP Observability] OTELExporter (#17128)
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Create new OTELExporter which uses the MetricsClient
Add transform because the conversion is in an /internal package
* Fix lint error
* early return when there are no metrics
* Add NewOTELExporter() function
* Downgrade to metrics SDK version: v1.15.0-rc.1
* Fix imports
* fix small nits with comments and url.URL
* Fix tests by asserting actual error for context cancellation, fix parallel, and make mock more versatile
* Cleanup error handling and clarify empty metrics case
* Fix input/expected naming in otel_transform_test.go
* add comment for metric tracking
* Add a general isEmpty method
* Add clear error types
* update to latest version 1.15.0 of OTEL
* Do not pass in waitgroup and use error channel instead.
* Using SHA 7dea2225a218872e86d2f580e82c089b321617b0 to avoid build failures in otel
* Rebase onto otl exporter to downgrade metrics API to v1.15.0-rc.1
* Initialize OTELSink with sync.Map for all the instrument stores.
* Added telemetry agent to client and init sink in deps
* Fixed client
* Initalize sink in deps
* init sink in telemetry library
* Init deps before telemetry
* Use concrete telemetry.OtelSink type
* add /v1/metrics
* Avoid returning err for telemetry init
* move sink init within the IsCloudEnabled()
* Use HCPSinkOpts in deps instead
* update golden test for configuration file
* Switch to using extra sinks in the telemetry library
* keep name MetricsConfig
* fix log in verifyCCMRegistration
* Set logger in context
* pass around MetricSink in deps
* Fix imports
* Rebased onto otel sink pr
* Fix URL in test
* [HCP Observability] OTELSink (#17159)
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Create new OTELExporter which uses the MetricsClient
Add transform because the conversion is in an /internal package
* Fix lint error
* early return when there are no metrics
* Add NewOTELExporter() function
* Downgrade to metrics SDK version: v1.15.0-rc.1
* Fix imports
* fix small nits with comments and url.URL
* Fix tests by asserting actual error for context cancellation, fix parallel, and make mock more versatile
* Cleanup error handling and clarify empty metrics case
* Fix input/expected naming in otel_transform_test.go
* add comment for metric tracking
* Add a general isEmpty method
* Add clear error types
* update to latest version 1.15.0 of OTEL
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Initialize OTELSink with sync.Map for all the instrument stores.
* Moved PeriodicReader init to NewOtelReader function. This allows us to use a ManualReader for tests.
* Switch to mutex instead of sync.Map to avoid type assertion
* Add gauge store
* Clarify comments
* return concrete sink type
* Fix lint errors
* Move gauge store to be within sink
* Use context.TODO,rebase and clenaup opts handling
* Rebase onto otl exporter to downgrade metrics API to v1.15.0-rc.1
* Fix imports
* Update to latest stable version by rebasing on cc-4933, fix import, remove mutex init, fix opts error messages and use logger from ctx
* Add lots of documentation to the OTELSink
* Fix gauge store comment and check ok
* Add select and ctx.Done() check to gauge callback
* use require.Equal for attributes
* Fixed import naming
* Remove float64 calls and add a NewGaugeStore method
* Change name Store to Set in gaugeStore, add concurrency tests in both OTELSink and gauge store
* Generate 100 gauge operations
* Seperate the labels into goroutines in sink test
* Generate kv store for the test case keys to avoid using uuid
* Added a race test with 300 samples for OTELSink
* Do not pass in waitgroup and use error channel instead.
* Using SHA 7dea2225a218872e86d2f580e82c089b321617b0 to avoid build failures in otel
* Fix nits
* pass extraSinks as function param instead
* Add default interval as package export
* remove verifyCCM func
* Add clusterID
* Fix import and add t.Parallel() for missing tests
* Kick Vercel CI
* Remove scheme from endpoint path, and fix error logging
* return metrics.MetricSink for sink method
* Update SDK
* [HCP Observability] Metrics filtering and Labels in Go Metrics sink (#17184)
* Move hcp client to subpackage hcpclient (#16800)
* [HCP Observability] New MetricsClient (#17100)
* Client configured with TLS using HCP config and retry/throttle
* Add tests and godoc for metrics client
* close body after request
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* remove clone
* Extract CloudConfig and mock for future PR
* Switch to hclog.FromContext
* [HCP Observability] New MetricsClient (#17100)
* Client configured with TLS using HCP config and retry/throttle
* Add tests and godoc for metrics client
* close body after request
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* remove clone
* Extract CloudConfig and mock for future PR
* Switch to hclog.FromContext
* [HCP Observability] New MetricsClient (#17100)
* Client configured with TLS using HCP config and retry/throttle
* Add tests and godoc for metrics client
* close body after request
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* remove clone
* Extract CloudConfig and mock for future PR
* Switch to hclog.FromContext
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Create new OTELExporter which uses the MetricsClient
Add transform because the conversion is in an /internal package
* Fix lint error
* early return when there are no metrics
* Add NewOTELExporter() function
* Downgrade to metrics SDK version: v1.15.0-rc.1
* Fix imports
* fix small nits with comments and url.URL
* Fix tests by asserting actual error for context cancellation, fix parallel, and make mock more versatile
* Cleanup error handling and clarify empty metrics case
* Fix input/expected naming in otel_transform_test.go
* add comment for metric tracking
* Add a general isEmpty method
* Add clear error types
* update to latest version 1.15.0 of OTEL
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Initialize OTELSink with sync.Map for all the instrument stores.
* Moved PeriodicReader init to NewOtelReader function. This allows us to use a ManualReader for tests.
* Switch to mutex instead of sync.Map to avoid type assertion
* Add gauge store
* Clarify comments
* return concrete sink type
* Fix lint errors
* Move gauge store to be within sink
* Use context.TODO,rebase and clenaup opts handling
* Rebase onto otl exporter to downgrade metrics API to v1.15.0-rc.1
* Fix imports
* Update to latest stable version by rebasing on cc-4933, fix import, remove mutex init, fix opts error messages and use logger from ctx
* Add lots of documentation to the OTELSink
* Fix gauge store comment and check ok
* Add select and ctx.Done() check to gauge callback
* use require.Equal for attributes
* Fixed import naming
* Remove float64 calls and add a NewGaugeStore method
* Change name Store to Set in gaugeStore, add concurrency tests in both OTELSink and gauge store
* Generate 100 gauge operations
* Seperate the labels into goroutines in sink test
* Generate kv store for the test case keys to avoid using uuid
* Added a race test with 300 samples for OTELSink
* [HCP Observability] OTELExporter (#17128)
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Create new OTELExporter which uses the MetricsClient
Add transform because the conversion is in an /internal package
* Fix lint error
* early return when there are no metrics
* Add NewOTELExporter() function
* Downgrade to metrics SDK version: v1.15.0-rc.1
* Fix imports
* fix small nits with comments and url.URL
* Fix tests by asserting actual error for context cancellation, fix parallel, and make mock more versatile
* Cleanup error handling and clarify empty metrics case
* Fix input/expected naming in otel_transform_test.go
* add comment for metric tracking
* Add a general isEmpty method
* Add clear error types
* update to latest version 1.15.0 of OTEL
* Do not pass in waitgroup and use error channel instead.
* Using SHA 7dea2225a218872e86d2f580e82c089b321617b0 to avoid build failures in otel
* Rebase onto otl exporter to downgrade metrics API to v1.15.0-rc.1
* Initialize OTELSink with sync.Map for all the instrument stores.
* Added telemetry agent to client and init sink in deps
* Fixed client
* Initalize sink in deps
* init sink in telemetry library
* Init deps before telemetry
* Use concrete telemetry.OtelSink type
* add /v1/metrics
* Avoid returning err for telemetry init
* move sink init within the IsCloudEnabled()
* Use HCPSinkOpts in deps instead
* update golden test for configuration file
* Switch to using extra sinks in the telemetry library
* keep name MetricsConfig
* fix log in verifyCCMRegistration
* Set logger in context
* pass around MetricSink in deps
* Fix imports
* Rebased onto otel sink pr
* Fix URL in test
* [HCP Observability] OTELSink (#17159)
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Create new OTELExporter which uses the MetricsClient
Add transform because the conversion is in an /internal package
* Fix lint error
* early return when there are no metrics
* Add NewOTELExporter() function
* Downgrade to metrics SDK version: v1.15.0-rc.1
* Fix imports
* fix small nits with comments and url.URL
* Fix tests by asserting actual error for context cancellation, fix parallel, and make mock more versatile
* Cleanup error handling and clarify empty metrics case
* Fix input/expected naming in otel_transform_test.go
* add comment for metric tracking
* Add a general isEmpty method
* Add clear error types
* update to latest version 1.15.0 of OTEL
* Client configured with TLS using HCP config and retry/throttle
* run go mod tidy
* Remove one abstraction to use the config from deps
* Address PR feedback
* Initialize OTELSink with sync.Map for all the instrument stores.
* Moved PeriodicReader init to NewOtelReader function. This allows us to use a ManualReader for tests.
* Switch to mutex instead of sync.Map to avoid type assertion
* Add gauge store
* Clarify comments
* return concrete sink type
* Fix lint errors
* Move gauge store to be within sink
* Use context.TODO,rebase and clenaup opts handling
* Rebase onto otl exporter to downgrade metrics API to v1.15.0-rc.1
* Fix imports
* Update to latest stable version by rebasing on cc-4933, fix import, remove mutex init, fix opts error messages and use logger from ctx
* Add lots of documentation to the OTELSink
* Fix gauge store comment and check ok
* Add select and ctx.Done() check to gauge callback
* use require.Equal for attributes
* Fixed import naming
* Remove float64 calls and add a NewGaugeStore method
* Change name Store to Set in gaugeStore, add concurrency tests in both OTELSink and gauge store
* Generate 100 gauge operations
* Seperate the labels into goroutines in sink test
* Generate kv store for the test case keys to avoid using uuid
* Added a race test with 300 samples for OTELSink
* Do not pass in waitgroup and use error channel instead.
* Using SHA 7dea2225a218872e86d2f580e82c089b321617b0 to avoid build failures in otel
* Fix nits
* pass extraSinks as function param instead
* Add default interval as package export
* remove verifyCCM func
* Add clusterID
* Fix import and add t.Parallel() for missing tests
* Kick Vercel CI
* Remove scheme from endpoint path, and fix error logging
* return metrics.MetricSink for sink method
* Update SDK
* Added telemetry agent to client and init sink in deps
* Add node_id and __replica__ default labels
* add function for default labels and set x-hcp-resource-id
* Fix labels tests
* Commit suggestion for getDefaultLabels
Co-authored-by: Joshua Timmons <joshua.timmons1@gmail.com>
* Fixed server.id, and t.Parallel()
* Make defaultLabels a method on the TelemetryConfig object
* Rename FilterList to lowercase filterList
* Cleanup filter implemetation by combining regex into a single one, and making the type lowercase
* Fix append
* use regex directly for filters
* Fix x-resource-id test to use mocked value
* Fix log.Error formats
* Forgot the len(opts.Label) optimization)
* Use cfg.NodeID instead
---------
Co-authored-by: Joshua Timmons <joshua.timmons1@gmail.com>
* remove replic tag (#17484)
* [HCP Observability] Add custom metrics for OTEL sink, improve logging, upgrade modules and cleanup metrics client (#17455)
* Add custom metrics for Exporter and transform operations
* Improve deps logging
Run go mod tidy
* Upgrade SDK and OTEL
* Remove the partial success implemetation and check for HTTP status code in metrics client
* Add x-channel
* cleanup logs in deps.go based on PR feedback
* Change to debug log and lowercase
* address test operation feedback
* use GetHumanVersion on version
* Fix error wrapping
* Fix metric names
* [HCP Observability] Turn off retries for now until dynamically configurable (#17496)
* Remove retries for now until dynamic configuration is possible
* Clarify comment
* Update changelog
* improve changelog
---------
Co-authored-by: Joshua Timmons <joshua.timmons1@gmail.com>
* Support Listener in Property Override
Add support for patching `Listener` resources via the builtin
`property-override` extension.
Refactor existing listener patch code in `BasicEnvoyExtender` to
simplify addition of resource support.
* Support ClusterLoadAssignment in Property Override
Add support for patching `ClusterLoadAssignment` resources via the
builtin `property-override` extension.
`property-override` is an extension that allows for arbitrarily
patching Envoy resources based on resource matching filters. Patch
operations resemble a subset of the JSON Patch spec with minor
differences to facilitate patching pre-defined (protobuf) schemas.
See Envoy Extension product documentation for more details.
Co-authored-by: Eric Haberkorn <eric.haberkorn@hashicorp.com>
Co-authored-by: Kyle Havlovitz <kyle@hashicorp.com>
* perf: Remove expensive reflection from raft/mesh hot path
Replaces a reflection-based copy of a struct in the mesh topology with a
deep-copy generated implementation.
This is in the hot-path of raft FSM updates, and the reflection overhead was a
substantial part of mesh registration times (~90%). This could manifest as raft
thread saturation, and resulting instability.
Co-authored-by: Joel Brandhorst <joel.brandhorst@gmail.com>
* add changelog
---------
Co-authored-by: Joel Brandhorst <joel.brandhorst@gmail.com>
Co-authored-by: John Murret <john.murret@hashicorp.com>
This will likely happen frequently with sameness groups. Relaxing this
constraint is harmless for failover because xds/endpoints exludes cross
partition and peer endpoints.
* xds generation for routes api gateway
* Update gateway.go
* move buildHttpRoute into xds package
* Update agent/consul/discoverychain/gateway.go
* remove unneeded function
* convert http route code to only run for http protocol to future proof code path
* Update agent/consul/discoverychain/gateway.go
Co-authored-by: Mike Morris <mikemorris@users.noreply.github.com>
* fix tests, clean up http check logic
* clean up todo
* Fix casing in docstring
* Fix import block, adjust docstrings
* Rename func
* Consolidate docstring onto single line
* Remove ToIngress() conversion for APIGW, which generates its own xDS now
* update name and comment
* use constant value
* use constant
* rename readyUpstreams to readyListeners to better communicate what that function is doing
---------
Co-authored-by: Mike Morris <mikemorris@users.noreply.github.com>
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* xds generation for routes api gateway
* Update gateway.go
* move buildHttpRoute into xds package
* Update agent/consul/discoverychain/gateway.go
* remove unneeded function
* convert http route code to only run for http protocol to future proof code path
* Update agent/consul/discoverychain/gateway.go
Co-authored-by: Mike Morris <mikemorris@users.noreply.github.com>
* fix tests, clean up http check logic
* clean up todo
* Fix casing in docstring
* Fix import block, adjust docstrings
* update name and comment
* use constant value
* use constant
---------
Co-authored-by: Mike Morris <mikemorris@users.noreply.github.com>
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
Fix ACL check on health endpoint
Prior to this change, the service health API would not explicitly return an
error whenever a token with invalid permissions was given, and it would instead
return empty results. With this change, a "Permission denied" error is returned
whenever data is queried. This is done to better support the agent cache, which
performs a fetch backoff sleep whenever ACL errors are encountered. Affected
endpoints are: `/v1/health/connect/` and `/v1/health/ingress/`.
* Fix namespaced peer service updates / deletes.
This change fixes a function so that namespaced services are
correctly queried when handling updates / deletes. Prior to this
change, some peered services would not correctly be un-exported.
* Add changelog.
To avoid unintended tampering with remote downstreams via service
config, refactor BasicEnvoyExtender and RuntimeConfig to disallow
typical Envoy extensions from being applied to non-local proxies.
Continue to allow this behavior for AWS Lambda and the read-only
Validate builtin extensions.
Addresses CVE-2023-2816.
* API Gateway XDS Primitives, endpoints and clusters (#17002)
* XDS primitive generation for endpoints and clusters
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* server_test
* deleted extra file
* add missing parents to test
---------
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* Routes for API Gateway (#17158)
* XDS primitive generation for endpoints and clusters
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* server_test
* deleted extra file
* add missing parents to test
* checkpoint
* delete extra file
* httproute flattening code
* linting issue
* so close on this, calling for tonight
* unit test passing
* add in header manip to virtual host
* upstream rebuild commented out
* Use consistent upstream name whether or not we're rebuilding
* Start working through route naming logic
* Fix typos in test descriptions
* Simplify route naming logic
* Simplify RebuildHTTPRouteUpstream
* Merge additional compiled discovery chains instead of overwriting
* Use correct chain for flattened route, clean up + add TODOs
* Remove empty conditional branch
* Restore previous variable declaration
Limit the scope of this PR
* Clean up, improve TODO
* add logging, clean up todos
* clean up function
---------
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* checkpoint, skeleton, tests not passing
* checkpoint
* endpoints xds cluster configuration
* resources test fix
* fix reversion in resources_test
* checkpoint
* Update agent/proxycfg/api_gateway.go
Co-authored-by: John Maguire <john.maguire@hashicorp.com>
* unit tests passing
* gofmt
* add deterministic sorting to appease the unit test gods
* remove panic
* Find ready upstream matching listener instead of first in list
* Clean up, improve TODO
* Modify getReadyUpstreams to filter upstreams by listener (#17410)
Each listener would previously have all upstreams from any route that bound to the listener. This is problematic when a route bound to one listener also binds to other listeners and so includes upstreams for multiple listeners. The list for a given listener would then wind up including upstreams for other listeners.
* clean up todos, references to api gateway in listeners_ingress
* merge in Nathan's fix
* Update agent/consul/discoverychain/gateway.go
* cleanup current todos, remove snapshot manipulation from generation code
* Update agent/structs/config_entry_gateways.go
Co-authored-by: Thomas Eckert <teckert@hashicorp.com>
* Update agent/consul/discoverychain/gateway.go
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* Update agent/consul/discoverychain/gateway.go
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* Update agent/proxycfg/snapshot.go
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* clarified header comment for FlattenHTTPRoute, changed RebuildHTTPRouteUpstream to BuildHTTPRouteUpstream
* simplify cert logic
* Delete scratch
* revert route related changes in listener PR
* Update agent/consul/discoverychain/gateway.go
* Update agent/proxycfg/snapshot.go
* clean up uneeded extra lines in endpoints
---------
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
Co-authored-by: John Maguire <john.maguire@hashicorp.com>
Co-authored-by: Thomas Eckert <teckert@hashicorp.com>
* endpoints xds cluster configuration
* clusters xds native generation
* resources test fix
* fix reversion in resources_test
* Update agent/proxycfg/api_gateway.go
Co-authored-by: John Maguire <john.maguire@hashicorp.com>
* gofmt
* Modify getReadyUpstreams to filter upstreams by listener (#17410)
Each listener would previously have all upstreams from any route that bound to the listener. This is problematic when a route bound to one listener also binds to other listeners and so includes upstreams for multiple listeners. The list for a given listener would then wind up including upstreams for other listeners.
* Update agent/proxycfg/api_gateway.go
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* Restore import blocking
* Undo removal of unrelated code
---------
Co-authored-by: John Maguire <john.maguire@hashicorp.com>
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
This change enables workflows where you are reapplying a resource that should have an owner ref to publish modifications to the resources data without performing a read to figure out the current owner resource incarnations UID.
Basically we want workflows similar to `kubectl apply` or `consul config write` to be able to work seamlessly even for owned resources.
In these cases the users intention is to have the resource owned by the “current” incarnation of the owner resource.
* endpoints xds cluster configuration
* resources test fix
* fix reversion in resources_test
* Update agent/proxycfg/api_gateway.go
Co-authored-by: John Maguire <john.maguire@hashicorp.com>
* gofmt
* Modify getReadyUpstreams to filter upstreams by listener (#17410)
Each listener would previously have all upstreams from any route that bound to the listener. This is problematic when a route bound to one listener also binds to other listeners and so includes upstreams for multiple listeners. The list for a given listener would then wind up including upstreams for other listeners.
* Update agent/proxycfg/api_gateway.go
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* Restore import blocking
* Skip to next route if route has no upstreams
* cleanup
* change set from bool to empty struct
---------
Co-authored-by: John Maguire <john.maguire@hashicorp.com>
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
* agent: configure server lastseen timestamp
Signed-off-by: Dan Bond <danbond@protonmail.com>
* use correct config
Signed-off-by: Dan Bond <danbond@protonmail.com>
* add comments
Signed-off-by: Dan Bond <danbond@protonmail.com>
* use default age in test golden data
Signed-off-by: Dan Bond <danbond@protonmail.com>
* add changelog
Signed-off-by: Dan Bond <danbond@protonmail.com>
* fix runtime test
Signed-off-by: Dan Bond <danbond@protonmail.com>
* agent: add server_metadata
Signed-off-by: Dan Bond <danbond@protonmail.com>
* update comments
Signed-off-by: Dan Bond <danbond@protonmail.com>
* correctly check if metadata file does not exist
Signed-off-by: Dan Bond <danbond@protonmail.com>
* follow instructions for adding new config
Signed-off-by: Dan Bond <danbond@protonmail.com>
* add comments
Signed-off-by: Dan Bond <danbond@protonmail.com>
* update comments
Signed-off-by: Dan Bond <danbond@protonmail.com>
* Update agent/agent.go
Co-authored-by: Dan Upton <daniel@floppy.co>
* agent/config: add validation for duration with min
Signed-off-by: Dan Bond <danbond@protonmail.com>
* docs: add new server_rejoin_age_max config definition
Signed-off-by: Dan Bond <danbond@protonmail.com>
* agent: add unit test for checking server last seen
Signed-off-by: Dan Bond <danbond@protonmail.com>
* agent: log continually for 60s before erroring
Signed-off-by: Dan Bond <danbond@protonmail.com>
* pr comments
Signed-off-by: Dan Bond <danbond@protonmail.com>
* remove unneeded todo
* agent: fix error message
Signed-off-by: Dan Bond <danbond@protonmail.com>
---------
Signed-off-by: Dan Bond <danbond@protonmail.com>
Co-authored-by: Dan Upton <daniel@floppy.co>
* Add v1/internal/service-virtual-ip for manually setting service VIPs
* Attach service virtual IP info to compiled discovery chain
* Separate auto-assigned and manual VIPs in response
The grpc resolver implementation is fed from changes to the
router.Router. Within the router there is a map of various areas storing
the addressing information for servers in those areas. All map entries
are of the WAN variety except a single special entry for the LAN.
Addressing information in the LAN "area" are local addresses intended
for use when making a client-to-server or server-to-server request.
The client agent correctly updates this LAN area when receiving lan serf
events, so by extension the grpc resolver works fine in that scenario.
The server agent only initially populates a single entry in the LAN area
(for itself) on startup, and then never mutates that area map again.
For normal RPCs a different structure is used for LAN routing.
Additionally when selecting a server to contact in the local datacenter
it will randomly select addresses from either the LAN or WAN addressed
entries in the map.
Unfortunately this means that the grpc resolver stack as it exists on
server agents is either broken or only accidentally functions by having
servers dial each other over the WAN-accessible address. If the operator
disables the serf wan port completely likely this incidental functioning
would break.
This PR enforces that local requests for servers (both for stale reads
or leader forwarded requests) exclusively use the LAN "area" information
and also fixes it so that servers keep that area up to date in the
router.
A test for the grpc resolver logic was added, as well as a higher level
full-stack test to ensure the externally perceived bug does not return.
* snapshot: some improvments to the snapshot process
Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
UNIX domain socket paths are limited to 104-108 characters, depending on
the OS. This limit was quite easy to exceed when testing the feature on
Kubernetes, due to how proxy IDs encode the Pod ID eg:
metrics-collector-59467bcb9b-fkkzl-hcp-metrics-collector-sidecar-proxy
To ensure we stay under that character limit this commit makes a
couple changes:
- Use a b64 encoded SHA1 hash of the namespace + proxy ID to create a
short and deterministic socket file name.
- Add validation to proxy registrations and proxy-defaults to enforce a
limit on the socket directory length.
Fix multiple issues related to proxycfg health queries.
1. The datacenter was not being provided to a proxycfg query, which resulted in
bypassing agentless query optimizations and using the normal API instead.
2. The health rpc endpoint would return a zero index when insufficient ACLs were
detected. This would result in the agent cache performing an infinite loop of
queries in rapid succession without backoff.
Fix issue with peer stream node cleanup.
This commit encompasses a few problems that are closely related due to their
proximity in the code.
1. The peerstream utilizes node IDs in several locations to determine which
nodes / services / checks should be cleaned up or created. While VM deployments
with agents will likely always have a node ID, agentless uses synthetic nodes
and does not populate the field. This means that for consul-k8s deployments, all
services were likely bundled together into the same synthetic node in some code
paths (but not all), resulting in strange behavior. The Node.Node field should
be used instead as a unique identifier, as it should always be populated.
2. The peerstream cleanup process for unused nodes uses an incorrect query for
node deregistration. This query is NOT namespace aware and results in the node
(and corresponding services) being deregistered prematurely whenever it has zero
default-namespace services and 1+ non-default-namespace services registered on
it. This issue is tricky to find due to the incorrect logic mentioned in #1,
combined with the fact that the affected services must be co-located on the same
node as the currently deregistering service for this to be encountered.
3. The stream tracker did not understand differences between services in
different namespaces and could therefore report incorrect numbers. It was
updated to utilize the full service name to avoid conflicts and return proper
results.
When using vault as a CA and generating the local signing cert, try to
enable the PKI endpoint's auto-tidy feature with it set to tidy expired
issuers.
This adds filtering for service-defaults: consul config list -filter 'MutualTLSMode == "permissive"'.
It adds CLI warnings when the CLI writes a config entry and sees that either service-defaults or proxy-defaults contains MutualTLSMode=permissive, or sees that the mesh config entry contains AllowEnablingPermissiveMutualTLSMode=true.
Partitioned downstreams with peered upstreams could not properly merge central config info (i.e. proxy-defaults and service-defaults things like mesh gateway modes) if the upstream had an empty DestinationPartition field in Enterprise.
Due to data flow, if this setup is done using Consul client agents the field is never empty and thus does not experience the bug.
When a service is registered directly to the catalog as is the case for consul-dataplane use this field may be empty and and the internal machinery of the merging function doesn't handle this well.
This PR ensures the internal machinery of that function is referentially self-consistent.
* Persist HCP management token from server config
We want to move away from injecting an initial management token into
Consul clusters linked to HCP. The reasoning is that by using a separate
class of token we can have more flexibility in terms of allowing HCP's
token to co-exist with the user's management token.
Down the line we can also more easily adjust the permissions attached to
HCP's token to limit it's scope.
With these changes, the cloud management token is like the initial
management token in that iit has the same global management policy and
if it is created it effectively bootstraps the ACL system.
* Update SDK and mock HCP server
The HCP management token will now be sent in a special field rather than
as Consul's "initial management" token configuration.
This commit also updates the mock HCP server to more accurately reflect
the behavior of the CCM backend.
* Refactor HCP bootstrapping logic and add tests
We want to allow users to link Consul clusters that already exist to
HCP. Existing clusters need care when bootstrapped by HCP, since we do
not want to do things like change ACL/TLS settings for a running
cluster.
Additional changes:
* Deconstruct MaybeBootstrap so that it can be tested. The HCP Go SDK
requires HTTPS to fetch a token from the Auth URL, even if the backend
server is mocked. By pulling the hcp.Client creation out we can modify
its TLS configuration in tests while keeping the secure behavior in
production code.
* Add light validation for data received/loaded.
* Sanitize initial_management token from received config, since HCP will
only ever use the CloudConfig.MangementToken.
* Add changelog entry
* Move status condition for invalid certifcate to reference the listener
that is using the certificate
* Fix where we set the condition status for listeners and certificate
refs, added tests
* Add changelog
* Add MaxEjectionPercent to config entry
* Add BaseEjectionTime to config entry
* Add MaxEjectionPercent and BaseEjectionTime to protobufs
* Add MaxEjectionPercent and BaseEjectionTime to api
* Fix integration test breakage
* Verify MaxEjectionPercent and BaseEjectionTime in integration test upstream confings
* Website docs for MaxEjectionPercent and BaseEjection time
* Add `make docs` to browse docs at http://localhost:3000
* Changelog entry
* so that is the difference between consul-docker and dev-docker
* blah
* update proto funcs
* update proto
---------
Co-authored-by: Maliz <maliheh.monshizadeh@hashicorp.com>
* Fix straggler from renaming Register->RegisterTypes
* somehow a lint failure got through previously
* Fix lint-consul-retry errors
* adding in fix for success jobs getting skipped. (#17132)
* Temporarily disable inmem backend conformance test to get green pipeline
* Another test needs disabling
---------
Co-authored-by: John Murret <john.murret@hashicorp.com>
* normalize status conditions for gateways and routes
* Added tests for checking condition status and panic conditions for
validating combinations, added dummy code for fsm store
* get rid of unneeded gateway condition generator struct
* Remove unused file
* run go mod tidy
* Update tests, add conflicted gateway status
* put back removed status for test
* Fix linting violation, remove custom conflicted status
* Update fsm commands oss
* Fix incorrect combination of type/condition/status
* cleaning up from PR review
* Change "invalidCertificate" to be of accepted status
* Move status condition enums into api package
* Update gateways controller and generated code
* Update conditions in fsm oss tests
* run go mod tidy on consul-container module to fix linting
* Fix type for gateway endpoint test
* go mod tidy from changes to api
* go mod tidy on troubleshoot
* Fix route conflicted reason
* fix route conflict reason rename
* Fix text for gateway conflicted status
* Add valid certificate ref condition setting
* Revert change to resolved refs to be handled in future PR
* added method for converting SamenessGroupConfigEntry
- added new method `ToQueryFailoverTargets` for converting a SamenessGroupConfigEntry's members to a list of QueryFailoverTargets
- renamed `ToFailoverTargets` ToServiceResolverFailoverTargets to distinguish it from `ToQueryFailoverTargets`
* Added SamenessGroup to PreparedQuery
- exposed Service.Partition to API when defining a prepared query
- added a method for determining if a QueryFailoverOptions is empty
- This will be useful for validation
- added unit tests
* added method for retrieving a SamenessGroup to state store
* added logic for using PQ with SamenessGroup
- added branching path for SamenessGroup handling in execute. It will be handled separate from the normal PQ case
- added a new interface so that the `GetSamenessGroupFailoverTargets` can be properly tested
- separated the execute logic into a `targetSelector` function so that it can be used for both failover and sameness group PQs
- split OSS only methods into new PQ OSS files
- added validation that `samenessGroup` is an enterprise only feature
* added documentation for PQ SamenessGroup
Before this change, we were not fetching service resolvers (and therefore
service defaults) configuration entries for services on members of sameness
groups.
This implements permissive mTLS , which allows toggling services into "permissive" mTLS mode.
Permissive mTLS mode allows incoming "non Consul-mTLS" traffic to be forward unmodified to the application.
* Update service-defaults and proxy-defaults config entries with a MutualTLSMode field
* Update the mesh config entry with an AllowEnablingPermissiveMutualTLS field and implement the necessary validation. AllowEnablingPermissiveMutualTLS must be true to allow changing to MutualTLSMode=permissive, but this does not require that all proxy-defaults and service-defaults are currently in strict mode.
* Update xDS listener config to add a "permissive filter chain" when MutualTLSMode=permissive for a particular service. The permissive filter chain matches incoming traffic by the destination port. If the destination port matches the service port from the catalog, then no mTLS is required and the traffic sent is forwarded unmodified to the application.
This commit adds the PrioritizeByLocality field to both proxy-config
and service-resolver config entries for locality-aware routing. The
field is currently intended for enterprise only, and will be used to
enable prioritization of service-mesh connections to services based
on geographical region / zone.
- added Sameness Group to config entries
- added Sameness Group to subscriptions
* generated proto files
* added Sameness Group events to the state store
- added test cases
* Refactored health RPC Client
- moved code that is common to rpcclient under rpcclient common.go. This will help set us up to support future RPC clients
* Refactored proxycfg glue views
- Moved views to rpcclient config entry. This will allow us to reuse this code for a config entry client
* added config entry RPC Client
- Copied most of the testing code from rpcclient/health
* hooked up new rpcclient in agent
* fixed documentation and comments for clarity
* Add a test to reproduce the race condition
* Fix race condition by publishing the event after the commit and adding a lock to prevent out of order events.
* split publish to generate the list of events before committing the transaction.
* add changelog
* remove extra func
* Apply suggestions from code review
Co-authored-by: Dan Upton <daniel@floppy.co>
* add comment to explain test
---------
Co-authored-by: Dan Upton <daniel@floppy.co>
Prior to this change, peer services would be targeted by service-default
overrides as long as the new `peer` field was not found in the config entry.
This commit removes that deprecated backwards-compatibility behavior. Now
it is necessary to specify the `peer` field in order for upstream overrides
to apply to a peer upstream.
The old setting of 24 hours was not enough time to deal with an expiring certificates. This change ups it to 28 days OR 40% of the full cert duration, whichever is shorter. It also adds details to the log message to indicate which certificate it is logging about and a suggested action.
Currently, if an acceptor peer deletes a peering the dialer's peering
will eventually get to a "terminated" state. If the two clusters need to
be re-peered the acceptor will re-generate the token but the dialer will
encounter this error on the call to establish:
"failed to get addresses to dial peer: failed to refresh peer server
addresses, will continue to use initial addresses: there is no active
peering for "<<<ID>>>""
This is because in `exchangeSecret().GetDialAddresses()` we will get an
error if fetching addresses for an inactive peering. The peering shows
up as inactive at this point because of the existing terminated state.
Rather than checking whether a peering is active we can instead check
whether it was deleted. This way users do not need to delete terminated
peerings in the dialing cluster before re-establishing them.
* Rename Intermediate cert references to LeafSigningCert
Within the Consul CA subsystem, the term "Intermediate"
is confusing because the meaning changes depending on
provider and datacenter (primary vs secondary). For
example, when using the Consul CA the "ActiveIntermediate"
may return the root certificate in a primary datacenter.
At a high level, we are interested in knowing which
CA is responsible for signing leaf certs, regardless of
its position in a certificate chain. This rename makes
the intent clearer.
* Move provider state check earlier
* Remove calls to GenerateLeafSigningCert
GenerateLeafSigningCert (formerly known
as GenerateIntermediate) is vestigial in
non-Vault providers, as it simply returns
the root certificate in primary
datacenters.
By folding Vault's intermediate cert logic
into `GenerateRoot` we can encapsulate
the intermediate cert handling within
`newCARoot`.
* Move GenerateLeafSigningCert out of PrimaryProvidder
Now that the Vault Provider calls
GenerateLeafSigningCert within
GenerateRoot, we can remove the method
from all other providers that never
used it in a meaningful way.
* Add test for IntermediatePEM
* Rename GenerateRoot to GenerateCAChain
"Root" was being overloaded in the Consul CA
context, as different providers and configs
resulted in a single root certificate or
a chain originating from an external trusted
CA. Since the Vault provider also generates
intermediates, it seems more accurate to
call this a CAChain.
This PR adds the sameness-group field to exported-service
config entries, which allows for services to be exported
to multiple destination partitions / peers easily.
* Use merge of enterprise meta's rather than new custom method
* Add merge logic for tcp routes
* Add changelog
* Normalize certificate refs on gateways
* Fix infinite call loop
* Explicitly call enterprise meta
This commit swaps the partition field to the local partition for
discovery chains targeting peers. Prior to this change, peer upstreams
would always use a value of default regardless of which partition they
exist in. This caused several issues in xds / proxycfg because of id
mismatches.
Some prior fixes were made to deal with one-off id mismatches that this
PR also cleans up, since they are no longer needed.
* delete config when nil
* fix mock interface implementation
* fix handler test to use the right assertion
* extract DeleteConfig as a separate API.
* fix mock limiter implementation to satisfy the new interface
* fix failing tests
* add test comments
* Remove unused are hosts set check
* Remove all traces of unused 'AreHostsSet' parameter
* Remove unused Hosts attribute
* Remove commented out use of snap.APIGateway.Hosts
* Refactored "NewGatewayService" to handle namespaces, fixed
TestHTTPRouteFlattening test
* Fixed existing http_route tests for namespacing
* Squash aclEnterpriseMeta for ResourceRefs and HTTPServices, accept
namespace for creating connect services and regular services
* Use require instead of assert after creating namespaces in
http_route_tests
* Refactor NewConnectService and NewGatewayService functions to use cfg
objects to reduce number of method args
* Rename field on SidecarConfig in tests from `SidecarServiceName` to
`Name` to avoid stutter
This commit fixes an issue where trust bundles could not be read
by services in a non-default namespace, unless they had excessive
ACL permissions given to them.
Prior to this change, `service:write` was required in the default
namespace in order to read the trust bundle. Now, `service:write`
to a service in any namespace is sufficient.
If a CA config update did not cause a root change, the codepath would return early and skip some steps which preserve its intermediate certificates and signing key ID. This commit re-orders some code and prevents updates from generating new intermediate certificates.
This commit adds a sameness-group config entry to the API and structs packages. It includes some validation logic and a new memdb index that tracks the default sameness-group for each partition. Sameness groups will simplify the effort of managing failovers / intentions / exports for peers and partitions.
Note that this change purely to introduce the configuration entry and does not include the full functionality of sameness-groups.
Co-authored-by: Ashvitha Sridharan <ashvitha.sridharan@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
Add a new envoy flag: "envoy_hcp_metrics_bind_socket_dir", a directory
where a unix socket will be created with the name
`<namespace>_<proxy_id>.sock` to forward Envoy metrics.
If set, this will configure:
- In bootstrap configuration a local stats_sink and static cluster.
These will forward metrics to a loopback listener sent over xDS.
- A dynamic listener listening at the socket path that the previously
defined static cluster is sending metrics to.
- A dynamic cluster that will forward traffic received at this listener
to the hcp-metrics-collector service.
Reasons for having a static cluster pointing at a dynamic listener:
- We want to secure the metrics stream using TLS, but the stats sink can
only be defined in bootstrap config. With dynamic listeners/clusters
we can use the proxy's leaf certificate issued by the Connect CA,
which isn't available at bootstrap time.
- We want to intelligently route to the HCP collector. Configuring its
addreess at bootstrap time limits our flexibility routing-wise. More
on this below.
Reasons for defining the collector as an upstream in `proxycfg`:
- The HCP collector will be deployed as a mesh service.
- Certificate management is taken care of, as mentioned above.
- Service discovery and routing logic is automatically taken care of,
meaning that no code changes are required in the xds package.
- Custom routing rules can be added for the collector using discovery
chain config entries. Initially the collector is expected to be
deployed to each admin partition, but in the future could be deployed
centrally in the default partition. These config entries could even be
managed by HCP itself.
Add support for using existing vault auto-auth configurations as the
provider configuration when using Vault's CA provider with AliCloud.
AliCloud requires 2 extra fields to enable it to use STS (it's preferred
auth setup). Our vault-plugin-auth-alicloud package contained a method
to help generate them as they require you to make an http call to
a faked endpoint proxy to get them (url and headers base64 encoded).
Receiving an "acl not found" error from an RPC in the agent cache and the
streaming/event components will cause any request loops to cease under the
assumption that they will never work again if the token was destroyed. This
prevents log spam (#14144, #9738).
Unfortunately due to things like:
- authz requests going to stale servers that may not have witnessed the token
creation yet
- authz requests in a secondary datacenter happening before the tokens get
replicated to that datacenter
- authz requests from a primary TO a secondary datacenter happening before the
tokens get replicated to that datacenter
The caller will get an "acl not found" *before* the token exists, rather than
just after. The machinery added above in the linked PRs will kick in and
prevent the request loop from looping around again once the tokens actually
exist.
For `consul-dataplane` usages, where xDS is served by the Consul servers
rather than the clients ultimately this is not a problem because in that
scenario the `agent/proxycfg` machinery is on-demand and launched by a new xDS
stream needing data for a specific service in the catalog. If the watching
goroutines are terminated it ripples down and terminates the xDS stream, which
CDP will eventually re-establish and restart everything.
For Consul client usages, the `agent/proxycfg` machinery is ahead-of-time
launched at service registration time (called "local" in some of the proxycfg
machinery) so when the xDS stream comes in the data is already ready to go. If
the watching goroutines terminate it should terminate the xDS stream, but
there's no mechanism to re-spawn the watching goroutines. If the xDS stream
reconnects it will see no `ConfigSnapshot` and will not get one again until
the client agent is restarted, or the service is re-registered with something
changed in it.
This PR fixes a few things in the machinery:
- there was an inadvertent deadlock in fetching snapshot from the proxycfg
machinery by xDS, such that when the watching goroutine terminated the
snapshots would never be fetched. This caused some of the xDS machinery to
get indefinitely paused and not finish the teardown properly.
- Every 30s we now attempt to re-insert all locally registered services into
the proxycfg machinery.
- When services are re-inserted into the proxycfg machinery we special case
"dead" ones such that we unilaterally replace them rather that doing that
conditionally.
Adds support for the approle auth-method. Only handles using the approle
role/secret to auth and it doesn't support the agent's extra management
configuration options (wrap and delete after read) as they are not
required as part of the auth (ie. they are vault agent things).
* Fix issue where terminating gateway service resolvers weren't properly cleaned up
* Add integration test for cleaning up resolvers
* Add changelog entry
* Use state test and drop integration test
* Leverage ServiceResolver ConnectTimeout for route timeouts to make TerminatingGateway upstream timeouts configurable
* Regenerate golden files
* Add RequestTimeout field
* Add changelog entry
Adds support for a jwt token in a file. Simply reads the file and sends
the read in jwt along to the vault login.
It also supports a legacy mode with the jwt string being passed
directly. In which case the path is made optional.
Does the required dance with the local HTTP endpoint to get the required
data for the jwt based auth setup in Azure. Keeps support for 'legacy'
mode where all login data is passed on via the auth methods parameters.
Refactored check for hardcoded /login fields.
Registering gRPC balancers is thread-unsafe because they are stored in a
global map variable that is accessed without holding a lock. Therefore,
it's expected that balancers are registered _once_ at the beginning of
your program (e.g. in a package `init` function) and certainly not after
you've started dialing connections, etc.
> NOTE: this function must only be called during initialization time
> (i.e. in an init() function), and is not thread-safe.
While this is fine for us in production, it's challenging for tests that
spin up multiple agents in-memory. We currently register a balancer per-
agent which holds agent-specific state that cannot safely be shared.
This commit introduces our own registry that _is_ thread-safe, and
implements the Builder interface such that we can call gRPC's `Register`
method once, on start-up. It uses the same pattern as our resolver
registry where we use the dial target's host (aka "authority"), which is
unique per-agent, to determine which builder to use.
Prior to this commit, all peer services were transmitted as connect-enabled
as long as a one or more mesh-gateways were healthy. With this change, there
is now a difference between typical services and connect services transmitted
via peering.
A service will be reported as "connect-enabled" as long as any of these
conditions are met:
1. a connect-proxy sidecar is registered for the service name.
2. a connect-native instance of the service is registered.
3. a service resolver / splitter / router is registered for the service name.
4. a terminating gateway has registered the service.