And remove BuildAndValidate. This commit completes some earlier work to reduce the config
interface a single Load function.
The last remaining test was converted to use Load instad of BuildAndValidate.
Previously, for a POST request to the /v1/operator/autopilot/configuration
endpoint, any fields not included in the payload were set to a zero-initialized
value rather than the documented default value.
Now, if an optional field is not included in the payload, it will be set to its
documented default value:
- CleanupDeadServers: true
- LastContactThreshold: "200ms"
- MaxTrailingLogs: 250
- MinQuorum: 0
- ServerStabilizationTime: "10s"
- RedundancyZoneTag: ""
- DisableUpgradeMigration: false
- UpgradeVersionTag: ""
* trim carriage return from certificates when inserting rootCA in the inMemDB
* format rootCA properly when returning the CA on the connect CA endpoint
* Fix linter warnings
* Fix providers to trim certs before returning it
* trim newlines on write when possible
* add changelog
* make sure all provider return a trailing newline after the root and intermediate certs
* Fix endpoint to return trailing new line
* Fix failing test with vault provider
* make test more robust
* make sure all provider return a trailing newline after the leaf certs
* Check for suffix before removing newline and use function
* Add comment to consul provider
* Update change log
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* fix typo
* simplify code callflow
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* extract requireNewLine as shared func
* remove dependency to testify in testing file
* remove extra newline in vault provider
* Add cert newline fix to envoy xds
* remove new line from mock provider
* Remove adding a new line from provider and fix it when the cert is read
* Add a comment to explain the fix
* Add missing for leaf certs
* fix missing new line
* fix missing new line in leaf certs
* remove extra new line in test
* updage changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* fix in vault provider and when reading cache (RPC call)
* fix AWS provider
* fix failing test in the provider
* remove comments and empty lines
* add check for empty cert in test
* fix linter warnings
* add new line for leaf and private key
* use string concat instead of Sprintf
* fix new lines for leaf signing
* preallocate slice and remove append
* Add new line to `SignIntermediate` and `CrossSignCA`
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
The hcl decoding apparently uses strconv.ParseInt, which fails to parse a 64bit int.
Since hcl v1 is basically EOl, it seems unlikely we'll fix this in hcl.
Since this test is only about loading values from config files, the extra large number
doesn't seem important. Trim a few zeros from the numbers so that they parse
properly on 32bit platforms.
Also skip a slow test when -short is used.
sync/atomic must be used with 64-bit aligned fields, and that alignment is difficult to
ensure unless the field is the first one in the struct.
https://golang.org/pkg/sync/atomic/#pkg-note-BUG.
If a value was already available in the local view the request is considered a cache hit.
If the materialized had to wait for a value, it is considered a cache miss.
This test is super racy (it's not just a single line).
This test also starts failing once streaming is enabled, because the
cache rate limit no longer applies to the requests in the test. The
queries use streaming instead of the cache.
This test is no longer valid, and the functionality is already well
tested by TestCacheThrottle. Instead of spending time rewriting this
test, let's remove it.
```
WARNING: DATA RACE
Read at 0x00c01de410fc by goroutine 735:
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:1024 +0x9af
github.com/hashicorp/consul/testrpc.WaitForTestAgent()
/home/daniel/pers/code/consul/testrpc/wait.go:99 +0x209
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:966 +0x1ad
testing.tRunner()
/usr/lib/go/src/testing/testing.go:1193 +0x202
Previous write at 0x00c01de410fc by goroutine 605:
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1.2()
/home/daniel/pers/code/consul/agent/agent_test.go:998 +0xe9
Goroutine 735 (running) created at:
testing.(*T).Run()
/usr/lib/go/src/testing/testing.go:1238 +0x5d7
github.com/hashicorp/consul/agent.TestCacheRateLimit()
/home/daniel/pers/code/consul/agent/agent_test.go:961 +0x375
testing.tRunner()
/usr/lib/go/src/testing/testing.go:1193 +0x202
Goroutine 605 (finished) created at:
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:1022 +0x91e
github.com/hashicorp/consul/testrpc.WaitForTestAgent()
/home/daniel/pers/code/consul/testrpc/wait.go:99 +0x209
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:966 +0x1ad
testing.tRunner()
/usr/lib/go/src/testing/testing.go:1193 +0x202
```
The query metrics are actually reported for all read queries, not only
ones that use a MinIndex to block for updates.
Also clarify the raft.apply metric is only on the leader.
As part of this change, we ensure that the SAN extensions are marked as
critical when the subject is empty so that AWS PCA tolerates the loss of
common names well and continues to function as a Connect CA provider.
Parts of this currently hack around a bug in crypto/x509 and can be
removed after https://go-review.googlesource.com/c/go/+/329129 lands in
a Go release.
Note: the AWS PCA tests do not run automatically, but the following
passed locally for me:
ENABLE_AWS_PCA_TESTS=1 go test ./agent/connect/ca -run TestAWS
* return an invalid record when asked for an addr dns with type other then A and AAAA
* add changelog
* fix ANY use case and add a test for it
* update changelog type
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* return empty response if the question record type do not match for addr
* set comment in the right place
* return A\AAAA record in extra section if record type is not A\AAAA for addr
* Fix failing test
* remove commented code
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* use require for test validation
* use variable to init struct
* fix failing test
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update .changelog/10401.txt
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* fix compilation error
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
With an optional interface that providers can use to indicate if they
use an intermediate cert in the primary DC.
This removes the need to look up the provider config when renewing the
intermediate.
Replace two methods with a single one that returns the cert. This moves more
of the logic into the single caller (auto-config).
tlsutil.Configurator is widely used. By keeping it smaller and focused only on storing and
returning TLS config, we make the code easier to follow.
These two methods were more related to auto-config than to tlsutil, so reducing the interface
moves the logic closer to the feature that requires it.
There is no interaction between these handlers, so splitting them into separate files
makes it easier to discover the full implementation of each kindHandler.
This commit extracts all the kind-specific logic into handler types, and
keeps the generic parts on the state struct. This change should make it
easier to add new kinds, and see the implementation of each kind more
clearly.
* remove flush for each write to http response in the agent monitor endpoint
* fix race condition when we stop and start monitor multiple times, the doneCh is closed and never recover.
* start log reading goroutine before adding the sink to avoid filling the log channel before getting a chance of reading from it
* flush every 500ms to optimize log writing in the http server side.
* add changelog file
* add issue url to changelog
* fix changelog url
* Update changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* use ticker to flush and avoid race condition when flushing in a different goroutine
* stop the ticker when done
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Revert "fix race condition when we stop and start monitor multiple times, the doneCh is closed and never recover."
This reverts commit 1eeddf7a
* wait for log consumer loop to start before registering the sink
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
In the absence of stats_tags to handle this pattern, when we pass
"ingress_upstream.$port" as the stat_prefix, Envoy splits up that prefix
and makes the port a part of the metric name.
For example:
- stat_prefix: ingress_upstream.8080
This leads to metric names like envoy_http_8080_no_route. Changing the
stat_prefix to ingress_upstream_80880 yields the expected metric names
such as envoy_http_no_route.
Note that we don't encode the destination's name/ns/dc in this
stat_prefix because for HTTP services ingress gateways use a single
filter chain. Only cluster metrics are available on a per-upstream
basis.
This change makes it so that the stat prefix for terminating gateways
matches that of connect proxies. By using the structure of
"upstream.svc.ns.dc" we can extract labels for the destination service,
namespace, and datacenter.
Updates to a cluster will clear the associated endpoints, and updates to
a listener will clear the associated routes. Update the incremental xDS
logic to account for this implicit cleanup so that we can finish warming
the clusters and listeners.
Fixes#10379
CatalogDestinationsOnly is a passthrough that would enable dialing
addresses outside of Consul's catalog. However, when this flag is set to
true only _connect_ endpoints for services can be dialed.
This flag is being renamed to signal that non-Connect endpoints can't be
dialed by transparent proxies when the value is set to true.
Previously we would return an error if duplicate paths were specified.
This could lead to problems in cases where a user has the same path,
say /healthz, on two different ports.
This validation was added to signal a potential misconfiguration.
Instead we will only check for duplicate listener ports, since that is
what would lead to ambiguity issues when generating xDS config.
In the future we could look into using a single listener and creating
distinct filter chains for each path/port.
These two new struct types will allow us to make polymorphic handler for each kind, instad of
having all the logic for each proxy kind on the state struct.
context.Context should never be stored on a struct (as it says in the godoc) because it is easy to
to end up with the wrong context when it is stored.
Also see https://blog.golang.org/context-and-structs
This change is also in preparation for splitting state into kind-specific handlers so that the
implementation of each kind is grouped together.
This field is available in DebugConfig, but that field is not stable and could change at any time.
The consul-k8s needs to be able to detect the primary DC for tests, so adding this field to the
stable part of the API response.
Both NextLink and NextNoBlock had the same logic, with slightly
different return values. By adding a bool return value (similar to map
lookups) we can remove the duplicate method.
The head of the topic buffer was being ignored when creating a snapshot. This commit fixes
the bug by ensuring that the head of the topic buffer is included in the snapshot
before handing it off to the subscription.
When info.Timeout is 0, it should have no timeout. Previously it was using a 0 duration timeout
which caused it to return without waiting.
This bug was masked by using a timeout in the tests. Removing the timeout caused the tests to fail.
This PR adds cluster members to the metrics API. The number of members per
segment are reported as well as the total number of members.
Tested by running a multi-node cluster locally and ensuring the numbers were
correct. Also added unit test coverage to add the new expected gauges to
existing test cases.
We have seen test flakes caused by 'concurrent map read and map write', and the race detector
reports the problem as well (prevent us from running some tests with -race).
The root of the problem is the grpc expects resolvers to be registered at init time
before any requests are made, but we were using a separate resolver for each test.
This commit introduces a resolver registry. The registry is registered as the single
resolver for the consul scheme. Each test uses the Authority section of the target
(instead of the scheme) to identify the resolver that should be used for the test.
The scheme is used for lookup, which is why it can no longer be used as the unique
key.
This allows us to use a lock around the map of resolvers, preventing the data race.
* fix tests to use a dummy nodeName and not fail when hostname is not a valid nodeName
* remove conditional testing
* add test when node name is invalid
* debug: remove the CLI check for debug_enabled
The API allows collecting profiles even debug_enabled=false as long as
ACLs are enabled. Remove this check from the CLI so that users do not
need to set debug_enabled=true for no reason.
Also:
- fix the API client to return errors on non-200 status codes for debug
endpoints
- improve the failure messages when pprof data can not be collected
Co-Authored-By: Dhia Ayachi <dhia@hashicorp.com>
* remove parallel test runs
parallel runs create a race condition that fail the debug tests
* Add changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
The bulk of this commit is moving the LeaderRoutineManager from the agent/consul package into its own package: lib/gort. It also got a renaming and its Start method now requires a context. Requiring that context required updating a whole bunch of other places in the code.
This refactor is to make it easier to see how serf feature flags are
encoded as serf tags, and where those feature flags are read.
- use constants for both the prefix and feature flag name. A constant
makes it much easier for an IDE to locate the read and write location.
- isolate the feature-flag encoding logic in the metadata package, so
that the feature flag prefix can be unexported. Only expose a function
for encoding the flags into tags. This logic is now next to the logic
which reads the tags.
- remove the duplicate `addEnterpriseSerfTags` functions. Both Client
and Server structs had the same implementation. And neither
implementation needed the method receiver.
The prior solution to call reply.Reset() aged poorly since newer fields
were added to the reply, but not added to Reset() leading serial
blocking query loops on the server to blend replies.
This could manifest as a service-defaults protocol change from
default=>http not reverting back to default after the config entry
reponsible was deleted.
* Save exposed HTTP or GRPC ports to the agent's store
* Add those the health checks API so we can retrieve them from the API
* Change redirect-traffic command to also exclude those ports from inbound traffic redirection when expose.checks is set to true.
TestACLEndpoint_Login_with_TokenLocality was reguardly being reported as failed even though
it was not failing. I took another look and I suspect it is because t.Parllel was being
called in a goroutine.
This would lead to strange behaviour which apparently confused the 'go test' runner.
- return errors in TestAgent.Start so that the retry works correctly
- remove duplicate logging, the error is returned already
- add a missing t.Helper() to retry.Run
- properly set a.Agent to nil so that subsequent retry attempts will actually try to start
Also accept an RPCInfo instead of interface{}. Accepting an interface
lead to a bug where the caller was expecting the arg to be the response
when in fact it was always passed the request. By accepting RPCInfo
it should indicate that this is actually the request value.
One caller of canRetry already passed an RPCInfo, the second handles
the type assertion before calling canRetry.
A recent change in 1.9.x inverted the order of these two lines, which caused the
X-Consul-Effective-Consistency header to be missing for the servie health endpoints
Only default to the user token and agent token for the sync. Change the
exported methods to only return the stored tokens associated with a
specific check or service.
Also fixes a bug with listing kind=mesh config entries. ValidateConfigEntryKind was only being used by
the List endpoint, and was yet another place where we have to enumerate all the kinds.
This commit removes ValidateConfigEntryKind and uses MakeConfigEntry instead. This change removes
the need to maintain two separate functions at the cost of creating an instance of the config entry which will be thrown away immediately.
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Previously we would associate the address of a discovery chain target
with the discovery chain's filter chain. This was broken for a few reasons:
- If the upstream is a virtual service, the client proxy has no way of
dialing it because virtual services are not targets of their discovery
chains. The targets are distinct services. This is addressed by watching
the endpoints of all upstream services, not just their discovery chain
targets.
- If multiple discovery chains resolve to the same target, that would
lead to multiple filter chains attempting to match on the target's
virtual IP. This is addressed by only matching on the upstream's virtual
IP.
NOTE: this implementation requires an intention to the redirecting
virtual service and not just to the final destination. This is how
we can know that the virtual service is an upstream to watch.
A later PR will look into traversing discovery chains when computing
upstreams so that intentions are only required to the discovery chain
targets.
* WIP reloadable raft config
* Pre-define new raft gauges
* Update go-metrics to change gauge reset behaviour
* Update raft to pull in new metric and reloadable config
* Add snapshot persistance timing and installSnapshot to our 'protected' list as they can be infrequent but are important
* Update telemetry docs
* Update config and telemetry docs
* Add note to oldestLogAge on when it is visible
* Add changelog entry
* Update website/content/docs/agent/options.mdx
Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
* Give descriptive error if auth method not found
Previously during a `consul login -method=blah`, if the auth method was not found, the
error returned would be "ACL not found". This is potentially confusing
because there may be many different ACLs involved in a login: the ACL of
the Consul client, perhaps the binding rule or the auth method.
Now the error will be "auth method blah not found", which is much easier
to debug.
Initially we were loading every potential upstream address into Envoy
and then routing traffic to the logical upstream service. The downside
of this behavior is that traffic meant to go to a specific instance
would be load balanced across ALL instances.
Traffic to specific instance IPs should be forwarded to the original
destination and if it's a destination in the mesh then we should ensure
the appropriate certificates are used.
This PR makes transparent proxying a Kubernetes-only feature for now
since support for other environments requires generating virtual IPs,
and Consul does not do that at the moment.
No config entry needs a Kind field. It is only used to determine the Go type to
target. As we introduce new config entries (like this one) we can remove the kind field
and have the GetKind method return the single supported value.
In this case (similar to proxy-defaults) the Name field is also unnecessary. We always
use the same value. So we can omit the name field entirely.
The only thing that needed fixing up pertained to this section of the 1.18.x release notes:
> grpc_stats: the default value for stats_for_all_methods is switched from true to false, in order to avoid possible memory exhaustion due to an untrusted downstream sending a large number of unique method names. The previous default value was deprecated in version 1.14.0. This only changes the behavior when the value is not set. The previous behavior can be used by setting the value to true. This behavior change by be overridden by setting runtime feature envoy.deprecated_features.grpc_stats_filter_enable_stats_for_all_methods_by_default.
For now to maintain status-quo I'm explicitly setting `stats_for_all_methods=true` in all versions to avoid relying upon the default.
Additionally the naming of the emitted metrics for these gRPC requests changed slightly so the integration test assertions for `case-grpc` needed adjusting.
This ensures that if someone does include some extension Consul does not currently make use of, that extension is actually usable. Without linking these envoy protobufs into the main binary it can't round trip the escape hatches to send them down to envoy.
Whenenver the go-control-plane library is upgraded next we just have to re-run 'make envoy-library'.
This adds support for the Incremental xDS protocol when using xDS v3. This is best reviewed commit-by-commit and will not be squashed when merged.
Union of all commit messages follows to give an overarching summary:
xds: exclusively support incremental xDS when using xDS v3
Attempts to use SoTW via v3 will fail, much like attempts to use incremental via v2 will fail.
Work around a strange older envoy behavior involving empty CDS responses over incremental xDS.
xds: various cleanups and refactors that don't strictly concern the addition of incremental xDS support
Dissolve the connectionInfo struct in favor of per-connection ResourceGenerators instead.
Do a better job of ensuring the xds code uses a well configured logger that accurately describes the connected client.
xds: pull out checkStreamACLs method in advance of a later commit
xds: rewrite SoTW xDS protocol tests to use protobufs rather than hand-rolled json strings
In the test we very lightly reuse some of the more boring protobuf construction helper code that is also technically under test. The important thing of the protocol tests is testing the protocol. The actual inputs and outputs are largely already handled by the xds golden output tests now so these protocol tests don't have to do double-duty.
This also updates the SoTW protocol test to exclusively use xDS v2 which is the only variant of SoTW that will be supported in Consul 1.10.
xds: default xds.Server.AuthCheckFrequency at use-time instead of construction-time
This config entry is being renamed primarily because in k8s the name
cluster could be confusing given that the config entry applies across
federated datacenters.
Additionally, this config entry will only apply to Consul as a service
mesh, so the more generic "cluster" name is not needed.
Previous getFromView would call view.Result when the result may not have been returned
(because the index is updated past the minIndex. This would allocate a slice and sort it
for no reason, because the values would never be returned.
Fix this by re-ordering the operations in getFromView.
The test changes in this commit were an attempt to cover the case where
an update is received but the index does not exceed the minIndex.
Also rename it to readEntry now that it doesn't return the entire entry. Based on feedback
in PR review, the full entry is not used by the caller, and accessing the fields wouldn't be
safe outside the lock, so it is safer to return only the Materializer
grpclog.SetLoggerV2 is meant to be called only once before any gRPC requests are received, but
each test that uses TestAgent will call NewBaseDeps again. Use a sync.Once to prevent the grpc
logging from being re-initialized by each test.
This will mean that a test can't use a fake logger to capture logs from the gRPC server.
These tests can flake when we get a notification for an earlier event.
Retry the read from update channel a few times to make sure we get the
event we expect.
Split the TestStreamingClient into the two logical components the real
client uses. This allows us to test multiple clients properly.
Previously writing of ctx from multiple Subscribe calls was showing a
data race.
Once this was fixed a test started to fail because the request had to be
made with a greater index, so that the store.Get call did not return
immediately.
The idleTTL was being written and read concurrently. Instead move the idleTTL to a struct
field so that when one test patches the TTL it does not impact others.
The background goroutines for the store can outlive a test because context cancellation
is async.
Setting this field to a value is equivalent to using the 'near' query paramter.
The intent is to sort the results by proximity to the node requesting
them. However with connect we send the results to envoy, which doesn't
care about the order, so setting this field is increasing the work
performed for no gain.
It is necessary to unset this field now because we would like connect
to use streaming, but streaming does not support sorting by proximity.
Remove View.Result error return value, it was always nil, and seems like it will likely always remain nill
since it is simply reading a stored value.
Also replace some cache types with local types.
And only start expiration time when the last request ends. This makes tracking expiry simpler, and
ensures that no entry can be expired while there are active requests.
Previously canRetry was attempting to retrieve this error from args, however there was never
any callers that would pass an error to args.
With the change to raftApply to move this error to the error return value, it is now possible
to receive this error from the err argument.
This commit updates canRetry to check for ErrChunkingResubmit in err.
Previously we were inconsistently checking the response for errors. This
PR moves the response-is-error check into raftApply, so that all callers
can look at only the error response, instead of having to know that
errors could come from two places.
This should expose a few more errors that were previously hidden because
in some calls to raftApply we were ignoring the response return value.
Also handle errors more consistently. In some cases we would log the
error before returning it. This can be very confusing because it can
result in the same error being logged multiple times. Instead return
a wrapped error.