filterACLWithAuthorizer could never return an error. This change moves us a little bit
closer to being able to enable errcheck and catch problems caused by unhandled error
return values.
These functions are moved to the one place they are called to improve code locality.
They are being moved out of agent/consul/acl.go in preparation for moving
ACLResolver to an acl package.
These functions are used in only one place. Move the functions next to their one caller
to improve code locality.
This change is being made in preparation for moving the ACLResolver into an
acl package. The moved functions were previously in the same file as the ACLResolver.
By moving them out of that file we may be able to move the entire file
with fewer modifications.
* defer setting the state before returning to avoid being stuck in `INITIALIZING` state
* add changelog
* move comment with the right if statement
* ca: report state transition error from setSTate
* update comment to reflect state transition
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Follow up to: https://github.com/hashicorp/consul/pull/10738#discussion_r680190210
Previously we were passing an Authorizer that would always allow the
operation, then later checking the authorization using vetServiceTxnOp.
On the surface this seemed strange, but I think it was actually masking
a bug as well. Over time `servicePreApply` was changed to add additional
authorization for `service.Proxy.DestinationServiceName`, but because
we were passing a nil Authorizer, that authorization was not handled on
the txn_endpoint.
`TxnServiceOp.FillAuthzContext` has some special handling in enterprise,
so we need to make sure to continue to use that from the Txn endpoint.
This commit removes the `vetServiceTxnOp` function, and passes in the
`FillAuthzContext` function so that `servicePreApply` can be used by
both the catalog and txn endpoints. This should be much less error prone
and prevent bugs like this in the future.
Follow up to https://github.com/hashicorp/consul/pull/10737#discussion_r680134445
Move the check for the Intention.DestinationName into the Authorizer to remove the
need to check what kind of Authorizer is being used.
It sounds like this check is only for legacy ACLs, so is probably just a safeguard
.
1. do not emit the metric if Query fails
2. properly check for PrimaryUsersIntermediate, the logic was inverted
Also improve the logging by including the metric name in the log message
* fix state index for `CAOpSetRootsAndConfig` op
* add changelog
* Update changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* remove the change log as it's not needed
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
These checks were a bit more involved. They were previously skipping some code paths
when the authorizer was nil. After looking through these it seems correct to remove the
authz == nil check, since it will never evaluate to true.
These case are already impossible conditions, because most of these functions already start
with a check for ACLs being disabled. So the code path being removed could never be reached.
The one other case (ConnectAuthorized) was already changed in a previous commit. This commit
removes an impossible branch because authz == nil can never be true.
These methods are no longer used. Remove the methods, and update the
tests to use actual method used by production code.
Also removes the 'authz == nil' check is no longer a possible code path
now that we are returning a non-nil acl.Authorizer when ACLs are disabled.
The blocking query backend sets the default value on the server side.
The streaming backend does not using blocking queries, so we must set the timeout on
the client.
Now that we have at least one endpoint that uses context for cancellation we can
encounter this scenario where the returned error is a context.Cancelled or
context.DeadlineExceeded.
If the request.Context().Err() is not nil, then we know the request itself was cancelled, so
we can log a different message at Info level, instad of the error.
Knowing that blocking queries are firing does not provide much
information on its own. If we know the correlation IDs we can
piece together which parts of the snapshot have been populated.
Some of these responses might be empty from the blocking
query timing out. But if they're returning quickly I think we
can reasonably assume they contain data.
* return an error when the index is not valid
* check response as bool when applying `CAOpSetConfig`
* remove check for bool response
* fix error message and add check to test
* fix comment
* add changelog
If multiple instances of a service are co-located on the same node then
their proxies will all share a cache entry for their resolved service
configuration. This is because the cache key contains the name of the
watched service but does not take into account the ID of the watching
proxies.
This means that there will be multiple agent service manager watches
that can wake up on the same cache update. These watchers then
concurrently modify the value in the cache when merging the resolved
config into the local proxy definitions.
To avoid this concurrent map write we will only delete the key from
opaque config in the local proxy definition after the merge, rather
than from the cached value before the merge.
This change adds a new `dns_config.recursor_strategy` option which
controls how Consul queries DNS resolvers listed in the `recursors`
config option. The supported options are `sequential` (default), and
`random`.
Closes#8807
Co-authored-by: Blake Covarrubias <blake@covarrubi.as>
Co-authored-by: Priyanka Sengupta <psengupta@flatiron.com>
A previous commit used SetHash on two of the cases to fix a data race. This commit applies
that change to all cases. Using SetHash in this test helper should ensure that the
test helper behaves closer to production.
These changes ensure that the identity of services dialed is
cryptographically verified.
For all upstreams we validate against SPIFFE IDs in the format used by
Consul's service mesh:
spiffe://<trust-domain>/ns/<namespace>/dc/<datacenter>/svc/<service>
The dnsConfig pulled from the atomic.Value is a pointer, so modifying it in place
creates a data race. Use the exported ReloadConfig interface instead.
The LogOutput io.Writer used by TestAgent must allow concurrent reads and writes, and a
bytes.Buffer does not allow this. The bytes.Buffer must be wrapped with a lock to make this safe.
Some global variables are patched to shorter values in these tests. But the goroutines that read
them can outlive the test because nothing waited for them to exit.
This commit adds a Wait() method to the routine manager, so that tests can wait for the goroutines
to exit. This prevents the data race because the 'reset to original value' can happen
after all other goroutines have stopped.
A previous commit started using QueryRefuced, but that is not correct. QueryRefuced refers to
the OpCode, not the query type.
Instead use errNoAnswer because we have no records for that query type.
Now that trimDNSResponse is handled by the caller we don't need to pass this value
around. We can remove it from both the serviceLookup struct, and two functions.
The dispatch function was called from a single place and did nothing but add a default value.
Removing it makes code easier to trace by removing an unnecessary hop.
The compatv2 integration tests were failing because they use an older CLI version with a newer
HTTP API. This commit restores the GRPCPort field to the DebugConfig output to allow older
CIs to continue to fetch the port.
The DebugConfig in the self endpoint can change at any time. It's not a stable API.
With the previous change to rename GRPCPort to XDSPort this command would have broken.
This commit adds the XDSPort to a stable part of the XDS api, and changes the envoy command to read
this new field.
It includes support for the old API as well, in case a newer CLI is used with an older API, and
adds a test for both cases.
* ca: move provider creation into CAManager
This further decouples the CAManager from Server. It reduces the interface between them and
removes the need for the SetLogger method on providers.
* ca: move SignCertificate to CAManager
To reduce the scope of Server, and keep all the CA logic together
* ca: move SignCertificate to the file where it is used
* auto-config: move autoConfigBackend impl off of Server
Most of these methods are used exclusively for the AutoConfig RPC
endpoint. This PR uses a pattern that we've used in other places as an
incremental step to reducing the scope of Server.
* fix linter issues
* check error when `raftApplyMsgpack`
* ca: move SignCertificate to CAManager
To reduce the scope of Server, and keep all the CA logic together
* check expiry date of the intermediate before using it to sign a leaf
* fix typo in comment
Co-authored-by: Kyle Havlovitz <kylehav@gmail.com>
* Fix test name
* do not check cert start date
* wrap error to mention it is the intermediate expired
* Fix failing test
* update comment
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* use shim to avoid sleep in test
* add root cert validation
* remove duplicate code
* Revert "fix linter issues"
This reverts commit 6356302b54f06c8f2dee8e59740409d49e84ef24.
* fix import issue
* gofmt leader_connect_ca
* add changelog entry
* update error message
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
* fix error message in test
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Kyle Havlovitz <kylehav@gmail.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
Most of these methods are used exclusively for the AutoConfig RPC
endpoint. This PR uses a pattern that we've used in other places as an
incremental step to reducing the scope of Server.
raftApply was removed so ApplyCARequest needs to handle all the possible operations
Also set the providerShim to use the mock provider.
other changes are small test improvements that were necessary to debug the failures.
and small refactor to getCAProvider so that GoLand is less confused about what it is doing.
Previously it was reporting that the for condition was always true, which was not the case.
After moving ca.ConsulProviderStateDelegate into the interface we now
have the ApplyCARequest method which does the same thing. Use this more
specific method instead of raftApply.
This field was documented as enabling TLS for outgoing RPC, but that was not the case.
All this field did was set the use_tls serf tag.
Instead of setting this field in a place far from where it is used, move the logic to where
the serf tag is set, so that the code is much more obvious.
tlsutil.Config already presents an excellent structure for this
configuration. Copying the runtime config fields to agent/consul.Config
makes code harder to trace, and provides no advantage.
Instead of copying the fields around, use the tlsutil.Config struct
directly instead.
This is one small step in removing the many layers of duplicate
configuration.
* add intermediate ca metric routine
* add Gauge config for intermediate cert
* Stop metrics routine when stopping leader
* add changelog entry
* updage changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* use variables instead of a map
* go imports sort
* Add metrics for primary and secondary ca
* start metrics routine in the right DC
* add telemetry documentation
* update docs
* extract expiry fetching in a func
* merge metrics for primary and secondary into signing ca metric
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
And remove BuildAndValidate. This commit completes some earlier work to reduce the config
interface a single Load function.
The last remaining test was converted to use Load instad of BuildAndValidate.
Previously, for a POST request to the /v1/operator/autopilot/configuration
endpoint, any fields not included in the payload were set to a zero-initialized
value rather than the documented default value.
Now, if an optional field is not included in the payload, it will be set to its
documented default value:
- CleanupDeadServers: true
- LastContactThreshold: "200ms"
- MaxTrailingLogs: 250
- MinQuorum: 0
- ServerStabilizationTime: "10s"
- RedundancyZoneTag: ""
- DisableUpgradeMigration: false
- UpgradeVersionTag: ""
* trim carriage return from certificates when inserting rootCA in the inMemDB
* format rootCA properly when returning the CA on the connect CA endpoint
* Fix linter warnings
* Fix providers to trim certs before returning it
* trim newlines on write when possible
* add changelog
* make sure all provider return a trailing newline after the root and intermediate certs
* Fix endpoint to return trailing new line
* Fix failing test with vault provider
* make test more robust
* make sure all provider return a trailing newline after the leaf certs
* Check for suffix before removing newline and use function
* Add comment to consul provider
* Update change log
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* fix typo
* simplify code callflow
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* extract requireNewLine as shared func
* remove dependency to testify in testing file
* remove extra newline in vault provider
* Add cert newline fix to envoy xds
* remove new line from mock provider
* Remove adding a new line from provider and fix it when the cert is read
* Add a comment to explain the fix
* Add missing for leaf certs
* fix missing new line
* fix missing new line in leaf certs
* remove extra new line in test
* updage changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* fix in vault provider and when reading cache (RPC call)
* fix AWS provider
* fix failing test in the provider
* remove comments and empty lines
* add check for empty cert in test
* fix linter warnings
* add new line for leaf and private key
* use string concat instead of Sprintf
* fix new lines for leaf signing
* preallocate slice and remove append
* Add new line to `SignIntermediate` and `CrossSignCA`
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
The hcl decoding apparently uses strconv.ParseInt, which fails to parse a 64bit int.
Since hcl v1 is basically EOl, it seems unlikely we'll fix this in hcl.
Since this test is only about loading values from config files, the extra large number
doesn't seem important. Trim a few zeros from the numbers so that they parse
properly on 32bit platforms.
Also skip a slow test when -short is used.
sync/atomic must be used with 64-bit aligned fields, and that alignment is difficult to
ensure unless the field is the first one in the struct.
https://golang.org/pkg/sync/atomic/#pkg-note-BUG.
If a value was already available in the local view the request is considered a cache hit.
If the materialized had to wait for a value, it is considered a cache miss.
This test is super racy (it's not just a single line).
This test also starts failing once streaming is enabled, because the
cache rate limit no longer applies to the requests in the test. The
queries use streaming instead of the cache.
This test is no longer valid, and the functionality is already well
tested by TestCacheThrottle. Instead of spending time rewriting this
test, let's remove it.
```
WARNING: DATA RACE
Read at 0x00c01de410fc by goroutine 735:
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:1024 +0x9af
github.com/hashicorp/consul/testrpc.WaitForTestAgent()
/home/daniel/pers/code/consul/testrpc/wait.go:99 +0x209
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:966 +0x1ad
testing.tRunner()
/usr/lib/go/src/testing/testing.go:1193 +0x202
Previous write at 0x00c01de410fc by goroutine 605:
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1.2()
/home/daniel/pers/code/consul/agent/agent_test.go:998 +0xe9
Goroutine 735 (running) created at:
testing.(*T).Run()
/usr/lib/go/src/testing/testing.go:1238 +0x5d7
github.com/hashicorp/consul/agent.TestCacheRateLimit()
/home/daniel/pers/code/consul/agent/agent_test.go:961 +0x375
testing.tRunner()
/usr/lib/go/src/testing/testing.go:1193 +0x202
Goroutine 605 (finished) created at:
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:1022 +0x91e
github.com/hashicorp/consul/testrpc.WaitForTestAgent()
/home/daniel/pers/code/consul/testrpc/wait.go:99 +0x209
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:966 +0x1ad
testing.tRunner()
/usr/lib/go/src/testing/testing.go:1193 +0x202
```
The query metrics are actually reported for all read queries, not only
ones that use a MinIndex to block for updates.
Also clarify the raft.apply metric is only on the leader.
As part of this change, we ensure that the SAN extensions are marked as
critical when the subject is empty so that AWS PCA tolerates the loss of
common names well and continues to function as a Connect CA provider.
Parts of this currently hack around a bug in crypto/x509 and can be
removed after https://go-review.googlesource.com/c/go/+/329129 lands in
a Go release.
Note: the AWS PCA tests do not run automatically, but the following
passed locally for me:
ENABLE_AWS_PCA_TESTS=1 go test ./agent/connect/ca -run TestAWS
* return an invalid record when asked for an addr dns with type other then A and AAAA
* add changelog
* fix ANY use case and add a test for it
* update changelog type
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* return empty response if the question record type do not match for addr
* set comment in the right place
* return A\AAAA record in extra section if record type is not A\AAAA for addr
* Fix failing test
* remove commented code
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* use require for test validation
* use variable to init struct
* fix failing test
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update .changelog/10401.txt
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* fix compilation error
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
With an optional interface that providers can use to indicate if they
use an intermediate cert in the primary DC.
This removes the need to look up the provider config when renewing the
intermediate.
Replace two methods with a single one that returns the cert. This moves more
of the logic into the single caller (auto-config).
tlsutil.Configurator is widely used. By keeping it smaller and focused only on storing and
returning TLS config, we make the code easier to follow.
These two methods were more related to auto-config than to tlsutil, so reducing the interface
moves the logic closer to the feature that requires it.
There is no interaction between these handlers, so splitting them into separate files
makes it easier to discover the full implementation of each kindHandler.
This commit extracts all the kind-specific logic into handler types, and
keeps the generic parts on the state struct. This change should make it
easier to add new kinds, and see the implementation of each kind more
clearly.
* remove flush for each write to http response in the agent monitor endpoint
* fix race condition when we stop and start monitor multiple times, the doneCh is closed and never recover.
* start log reading goroutine before adding the sink to avoid filling the log channel before getting a chance of reading from it
* flush every 500ms to optimize log writing in the http server side.
* add changelog file
* add issue url to changelog
* fix changelog url
* Update changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* use ticker to flush and avoid race condition when flushing in a different goroutine
* stop the ticker when done
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Revert "fix race condition when we stop and start monitor multiple times, the doneCh is closed and never recover."
This reverts commit 1eeddf7a
* wait for log consumer loop to start before registering the sink
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
In the absence of stats_tags to handle this pattern, when we pass
"ingress_upstream.$port" as the stat_prefix, Envoy splits up that prefix
and makes the port a part of the metric name.
For example:
- stat_prefix: ingress_upstream.8080
This leads to metric names like envoy_http_8080_no_route. Changing the
stat_prefix to ingress_upstream_80880 yields the expected metric names
such as envoy_http_no_route.
Note that we don't encode the destination's name/ns/dc in this
stat_prefix because for HTTP services ingress gateways use a single
filter chain. Only cluster metrics are available on a per-upstream
basis.
This change makes it so that the stat prefix for terminating gateways
matches that of connect proxies. By using the structure of
"upstream.svc.ns.dc" we can extract labels for the destination service,
namespace, and datacenter.
Updates to a cluster will clear the associated endpoints, and updates to
a listener will clear the associated routes. Update the incremental xDS
logic to account for this implicit cleanup so that we can finish warming
the clusters and listeners.
Fixes#10379
CatalogDestinationsOnly is a passthrough that would enable dialing
addresses outside of Consul's catalog. However, when this flag is set to
true only _connect_ endpoints for services can be dialed.
This flag is being renamed to signal that non-Connect endpoints can't be
dialed by transparent proxies when the value is set to true.
Previously we would return an error if duplicate paths were specified.
This could lead to problems in cases where a user has the same path,
say /healthz, on two different ports.
This validation was added to signal a potential misconfiguration.
Instead we will only check for duplicate listener ports, since that is
what would lead to ambiguity issues when generating xDS config.
In the future we could look into using a single listener and creating
distinct filter chains for each path/port.
These two new struct types will allow us to make polymorphic handler for each kind, instad of
having all the logic for each proxy kind on the state struct.
context.Context should never be stored on a struct (as it says in the godoc) because it is easy to
to end up with the wrong context when it is stored.
Also see https://blog.golang.org/context-and-structs
This change is also in preparation for splitting state into kind-specific handlers so that the
implementation of each kind is grouped together.
This field is available in DebugConfig, but that field is not stable and could change at any time.
The consul-k8s needs to be able to detect the primary DC for tests, so adding this field to the
stable part of the API response.
Both NextLink and NextNoBlock had the same logic, with slightly
different return values. By adding a bool return value (similar to map
lookups) we can remove the duplicate method.
The head of the topic buffer was being ignored when creating a snapshot. This commit fixes
the bug by ensuring that the head of the topic buffer is included in the snapshot
before handing it off to the subscription.
When info.Timeout is 0, it should have no timeout. Previously it was using a 0 duration timeout
which caused it to return without waiting.
This bug was masked by using a timeout in the tests. Removing the timeout caused the tests to fail.
This PR adds cluster members to the metrics API. The number of members per
segment are reported as well as the total number of members.
Tested by running a multi-node cluster locally and ensuring the numbers were
correct. Also added unit test coverage to add the new expected gauges to
existing test cases.
We have seen test flakes caused by 'concurrent map read and map write', and the race detector
reports the problem as well (prevent us from running some tests with -race).
The root of the problem is the grpc expects resolvers to be registered at init time
before any requests are made, but we were using a separate resolver for each test.
This commit introduces a resolver registry. The registry is registered as the single
resolver for the consul scheme. Each test uses the Authority section of the target
(instead of the scheme) to identify the resolver that should be used for the test.
The scheme is used for lookup, which is why it can no longer be used as the unique
key.
This allows us to use a lock around the map of resolvers, preventing the data race.
* fix tests to use a dummy nodeName and not fail when hostname is not a valid nodeName
* remove conditional testing
* add test when node name is invalid
* debug: remove the CLI check for debug_enabled
The API allows collecting profiles even debug_enabled=false as long as
ACLs are enabled. Remove this check from the CLI so that users do not
need to set debug_enabled=true for no reason.
Also:
- fix the API client to return errors on non-200 status codes for debug
endpoints
- improve the failure messages when pprof data can not be collected
Co-Authored-By: Dhia Ayachi <dhia@hashicorp.com>
* remove parallel test runs
parallel runs create a race condition that fail the debug tests
* Add changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
The bulk of this commit is moving the LeaderRoutineManager from the agent/consul package into its own package: lib/gort. It also got a renaming and its Start method now requires a context. Requiring that context required updating a whole bunch of other places in the code.
This refactor is to make it easier to see how serf feature flags are
encoded as serf tags, and where those feature flags are read.
- use constants for both the prefix and feature flag name. A constant
makes it much easier for an IDE to locate the read and write location.
- isolate the feature-flag encoding logic in the metadata package, so
that the feature flag prefix can be unexported. Only expose a function
for encoding the flags into tags. This logic is now next to the logic
which reads the tags.
- remove the duplicate `addEnterpriseSerfTags` functions. Both Client
and Server structs had the same implementation. And neither
implementation needed the method receiver.
The prior solution to call reply.Reset() aged poorly since newer fields
were added to the reply, but not added to Reset() leading serial
blocking query loops on the server to blend replies.
This could manifest as a service-defaults protocol change from
default=>http not reverting back to default after the config entry
reponsible was deleted.
* Save exposed HTTP or GRPC ports to the agent's store
* Add those the health checks API so we can retrieve them from the API
* Change redirect-traffic command to also exclude those ports from inbound traffic redirection when expose.checks is set to true.
TestACLEndpoint_Login_with_TokenLocality was reguardly being reported as failed even though
it was not failing. I took another look and I suspect it is because t.Parllel was being
called in a goroutine.
This would lead to strange behaviour which apparently confused the 'go test' runner.
- return errors in TestAgent.Start so that the retry works correctly
- remove duplicate logging, the error is returned already
- add a missing t.Helper() to retry.Run
- properly set a.Agent to nil so that subsequent retry attempts will actually try to start
Also accept an RPCInfo instead of interface{}. Accepting an interface
lead to a bug where the caller was expecting the arg to be the response
when in fact it was always passed the request. By accepting RPCInfo
it should indicate that this is actually the request value.
One caller of canRetry already passed an RPCInfo, the second handles
the type assertion before calling canRetry.
A recent change in 1.9.x inverted the order of these two lines, which caused the
X-Consul-Effective-Consistency header to be missing for the servie health endpoints