When de-registering in anti-entropy sync, when there is no service or
check token.
The agent token will fall back to the default (aka user) token if no agent
token is set, so the existing behaviour still works, but it will prefer
the agent token over the user token if both are set.
ref: https://www.consul.io/docs/agent/options#acl_tokens
The agent token seems more approrpiate in this case, since this is an
"internal operation", not something initiated by the user.
These types are used as values (not pointers) in other structs. Using a pointer receiver causes
problems when the value is printed. fmt will not call the String method if it is passed a value
and the String method has a pointer receiver. By using a value receiver the correct string is printed.
Also remove some unused methods.
Add a skip condition to all tests slower than 100ms.
This change was made using `gotestsum tool slowest` with data from the
last 3 CI runs of master.
See https://github.com/gotestyourself/gotestsum#finding-and-skipping-slow-tests
With this change:
```
$ time go test -count=1 -short ./agent
ok github.com/hashicorp/consul/agent 0.743s
real 0m4.791s
$ time go test -count=1 -short ./agent/consul
ok github.com/hashicorp/consul/agent/consul 4.229s
real 0m8.769s
```
There are a couple of things in here.
First, just like auto encrypt, any Cluster.AutoConfig RPC will implicitly use the less secure RPC mechanism.
This drastically modifies how the Consul Agent starts up and moves most of the responsibilities (other than signal handling) from the cli command and into the Agent.
In current implementation of Consul, check alias cannot determine
if a service exists or not. Because a service without any check
is semantically considered as passing, so when no healthchecks
are found for an agent, the check was considered as passing.
But this make little sense as the current implementation does not
make any difference between:
* a non-existing service (passing)
* a service without any check (passing as well)
In order to make it work, we have to ensure that when a check did
not find any healthcheck, the service does indeed exists. If it
does not, lets consider the check as failing.
Previously this happened to be using the method on the Server/Client that was meant to allow the ACLResolver to locally resolve tokens. On Servers that had tokens (primary or secondary dc + token replication) this function would lookup the token from raft and return the ACLIdentity. On clients this was always a noop. We inadvertently used this function instead of creating a new one when we added logging accessor ids for permission denied RPC requests.
With this commit, a new method is used for resolving the identity properly via the ACLResolver which may still resolve locally in the case of being on a server with tokens but also supports remote token resolution.
The Init method provided the same functionality as the New constructor.
The constructor is both more widely used, and more idiomatic, so remove
the Init method.
This change is in preparation for fixing printing of these IDs.
This fixes#7020.
There are two problems this PR solves:
* if the node info changes it is highly likely to get service and check registration permission errors unless those service tokens have node:write. Hopefully services you register don’t have this permission.
* the timer for a full sync gets reset for every partial sync which means that many partial syncs are preventing a full sync from happening
Instead of syncing node info last, after services and checks, and possibly saving one RPC because it is included in every service sync, I am syncing node info first. It is only ever going to be a single RPC that we are only doing when node info has changed. This way we are guaranteed to sync node info even when something goes wrong with services or checks which is more likely because there are more syncs happening for them.
* Fix segfault when removing both a service and associated check
updateSyncState creates entries in the services and checks maps for
remote services/checks that are not found locally, so that we can then
make sure to delete them in our reconciliation process. However, the
values added to the map are missing key fields that the rest of the code
expects to not be nil.
* Add comment stating Check field can be nil
Deregistering a service from the catalog automatically deregisters its
checks, however the agent still performs a deregister call for each
service checks even after the service has been deregistered.
With ACLs enabled this results in logs like:
"message:consul: "Catalog.Deregister" RPC failed to server
server_ip:8300: rpc error making call: rpc error making call: Unknown
check 'check_id'"
This change removes associated checks from the agent state when
deregistering a service, which results in less calls to the servers and
supresses the error logs.
Fixes: #5396
This PR adds a proxy configuration stanza called expose. These flags register
listeners in Connect sidecar proxies to allow requests to specific HTTP paths from outside of the node. This allows services to protect themselves by only
listening on the loopback interface, while still accepting traffic from non
Connect-enabled services.
Under expose there is a boolean checks flag that would automatically expose all
registered HTTP and gRPC check paths.
This stanza also accepts a paths list to expose individual paths. The primary
use case for this functionality would be to expose paths for third parties like
Prometheus or the kubelet.
Listeners for requests to exposed paths are be configured dynamically at run
time. Any time a proxy, or check can be registered, a listener can also be
created.
In this initial implementation requests to these paths are not
authenticated/encrypted.
* Store primaries root in secondary after intermediate signature
This ensures that the intermediate exists within the CA root stored in raft and not just in the CA provider state. This has the very nice benefit of actually outputting the intermediate cert within the ca roots HTTP/RPC endpoints.
This change means that if signing the intermediate fails it will not set the root within raft. So far I have not come up with a reason why that is bad. The secondary CA roots watch will pull the root again and go through all the motions. So as soon as getting an intermediate CA works the root will get set.
* Make TestAgentAntiEntropy_Check_DeferSync less flaky
I am not sure this is the full fix but it seems to help for me.
Read requests performed during anti antropy full sync currently target
the leader only. This generates a non-negligible load on the leader when
the DC is large enough and can be offloaded to the followers following
the "eventually consistent" policy for the agent state.
We switch the AE read calls to use stale requests with a small (2s)
MaxStaleDuration value and make sure we do not read too fast after a
write.
The observed bug was that a full restart of a consul datacenter (servers
and clients) in conjunction with a restart of a connect-flavored
application with bring-your-own-service-registration logic would very
frequently cause the envoy sidecar service check to never reflect the
aliased service.
Over the course of investigation several bugs and unfortunate
interactions were corrected:
(1)
local.CheckState objects were only shallow copied, but the key piece of
data that gets read and updated is one of the things not copied (the
underlying Check with a Status field). When the stock code was run with
the race detector enabled this highly-relevant-to-the-test-scenario field
was found to be racy.
Changes:
a) update the existing Clone method to include the Check field
b) copy-on-write when those fields need to change rather than
incrementally updating them in place.
This made the observed behavior occur slightly less often.
(2)
If anything about how the runLocal method for node-local alias check
logic was ever flawed, there was no fallback option. Those checks are
purely edge-triggered and failure to properly notice a single edge
transition would leave the alias check incorrect until the next flap of
the aliased check.
The change was to introduce a fallback timer to act as a control loop to
double check the alias check matches the aliased check every minute
(borrowing the duration from the non-local alias check logic body).
This made the observed behavior eventually go away when it did occur.
(3)
Originally I thought there were two main actions involved in the data race:
A. The act of adding the original check (from disk recovery) and its
first health evaluation.
B. The act of the HTTP API requests coming in and resetting the local
state when re-registering the same services and checks.
It took awhile for me to realize that there's a third action at work:
C. The goroutines associated with the original check and the later
checks.
The actual sequence of actions that was causing the bad behavior was
that the API actions result in the original check to be removed and
re-added _without waiting for the original goroutine to terminate_. This
means for brief windows of time during check definition edits there are
two goroutines that can be sending updates for the alias check status.
In extremely unlikely scenarios the original goroutine sees the aliased
check start up in `critical` before being removed but does not get the
notification about the nearly immediate update of that check to
`passing`.
This is interlaced wit the new goroutine coming up, initializing its
base case to `passing` from the current state and then listening for new
notifications of edge triggers.
If the original goroutine "finishes" its update, it then commits one
more write into the local state of `critical` and exits leaving the
alias check no longer reflecting the underlying check.
The correction here is to enforce that the old goroutines must terminate
before spawning the new one for alias checks.
* First conversion
* Use serf 0.8.2 tag and associated updated deps
* * Move freeport and testutil into internal/
* Make internal/ its own module
* Update imports
* Add replace statements so API and normal Consul code are
self-referencing for ease of development
* Adapt to newer goe/values
* Bump to new cleanhttp
* Fix ban nonprintable chars test
* Update lock bad args test
The error message when the duration cannot be parsed changed in Go 1.12
(ae0c435877d3aacb9af5e706c40f9dddde5d3e67). This updates that test.
* Update another test as well
* Bump travis
* Bump circleci
* Bump go-discover and godo to get rid of launchpad dep
* Bump dockerfile go version
* fix tar command
* Bump go-cleanhttp
Prevent race between register and deregister requests by saving them
together in the local state on registration.
Also adds more cleaning in case of failure when registering services
/ checks.
This PR adds two features which will be useful for operators when ACLs are in use.
1. Tokens set in configuration files are now reloadable.
2. If `acl.enable_token_persistence` is set to `true` in the configuration, tokens set via the `v1/agent/token` endpoint are now persisted to disk and loaded when the agent starts (or during configuration reload)
Note that token persistence is opt-in so our users who do not want tokens on the local disk will see no change.
Some other secondary changes:
* Refactored a bunch of places where the replication token is retrieved from the token store. This token isn't just for replicating ACLs and now it is named accordingly.
* Allowed better paths in the `v1/agent/token/` API. Instead of paths like: `v1/agent/token/acl_replication_token` the path can now be just `v1/agent/token/replication`. The old paths remain to be valid.
* Added a couple new API functions to set tokens via the new paths. Deprecated the old ones and pointed to the new names. The names are also generally better and don't imply that what you are setting is for ACLs but rather are setting ACL tokens. There is a minor semantic difference there especially for the replication token as again, its no longer used only for ACL token/policy replication. The new functions will detect 404s and fallback to using the older token paths when talking to pre-1.4.3 agents.
* Docs updated to reflect the API additions and to show using the new endpoints.
* Updated the ACL CLI set-agent-tokens command to use the non-deprecated APIs.
This way we can avoid unnecessary panics which cause other tests not to run.
This doesn't remove all the possibilities for panics causing other tests not to run, it just fixes the TestAgent
Fixes: https://github.com/hashicorp/consul/issues/3676
This fixes a bug were registering an agent with a non-existent ACL token can prevent other
services registered with a good token from being synced to the server when using
`acl_enforce_version_8 = false`.
## Background
When `acl_enforce_version_8` is off the agent does not check the ACL token validity before
storing the service in its state.
When syncing a service registered with a missing ACL token we fall into the default error
handling case (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1255)
and stop the sync (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1082)
without setting its Synced property to true like in the permission denied case.
This means that the sync will always stop at the faulty service(s).
The order in which the services are synced is random since we iterate on a map. So eventually
all services with good ACL tokens will be synced, this can however take some time and is influenced
by the cluster size, the bigger the slower because retries are less frequent.
Having a service in this state also prevent all further sync of checks as they are done after
the services.
## Changes
This change modify the sync process to continue even if there is an error.
This fixes the issue described above as well as making the sync more error tolerant: if the server repeatedly refuses
a service (the ACL token could have been deleted by the time the service is synced, the servers
were upgraded to a newer version that has more strict checks on the service definition...).
Then all services and check that can be synced will, and those that don't will be marked as errors in
the logs instead of blocking the whole process.