* Fix leaked Vault LifetimeRenewers
When the Vault CA Provider is reconfigured we do not stop the
LifetimeRenewers which can cause them to leak until the Consul processes
recycles. On Configure execute stopWatcher if it exists and is not nil
before starting a new renewal
* Add jitter before restarting the LifetimeWatcher
If we fail to login to Vault or our token is no longer valid we can
overwhelm a Vault instance with many requests very quickly by restarting
the LifetimeWatcher. Before restarting the LifetimeWatcher provide a
backoff time of 1 second or less.
* Use a retry.Waiter instead of RandomStagger
* changelog
* gofmt'd
* Swap out bool for atomic.Unit32 in test
* Provide some extra clarification in comment and changelog
* tlsutil: initial implementation of types/TLSVersion
tlsutil: add test for parsing deprecated agent TLS version strings
tlsutil: return TLSVersionInvalid with error
tlsutil: start moving tlsutil cipher suite lookups over to types/tls
tlsutil: rename tlsLookup to ParseTLSVersion, add cipherSuiteLookup
agent: attempt to use types in runtime config
agent: implement b.tlsVersion validation in config builder
agent: fix tlsVersion nil check in builder
tlsutil: update to renamed ParseTLSVersion and goTLSVersions
tlsutil: fixup TestConfigurator_CommonTLSConfigTLSMinVersion
tlsutil: disable invalid config parsing tests
tlsutil: update tests
auto_config: lookup old config strings from base.TLSMinVersion
auto_config: update endpoint tests to use TLS types
agent: update runtime_test to use TLS types
agent: update TestRuntimeCinfig_Sanitize.golden
agent: update config runtime tests to expect TLS types
* website: update Consul agent tls_min_version values
* agent: fixup TLS parsing and compilation errors
* test: fixup lint issues in agent/config_runtime_test and tlsutil/config_test
* tlsutil: add CHACHA20_POLY1305 cipher suites to goTLSCipherSuites
* test: revert autoconfig tls min version fixtures to old format
* types: add TLSVersions public function
* agent: add warning for deprecated TLS version strings
* agent: move agent config specific logic from tlsutil.ParseTLSVersion into agent config builder
* tlsutil(BREAKING): change default TLS min version to TLS 1.2
* agent: move ParseCiphers logic from tlsutil into agent config builder
* tlsutil: remove unused CipherString function
* agent: fixup import for types package
* Revert "tlsutil: remove unused CipherString function"
This reverts commit 6ca7f6f58d268e617501b7db9500113c13bae70c.
* agent: fixup config builder and runtime tests
* tlsutil: fixup one remaining ListenerConfig -> ProtocolConfig
* test: move TLS cipher suites parsing test from tlsutil into agent config builder tests
* agent: remove parseCiphers helper from auto_config_endpoint_test
* test: remove unused imports from tlsutil
* agent: remove resolved FIXME comment
* tlsutil: remove TODO and FIXME in cipher suite validation
* agent: prevent setting inherited cipher suite config when TLS 1.3 is specified
* changelog: add entry for converting agent config to TLS types
* agent: remove FIXME in runtime test, this is covered in builder tests with invalid tls9 value now
* tlsutil: remove config tests for values checked at agent config builder boundary
* tlsutil: remove tls version check from loadProtocolConfig
* tlsutil: remove tests and TODOs for logic checked in TestBuilder_tlsVersion and TestBuilder_tlsCipherSuites
* website: update search link for supported Consul agent cipher suites
* website: apply review suggestions for tls_min_version description
* website: attempt to clean up markdown list formatting for tls_min_version
* website: moar linebreaks to fix tls_min_version formatting
* Revert "website: moar linebreaks to fix tls_min_version formatting"
This reverts commit 38585927422f73ebf838a7663e566ac245f2a75c.
* autoconfig: translate old values for TLSMinVersion
* agent: rename var for translated value of deprecated TLS version value
* Update agent/config/deprecated.go
Co-authored-by: Dan Upton <daniel@floppy.co>
* agent: fix lint issue
* agent: fixup deprecated config test assertions for updated warning
Co-authored-by: Dan Upton <daniel@floppy.co>
Looks like something got munged at some point. Not sure how it slipped in, but my best guess is that because TestTxn_Apply_ACLDeny is marked flaky we didn't block merge because it failed.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* mogify needed pbcommon structs
* mogify needed pbconnect structs
* fix compilation errors and make config_translate_test pass
* add missing file
* remove redundant oss func declaration
* fix EnterpriseMeta to copy the right data for enterprise
* rename pbcommon package to pbcommongogo
* regenerate proto and mog files
* add missing mog files
* add pbcommon package
* pbcommon no mog
* fix enterprise meta code generation
* fix enterprise meta code generation (pbcommongogo)
* fix mog generation for gogo
* use `protoc-go-inject-tag` to inject tags
* rename proto package
* pbcommon no mog
* use `protoc-go-inject-tag` to inject tags
* add non gogo proto to make file
* fix proto get
This extends the acl.AllowAuthorizer with source of authority information.
The next step is to unify the AllowAuthorizer and ACLResolveResult structures; that will be done in a separate PR.
Part of #12481
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
Introduces the capability to configure TLS differently for Consul's
listeners/ports (i.e. HTTPS, gRPC, and the internal multiplexed RPC
port) which is useful in scenarios where you may want the HTTPS or
gRPC interfaces to present a certificate signed by a well-known/public
CA, rather than the certificate used for internal communication which
must have a SAN in the form `server.<dc>.consul`.
Give the user a hint that they might be doing something wrong if their GET
request has a non-empty body, which can easily happen using curl's
--data-urlencode if specifying request type via "--request GET" rather than
"--get". See https://github.com/hashicorp/consul/issues/11471.
Currently the config_entry.go subsystem delegates authorization decisions via the ConfigEntry interface CanRead and CanWrite code. Unfortunately this returns a true/false value and loses the details of the source.
This is not helpful, especially since it the config subsystem can be more complex to understand, since it covers so many domains.
This refactors CanRead/CanWrite to return a structured error message (PermissionDenied or the like) with more details about the reason for denial.
Part of #12241
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* First pass for helper for bulk changes
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Convert ACLRead and ACLWrite to new form
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* AgentRead and AgentWRite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fix EventWrite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* KeyRead, KeyWrite, KeyList
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* KeyRing
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* NodeRead NodeWrite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* OperatorRead and OperatorWrite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* PreparedQuery
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Intention partial
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fix ServiceRead, Write ,etc
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Error check ServiceRead?
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fix Sessionread/Write
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fixup snapshot ACL
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Error fixups for txn
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Add changelog
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fixup review comments
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
Prior to this PR for the envoy xDS golden tests in the agent/xds package we
were hand-creating a proxycfg.ConfigSnapshot structure in the proper format for
input to the xDS generator. Over time this intermediate structure has gotten
trickier to build correctly for the various tests.
This PR proposes to switch to using the existing mechanism for turning a
structs.NodeService and a sequence of cache.UpdateEvent copies into a
proxycfg.ConfigSnapshot, as that is less error prone to construct and aligns
more with how the data arrives.
NOTE: almost all of this is in test-related code. I tried super hard to craft
correct event inputs to get the golden files to be the same, or similar enough
after construction to feel ok that i recreated the spirit of the original test
cases.
Improves tests from #12362
These tests try to setup the following concurrent scenario:
1. (goroutine 1) execute read RPC with index=0
2. (goroutine 1) get response from (1) @ index=10
3. (goroutine 1) execute read RPC with index=10 and block
4. (goroutine 2) WHILE (3) is blocking, start slamming the system with stray writes that will cause the WatchSet to wakeup
5. (goroutine 2) after doing all writes, shut down the reader above
6. (goroutine 1) stops reading, double checks that it only ever woke up once (from 1)
Minor fix for behavior in #12362
IsDefault sometimes returns true even if there was a proxy-defaults or service-defaults config entry that was consulted. This PR fixes that.
before:
$ go test ./agent/consul -run TestLeader_ReapOrLeftMember_IgnoreSelf
ok github.com/hashicorp/consul/agent/consul 21.147s
after:
$ go test ./agent/consul -run TestLeader_ReapOrLeftMember_IgnoreSelf
ok github.com/hashicorp/consul/agent/consul 5.402s
Previously we were using two different criteria to decide where to run a
test. The main `go-test` job would skip Vault tests based on the
presence of the `vault` binary, but the `test-connect-ca-providers` job
would run tests based on the name.
This led to a scenario where a test may never run in CI.
To fix this problem I added a name check to the function we use to skip
the test. This should ensure that any test that requires vault is named
correctly to be run as part of the `test-connect-ca-providers` job.
At the same time I relaxed the regex we use. I verified this runs the
same tests using `go test --list Vault`. I made this change because a
bunch of tests in `agent/connect/ca` used `Vault` in the name, without
the underscores. Instead of changing a bunch of test names, this seemed
easier.
With this approach, the worst case is that we run a few extra tests in
the `test-connect-ca-providers` job, which doesn't seem like a problem.
Starting from and extending the mechanism introduced in #12110 we can specially handle the 3 main special Consul RPC endpoints that react to many config entries in a single blocking query in Connect:
- `DiscoveryChain.Get`
- `ConfigEntry.ResolveServiceConfig`
- `Intentions.Match`
All of these will internally watch for many config entries, and at least one of those will likely be not found in any given query. Because these are blends of multiple reads the exact solution from #12110 isn't perfectly aligned, but we can tweak the approach slightly and regain the utility of that mechanism.
### No Config Entries Found
In this case, despite looking for many config entries none may be found at all. Unlike #12110 in this scenario we do not return an empty reply to the caller, but instead synthesize a struct from default values to return. This can be handled nearly identically to #12110 with the first 1-2 replies being non-empty payloads followed by the standard spurious wakeup suppression mechanism from #12110.
### No Change Since Last Wakeup
Once a blocking query loop on the server has completed and slept at least once, there is a further optimization we can make here to detect if any of the config entries that were present at specific versions for the prior execution of the loop are identical for the loop we just woke up for. In that scenario we can return a slightly different internal sentinel error and basically externally handle it similar to #12110.
This would mean that even if 20 discovery chain read RPC handling goroutines wakeup due to the creation of an unrelated config entry, the only ones that will terminate and reply with a blob of data are those that genuinely have new data to report.
### Extra Endpoints
Since this pattern is pretty reusable, other key config-entry-adjacent endpoints used by `agent/proxycfg` also were updated:
- `ConfigEntry.List`
- `Internal.IntentionUpstreams` (tproxy)
Many places in consul already treated node names case insensitively.
The state store indexes already do it, but there are a few places that
did a direct byte comparison which have now been corrected.
One place of particular consideration is ensureCheckIfNodeMatches
which is executed during snapshot restore (among other places). If a
node check used a slightly different casing than the casing of the node
during register then the snapshot restore here would deterministically
fail. This has been fixed.
Primary approach:
git grep -i "node.*[!=]=.*node" -- ':!*_test.go' ':!docs'
git grep -i '\[[^]]*member[^]]*\]
git grep -i '\[[^]]*\(member\|name\|node\)[^]]*\]' -- ':!*_test.go' ':!website' ':!ui' ':!agent/proxycfg/testing.go:' ':!*.md'
There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.
Config entry replication procedes in a particular sorted order by kind
and name.
Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.
This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.
This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.
Fixes#9319
The scenario that is NOT ADDRESSED by this PR is as follows:
1. create: service-defaults: name=new-web, protocol=http
2. create: service-defaults: name=old-web, protocol=http
3. create: service-resolver: name=old-web, redirect-to=new-web
4. delete: service-resolver: name=old-web
5. update: service-defaults: name=old-web, protocol=grpc
6. update: service-defaults: name=new-web, protocol=grpc
7. create: service-resolver: name=old-web, redirect-to=new-web
If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.
* add config watcher to the config package
* add logging to watcher
* add test and refactor to add WatcherEvent.
* add all API calls and fix a bug with recreated files
* add tests for watcher
* remove the unnecessary use of context
* Add debug log and a test for file rename
* use inode to detect if the file is recreated/replaced and only listen to create events.
* tidy ups (#1535)
* tidy ups
* Add tests for inode reconcile
* fix linux vs windows syscall
* fix linux vs windows syscall
* fix windows compile error
* increase timeout
* use ctime ID
* remove remove/creation test as it's a use case that fail in linux
* fix linux/windows to use Ino/CreationTime
* fix the watcher to only overwrite current file id
* fix linter error
* fix remove/create test
* set reconcile loop to 200 Milliseconds
* fix watcher to not trigger event on remove, add more tests
* on a remove event try to add the file back to the watcher and trigger the handler if success
* fix race condition
* fix flaky test
* fix race conditions
* set level to info
* fix when file is removed and get an event for it after
* fix to trigger handler when we get a remove but re-add fail
* fix error message
* add tests for directory watch and fixes
* detect if a file is a symlink and return an error on Add
* rename Watcher to FileWatcher and remove symlink deref
* add fsnotify@v1.5.1
* fix go mod
* fix flaky test
* Apply suggestions from code review
Co-authored-by: Ashwin Venkatesh <ashwin@hashicorp.com>
* fix a possible stack overflow
* do not reset timer on errors, rename OS specific files
* start the watcher when creating it
* fix data race in tests
* rename New func
* do not call handler when a remove event happen
* events trigger on write and rename
* fix watcher tests
* make handler async
* remove recursive call
* do not produce events for sub directories
* trim "/" at the end of a directory when adding
* add missing test
* fix logging
* add todo
* fix failing test
* fix flaking tests
* fix flaky test
* add logs
* fix log text
* increase timeout
* reconcile when remove
* check reconcile when removed
* fix reconcile move test
* fix logging
* delete invalid file
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* fix review comments
* fix is watched to properly catch a remove
* change test timeout
* fix test and rename id
* fix test to create files with different mod time.
* fix deadlock when stopping watcher
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* fix a deadlock when calling stop while emitting event is blocked
* make sure to close the event channel after the event loop is done
* add go doc
* back date file instead of sleeping
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* check error
Co-authored-by: Ashwin Venkatesh <ashwin@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Otherwise when the query times out we might incorrectly send a value for
the reply, when we should send an empty reply.
Also document errNotFound and how to handle the result in that case.
The interface is documented as 'Sign will only return the leaf', and the other providers
only return the leaf. It seems like this was added during the initial implementation, so
is likely just something we missed. It doesn't break anything , but it does cause confusing cert chains
in the API response which could break something in the future.
* Parse datacenter from request
- Parse the value of the datacenter from the create/delete requests for AuthMethods and BindingRules so that they can be created in and deleted from the datacenters specified in the request.
By using the query results as state.
Blocking queries are efficient when the query matches some results,
because the ModifyIndex of those results, returned as queryMeta.Mindex,
will never change unless the items themselves change.
Blocking queries for non-existent items are not efficient because the
queryMeta.Index can (and often does) change when other entities are
written.
This commit reduces the churn of these queries by using a different
comparison for "has changed". Instead of using the modified index, we
use the existence of the results. If the previous result was "not found"
and the new result is still "not found", we know we can ignore the
modified index and continue to block.
This is done by setting the minQueryIndex to the returned
queryMeta.Index, which prevents the query from returning before a state
change is observed.
This test shows how blocking queries are not efficient when the query
returns no results. The test fails with 100+ calls instead of the
expected 2.
This test is still a bit flaky because it depends on the timing of the
writes. It can sometimes return 3 calls.
A future commit should fix this and make blocking queries even more
optimal for not-found results.
Follow the Go convention of accepting a small interface that documents
the methods used by the function.
Clarify the rules for implementing a query function passed to
blockingQuery.
This will both save on unnecessary raft operations as well as
unnecessarily incrementing the raft modify index of config entries
subject to no-op updates.
This commit syncs ENT changes to the OSS repo.
Original commit details in ENT:
```
commit 569d25f7f4578981c3801e6e067295668210f748
Author: FFMMM <FFMMM@users.noreply.github.com>
Date: Thu Feb 10 10:23:33 2022 -0800
Vendor fork net rpc (#1538)
* replace net/rpc w consul-net-rpc/net/rpc
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
* replace msgpackrpc and go-msgpack with fork from mono repo
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
* gofmt all files touched
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
```
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
The race detector noticed this initially in `TestAgentConfigWatcherSidecarProxy` but it is not restricted to just tests.
The two main changes here were:
- ensure that before we mutate the internal `agent/local` representation of a Service (for tags or VIPs) we clone those fields
- ensure that there's no function argument joint ownership between the caller of a function and the local state when calling `AddService`, `AddCheck`, and related using `copystructure` for now.
* First phase of refactoring PermissionDeniedError
Add extended type PermissionDeniedByACLError that captures information
about the accessor, particular permission type and the object and name
of the thing being checked.
It may be worth folding the test and error return into a single helper
function, that can happen at a later date.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
Transparent proxies typically cannot dial upstreams in remote
datacenters. However, if their upstream configures a redirect to a
remote DC then the upstream targets will be in another datacenter.
In that sort of case we should use the WAN address for the passthrough.
Due to timing, a transparent proxy could have two upstreams to dial
directly with the same address.
For example:
- The orders service can dial upstreams shipping and payment directly.
- An instance of shipping at address 10.0.0.1 is deregistered.
- Payments is scaled up and scheduled to have address 10.0.0.1.
- The orders service receives the event for the new payments instance
before seeing the deregistration for the shipping instance. At this
point two upstreams have the same passthrough address and Envoy will
reject the listener configuration.
To disambiguate this commit considers the Raft index when storing
passthrough addresses. In the example above, 10.0.0.1 would only be
associated with the newer payments service instance.
Transparent proxies can set up filter chains that allow direct
connections to upstream service instances. Services that can be dialed
directly are stored in the PassthroughUpstreams map of the proxycfg
snapshot.
Previously these addresses were not being cleaned up based on new
service health data. The list of addresses associated with an upstream
service would only ever grow.
As services scale up and down, eventually they will have instances
assigned to an IP that was previously assigned to a different service.
When IP addresses are duplicated across filter chain match rules the
listener config will be rejected by Envoy.
This commit updates the proxycfg snapshot management so that passthrough
addresses can get cleaned up when no longer associated with a given
upstream.
There is still the possibility of a race condition here where due to
timing an address is shared between multiple passthrough upstreams.
That concern is mitigated by #12195, but will be further addressed
in a follow-up.