* Implement the ServerDiscovery.WatchServers gRPC endpoint
* Fix the ConnectCA.Sign gRPC endpoints metadata forwarding.
* Unify public gRPC endpoints around the public.TraceID function for request_id logging
Fixes#12048Fixes#12319
Regression introduced in #11693
Local reproduction steps:
1. `consul agent -dev`
2. `curl -sLiv 'localhost:8500/v1/agent/connect/ca/leaf/web'`
3. make note of the `X-Consul-Index` header returned
4. `curl -sLi 'localhost:8500/v1/agent/connect/ca/leaf/web?index=<VALUE_FROM_STEP_3>'`
5. Kill the above curl when it hangs with Ctrl-C
6. Repeat (2) and it should not hang.
Adds a new gRPC endpoint to get envoy bootstrap params. The new consul-dataplane service will use this
endpoint to generate an envoy bootstrap configuration.
Whenever autopilot updates its state it notifies Consul. That notification will then trigger Consul to extract out the ready server information. If the ready servers have changed, then an event will be published to notify any subscribers of the full set of ready servers.
All these ready server event things are contained within an autopilotevents package instead of the consul package to make importing them into the grpc related packages possible
Introduces a gRPC endpoint for signing Connect leaf certificates. It's also
the first of the public gRPC endpoints to perform leader-forwarding, so
establishes the pattern of forwarding over the multiplexed internal RPC port.
Previously we had 1 EventPublisher per state.Store. When a state store was closed/abandoned such as during a consul snapshot restore, this had the behavior of force closing subscriptions for that topic and evicting event snapshots from the cache.
The intention of this commit is to keep all that behavior. To that end, the shared EventPublisher now supports the ability to refresh a topic. That will perform the force close + eviction. The FSM upon abandoning the previous state.Store will call RefreshTopic for all the topics with events generated by the state.Store.
Just like standard upstreams the order of applicability in descending precedence:
1. caller's `service-defaults` upstream override for destination
2. caller's `service-defaults` upstream defaults
3. destination's `service-resolver` ConnectTimeout
4. system default of 5s
Co-authored-by: mrspanishviking <kcardenas@hashicorp.com>
If a service is automatically registered because it has a critical health check
for longer than deregister_critical_service_after, the error message will now
include:
- mention of the deregister_critical_service_after option
- the value of deregister_critical_service_after for that check
* Fixes a lint warning about t.Errorf not supporting %w
* Enable running autopilot on all servers
On the non-leader servers all they do is update the state and do not attempt any modifications.
* Fix the RPC conn limiting tests
Technically they were relying on racey behavior before. Now they should be reliable.
Adds a new gRPC service and endpoint to return the list of supported
consul dataplane features. The Consul Dataplane will use this API to
customize its interaction with that particular server.
Adds a new gRPC streaming endpoint (WatchRoots) that dataplane clients will
use to fetch the current list of active Connect CA roots and receive new
lists whenever the roots are rotated.
* add config watcher to the config package
* add logging to watcher
* add test and refactor to add WatcherEvent.
* add all API calls and fix a bug with recreated files
* add tests for watcher
* remove the unnecessary use of context
* Add debug log and a test for file rename
* use inode to detect if the file is recreated/replaced and only listen to create events.
* tidy ups (#1535)
* tidy ups
* Add tests for inode reconcile
* fix linux vs windows syscall
* fix linux vs windows syscall
* fix windows compile error
* increase timeout
* use ctime ID
* remove remove/creation test as it's a use case that fail in linux
* fix linux/windows to use Ino/CreationTime
* fix the watcher to only overwrite current file id
* fix linter error
* fix remove/create test
* set reconcile loop to 200 Milliseconds
* fix watcher to not trigger event on remove, add more tests
* on a remove event try to add the file back to the watcher and trigger the handler if success
* fix race condition
* fix flaky test
* fix race conditions
* set level to info
* fix when file is removed and get an event for it after
* fix to trigger handler when we get a remove but re-add fail
* fix error message
* add tests for directory watch and fixes
* detect if a file is a symlink and return an error on Add
* rename Watcher to FileWatcher and remove symlink deref
* add fsnotify@v1.5.1
* fix go mod
* do not reset timer on errors, rename OS specific files
* rename New func
* events trigger on write and rename
* add missing test
* fix flaking tests
* fix flaky test
* check reconcile when removed
* delete invalid file
* fix test to create files with different mod time.
* back date file instead of sleeping
* add watching file in agent command.
* fix watcher call to use new API
* add configuration and stop watcher when server stop
* add certs as watched files
* move FileWatcher to the agent start instead of the command code
* stop watcher before replacing it
* save watched files in agent
* add add and remove interfaces to the file watcher
* fix remove to not return an error
* use `Add` and `Remove` to update certs files
* fix tests
* close events channel on the file watcher even when the context is done
* extract `NotAutoReloadableRuntimeConfig` is a separate struct
* fix linter errors
* add Ca configs and outgoing verify to the not auto reloadable config
* add some logs and fix to use background context
* add tests to auto-config reload
* remove stale test
* add tests to changes to config files
* add check to see if old cert files still trigger updates
* rename `NotAutoReloadableRuntimeConfig` to `StaticRuntimeConfig`
* fix to re add both key and cert file. Add test to cover this case.
* review suggestion
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add check to static runtime config changes
* fix test
* add changelog file
* fix review comments
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* update flag description
Co-authored-by: FFMMM <FFMMM@users.noreply.github.com>
* fix compilation error
* add static runtime config support
* fix test
* fix review comments
* fix log test
* Update .changelog/12329.txt
Co-authored-by: Dan Upton <daniel@floppy.co>
* transfer tests to runtime_test.go
* fix filewatcher Replace to not deadlock.
* avoid having lingering locks
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* split ReloadConfig func
* fix warning message
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* convert `FileWatcher` into an interface
* fix compilation errors
* fix tests
* extract func for adding and removing files
* add a coalesceTimer with a very small timer
* extract coaelsce Timer and add a shim for testing
* add tests to coalesceTimer fix to send remaining events
* set `coalesceTimer` to 1 Second
* support symlink, fix a nil deref.
* fix compile error
* fix compile error
* refactor file watcher rate limiting to be a Watcher implementation
* fix linter issue
* fix runtime config
* fix runtime test
* fix flaky tests
* fix compile error
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* fix agent New to return an error if File watcher New return an error
* quit timer loop if ctx is canceled
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: Ashwin Venkatesh <ashwin@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Co-authored-by: FFMMM <FFMMM@users.noreply.github.com>
Co-authored-by: Daniel Upton <daniel@floppy.co>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
* Avoid doing list of /sys/mounts
From an internal ticket "Support standard "Vault namespace in the path" semantics for Connect Vault CA Provider"
Vault allows the namespace to be specified as a prefix in the path of
a PKI definition, but this doesn't currently work for
```IntermediatePKIPath``` specifications, because we attempt to list
all of the paths to check if ours is already defined. This doesn't
really work in a namespaced world.
This changes the IntermediatePKIPath code to follow the same pattern
as the root key, where we directly get the key rather than listing.
This code is difficult to write automated tests for because it relies
on features of Vault Enterprise, which isn't currently part of our
test framework, so it was tested manually.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* add changelog
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* add config watcher to the config package
* add logging to watcher
* add test and refactor to add WatcherEvent.
* add all API calls and fix a bug with recreated files
* add tests for watcher
* remove the unnecessary use of context
* Add debug log and a test for file rename
* use inode to detect if the file is recreated/replaced and only listen to create events.
* tidy ups (#1535)
* tidy ups
* Add tests for inode reconcile
* fix linux vs windows syscall
* fix linux vs windows syscall
* fix windows compile error
* increase timeout
* use ctime ID
* remove remove/creation test as it's a use case that fail in linux
* fix linux/windows to use Ino/CreationTime
* fix the watcher to only overwrite current file id
* fix linter error
* fix remove/create test
* set reconcile loop to 200 Milliseconds
* fix watcher to not trigger event on remove, add more tests
* on a remove event try to add the file back to the watcher and trigger the handler if success
* fix race condition
* fix flaky test
* fix race conditions
* set level to info
* fix when file is removed and get an event for it after
* fix to trigger handler when we get a remove but re-add fail
* fix error message
* add tests for directory watch and fixes
* detect if a file is a symlink and return an error on Add
* rename Watcher to FileWatcher and remove symlink deref
* add fsnotify@v1.5.1
* fix go mod
* do not reset timer on errors, rename OS specific files
* rename New func
* events trigger on write and rename
* add missing test
* fix flaking tests
* fix flaky test
* check reconcile when removed
* delete invalid file
* fix test to create files with different mod time.
* back date file instead of sleeping
* add watching file in agent command.
* fix watcher call to use new API
* add configuration and stop watcher when server stop
* add certs as watched files
* move FileWatcher to the agent start instead of the command code
* stop watcher before replacing it
* save watched files in agent
* add add and remove interfaces to the file watcher
* fix remove to not return an error
* use `Add` and `Remove` to update certs files
* fix tests
* close events channel on the file watcher even when the context is done
* extract `NotAutoReloadableRuntimeConfig` is a separate struct
* fix linter errors
* add Ca configs and outgoing verify to the not auto reloadable config
* add some logs and fix to use background context
* add tests to auto-config reload
* remove stale test
* add tests to changes to config files
* add check to see if old cert files still trigger updates
* rename `NotAutoReloadableRuntimeConfig` to `StaticRuntimeConfig`
* fix to re add both key and cert file. Add test to cover this case.
* review suggestion
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add check to static runtime config changes
* fix test
* add changelog file
* fix review comments
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* update flag description
Co-authored-by: FFMMM <FFMMM@users.noreply.github.com>
* fix compilation error
* add static runtime config support
* fix test
* fix review comments
* fix log test
* Update .changelog/12329.txt
Co-authored-by: Dan Upton <daniel@floppy.co>
* transfer tests to runtime_test.go
* fix filewatcher Replace to not deadlock.
* avoid having lingering locks
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* split ReloadConfig func
* fix warning message
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* convert `FileWatcher` into an interface
* fix compilation errors
* fix tests
* extract func for adding and removing files
Co-authored-by: Ashwin Venkatesh <ashwin@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Co-authored-by: FFMMM <FFMMM@users.noreply.github.com>
Co-authored-by: Daniel Upton <daniel@floppy.co>
This adds an aws-iam auth method type which supports authenticating to Consul using AWS IAM identities.
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
- `tls.incoming`: applies to the inbound mTLS targeting the public
listener on `connect-proxy` and `terminating-gateway` envoy instances
- `tls.outgoing`: applies to the outbound mTLS dialing upstreams from
`connect-proxy` and `ingress-gateway` envoy instances
Fixes#11966
* Fix leaked Vault LifetimeRenewers
When the Vault CA Provider is reconfigured we do not stop the
LifetimeRenewers which can cause them to leak until the Consul processes
recycles. On Configure execute stopWatcher if it exists and is not nil
before starting a new renewal
* Add jitter before restarting the LifetimeWatcher
If we fail to login to Vault or our token is no longer valid we can
overwhelm a Vault instance with many requests very quickly by restarting
the LifetimeWatcher. Before restarting the LifetimeWatcher provide a
backoff time of 1 second or less.
* Use a retry.Waiter instead of RandomStagger
* changelog
* gofmt'd
* Swap out bool for atomic.Unit32 in test
* Provide some extra clarification in comment and changelog
* tlsutil: initial implementation of types/TLSVersion
tlsutil: add test for parsing deprecated agent TLS version strings
tlsutil: return TLSVersionInvalid with error
tlsutil: start moving tlsutil cipher suite lookups over to types/tls
tlsutil: rename tlsLookup to ParseTLSVersion, add cipherSuiteLookup
agent: attempt to use types in runtime config
agent: implement b.tlsVersion validation in config builder
agent: fix tlsVersion nil check in builder
tlsutil: update to renamed ParseTLSVersion and goTLSVersions
tlsutil: fixup TestConfigurator_CommonTLSConfigTLSMinVersion
tlsutil: disable invalid config parsing tests
tlsutil: update tests
auto_config: lookup old config strings from base.TLSMinVersion
auto_config: update endpoint tests to use TLS types
agent: update runtime_test to use TLS types
agent: update TestRuntimeCinfig_Sanitize.golden
agent: update config runtime tests to expect TLS types
* website: update Consul agent tls_min_version values
* agent: fixup TLS parsing and compilation errors
* test: fixup lint issues in agent/config_runtime_test and tlsutil/config_test
* tlsutil: add CHACHA20_POLY1305 cipher suites to goTLSCipherSuites
* test: revert autoconfig tls min version fixtures to old format
* types: add TLSVersions public function
* agent: add warning for deprecated TLS version strings
* agent: move agent config specific logic from tlsutil.ParseTLSVersion into agent config builder
* tlsutil(BREAKING): change default TLS min version to TLS 1.2
* agent: move ParseCiphers logic from tlsutil into agent config builder
* tlsutil: remove unused CipherString function
* agent: fixup import for types package
* Revert "tlsutil: remove unused CipherString function"
This reverts commit 6ca7f6f58d268e617501b7db9500113c13bae70c.
* agent: fixup config builder and runtime tests
* tlsutil: fixup one remaining ListenerConfig -> ProtocolConfig
* test: move TLS cipher suites parsing test from tlsutil into agent config builder tests
* agent: remove parseCiphers helper from auto_config_endpoint_test
* test: remove unused imports from tlsutil
* agent: remove resolved FIXME comment
* tlsutil: remove TODO and FIXME in cipher suite validation
* agent: prevent setting inherited cipher suite config when TLS 1.3 is specified
* changelog: add entry for converting agent config to TLS types
* agent: remove FIXME in runtime test, this is covered in builder tests with invalid tls9 value now
* tlsutil: remove config tests for values checked at agent config builder boundary
* tlsutil: remove tls version check from loadProtocolConfig
* tlsutil: remove tests and TODOs for logic checked in TestBuilder_tlsVersion and TestBuilder_tlsCipherSuites
* website: update search link for supported Consul agent cipher suites
* website: apply review suggestions for tls_min_version description
* website: attempt to clean up markdown list formatting for tls_min_version
* website: moar linebreaks to fix tls_min_version formatting
* Revert "website: moar linebreaks to fix tls_min_version formatting"
This reverts commit 38585927422f73ebf838a7663e566ac245f2a75c.
* autoconfig: translate old values for TLSMinVersion
* agent: rename var for translated value of deprecated TLS version value
* Update agent/config/deprecated.go
Co-authored-by: Dan Upton <daniel@floppy.co>
* agent: fix lint issue
* agent: fixup deprecated config test assertions for updated warning
Co-authored-by: Dan Upton <daniel@floppy.co>
Looks like something got munged at some point. Not sure how it slipped in, but my best guess is that because TestTxn_Apply_ACLDeny is marked flaky we didn't block merge because it failed.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* mogify needed pbcommon structs
* mogify needed pbconnect structs
* fix compilation errors and make config_translate_test pass
* add missing file
* remove redundant oss func declaration
* fix EnterpriseMeta to copy the right data for enterprise
* rename pbcommon package to pbcommongogo
* regenerate proto and mog files
* add missing mog files
* add pbcommon package
* pbcommon no mog
* fix enterprise meta code generation
* fix enterprise meta code generation (pbcommongogo)
* fix mog generation for gogo
* use `protoc-go-inject-tag` to inject tags
* rename proto package
* pbcommon no mog
* use `protoc-go-inject-tag` to inject tags
* add non gogo proto to make file
* fix proto get
This extends the acl.AllowAuthorizer with source of authority information.
The next step is to unify the AllowAuthorizer and ACLResolveResult structures; that will be done in a separate PR.
Part of #12481
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
Introduces the capability to configure TLS differently for Consul's
listeners/ports (i.e. HTTPS, gRPC, and the internal multiplexed RPC
port) which is useful in scenarios where you may want the HTTPS or
gRPC interfaces to present a certificate signed by a well-known/public
CA, rather than the certificate used for internal communication which
must have a SAN in the form `server.<dc>.consul`.
Give the user a hint that they might be doing something wrong if their GET
request has a non-empty body, which can easily happen using curl's
--data-urlencode if specifying request type via "--request GET" rather than
"--get". See https://github.com/hashicorp/consul/issues/11471.
Currently the config_entry.go subsystem delegates authorization decisions via the ConfigEntry interface CanRead and CanWrite code. Unfortunately this returns a true/false value and loses the details of the source.
This is not helpful, especially since it the config subsystem can be more complex to understand, since it covers so many domains.
This refactors CanRead/CanWrite to return a structured error message (PermissionDenied or the like) with more details about the reason for denial.
Part of #12241
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* First pass for helper for bulk changes
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Convert ACLRead and ACLWrite to new form
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* AgentRead and AgentWRite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fix EventWrite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* KeyRead, KeyWrite, KeyList
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* KeyRing
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* NodeRead NodeWrite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* OperatorRead and OperatorWrite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* PreparedQuery
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Intention partial
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fix ServiceRead, Write ,etc
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Error check ServiceRead?
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fix Sessionread/Write
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fixup snapshot ACL
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Error fixups for txn
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Add changelog
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fixup review comments
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
Prior to this PR for the envoy xDS golden tests in the agent/xds package we
were hand-creating a proxycfg.ConfigSnapshot structure in the proper format for
input to the xDS generator. Over time this intermediate structure has gotten
trickier to build correctly for the various tests.
This PR proposes to switch to using the existing mechanism for turning a
structs.NodeService and a sequence of cache.UpdateEvent copies into a
proxycfg.ConfigSnapshot, as that is less error prone to construct and aligns
more with how the data arrives.
NOTE: almost all of this is in test-related code. I tried super hard to craft
correct event inputs to get the golden files to be the same, or similar enough
after construction to feel ok that i recreated the spirit of the original test
cases.
Improves tests from #12362
These tests try to setup the following concurrent scenario:
1. (goroutine 1) execute read RPC with index=0
2. (goroutine 1) get response from (1) @ index=10
3. (goroutine 1) execute read RPC with index=10 and block
4. (goroutine 2) WHILE (3) is blocking, start slamming the system with stray writes that will cause the WatchSet to wakeup
5. (goroutine 2) after doing all writes, shut down the reader above
6. (goroutine 1) stops reading, double checks that it only ever woke up once (from 1)
Minor fix for behavior in #12362
IsDefault sometimes returns true even if there was a proxy-defaults or service-defaults config entry that was consulted. This PR fixes that.
before:
$ go test ./agent/consul -run TestLeader_ReapOrLeftMember_IgnoreSelf
ok github.com/hashicorp/consul/agent/consul 21.147s
after:
$ go test ./agent/consul -run TestLeader_ReapOrLeftMember_IgnoreSelf
ok github.com/hashicorp/consul/agent/consul 5.402s
Previously we were using two different criteria to decide where to run a
test. The main `go-test` job would skip Vault tests based on the
presence of the `vault` binary, but the `test-connect-ca-providers` job
would run tests based on the name.
This led to a scenario where a test may never run in CI.
To fix this problem I added a name check to the function we use to skip
the test. This should ensure that any test that requires vault is named
correctly to be run as part of the `test-connect-ca-providers` job.
At the same time I relaxed the regex we use. I verified this runs the
same tests using `go test --list Vault`. I made this change because a
bunch of tests in `agent/connect/ca` used `Vault` in the name, without
the underscores. Instead of changing a bunch of test names, this seemed
easier.
With this approach, the worst case is that we run a few extra tests in
the `test-connect-ca-providers` job, which doesn't seem like a problem.
Starting from and extending the mechanism introduced in #12110 we can specially handle the 3 main special Consul RPC endpoints that react to many config entries in a single blocking query in Connect:
- `DiscoveryChain.Get`
- `ConfigEntry.ResolveServiceConfig`
- `Intentions.Match`
All of these will internally watch for many config entries, and at least one of those will likely be not found in any given query. Because these are blends of multiple reads the exact solution from #12110 isn't perfectly aligned, but we can tweak the approach slightly and regain the utility of that mechanism.
### No Config Entries Found
In this case, despite looking for many config entries none may be found at all. Unlike #12110 in this scenario we do not return an empty reply to the caller, but instead synthesize a struct from default values to return. This can be handled nearly identically to #12110 with the first 1-2 replies being non-empty payloads followed by the standard spurious wakeup suppression mechanism from #12110.
### No Change Since Last Wakeup
Once a blocking query loop on the server has completed and slept at least once, there is a further optimization we can make here to detect if any of the config entries that were present at specific versions for the prior execution of the loop are identical for the loop we just woke up for. In that scenario we can return a slightly different internal sentinel error and basically externally handle it similar to #12110.
This would mean that even if 20 discovery chain read RPC handling goroutines wakeup due to the creation of an unrelated config entry, the only ones that will terminate and reply with a blob of data are those that genuinely have new data to report.
### Extra Endpoints
Since this pattern is pretty reusable, other key config-entry-adjacent endpoints used by `agent/proxycfg` also were updated:
- `ConfigEntry.List`
- `Internal.IntentionUpstreams` (tproxy)
Many places in consul already treated node names case insensitively.
The state store indexes already do it, but there are a few places that
did a direct byte comparison which have now been corrected.
One place of particular consideration is ensureCheckIfNodeMatches
which is executed during snapshot restore (among other places). If a
node check used a slightly different casing than the casing of the node
during register then the snapshot restore here would deterministically
fail. This has been fixed.
Primary approach:
git grep -i "node.*[!=]=.*node" -- ':!*_test.go' ':!docs'
git grep -i '\[[^]]*member[^]]*\]
git grep -i '\[[^]]*\(member\|name\|node\)[^]]*\]' -- ':!*_test.go' ':!website' ':!ui' ':!agent/proxycfg/testing.go:' ':!*.md'
There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.
Config entry replication procedes in a particular sorted order by kind
and name.
Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.
This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.
This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.
Fixes#9319
The scenario that is NOT ADDRESSED by this PR is as follows:
1. create: service-defaults: name=new-web, protocol=http
2. create: service-defaults: name=old-web, protocol=http
3. create: service-resolver: name=old-web, redirect-to=new-web
4. delete: service-resolver: name=old-web
5. update: service-defaults: name=old-web, protocol=grpc
6. update: service-defaults: name=new-web, protocol=grpc
7. create: service-resolver: name=old-web, redirect-to=new-web
If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.
* add config watcher to the config package
* add logging to watcher
* add test and refactor to add WatcherEvent.
* add all API calls and fix a bug with recreated files
* add tests for watcher
* remove the unnecessary use of context
* Add debug log and a test for file rename
* use inode to detect if the file is recreated/replaced and only listen to create events.
* tidy ups (#1535)
* tidy ups
* Add tests for inode reconcile
* fix linux vs windows syscall
* fix linux vs windows syscall
* fix windows compile error
* increase timeout
* use ctime ID
* remove remove/creation test as it's a use case that fail in linux
* fix linux/windows to use Ino/CreationTime
* fix the watcher to only overwrite current file id
* fix linter error
* fix remove/create test
* set reconcile loop to 200 Milliseconds
* fix watcher to not trigger event on remove, add more tests
* on a remove event try to add the file back to the watcher and trigger the handler if success
* fix race condition
* fix flaky test
* fix race conditions
* set level to info
* fix when file is removed and get an event for it after
* fix to trigger handler when we get a remove but re-add fail
* fix error message
* add tests for directory watch and fixes
* detect if a file is a symlink and return an error on Add
* rename Watcher to FileWatcher and remove symlink deref
* add fsnotify@v1.5.1
* fix go mod
* fix flaky test
* Apply suggestions from code review
Co-authored-by: Ashwin Venkatesh <ashwin@hashicorp.com>
* fix a possible stack overflow
* do not reset timer on errors, rename OS specific files
* start the watcher when creating it
* fix data race in tests
* rename New func
* do not call handler when a remove event happen
* events trigger on write and rename
* fix watcher tests
* make handler async
* remove recursive call
* do not produce events for sub directories
* trim "/" at the end of a directory when adding
* add missing test
* fix logging
* add todo
* fix failing test
* fix flaking tests
* fix flaky test
* add logs
* fix log text
* increase timeout
* reconcile when remove
* check reconcile when removed
* fix reconcile move test
* fix logging
* delete invalid file
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* fix review comments
* fix is watched to properly catch a remove
* change test timeout
* fix test and rename id
* fix test to create files with different mod time.
* fix deadlock when stopping watcher
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* fix a deadlock when calling stop while emitting event is blocked
* make sure to close the event channel after the event loop is done
* add go doc
* back date file instead of sleeping
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* check error
Co-authored-by: Ashwin Venkatesh <ashwin@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Otherwise when the query times out we might incorrectly send a value for
the reply, when we should send an empty reply.
Also document errNotFound and how to handle the result in that case.
The interface is documented as 'Sign will only return the leaf', and the other providers
only return the leaf. It seems like this was added during the initial implementation, so
is likely just something we missed. It doesn't break anything , but it does cause confusing cert chains
in the API response which could break something in the future.
* Parse datacenter from request
- Parse the value of the datacenter from the create/delete requests for AuthMethods and BindingRules so that they can be created in and deleted from the datacenters specified in the request.
By using the query results as state.
Blocking queries are efficient when the query matches some results,
because the ModifyIndex of those results, returned as queryMeta.Mindex,
will never change unless the items themselves change.
Blocking queries for non-existent items are not efficient because the
queryMeta.Index can (and often does) change when other entities are
written.
This commit reduces the churn of these queries by using a different
comparison for "has changed". Instead of using the modified index, we
use the existence of the results. If the previous result was "not found"
and the new result is still "not found", we know we can ignore the
modified index and continue to block.
This is done by setting the minQueryIndex to the returned
queryMeta.Index, which prevents the query from returning before a state
change is observed.
This test shows how blocking queries are not efficient when the query
returns no results. The test fails with 100+ calls instead of the
expected 2.
This test is still a bit flaky because it depends on the timing of the
writes. It can sometimes return 3 calls.
A future commit should fix this and make blocking queries even more
optimal for not-found results.
Follow the Go convention of accepting a small interface that documents
the methods used by the function.
Clarify the rules for implementing a query function passed to
blockingQuery.
This will both save on unnecessary raft operations as well as
unnecessarily incrementing the raft modify index of config entries
subject to no-op updates.
This commit syncs ENT changes to the OSS repo.
Original commit details in ENT:
```
commit 569d25f7f4578981c3801e6e067295668210f748
Author: FFMMM <FFMMM@users.noreply.github.com>
Date: Thu Feb 10 10:23:33 2022 -0800
Vendor fork net rpc (#1538)
* replace net/rpc w consul-net-rpc/net/rpc
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
* replace msgpackrpc and go-msgpack with fork from mono repo
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
* gofmt all files touched
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
```
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
The race detector noticed this initially in `TestAgentConfigWatcherSidecarProxy` but it is not restricted to just tests.
The two main changes here were:
- ensure that before we mutate the internal `agent/local` representation of a Service (for tags or VIPs) we clone those fields
- ensure that there's no function argument joint ownership between the caller of a function and the local state when calling `AddService`, `AddCheck`, and related using `copystructure` for now.
* First phase of refactoring PermissionDeniedError
Add extended type PermissionDeniedByACLError that captures information
about the accessor, particular permission type and the object and name
of the thing being checked.
It may be worth folding the test and error return into a single helper
function, that can happen at a later date.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
Transparent proxies typically cannot dial upstreams in remote
datacenters. However, if their upstream configures a redirect to a
remote DC then the upstream targets will be in another datacenter.
In that sort of case we should use the WAN address for the passthrough.
Due to timing, a transparent proxy could have two upstreams to dial
directly with the same address.
For example:
- The orders service can dial upstreams shipping and payment directly.
- An instance of shipping at address 10.0.0.1 is deregistered.
- Payments is scaled up and scheduled to have address 10.0.0.1.
- The orders service receives the event for the new payments instance
before seeing the deregistration for the shipping instance. At this
point two upstreams have the same passthrough address and Envoy will
reject the listener configuration.
To disambiguate this commit considers the Raft index when storing
passthrough addresses. In the example above, 10.0.0.1 would only be
associated with the newer payments service instance.
Transparent proxies can set up filter chains that allow direct
connections to upstream service instances. Services that can be dialed
directly are stored in the PassthroughUpstreams map of the proxycfg
snapshot.
Previously these addresses were not being cleaned up based on new
service health data. The list of addresses associated with an upstream
service would only ever grow.
As services scale up and down, eventually they will have instances
assigned to an IP that was previously assigned to a different service.
When IP addresses are duplicated across filter chain match rules the
listener config will be rejected by Envoy.
This commit updates the proxycfg snapshot management so that passthrough
addresses can get cleaned up when no longer associated with a given
upstream.
There is still the possibility of a race condition here where due to
timing an address is shared between multiple passthrough upstreams.
That concern is mitigated by #12195, but will be further addressed
in a follow-up.
Fixes#11876
This enforces that multiple xDS mutations are not issued on the same ADS connection at once, so that we can 100% control the order that they are applied. The original code made assumptions about the way multiple in-flight mutations were applied on the Envoy side that was incorrect.
This commit makes two changes to the validation.
Previously we would call this validation in GenerateRoot, which happens
both on initialization (when a follower becomes leader), and when a
configuration is updated. We only want to do this validation during
config update so the logic was moved to the UpdateConfiguration
function.
Previously we would compare the config values against the actual cert.
This caused problems when the cert was created manually in Vault (not
created by Consul). Now we compare the new config against the previous
config. Using a already created CA cert should never error now.
Adding the key bit and types to the config should only error when
the previous values were not the defaults.
These two tests require debug logging enabled, because they look for log lines.
Also switched to testify assertions because the previous errors were not clear.
This test found a bug in the secondary. We were appending the root cert
to the PEM, but that cert was already appended. This was failing
validation in Vault here:
https://github.com/hashicorp/vault/blob/sdk/v0.3.0/sdk/helper/certutil/types.go#L329
Previously this worked because self signed certs have the same
SubjectKeyID and AuthorityKeyID. So having the same self-signed cert
repeated doesn't fail that check.
However with an intermediate that is not self-signed, those values are
different, and so we fail the check. A test I added in a previous commit
should show that this continues to work with self-signed root certs as
well.
This is safer than embedding two interface because there are a number of
places where we check the concrete type. If we check the concrete type
on the top-level interface it will fail. So instead expose the
ACLIdentity from a method.
This change allows us to remove one of the last remaining duplicate
resolve token methods (Server.ResolveToken).
With this change we are down to only 2, where the second one also
handles setting the default EnterpriseMeta from the token.
When a wildcard xDS type (LDS/CDS/SRDS) reconnects from a delta xDS stream,
prior to envoy `1.19.0` it would populate the `ResourceNamesSubscribe` field
with the full list of currently subscribed items, instead of simply omitting it
to infer that it wanted everything (which is what wildcard mode means).
This upstream issue was filed in envoyproxy/envoy#16063 and fixed in
envoyproxy/envoy#16153 which went out in Envoy `1.19.0` and is fixed in later
versions (later refactored in envoyproxy/envoy#16855).
This PR conditionally forces LDS/CDS to be wildcard-only even when the
connected Envoy requests a non-wildcard subscription, but only does so on
versions prior to `1.19.0`, as we should not need to do this on later versions.
This fixes the failure case as described here: #11833 (comment)
Co-authored-by: Huan Wang <fredwanghuan@gmail.com>
Now that ACLResolver is embedded we don't need ResolveTokenToIdentity on
Client and Server.
Moving ResolveTokenAndDefaultMeta to ACLResolver removes the duplicate
implementation.
set -euo pipefail
unset CDPATH
cd "$(dirname "$0")"
for f in $(git grep '\brequire := require\.New(' | cut -d':' -f1 | sort -u); do
echo "=== require: $f ==="
sed -i '/require := require.New(t)/d' $f
# require.XXX(blah) but not require.XXX(tblah) or require.XXX(rblah)
sed -i 's/\brequire\.\([a-zA-Z0-9_]*\)(\([^tr]\)/require.\1(t,\2/g' $f
# require.XXX(tblah) but not require.XXX(t, blah)
sed -i 's/\brequire\.\([a-zA-Z0-9_]*\)(\(t[^,]\)/require.\1(t,\2/g' $f
# require.XXX(rblah) but not require.XXX(r, blah)
sed -i 's/\brequire\.\([a-zA-Z0-9_]*\)(\(r[^,]\)/require.\1(t,\2/g' $f
gofmt -s -w $f
done
for f in $(git grep '\bassert := assert\.New(' | cut -d':' -f1 | sort -u); do
echo "=== assert: $f ==="
sed -i '/assert := assert.New(t)/d' $f
# assert.XXX(blah) but not assert.XXX(tblah) or assert.XXX(rblah)
sed -i 's/\bassert\.\([a-zA-Z0-9_]*\)(\([^tr]\)/assert.\1(t,\2/g' $f
# assert.XXX(tblah) but not assert.XXX(t, blah)
sed -i 's/\bassert\.\([a-zA-Z0-9_]*\)(\(t[^,]\)/assert.\1(t,\2/g' $f
# assert.XXX(rblah) but not assert.XXX(r, blah)
sed -i 's/\bassert\.\([a-zA-Z0-9_]*\)(\(r[^,]\)/assert.\1(t,\2/g' $f
gofmt -s -w $f
done
The gist here is that now we use a value-type struct proxycfg.UpstreamID
as the map key in ConfigSnapshot maps where we used to use "upstream
id-ish" strings. These are internal only and used just for bidirectional
trips through the agent cache keyspace (like the discovery chain target
struct).
For the few places where the upstream id needs to be projected into xDS,
that's what (proxycfg.UpstreamID).EnvoyID() is for. This lets us ALWAYS
inject the partition and namespace into these things without making
stuff like the golden testdata diverge.
Remove some unnecessary comments around query_blocking metric. The only
line that needs any comments in the atomic decrement.
Cleanup the block and return comments and logic. The old comment about
AbandonCh may have been relevant before, but it is expected behaviour
now.
The logic was simplified by inverting the err condition.
This helps keep the logic in blockingQuery more focused. In the future we
may have a separate struct for RPC queries which may allow us to move this
off of Server.
This safeguard should be safe to apply in general. We are already
applying it to non-blocking queries that call blockingQuery, so it
should be fine to apply it to others.
To remove the TODO, and make it more readable.
In general this reduces the scope of variables, making them easier to reason about.
It also introduces more early returns so that we can see the flow from the structure
of the function.
* xds: refactor ingress listener SDS configuration
* xds: update resolveListenerSDS call args in listeners_test
* ingress: add TLS min, max and cipher suites to GatewayTLSConfig
* xds: implement envoyTLSVersions and envoyTLSCipherSuites
* xds: merge TLS config
* xds: configure TLS parameters with ingress TLS context from leaf
* xds: nil check in resolveListenerTLSConfig validation
* xds: nil check in makeTLSParameters* functions
* changelog: add entry for TLS params on ingress config entries
* xds: remove indirection for TLS params in TLSConfig structs
* xds: return tlsContext, nil instead of ambiguous err
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
* xds: switch zero checks to types.TLSVersionUnspecified
* ingress: add validation for ingress config entry TLS params
* ingress: validate listener TLS config
* xds: add basic ingress with TLS params tests
* xds: add ingress listeners mixed TLS min version defaults precedence test
* xds: add more explicit tests for ingress listeners inheriting gateway defaults
* xds: add test for single TLS listener on gateway without TLS defaults
* xds: regen golden files for TLSVersionInvalid zero value, add TLSVersionAuto listener test
* types/tls: change TLSVersion to string
* types/tls: update TLSCipherSuite to string type
* types/tls: implement validation functions for TLSVersion and TLSCipherSuites, make some maps private
* api: add TLS params to GatewayTLSConfig, add tests
* api: add TLSMinVersion to ingress gateway config entry test JSON
* xds: switch to Envoy TLS cipher suite encoding from types package
* xds: fixup validation for TLSv1_3 min version with cipher suites
* add some kitchen sink tests and add a missing struct tag
* xds: check if mergedCfg.TLSVersion is in TLSVersionsWithConfigurableCipherSuites
* xds: update connectTLSEnabled comment
* xds: remove unsued resolveGatewayServiceTLSConfig function
* xds: add makeCommonTLSContextFromLeafWithoutParams
* types/tls: add LessThan comparator function for concrete values
* types/tls: change tlsVersions validation map from string to TLSVersion keys
* types/tls: remove unused envoyTLSCipherSuites
* types/tls: enable chacha20 cipher suites for Consul agent
* types/tls: remove insecure cipher suites from allowed config
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256 and TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 are both explicitly listed as insecure and disabled in the Go source.
Refs https://cs.opensource.google/go/go/+/refs/tags/go1.17.3:src/crypto/tls/cipher_suites.go;l=329-330
* types/tls: add ValidateConsulAgentCipherSuites function, make direct lookup map private
* types/tls: return all unmatched cipher suites in validation errors
* xds: check that Envoy API value matching TLS version is found when building TlsParameters
* types/tls: check that value is found in map before appending to slice in MarshalEnvoyTLSCipherSuiteStrings
* types/tls: cast to string rather than fmt.Printf in TLSCihperSuite.String()
* xds: add TLSVersionUnspecified to list of configurable cipher suites
* structs: update note about config entry warning
* xds: remove TLS min version cipher suite unconfigurable test placeholder
* types/tls: update tests to remove assumption about private map values
Co-authored-by: R.B. Boyer <rb@hashicorp.com>
`newIntermediate` is always equal to `needsNewIntermediate`, so we can
remove the extra variable and use the original directly.
Also remove the `activeRoot.ID != newActiveRoot.ID` case from an if,
because that case is already checked above, and `needsNewIntermediate` will
already be true in that case.
This condition now reads a lot better:
> Persist a new root if we did not have one before, or if generated a new intermediate.
In the previous commit the single use of this storedRoot was removed.
In this commit the original objective is completed. The
Provider.ActiveRoot is being removed because
1. the secondary should get the active root from the Consul primary DC,
not the provider, so that secondary DCs do not need to communicate
with a provider instance in a different DC.
2. so that the Provider.ActiveRoot interface can be changed without
impacting other code paths.
This method had only one caller, which always looked for the active
root. This commit moves the lookup into the method to reduce the logic
in the one caller.
This is being done in preparation for a larger change. Keeping this
separate so it is easier to see.
The `storedRootID != primaryRoots.ActiveRootID` is being removed because
these can never be different.
The `storedRootID` comes from `provider.ActiveRoot`, the
`primaryRoots.ActiveRootID` comes from the store `CARoot` from the
primary. In both cases the source of the data is the primary DC.
Technically they could be different if someone modified the provider
outside of Consul, but that would break many things, so is not a
supported flow.
If these were out of sync because of ordering of events then the
secondary will soon receive an update to `primaryRoots` and everything
will be sorted out again.
ActiveRoot should not be called from the secondary DC, because there
should not be a requirement to run the same Vault instance in a
secondary DC. SignIntermediate is called in a secondary DC, so it should
not call ActiveRoot
We would also like to change the interface of ActiveRoot so that we can
support using an intermediate cert as the primary CA in Consul. In
preparation for making that change I am reducing the number of calls to
ActiveRoot, so that there are fewer code paths to modify when the
interface changes.
This change required a change to the mockCAServerDelegate we use in
tests. It was returning the RootCert for SignIntermediate, but that is
not an accurate fake of production. In production this would also be a
separate cert.
Immediately above this line we are already appending the full list of
intermediates. The `provider.ActiveIntermediate` MUST be in this list of
intermediates because it must be available to all the other non-leader
Servers. If it was not in this list of intermediates then any proxy
that received data from a non-leader would have the wrong certs.
This is being removed now because we are planning on changing the
`Provider.ActiveIntermediate` interface, and removing these extra calls ahead of
time helps make that change easier.
Using tracing and cpu profiling I found that the majority of the time in
these test cases is spent generating a private key. We really don't need
separate private keys, so we can generate only one and use it for all
cases.
With this change the test runs much faster.
Fix the name to match the function it is testing
Remove unused code
Fix the signature, instead of returning (error, string) which should be (string, error)
accept a testing.T to emit errors.
Handle the error from encode.
Update the `/agent/check/deregister/` API endpoint to return a 404
HTTP response code when an attempt is made to de-register a check ID
that does not exist on the agent.
This brings the behavior of /agent/check/deregister/ in line with the
behavior of /agent/service/deregister/ which was changed in #10632 to
similarly return a 404 when de-registering non-existent services.
Fixes#5821
* clone the service under lock to avoid a data race
* add change log
* create a struct and copy the pointer to mutate it to avoid a data race
* fix failing test
* revert added space
* add comments, to clarify the data race.
The only function passed to SnapshotRPC today always returns a nil error, so there's no
way to exercise this bug in practice. This change is being made for correctness so that
it doesn't become a problem in the future, if we ever pass a different function to
SnapshotRPC.
Error messages related to service and check operations previously included
the following substrings:
- service %q
- check %q
From this error message, it isn't clear that the expected field is the ID for
the entity, not the name. For example, if the user has a service named test,
the error message would read 'Unknown service "test"'. This is misleading -
a service with that *name* does exist, but not with that *ID*.
The substrings above have been modified to make it clear that ID is needed,
not name:
- service with ID %q
- check with ID %q
Previously we could get into a state where discovery chain entries were
not cleaned up after the associated watch was cancelled. These changes
add handling for that case where stray chain references are encountered.
When a URL path is not found, return a non-empty message with the 404 status
code to help the user understand what went wrong. If the URL path was not
prefixed with '/v1/', suggest that may be the cause of the problem (which is a
common mistake).
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: Dan Upton <daniel@floppy.co>
This query has been incorrectly querying by accessor ID since New ACLs
were added. However, the legacy token compat allowed this to continue to
work, since it made a fallback query for the anonymousToken ID.
PR #11184 removed this legacy token query, which means that the query by
accessor ID is now the only check for the anonymous token's existence.
This PR updates the GetBySecret call to use the secret ID of the token.
These helper functions actually end up hiding important setup details
that should be visible from the test case. We already have a convenient
way of setting this config when calling newTestServerWithConfig.
I suspect one problem was that we set structs.IntermediateCertRenewInterval to 1ms, which meant
that in some cases the intermediate could renew before we stored the original value.
Another problem was that the 'wait for intermediate' loop was calling the provider.ActiveIntermediate,
but the comparison needs to use the RPC endpoint to accurately represent a user request. So
changing the 'wait for' to use the state store ensures we don't race.
Also moves the patching into a separate function.
Removes the addition of ca.CertificateTimeDriftBuffer as part of calculating halfTime. This was added
in a previous commit to attempt to fix the flake, but it did not appear to fix the problem. Adding the
time here was making the tests fail when using the shared patch
function. It's not clear to me why, but there's no reason we should be
including this time in the halfTime calculation.
Use the new verifyLearfCert to show the cert verifies with intermediates
from both sources. This required using the RPC interface so that the
leaf pem was constructed correctly.
Add IndexedCARoots.Active since that is a common operation we see in a
few places.
Previously we had a couple copies that reproduced the FSM operation.
These copies introduce risk that the test does not accurately match
production.
This PR removes the test versions of the FSM operation, and exports the
real production FSM operation so that it can be used in tests.
The consul provider tests did need to change because of this. Previously
we would return a hardcoded value of 2, but in production this value is
always incremented.
Failing over to a partition is more siimilar to failing over to another
datacenter than it is to failing over to a namespace. In a future
release we should update how localities for failover are specified. We
should be able to accept a list of localities which can include both
partition and datacenter.
* Add partition fields to targets like service route destinations
* Update validation to prevent cross-DC + cross-partition references
* Handle partitions when reading config entries for disco chain
* Encode partition in compiled targets
Fixes a bug whereby servers present in multiple network areas would be
properly segmented in the Router, but not in the gRPC mirror. This would
lead servers in the current datacenter leaving from a network area
(possibly during the network area's removal) from deleting their own
records that still exist in the standard WAN area.
The gRPC client stack uses the gRPC server tracker to execute all RPCs,
even those targeting members of the current datacenter (which is unlike
the net/rpc stack which has a bypass mechanism).
This would manifest as a gRPC method call never opening a socket because
it would block forever waiting for the current datacenter's pool of
servers to be non-empty.
Given that we do not allow wildcard partitions in intentions, no one ixn
can override the DefaultAllow setting. Only the default ACL policy
applies across all partitions.
This table purposefully does not index by partition/namespace. It's a
global view into all service names.
This table is intended to replace the current serviceListTxn watch in
intentionTopologyTxn. For cross-partition transparent proxying we need
to be able to calculate upstreams from intentions in any partition. This
means that the existing serviceListTxn function is insufficient since
it's scoped to a partition.
Moving away from that function is also beneficial because it watches the
main "services" table, so watchers will wake up when any instance is
registered or deregistered.
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused operations_ent.go and operations_oss.go func
* remove unused const
* convert `IndexID` of `session_checks` table
* convert `indexSession` of `session_checks` table
* convert `indexNodeCheck` of `session_checks` table
* partition `indexID` and `indexSession` of `tableSessionChecks`
* fix oss linter
* fix review comments
* remove partition for Checks as it's always use the session partition
* fix tests
* fix tests
* do not namespace nodeChecks index
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Cross port of ent #1383 "Reject non-default datacenter when making partitioned ACLs"
On the OSS side this is a minor refactor to add some more checks that are only applicable to enterprise code.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
The test added in this commit shows the problem. Previously the
SigningKeyID was set to the RootCert not the local leaf signing cert.
This same bug was fixed in two other places back in 2019, but this last one was
missed.
While fixing this bug I noticed I had the same few lines of code in 3
places, so I extracted a new function for them.
There would be 4 places, but currently the InitializeCA flow sets this
SigningKeyID in a different way, so I've left that alone for now.
While working on the CA system it is important to be able to run all the
tests related to the system, without having to wait for unrelated tests.
There are many slow and unrelated tests in agent/consul, so we need some
way to filter to only the relevant tests.
This PR renames all the CA system related tests to start with either
`TestCAMananger` for tests of internal operations that don't have RPC
endpoint, or `TestConnectCA` for tests of RPC endpoints. This allows us
to run all the test with:
go test -run 'TestCAMananger|TestConnectCA' ./agent/consul
The test naming follows an undocumented convention of naming tests as
follows:
Test[<struct name>_]<function name>[_<test case description>]
I tried to always keep Primary/Secondary at the end of the description,
and _Vault_ has to be in the middle because of our regex to run those
tests as a separate CI job.
You may notice some of the test names changed quite a bit. I did my best
to identify the underlying method being tested, but I may have been
slightly off in some cases.
As a method on the struct type this would not be safe to call without first checking
c.isIntermediateUsedToSignLeaf.
So for now, move this logic to the CAMananger, so that it is always correct.
We were not adding the local signing cert to the CARoot. This commit
fixes that bug, and also adds support for fixing existing CARoot on
upgrade.
Also update the tests for both primary and secondary to be more strict.
Check the SigningKeyID is correct after initialization and rotation.
Validation was added on the config entry kind since that is called when
validating config entries to bootstrap via agent configuration and when
applying entries via the config RPC endpoint.
Previously we believe it was necessary for all code that required ports
to use freeport to prevent conflicts.
https://github.com/dnephin/freeport-test shows that it is actually save
to use port 0 (`127.0.0.1:0`) as long as it is passed directly to
`net.Listen`, and the listener holds the port for as long as it is
needed.
This works because freeport explicitly avoids the ephemeral port range,
and port 0 always uses that range. As you can see from the test output
of https://github.com/dnephin/freeport-test, the two systems never use
overlapping ports.
This commit converts all uses of freeport that were being passed
directly to a net.Listen to use port 0 instead. This allows us to remove
a bit of wrapping we had around httptest, in a couple places.
In d2ab767fef raftApply was changed to handle this check in
a single place, instad of having every caller check it. It looks like these few places
were missed when I did that clean up.
This commit removes the remaining resp.(error) checks, since they are all no-ops now.
This function is only ever called from operations that have already acquired the state lock, so checking
the value of state can never fail.
This change is being made in preparation for splitting out a separate type for the secondary logic. The
state can't easily be shared, so really only the expored top-level functions should acquire the 'state lock'.
This commit removes the actingSecondaryCA field, and removes the stateLock around it. This field
was acting as a proxy for providerRoot != nil, so replace it with that check instead.
The two methods which called secondarySetCAConfigured already set the state, so checking the
state again at this point will not catch runtime errors (only programming errors, which we can catch with tests).
In general, handling state transitions should be done on the "entrypoint" methods where execution starts, not
in every internal method.
This is being done to remove some unnecessary references to c.state, in preparations for extracting
types for primary/secondary.
This makes it easier to fake, which will allow me to use the ConsulProvider as
an 'external PKI' to test a customer setup where the actual root CA is not
the root we use for the Consul CA.
Replaces a call to the state store to fetch the clusterID with the
clusterID field already available on the built-in provider.
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused operations_ent.go and operations_oss.go func
* convert `indexNodeCheck` of `session_checks` table
* partition `indexID` and `indexSession` of `tableSessionChecks`
* remove partition for Checks as it's always use the session partition
* partition sessions index id table
* fix rebase issues
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused operations_ent.go and operations_oss.go func
* remove unused const
* convert `IndexID` of `session_checks` table
* convert `indexSession` of `session_checks` table
* convert `indexNodeCheck` of `session_checks` table
* partition `indexID` and `indexSession` of `tableSessionChecks`
* fix oss linter
* fix review comments
* remove partition for Checks as it's always use the session partition
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* Support vault auth methods for the Vault connect CA provider
* Rotate the token (re-authenticate to vault using auth method) when the token can no longer be renewed
The TrustDomain is populated from the Host() method which includes the
hard-coded "consul" domain. This means that despite having an empty
cluster ID, the TrustDomain won't be empty.
There are two restrictions:
- Writes from the primary DC which explicitly target a secondary DC.
- Writes to a secondary DC that do not explicitly target the primary DC.
The first restriction is because the config entry is not supported in
secondary datacenters.
The second restriction is to prevent the scenario where a user writes
the config entry to a secondary DC, the write gets forwarded to the
primary, but then the config entry does not apply in the secondary.
This makes the scope more explicit.
The duo of `makeUpstreamFilterChainForDiscoveryChain` and `makeListenerForDiscoveryChain` were really hard to reason about, and led to concealing a bug in their branching logic. There were several issues here:
- They tried to accomplish too much: determining filter name, cluster name, and whether RDS should be used.
- They embedded logic to handle significantly different kinds of upstream listeners (passthrough, prepared query, typical services, and catch-all)
- They needed to coalesce different data sources (Upstream and CompiledDiscoveryChain)
Rather than handling all of those tasks inside of these functions, this PR pulls out the RDS/clusterName/filterName logic.
This refactor also fixed a bug with the handling of [UpstreamDefaults](https://www.consul.io/docs/connect/config-entries/service-defaults#defaults). These defaults get stored as UpstreamConfig in the proxy snapshot with a DestinationName of "*", since they apply to all upstreams. However, this wildcard destination name must not be used when creating the name of the associated upstream cluster. The coalescing logic in the original functions here was in some situations creating clusters with a `*.` prefix, which is not a valid destination.
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.
This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
Currently getCARoots could return an empty object with an empty trust
domain before the CA is initialized. This commit returns an error while
there is no CA config or no trust domain.
There could be a CA config and no trust domain because the CA config can
be created in InitializeCA before initialization succeeds.
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused func `prefixIndexFromServiceNameAsString`
* fix test error check
* remove unused operations_ent.go and operations_oss.go func
* remove unused const
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* partition kvs indexID table
* add `partitionedIndexEntryName` in oss for test purpose
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* remove entmeta reference from oss
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Previously secondaryInitialize would return nil in this case, which prevented the
deferred initialize from happening, and left the CA in an uninitialized state until a config
update or root rotation.
To fix this I extracted the common parts into the delegate implementation. However looking at this
again, it seems like the handling in secondaryUpdateRoots is impossible, because that function
should never be called before the secondary is initialzied. I beleive we can remove some of that
logic in a follow up.
These two fields do not appear to be used anywhere. We use the structs.ACLPolicy ID in the
ACLResolver cache, but the acl.Policy ID and revision are not used.
* Support Vault Namespaces explicitly in CA config
If there is a Namespace entry included in the Vault CA configuration,
set it as the Vault Namespace on the Vault client
Currently the only way to support Vault namespaces in the Consul CA
config is by doing one of the following:
1) Set the VAULT_NAMESPACE environment variable which will be picked up
by the Vault API client
2) Prefix all Vault paths with the namespace
Neither of these are super pleasant. The first requires direct access
and modification to the Consul runtime environment. It's possible and
expected, not super pleasant.
The second requires more indepth knowledge of Vault and how it uses
Namespaces and could be confusing for anyone without that context. It
also infers that it is not supported
* Add changelog
* Remove fmt.Fprint calls
* Make comment clearer
* Add next consul version to website docs
* Add new test for default configuration
* go mod tidy
* Add skip if vault not present
* Tweak changelog text
* Remove some usage of md5 from the system
OSS side of https://github.com/hashicorp/consul-enterprise/pull/1253
This is a potential security issue because an attacker could conceivably manipulate inputs to cause persistence files to collide, effectively deleting the persistence file for one of the colliding elements.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* add root_cert_ttl option for consul connect, vault ca providers
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
* add changelog, pr feedback
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
* Update .changelog/11428.txt, more docs
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update website/content/docs/agent/options.mdx
Co-authored-by: Kyle Havlovitz <kylehav@gmail.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Kyle Havlovitz <kylehav@gmail.com>
if the provided value is empty string then the client services
(DNS, HTTP, HTTPS, GRPC) are not listening and the user is not notified
in any way about what's happening.
Also, since a not provided client_addr defaults to 127.0.0.1, we make sure
we are not getting unwanted warnings
Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
This will behave the way we handle SNI and SPIFFE IDs, where the default
partition is excluded.
Excluding the default ensures that don't attempt to compare default.dc2
to dc2 in OSS.
The api module has decoding functions that rely on 'kind' being present
of payloads. This is so that we can decode into the appropriate api type
for the config entry.
This commit ensures that a static kind is marshalled in responses from
Consul's api endpoints so that the api module can decode them.
These labels should be set by whatever process scrapes Consul (for
prometheus), or by the agent that receives them (for datadog/statsd).
We need to remove them here because the labels are part of the "metric
key", so we'd have to pre-declare the metrics with the labels. We could
do that, but that is extra work for labels that should be added from
elsewhere.
Also renames the closure to be more descriptive.
Prometheus scrapes metrics from each process, so when leadership transfers to a different node
the previous leader would still be reporting the old cached value.
By setting NaN, I believe we should zero-out the value, so that prometheus should only consider the
value from the new leader.
Emit the metric immediately so that after restarting an agent, the new expiry time will be
emitted. This is particularly important when this metric is being monitored, because we want
the alert to resovle itself immediately.
Also fixed a bug that was exposed in one of these metrics. The CARoot can be nil, so we have
to handle that case.
TestSubscribeBackend_IntegrationWithServer_DeliversAllMessages has been
flaking a few times. This commit cleans up the test a bit, and improves
the failure output.
I don't believe this actually fixes the flake, but I'm not able to
reproduce it reliably.
The failure appears to be that the event with Port=0 is being sent in
both the snapshot and as the first event after the EndOfSnapshot event.
Hopefully the improved logging will show us if these are really
duplicate events, or actually different events with different indexes.
This commit updates mesh gateway watches for cross-partitions
communication.
* Mesh gateways are keyed by partition and datacenter.
* Mesh gateways will now watch gateways in partitions that export
services to their partition.
* Mesh gateways in non-default partitions will not have cross-datacenter
watches. They are not involved in traditional WAN federation.
partitionAuthorizer.config can be nil if it wasn't provided on calls to
newPartitionAuthorizer outside of the ACLResolver. This usage happens
often in tests.
This commit: adds a nil check when the config is going to be used,
updates non-test usage of NewPolicyAuthorizerWithDefaults to pass a
non-nil config, and dettaches setEnterpriseConf from the ACLResolver.
When issuing cross-partition service discovery requests, ACL filtering
often checks for NodeRead privileges. This is because the common return
type is a CheckServiceNode, which contains node data.
Previously the datacenter of the gateway was the key identifier, now it
is the datacenter and partition.
When dialing services in other partitions or datacenters we now watch
the appropriate partition.
useInDatacenter was used to determine whether the mesh gateway mode of
the upstream should be returned in the discovery chain target. This
commit makes it so that the mesh gateway mode is returned every time,
and it is up to the caller to decide whether mesh gateways should be
watched or used.
Existing config entries prefixed by service- are specific to individual
services. Since this config entry applies to partitions it is being
renamed.
Additionally, the Partition label was changed to Name because using
Partition at the top-level and in the enterprise meta was leading to the
enterprise meta partition being dropped by msgpack.
The code for this was already removed, which suggests this is not actually testing what it claims.
I'm guessing these are still resolving because the tokens are converted to non-legacy tokens?
It seems like this was missing. Previously this was only called by init of ACLs during an upgrade.
Now that legacy ACLs are removed, nothing was calling stop.
Also remove an unused method from client.
To make it more clear which methods are necessary for each scenario. This can
also prevent problems which force all DCs to use the same Vault instance, which
is currently a problem.
This function is only run when the CAManager is a primary. Extracting this function
makes it clear which parts of UpdateConfiguration are run only in the primary and
also makes the cleanup logic simpler. Instead of both a defer and a local var we
can call the cleanup function in two places.
This commit renames functions to use a consistent pattern for identifying the functions that
can only be called when the Manager is run as the primary or secondary.
This is a step toward eventually creating separate types and moving these methods off of CAManager.
Add changelog to document what changed.
Add entry to telemetry section of the website to document what changed
Add docs to the usagemetric endpoint to help document the metrics in code
This commit two test failures:
1. Remove check for "in legacy ACL mode", the actual upgrade will be removed in a following commit.
2. Remove the early WaitForLeader in dc2, because with it the test was
failing with ACL not found.
TestAgentLeaks_Server was reporting a goroutine leak without this. Not sure if it would actually
be a leak in production or if this is due to the test setup, but seems easy enough to call it
this way until we remove legacyACLTokenUpgrade.
We no long need to read the acl serf tag, because servers are always either ACL enabled or
ACL disabled.
We continue to write the tag so that during an upgarde older servers will see the tag.
This commit two test failures:
1. Remove check for "in legacy ACL mode", the actual upgrade will be removed in a following commit.
2. Use the root token in WaitForLeader, because without it the test was
failing with ACL not found.
As part of removing the legacy ACL system ACL upgrading and the flag for
legacy ACLs is removed from Clients.
This commit also removes the 'acls' serf tag from client nodes. The tag is only ever read
from server nodes.
This commit also introduces a constant for the acl serf tag, to make it easier to track where
it is used.
The DebugConfig in the self endpoint can change at any time. It's not a stable API.
This commit adds the XDSPort to a stable part of the XDS api, and changes the envoy command to read
this new field.
It includes support for the old API as well, in case a newer CLI is used with an older API, and
adds a test for both cases.
Replace it with an implementation that returns an error, and rename some symbols
to use a Deprecated suffix to make it clear.
Also remove the ACLRequest struct, which is no longer referenced.
These methods only called a single function. Wrappers like this end up making code harder to read
because it adds extra ways of doing things.
We already have many helper functions for constructing these types, we don't need additional methods.
When converting these tests from the legacy ACL system to the new RPC endpoints I
initially changed most things to use _prefix rules, because that was equivalent to
the old legacy rules.
This commit modifies a few of those rules to be a bit more specific by replacing the _prefix
rule with a non-prefix one where possible.
This struct allows us to move all the deprecated config options off of
the main config struct, and keeps all the deprecation logic in a single
place, instead of spread across 3+ places.
In preparation for removing ACL.Apply.
Tests for ACL.Apply, ACL.GetPolicy, and ACL upgrades were removed
because all 3 of those will be removed shortly.
The forth test appears to be for the ACLResolver cache, so the test was moved to the correct
test file, and the name was updated to make it obvious what is being tested.
structs.ACLForceSet was deprecated 4 years ago, it should be safe to remove now.
ACLBootstrapNow was removed in a recent commit. While it is technically possible that a cluster with mixed version
could still attempt a legacy boostrap, we documented that the legacy system was deprecated in 1.4, so no
clusters that are being upgraded should be attempting a legacy boostrap.
Fixes#10563
The `resourceVersion` map was doing two jobs prior to this PR. The first job was
to track what version of every resource we know envoy currently has. The
second was to track subscriptions to those resources (by way of the empty
string for a version). This mostly works out fine, but occasionally leads to
consul removing a resource and accidentally (effectively) unsubscribing at the
same time.
The fix separates these two jobs. When all of the resources for a subscription
are removed we continue to track the subscription until envoy explicitly
unsubscribes
* Port consul-enterprise #1123 to OSS
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fixup missing query field
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* change to re-trigger ci system
Signed-off-by: Mark Anderson <manderson@hashicorp.com>