Commit Graph

1383 Commits

Author SHA1 Message Date
R.B. Boyer 02b2cb1d15 agent: ensure the TLS hostname verification knows about the currently configured domain (#5513) 2019-03-19 22:35:19 +01:00
Hans Hasselberg e7134a0dab
agent: only use TestAgent when appropriate (#5502) 2019-03-18 17:06:16 +01:00
Paul Banks 0b5a078b95
Optimize health watching to single chan/goroutine. (#5449)
Refs #4984.

Watching chans for every node we touch in a health query is wasteful. In #4984 it shows that if there are more than 682 service instances we always fallback to watching all services which kills performance.

We already have a record in MemDB that is reliably update whenever the service health result should change thanks to per-service watch indexes.

So in general, provided there is at least one service instances and we actually have a service index for it (we always do now) we only ever need to watch a single channel.

This saves us from ever falling back to the general index and causing the performance cliff in #4984, but it also means fewer goroutines and work done for every blocking health query.

It also saves some allocations made during the query because we no longer have to populate a WatchSet with 3 chans per service instance which saves the internal map allocation.

This passes all state store tests except the one that explicitly checked for the fallback behaviour we've now optimized away and in general seems safe.
2019-03-15 20:18:48 +00:00
Pierre Souchay 88d4383410 Ensure we remove Connect proxy before deregistering the service itself (#5482)
This will fix https://github.com/hashicorp/consul/issues/5351
2019-03-15 20:14:46 +00:00
Valentin Fritz 21f149de8b Fix checks removal when removing service (#5457)
Fix my recently discovered issue described here: #5456
2019-03-14 11:02:49 -04:00
R.B. Boyer cd96af4fc0
acl: reduce complexity of token resolution process with alternative singleflighting (#5480)
acl: reduce complexity of token resolution process with alternative singleflighting

Switches acl resolution to use golang.org/x/sync/singleflight. For the
identity/legacy lookups this is a drop-in replacement with the same
overall approach to request coalescing.

For policies this is technically a change in behavior, but when
considered holistically is approximately performance neutral (with the
benefit of less code).

There are two goals with this blob of code (speaking specifically of
policy resolution here):

  1) Minimize cross-DC requests.
  2) Minimize client-to-server LAN requests.

The previous iteration of this code was optimizing for the case of many
possibly different tokens being resolved concurrently that have a
significant overlap in linked policies such that deduplication would be
worth the complexity. While this is laudable there are some things to
consider that can help to adjust expectations:

  1) For v1.4+ policies are always replicated, and once a single policy
  shows up in a secondary DC the replicated data is considered
  authoritative for requests made in that DC. This means that our
  earlier concerns about minimizing cross-DC requests are irrelevant
  because there will be no cross-DC policy reads that occur.

  2) For Server nodes the in-memory ACL policy cache is capped at zero,
  meaning it has no caching. Only Client nodes run with a cache. This
  means that instead of having an entire DC's worth of tokens (what a
  Server might see) that can have policy resolutions coalesced these
  nodes will only ever be seeing node-local token resolutions. In a
  reasonable worst-case scenario where a scheduler like Kubernetes has
  "filled" a node with Connect services, even that will only schedule
  ~100 connect services per node. If every service has a unique token
  there will only be 100 tokens to coalesce and even then those requests
  have to occur concurrently AND be hitting an empty consul cache.

Instead of seeing a great coalescing opportunity for cutting down on
redundant Policy resolutions, in practice it's far more likely given
node densities that you'd see requests for the same token concurrently
than you would for two tokens sharing a policy concurrently (to a degree
that would warrant the overhead of the current variation of
singleflighting.

Given that, this patch switches the Policy resolution process to only
singleflight by requesting token (but keeps the cache as by-policy).
2019-03-14 09:35:34 -05:00
Hans Hasselberg 7e11dd82aa
agent: enable reloading of tls config (#5419)
This PR introduces reloading tls configuration. Consul will now be able to reload the TLS configuration which previously required a restart. It is not yet possible to turn TLS ON or OFF with these changes. Only when TLS is already turned on, the configuration can be reloaded. Most importantly the certificates and CAs.
2019-03-13 10:29:06 +01:00
R.B. Boyer 2e175be41b
acl: correctly extend the cache for acl identities during resolution (#5475) 2019-03-12 10:23:43 -05:00
Aestek 4bea29f15a [catalog] Update the node's services indexes on update (#5458)
Node updates were not updating the service indexes, which are used for
service related queries. This caused the X-Consul-Index to stay the same
after a node update as seen from a service query even though the node
data is returned in heath queries. If that happened in between queries
the client would miss this change.
We now update the indexes of the services on the node when it is
updated.

Fixes: #5450
2019-03-11 14:48:19 +00:00
Alvin Huang 8cb8108b1b fix typos 2019-03-06 14:47:33 -05:00
R.B. Boyer f4a3b9d518
fix typos reported by golangci-lint:misspell (#5434) 2019-03-06 11:13:28 -06:00
R.B. Boyer 2ffbea41c8 improve flaky LANReap tests by expliciting configuring the tombstone timeout
In TestServer_LANReap autopilot is running, so the alternate flow
through the serf reaping function is possible. In that situation the
ReconnectTimeout is not relevant so for parity also override the
TombstoneTimeout value as well.

For additional parity update the TestServer_WANReap and
TestClient_LANReap versions of this test in the same way even though
autopilot is irrelevant here .
2019-03-05 14:34:03 -06:00
R.B. Boyer 5bea49ecb0 tests: avoid leaking child processes from agent/proxyprocess package 2019-03-05 14:29:25 -06:00
Matt Keeler 567e41ff6b
Release v1.4.3 2019-03-04 19:21:20 +00:00
Matt Keeler 90040f8bff Fixes for CVE-2019-8336
Fix error in detecting raft replication errors.

Detect redacted token secrets and prevent attempting to insert.

Add a Redacted field to the TokenBatchRead and TokenRead RPC endpoints

This will indicate whether token secrets have been redacted.

Ensure any token with a redacted secret in secondary datacenters is removed.

Test that redacted tokens cannot be replicated.
2019-03-04 19:13:24 +00:00
Hans Hasselberg d35824b1fa default to tls 1.2 as promised. (#5340) 2019-03-04 09:42:04 -05:00
Aestek 2aac4d5168 Register and deregisters services and their checks atomically in the local state (#5012)
Prevent race between register and deregister requests by saving them
together in the local state on registration.
Also adds more cleaning in case of failure when registering services
/ checks.
2019-03-04 09:34:05 -05:00
Matt Keeler 6e6910ea11
Dont modify memdb owned token data for get/list requests of tokens (#5412)
Previously we were fixing up the token links directly on the *ACLToken returned by memdb. This invalidated some assumptions that a snapshot is immutable as well as potentially being able to cause a crash.

The fix here is to give the policy link fixing function copy on write semantics. When no fixes are necessary we can return the memdb object directly, otherwise we copy it and create a new list of links.

Eventually we might find a better way to keep those policy links in sync but for now this fixes the issue.
2019-03-04 09:28:46 -05:00
Aestek 02f991843f Fix race condition in DNS when using cache (#5398)
* Fix race condition in DNS when using cache

The healty node filtering was modifying the result from the cache, which
caused a crash when multiple queries were made to the same service
simultaneously.
We now copy the node slice before filtering to ensure we do not modify
the data stored in the cache.

* Fix wording in dns cache config doc

s/dns_max_age/cache_max_age/
2019-03-04 09:22:01 -05:00
Matt Keeler 200c0fb3e9
Call RemoveServer for reap events (#5317)
This ensures that servers are removed from RPC routing when they are reaped.
2019-03-04 09:19:35 -05:00
R.B. Boyer 409c901f8e test: fix concurrent map access when setting up test vault 2019-03-01 14:30:19 -06:00
R.B. Boyer 6955186239 fix ignored errors in state store internals as reported by errcheck 2019-03-01 14:18:00 -06:00
R.B. Boyer c7067645dd fix a few leap-year related clock math inaccuracies and failing tests 2019-03-01 13:51:49 -06:00
Matt Keeler 118adbb123
ACL Token Persistence and Reloading (#5328)
This PR adds two features which will be useful for operators when ACLs are in use.

1. Tokens set in configuration files are now reloadable.
2. If `acl.enable_token_persistence` is set to `true` in the configuration, tokens set via the `v1/agent/token` endpoint are now persisted to disk and loaded when the agent starts (or during configuration reload)

Note that token persistence is opt-in so our users who do not want tokens on the local disk will see no change.

Some other secondary changes:

* Refactored a bunch of places where the replication token is retrieved from the token store. This token isn't just for replicating ACLs and now it is named accordingly.
* Allowed better paths in the `v1/agent/token/` API. Instead of paths like: `v1/agent/token/acl_replication_token` the path can now be just `v1/agent/token/replication`. The old paths remain to be valid. 
* Added a couple new API functions to set tokens via the new paths. Deprecated the old ones and pointed to the new names. The names are also generally better and don't imply that what you are setting is for ACLs but rather are setting ACL tokens. There is a minor semantic difference there especially for the replication token as again, its no longer used only for ACL token/policy replication. The new functions will detect 404s and fallback to using the older token paths when talking to pre-1.4.3 agents.
* Docs updated to reflect the API additions and to show using the new endpoints.
* Updated the ACL CLI set-agent-tokens command to use the non-deprecated APIs.
2019-02-27 14:28:31 -05:00
Kyle Havlovitz f07e928afc
Merge pull request #5325 from hashicorp/consul-ca-panic
connect/ca: fix a potential panic in the Consul provider
2019-02-27 09:43:44 -08:00
Hans Hasselberg 80e7d63fc2
Centralise tls configuration part 2 (#5374)
This PR is based on #5366 and continues to centralise the tls configuration in order to be reloadable eventually!

This PR is another refactoring. No tests are changed, beyond calling other functions or cosmetic stuff. I added a bunch of tests, even though they might be redundant.
2019-02-27 10:14:59 +01:00
Hans Hasselberg 786b3b1095
Centralise tls configuration part 1 (#5366)
In order to be able to reload the TLS configuration, we need one way to generate the different configurations.

This PR introduces a `tlsutil.Configurator` which holds a `tlsutil.Config`. Afterwards it is responsible for rendering every `tls.Config`. In this particular PR I moved `IncomingHTTPSConfig`, `IncomingTLSConfig`, and `OutgoingTLSWrapper` into `tlsutil.Configurator`.

This PR is a pure refactoring - not a single feature added. And not a single test added. I only slightly modified existing tests as necessary.
2019-02-26 16:52:07 +01:00
Aestek f1cdfbe40e Allow DNS interface to use agent cache (#5300)
Adds two new configuration parameters "dns_config.use_cache" and
"dns_config.cache_max_age" controlling how DNS requests use the agent
cache when querying servers.
2019-02-25 14:06:01 -05:00
R.B. Boyer c2a30c5fdd fix incorrect body of TestACLEndpoint_PolicyBatchRead
Lifted from PR #5307 as it was an unrelated drive-by fix on that PR anyway.

s/token/policy/
2019-02-22 09:32:51 -06:00
R.B. Boyer b569f222f9 update agent/agent_endpoint_test.go to use V2 tokens with attached policies 2019-02-20 11:11:47 -06:00
Nicholas Jackson 99fe9dabce Envoy config cluster (#5308)
* Start adding tests for cluster override

* Refactor tests for clusters

* Passing tests for custom upstream cluster override

* Added capability to customise local app cluster

* Rename config for local cluster override
2019-02-19 13:45:33 +00:00
Kainoa Seto b2af8862c7 Deferred updating response meta with consul headers (#5355) 2019-02-19 11:45:36 +00:00
R.B. Boyer ef8258cd4e test: switch test file from assert -> require for consistency
Also in acl_endpoint_test.go:

* convert logical blocks in some token tests to subtests
* remove use of require.New

This removes a lot of noise in a later PR.
2019-02-14 14:21:19 -06:00
Matt Keeler 766d771017
Pass a testing.T into NewTestAgent and TestAgent.Start (#5342)
This way we can avoid unnecessary panics which cause other tests not to run.

This doesn't remove all the possibilities for panics causing other tests not to run, it just fixes the TestAgent
2019-02-14 10:59:14 -05:00
R.B. Boyer adbe8ed370 correct some typos 2019-02-13 13:02:12 -06:00
R.B. Boyer 88bb53d001 ensure that we plumb our configured logger into all parts of the raft library 2019-02-13 13:02:09 -06:00
R.B. Boyer 2c983902be reduce the local scope of variable 2019-02-13 11:54:28 -06:00
R.B. Boyer de0f585583
agent: only enable TLS on gRPC if the HTTPS API port is enabled (#5287)
Currently the gRPC server assumes that if you have configured TLS
certs on the agent (for RPC) that you want gRPC to be encrypted.
If gRPC is bound to localhost this can be overkill. For the API we
let the user choose to offer HTTP or HTTPS API endpoints
independently of the TLS cert configuration for a similar reason.

This setting will let someone encrypt RPC traffic with TLS but avoid
encrypting local gRPC traffic if that is what they want to do by only
enabling TLS on gRPC if the HTTPS API port is enabled.
2019-02-13 11:49:54 -06:00
R.B. Boyer f2ed3a3777
clarify the ACL.PolicyDelete endpoint (#5337)
There was an errant early-return in PolicyDelete() that bypassed the
rest of the function.  This was ok because the only caller of this
function ignores the results.

This removes the early-return making it structurally behave like
TokenDelete() and for both PolicyDelete and TokenDelete clarify the lone
callers to indicate that the return values are ignored.

We may wish to avoid the entire return value as well, but this patch
doesn't go that far.
2019-02-13 09:16:30 -06:00
R.B. Boyer 324ba5df17
update TestStateStore_ACLBootstrap to not rely upon request mutation (#5335) 2019-02-12 16:09:26 -06:00
Matt Keeler 7073ba4ed2
Move autopilot initialization to prevent race (#5322)
`establishLeadership` invoked during leadership monitoring may use autopilot to do promotions etc. There was a race with doing that and having autopilot initialized and this fixes it.
2019-02-11 11:12:24 -05:00
Kyle Havlovitz 29e4c17b07
connect/ca: fix a potential panic in the Consul provider 2019-02-07 10:43:54 -08:00
Matt Keeler acfd87c673
Improve Connect with Prepared Queries (#5291)
Given a query like:

```
{
   "Name": "tagged-connect-query",
   "Service": {
      "Service": "foo",
      "Tags": ["tag"],
      "Connect": true
   }
}
```

And a Consul configuration like:

```
{
   "services": [
      "name": "foo",
      "port": 8080,
      "connect": { "sidecar_service": {} },
      "tags": ["tag"]
   ]
}
```

If you executed the query it would always turn up with 0 results. This was because the sidecar service was being created without any tags. You could instead make your config look like:

```
{
   "services": [
      "name": "foo",
      "port": 8080,
      "connect": { "sidecar_service": {
         "tags": ["tag"]
      } },
      "tags": ["tag"]
   ]
}
```

However that is a bit redundant for most cases. This PR ensures that the tags and service meta of the parent service get copied to the sidecar service. If there are any tags or service meta set in the sidecar service definition then this copying does not take place. After the changes, the query will now return the expected results.

A second change was made to prepared queries in this PR which is to allow filtering on ServiceMeta just like we allow for filtering on NodeMeta.
2019-02-04 09:36:51 -05:00
R.B. Boyer e1e4249e90
testutil: redirect some test agent logs to testing.T.Logf (#5304)
When tests fail, only the logs for the failing run are dumped to the
console which helps in diagnosis. This is easily added to other test
scenarios as they come up.
2019-02-01 09:21:54 -06:00
R.B. Boyer db8a871309
Merge pull request #5237 from hashicorp/term-grpc-stream-on-token-failure
Check ACLs more often for xDS endpoints.
2019-01-29 14:52:26 -06:00
mkeeler c97c712e96
Release v1.4.2 2019-01-28 21:46:00 +00:00
Kyle Havlovitz 7118f42950
Fix failing TestAgent_PurgeCheckOnDuplicate after merge 2019-01-28 13:19:38 -08:00
Matt Keeler 1736e24fb3
Don't generate TXT records just to discard them (#5272)
* Don't generate TXT records just to discard them

* Add unit test for formatNodeRecord to ensure it prevents returning TXT records
2019-01-28 14:59:58 -05:00
Kyle Havlovitz 928b7ec60d
Merge branch 'healthcheck-duration-fix' 2019-01-28 10:34:34 -08:00
Kyle Havlovitz 1a4978fb94
Re-add ReadableDuration types to health check definition
This is to fix the backwards-incompatible change made in 1.4.1 by
changing these fields to time.Duration.
2019-01-25 14:47:35 -08:00