Ensure that enabling AutoConfig sets the tls configurator properly
This also refactors the TLS configurator a bit so the naming doesn’t imply only AutoEncrypt as the source of the automatically setup TLS cert info.
Most of the groundwork was laid in previous PRs between adding the cert-monitor package to extracting the logic of signing certificates out of the connect_ca_endpoint.go code and into a method on the server.
This also refactors the auto-config package a bit to split things out into multiple files.
There were several PRs that while all passed CI independently, when they all got merged into the same branch caused compilation errors in test code.
The main changes that caused issues where changing agent/cache.Cache.New to require a concrete options struct instead of a pointer. This broke the cert monitor tests and the catalog_list_services_test.go. Another change was made to unembed the http.Server from the agent.HTTPServer struct. That coupled with another change to add a test to ensure cache rate limiting coming from HTTP requests was working as expected caused compilation failures.
This implements a solution for #7863
It does:
Add a new config cache.entry_fetch_rate to limit the number of calls/s for a given cache entry, default value = rate.Inf
Add cache.entry_fetch_max_burst size of rate limit (default value = 2)
The new configuration now supports the following syntax for instance to allow 1 query every 3s:
command line HCL: -hcl 'cache = { entry_fetch_rate = 0.333}'
in JSON
{
"cache": {
"entry_fetch_rate": 0.333
}
}
Replaces #7559
Running tests in parallel, with background goroutines, results in test output not being associated with the correct test. `go test` does not make any guarantees about output from goroutines being attributed to the correct test case.
Attaching log output from background goroutines also cause data races. If the goroutine outlives the test, it will race with the test being marked done. Previously this was noticed as a panic when logging, but with the race detector enabled it is shown as a data race.
The previous solution did not address the problem of correct test attribution because test output could still be hidden when it was associated with a test that did not fail. You would have to look at all of the log output to find the relevant lines. It also made debugging test failures more difficult because each log line was very long.
This commit attempts a new approach. Instead of printing all the logs, only print when a test fails. This should work well when there are a small number of failures, but may not work well when there are many test failures at the same time. In those cases the failures are unlikely a result of a specific test, and the log output is likely less useful.
All of the logs are printed from the test goroutine, so they should be associated with the correct test.
Also removes some test helpers that were not used, or only had a single caller. Packages which expose many functions with similar names can be difficult to use correctly.
Related:
https://github.com/golang/go/issues/38458 (may be fixed in go1.15)
https://github.com/golang/go/issues/38382#issuecomment-612940030
The rationale behind removing them is that all of our own code (xDS, builtin connect proxy) use the cache notification mechanism. This ensures that the blocking fetch behind the scenes is always executing. Therefore the only way you might go to get a certificate and have to wait is when 1) the request has never been made for that cert before or 2) you are using the v1/agent/connect/ca/leaf API for retrieving the cert yourself.
In the first case, the refresh change doesn’t alter the behavior. In the second case, it can be mitigated by using blocking queries with that API which just like normal cache notification mechanism will cause the blocking fetch to be initiated and to get leaf certs as soon as needed.
If you are not using blocking queries, or Envoy/xDS, or the builtin connect proxy but are retrieving the certs yourself then the HTTP endpoint might take a little longer to respond.
This also renames the RefreshTimeout field on the register options to QueryTimeout to more accurately reflect that it is used for any type that supports blocking queries.
The fallback method would still work but it would get into a state where it would let the certificate expire for 10s before getting a new one. And the new one used the less secure RPC endpoint.
This is also a pretty large refactoring of the auto encrypt code. I was going to write some tests around the certificate monitoring but it was going to be impossible to get a TestAgent configured in such a way that I could write a test that ran in less than an hour or two to exercise the functionality.
Moving the certificate monitoring into its own package will allow for dependency injection and in particular mocking the cache types to control how it hands back certificates and how long those certificates should live. This will allow for exercising the main loop more than would be possible with it coupled so tightly with the Agent.
CheckAlias already had a waitGroup, but the Add() call was happening too late, which was causing a race in tests. The add must happen before the goroutine is started.
CheckHTTP did not have a waitGroup, so I added it to match CheckAlias.
It looks like a lot of the implementation could be shared, and may not need all of channel, waitgroup and bool, but I will leave that refactor for another time.
When the race detector is enabled we see this test fail occasionally. The reordering of execution seems to make it possible for the snapshot splice to happen before any events are published to the topicBuffers.
We can handle this case in the test the same way it is handled by a subscription, by proceeding to the next event.
This change was mostly automated with the following
First generate a list of functions with:
git grep -o 'Store) \([^(]\+\)(tx \*txn' ./agent/consul/state | awk '{print $2}' | grep -o '^[^(]\+'
Then the list was curated a bit with trial/error to remove and add funcs
as necessary.
Finally the replacement was done with:
dir=agent/consul/state
file=${1-funcnames}
while read fn; do
echo "$fn"
sed -i -e "s/(s \*Store) $fn(/$fn(/" $dir/*.go
sed -i -e "s/s\.$fn(/$fn(/" $dir/*.go
sed -i -e "s/s\.store\.$fn(/$fn(/" $dir/*.go
done < $file
Making these functions allows them to be used without introducing
an artificial dependency on the struct. Many of these will be called
from streaming Event processors, which do not have a store.
This change is being made ahead of the streaming work to get to reduce
the size of the streaming diff.
Move the subscription context to Next. context.Context should generally
never be stored in a struct because it makes that struct only valid
while the context is valid. This is rarely obvious from the caller.
Adds a forceClosed channel in place of the old context, and uses the new
context as a way for the caller to stop the Subscription blocking.
Remove some recursion out of bufferImte.Next. The caller is already looping so we can continue
in that loop instead of recursing. This ensures currentItem is updated immediately (which probably
does not matter in practice), and also removes the chance that we overflow the stack.
NextNoBlock and FollowAfter do not need to handle bufferItem.Err, the caller already
handles it.
Moves filter to a method to simplify Next, and more explicitly separate filtering from looping.
Also improve some godoc
Only unwrap itemBuffer.Err when necessary
EventPublisher was receiving TopicHandlers, which had a couple of
problems:
- ChangeProcessors were being grouped by Topic, but they completely
ignored the topic and were performed on every change
- ChangeProcessors required EventPublisher to be aware of database
changes
By moving ChangeProcesors out of EventPublisher, and having Publish
accept events instead of changes, EventPublisher no longer needs to
be aware of these things.
Handlers is now only SnapshotHandlers, which are still mapped by Topic.
Also allows us to remove the small 'db' package that had only two types.
They can now be unexported types in state.
The EventPublisher is the central hub of the PubSub system. It is toughly coupled with much of
stream. Some stream internals were exported exclusively for EventPublisher.
The two Subscribe cases (with or without index) were also awkwardly split between two packages. By
moving EventPublisher into stream they are now both in the same package (although still in different files).
Also store the index in Changes instead of the Txn.
This change is in preparation for movinng EventPublisher to the stream package, and
making handleACLUpdates async once again.
It is critical that Unsubscribe be called with the same pointer to a
SubscriptionRequest that was used to create the Subscription. The
docstring made that clear, but it sill allowed a caler to get it wrong by
creating a new SubscriptionRequest.
By hiding this detail from the caller, and only exposing an Unsubscribe
method, it should be impossible to fail to Unsubscribe.
Also update some godoc strings.
Use a separate lock for subscriptions.ByToken to allow it to happen synchronously
in the commit flow.
This removes the need to create a new txn for the goroutine, and removes
the need for EventPublisher to contain a reference to DB.
Many of the fields are only needed in one place, and by using a closure
they can be removed from the struct. This reduces the scope of the variables
making it esier to see how they are used.
Otherwise the test will run with exactly the same values each time.
By printing the seed we can attempt to reproduce the test by adding an env var to override the seed
Make topicRegistry use functions instead of unbound methods
Use a regular memDB in EventPublisher to remove a reference cycle
Removes the need for EventPublisher to use a store
Also remove secretHash, which was used to hash tokens. We don't expose
these tokens anywhere, so we can use the string itself instead of a
Hash.
Fix acl_events_test.go for storing a structs type.
This is instead of having the AutoConfigBackend interface provide functions for retrieving them.
NOTE: the config is not reloadable. For now this is fine as we don’t look at any reloadable fields. If that changes then we should provide a way to make it reloadable.
A port can be sent in the Host header as defined in the HTTP RFC, so we
take any hosts that we want to match traffic to and also add another
host with the listener port added.
Also fix an issue with envoy integration tests not running the
case-ingress-gateway-tls test.
In all cases (oss/ent, client/server) this method was returning a value from config. Since the
value is consistent, it doesn't need to be part of the delegate interface.
Fixes#7527
I want to highlight this and explain what I think the implications are and make sure we are aware:
* `HTTPConnStateFunc` closes the connection when it is beyond the limit. `Close` does not block.
* `HTTPConnStateFuncWithDefault429Handler(10 * time.Millisecond)` blocks until the following is done (worst case):
1) `conn.SetDeadline(10*time.Millisecond)` so that
2) `conn.Write(429error)` is guaranteed to timeout after 10ms, so that the http 429 can be written and
3) `conn.Close` can happen
The implication of this change is that accepting any new connection is worst case delayed by 10ms. But only after a client reached the limit already.
The embedded HTTPServer struct is not used by the large HTTPServer
struct. It is used by tests and the agent. This change is a small first
step in the process of removing that field.
The eventual goal is to reduce the scope of HTTPServer making it easier
to test, and split into separate packages.
A query made with AllowNotModifiedResponse and a MinIndex, where the
result has the same Index as MinIndex, will return an empty response
with QueryMeta.NotModified set to true.
Co-authored-by: Pierre Souchay <pierresouchay@users.noreply.github.com>
Also fix a bug where Consul could segfault if TLS was enabled but no client certificate was provided. How no one has reported this as a problem I am not sure.
The initial auto encrypt CSR wasn’t containing the user supplied IP and DNS SANs. This fixes that. Also We were configuring a default :: IP SAN. This should be ::1 instead and was fixed.
Highlights:
- add new endpoint to query for intentions by exact match
- using this endpoint from the CLI instead of the dump+filter approach
- enforcing that OSS can only read/write intentions with a SourceNS or
DestinationNS field of "default".
- preexisting OSS intentions with now-invalid namespace fields will
delete those intentions on initial election or for wildcard namespaces
an attempt will be made to downgrade them to "default" unless one
exists.
- also allow the '-namespace' CLI arg on all of the intention subcommands
- update lots of docs
Split up unused key validation in config entry decode for oss/ent.
This is needed so that we can return an informative error in OSS if namespaces are provided.
Previously, we were only returning a single ListenerPort for a single
service. However, we actually allow a single service to be serviced over
multiple ports, as well as allow users to define what hostnames they
expect their services to be contacted over. When no hosts are defined,
we return the default ingress domain for any configured DNS domain.
To show this in the UI, we modify the gateway-services-nodes API to
return a GatewayConfig.Addresses field, which is a list of addresses
over which the specific service can be contacted.
This is in its own separate package so that it will be a separate test binary that runs thus isolating the go runtime from other tests and allowing accurate go routine leak checking.
This test would ideally use goleak.VerifyTestMain but that will fail 100% of the time due to some architectural things (blocking queries and net/rpc uncancellability).
This test is not comprehensive. We should enable/exercise more features and more cluster configurations. However its a start.
The old test case was a very specific regresion test for a case that is no longer possible.
Replaced with a new test that checks the default coordinate is returned.
We needed to pass a cancellable context into the limiter.Wait instead of context.Background. So I made the func take a context instead of a chan as most places were just passing through a Done chan from a context anyways.
Fix go routine leak in the gateway locator
This will allow to increase cache value when DC is not valid (aka
return SOA to avoid too many consecutive requests) and will
distinguish DC being temporarily not available from DC not existing.
Implements https://github.com/hashicorp/consul/issues/8102
On the servers they must have a certificate.
On the clients they just have to set verify_outgoing to true to attempt TLS connections for RPCs.
Eventually we may relax these restrictions but right now all of the settings we push down (acl tokens, acl related settings, certificates, gossip key) are sensitive and shouldn’t be transmitted over an unencrypted connection. Our guides and docs should recoommend verify_server_hostname on the clients as well.
Another reason to do this is weird things happen when making an insecure RPC when TLS is not enabled. Basically it tries TLS anyways. We should probably fix that to make it clearer what is going on.
The envisioned changes would allow extra settings to enable dynamically defined auth methods to be used instead of or in addition to the statically defined one in the configuration.
There are a couple of things in here.
First, just like auto encrypt, any Cluster.AutoConfig RPC will implicitly use the less secure RPC mechanism.
This drastically modifies how the Consul Agent starts up and moves most of the responsibilities (other than signal handling) from the cli command and into the Agent.
All commands which read config (agent, services, and validate) will now
print warnings when one of the config files is skipped because it did
not match an expected format.
Also ensures that config validate prints all warnings.
Right now this is only hooked into the insecure RPC server and requires JWT authorization. If no JWT authorizer is setup in the configuration then we inject a disabled “authorizer” to always report that JWT authorization is disabled.
While upgrading servers to a new version, I saw that metadata of
existing servers are not upgraded, so the version and raft meta
is not up to date in catalog.
The only way to do it was to:
* update Consul server
* make it leave the cluster, then metadata is accurate
That's because the optimization to avoid updating catalog does
not take into account metadata, so no update on catalog is performed.
And fix the 'value not used' issues.
Many of these are not bugs, but a few are tests not checking errors, and
one appears to be a missed error in non-test code.
A Node Identity is very similar to a service identity. Its main targeted use is to allow creating tokens for use by Consul agents that will grant the necessary permissions for all the typical agent operations (node registration, coordinate updates, anti-entropy).
Half of this commit is for golden file based tests of the acl token and role cli output. Another big updates was to refactor many of the tests in agent/consul/acl_endpoint_test.go to use the same style of tests and the same helpers. Besides being less boiler plate in the tests it also uses a common way of starting a test server with ACLs that should operate without any warnings regarding deprecated non-uuid master tokens etc.
Previously the logic for reading ConfigFiles and produces Sources was split
between NewBuilder and Build. This commit moves all of the logic into NewBuilder
so that Build() can operate entirely on Sources.
This change is in preparation for logging warnings when files have an
unsupported extension.
It also reduces the scope of BuilderOpts, and gets us very close to removing
Builder.options.
The nil value was never used. We can avoid a bunch of complications by
making the field a string value instead of a pointer.
This change is in preparation for fixing a silent config failure.
Flags is an overloaded term in this context. It generally is used to
refer to command line flags. This struct, however, is a data object
used as input to the construction.
It happens to be partially populated by command line flags, but
otherwise has very little to do with them.
Renaming this struct should make the actual responsibility of this struct
more obvious, and remove the possibility that it is confused with
command line flags.
This change is in preparation for adding additional fields to
BuilderOpts.
This field was populated for one reason, to test that it was empty.
Of all the callers, only a single one used this functionality. The rest
constructed a `Flags{}` struct which did not set Args.
I think this shows that the logic was in the wrong place. Only the agent
command needs to care about validating the args.
This commit removes the field, and moves the logic to the one caller
that cares.
Also fix some comments.
Passing the channel to the function which uses it significantly
reduces the scope of the variable, and makes its usage more explicit. It
also moves the initialization of the channel closer to where it is used.
Also includes a couple very small cleanups to remove a local var and
read the error from `ctx.Err()` directly instead of creating a channel
to check for an error.
This field was always read by the same function that populated the field,
so it does not need to be a field. Passing the value as an argument to
functions makes it more obvious where the value comes from, and also reduces
the scope of the variable significantly.
[The documentation for context](https://golang.org/pkg/context/)
recommends not storing context in a struct field:
> Do not store Contexts inside a struct type; instead, pass a Context
> explicitly to each function that needs it. The Context should be the
> first parameter, typically named ctx...
Sometimes there are good reasons to not follow this recommendation, but
in this case it seems easy enough to follow.
Also moved the ctx argument to be the first in one of the function calls
to follow the same recommendation.
Blocking queries issues will still be uncancellable (that cannot be helped until we get rid of net/rpc). However this makes it so that if calling getWithIndex (like during a cache Notify go routine) we can cancell the outer routine. Previously it would keep issuing more blocking queries until the result state actually changed.
Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
Currently when passing hostname clusters to Envoy, we set each service instance registered with Consul as an LbEndpoint for the cluster.
However, Envoy can only handle one per cluster:
[2020-06-04 18:32:34.094][1][warning][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC config for type.googleapis.com/envoy.api.v2.Cluster rejected: Error adding/updating cluster(s) dc2.internal.ddd90499-9b47-91c5-4616-c0cbf0fc358a.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint, server.dc2.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint
Envoy is currently handling this gracefully by only picking one of the endpoints. However, we should avoid passing multiple to avoid these warning logs.
This PR:
* Ensures we only pass one endpoint, which is tied to one service instance.
* We prefer sending an endpoint which is marked as Healthy by Consul.
* If no endpoints are healthy we emit a warning and skip the cluster.
* If multiple unique hostnames are spread across service instances we emit a warning and let the user know which will be resolved.
In discussion with team, it was pointed out that query parameters tend
to be filter mechanism, and that semantically the "/v1/health/connect"
endpoint should return "all healthy connect-enabled endpoints (e.g.
could be side car proxies or native instances) for this service so I can
connect with mTLS".
That does not fit an ingress gateway, so we remove the query parameter
and add a new endpoint "/v1/health/ingress" that semantically means
"all the healthy ingress gateway instances that I can connect to
to access this connect-enabled service without mTLS"
Currently opaque config blocks (config entries, and CA provider config) are
modified by PatchSliceOfMaps, making it impossible for these opaque
config sections to contain slices of maps.
In order to fix this problem, any lazy-decoding of these blocks needs to support
weak decoding of []map[string]interface{} to a struct type before
PatchSliceOfMaps is replaces. This is necessary because these config
blobs are persisted, and during an upgrade an older version of Consul
could read one of the new configuration values, which would cause an error.
To support the upgrade path, this commit first introduces the new hooks
for weak decoding of []map[string]interface{} and uses them only in the
lazy-decode paths. That way, in a future release, new style
configuration will be supported by the older version of Consul.
This decode hook has a number of advantages:
1. It no longer panics. It allows mapstructure to report the error
2. It no longer requires the user to declare which fields are slices of
structs. It can deduce that information from the 'to' value.
3. It will make it possible to preserve opaque configuration, allowing
for structured opaque config.
* Fixes#5606: Tokens converted from legacy ACLs get their Hash computed
This allows new style token replication to work for legacy tokens as well when they change.
* tests: fix timestamp comparison
Co-authored-by: Matt Keeler <mjkeeler7@gmail.com>
Previously, we did not require the 'service-name.*' host header value
when on a single http service was exposed. However, this allows a user
to get into a situation where, if they add another service to the
listener, suddenly the previous service's traffic might not be routed
correctly. Thus, we always require the Host header, even if there is
only 1 service.
Also, we add the make the default domain matching more restrictive by
matching "service-name.ingress.*" by default. This lines up better with
the namespace case and more accurately matches the Consul DNS value we
expect people to use in this case.
This allows the operator to disable agent caching for the http endpoint.
It is on by default for backwards compatibility and if disabled will
ignore the url parameter `cached`.
Found using staticcheck.
binary.Write does not accept int types without a size. The error from binary.Write was ignored, so we never saw this error. Casting the data to uint64 produces a correct hash.
Also deprecate the Default{Addr,Port} fields, and prevent them from being encoded. These fields will always be empty and are not used.
Removing these would break backwards compatibility, so they are left in place for now.
Co-authored-by: Hans Hasselberg <me@hans.io>
In current implementation of Consul, check alias cannot determine
if a service exists or not. Because a service without any check
is semantically considered as passing, so when no healthchecks
are found for an agent, the check was considered as passing.
But this make little sense as the current implementation does not
make any difference between:
* a non-existing service (passing)
* a service without any check (passing as well)
In order to make it work, we have to ensure that when a check did
not find any healthcheck, the service does indeed exists. If it
does not, lets consider the check as failing.
The DNS resolution will be handled by Envoy and defaults to LOGICAL_DNS. This discovery type can be overridden on a per-gateway basis with the envoy_dns_discovery_type Gateway Option.
If a service contains an instance with a hostname as an address we set the Envoy cluster to use DNS as the discovery type rather than EDS. Since both mesh gateways and terminating gateways route to clusters using SNI, whenever there is a mix of hostnames and IP addresses associated with a service we use the hostname + CDS rather than the IPs + EDS.
Note that we detect hostnames by attempting to parse the service instance's address as an IP. If it is not a valid IP we assume it is a hostname.
The ACL.GetPolicy RPC endpoint was supposed to return the “parent” policy and not always the default policy. In the case of legacy management tokens the parent policy was supposed to be “manage”. The result of us not sending this properly was that operations that required specifically a management token such as saving a snapshot would not work in secondary DCs until they were upgraded.
* testing: replace most goe/verify.Values with require.Equal
One difference between these two comparisons is that go/verify considers
nil slices/maps to be equal to empty slices/maps, where as testify/require
does not, and does not appear to provide any way to enable that behaviour.
Because of this difference some expected values were changed from empty
slices to nil slices, and some calls to verify.Values were left.
* Remove github.com/pascaldekloe/goe/verify
Reduce the number of assertion packages we use from 2 to 1
In the past TLS usage was enforced with these variables, but these days
this decision is made by TLSConfigurator and there is no reason to keep
using the variables.
The version field has been used to decide which multiplexing to use. It
was introduced in 2457293dce. But this is
6y ago and there is no need for this differentiation anymore.
Three of the checks are temporarily disabled to limit the size of the
diff, and allow us to enable all the other checks in CI.
In a follow up we can fix the issues reported by the other checks one
at a time, and enable them.
The main fix here is to always union the `primary-gateways` list with
the list of mesh gateways in the primary returned from the replicated
federation states list. This will allow any replicated (incorrect) state
to be supplemented with user-configured (correct) state in the config
file. Eventually the game of random selection whack-a-mole will pick a
winning entry and re-replicate the latest federation states from the
primary. If the user-configured state is actually the incorrect one,
then the same eventual correct selection process will work in that case,
too.
The secondary fix is actually to finish making wanfed-via-mgws actually
work as originally designed. Once a secondary datacenter has replicated
federation states for the primary AND managed to stand up its own local
mesh gateways then all of the RPCs from a secondary to the primary
SHOULD go through two sets of mesh gateways to arrive in the consul
servers in the primary (one hop for the secondary datacenter's mesh
gateway, and one hop through the primary datacenter's mesh gateway).
This was neglected in the initial implementation. While everything
works, ideally we should treat communications that go around the mesh
gateways as just provided for bootstrapping purposes.
Now we heuristically use the success/failure history of the federation
state replicator goroutine loop to determine if our current mesh gateway
route is working as intended. If it is, we try using the local gateways,
and if those don't work we fall back on trying the primary via the union
of the replicated state and the go-discover configuration flags.
This can be improved slightly in the future by possibly initializing the
gateway choice to local on startup if we already have replicated state.
This PR does not address that improvement.
Fixes#7339
* Standardize support for Tagged and BindAddresses in Ingress Gateways
This updates the TaggedAddresses and BindAddresses behavior for Ingress
to match Mesh/Terminating gateways. The `consul connect envoy` command
now also allows passing an address without a port for tagged/bind
addresses.
* Update command/connect/envoy/envoy.go
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
* PR comments
* Check to see if address is an actual IP address
* Update agent/xds/listeners.go
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
* fix whitespace
Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
Currently checks of type gRPC will emit log messages such as,
2020/02/12 13:48:22 [INFO] parsed scheme: ""
2020/02/12 13:48:22 [INFO] scheme "" not registered, fallback to default scheme
Without adding full support for using custom gRPC schemes (maybe that's
right long-term path) we can just supply the default scheme as provided
by the grpc library.
Fixes https://github.com/hashicorp/consul/issues/7274
and https://github.com/hashicorp/nomad/issues/7415
Errors are values. We can use the error value to identify the 'comparison failed' case which makes the function easier to use and should make it harder to miss handle the error case