Most of the groundwork was laid in previous PRs between adding the cert-monitor package to extracting the logic of signing certificates out of the connect_ca_endpoint.go code and into a method on the server.
This also refactors the auto-config package a bit to split things out into multiple files.
This implements a solution for #7863
It does:
Add a new config cache.entry_fetch_rate to limit the number of calls/s for a given cache entry, default value = rate.Inf
Add cache.entry_fetch_max_burst size of rate limit (default value = 2)
The new configuration now supports the following syntax for instance to allow 1 query every 3s:
command line HCL: -hcl 'cache = { entry_fetch_rate = 0.333}'
in JSON
{
"cache": {
"entry_fetch_rate": 0.333
}
}
The fallback method would still work but it would get into a state where it would let the certificate expire for 10s before getting a new one. And the new one used the less secure RPC endpoint.
This is also a pretty large refactoring of the auto encrypt code. I was going to write some tests around the certificate monitoring but it was going to be impossible to get a TestAgent configured in such a way that I could write a test that ran in less than an hour or two to exercise the functionality.
Moving the certificate monitoring into its own package will allow for dependency injection and in particular mocking the cache types to control how it hands back certificates and how long those certificates should live. This will allow for exercising the main loop more than would be possible with it coupled so tightly with the Agent.
# Conflicts:
# agent/agent.go
Three of the checks are temporarily disabled to limit the size of the
diff, and allow us to enable all the other checks in CI.
In a follow up we can fix the issues reported by the other checks one
at a time, and enable them.
Based on work done in https://github.com/hashicorp/memberlist/pull/196
this allows to restrict the IP ranges that can join a given Serf cluster
and be a member of the cluster.
Restrictions on IPs can be done separatly using 2 new differents flags
and config options to restrict IPs for LAN and WAN Serf.
Previously this happened to be using the method on the Server/Client that was meant to allow the ACLResolver to locally resolve tokens. On Servers that had tokens (primary or secondary dc + token replication) this function would lookup the token from raft and return the ACLIdentity. On clients this was always a noop. We inadvertently used this function instead of creating a new one when we added logging accessor ids for permission denied RPC requests.
With this commit, a new method is used for resolving the identity properly via the ACLResolver which may still resolve locally in the case of being on a server with tokens but also supports remote token resolution.
Previously the SupportsBlocking option was specified by a method on the
type, and all the other options were specified from RegisterOptions.
This change moves RegisterOptions to a method on the type, and moves
SupportsBlocking into the options struct.
Currently there are only 2 cache-types. So all cache-types can implement
this method by embedding a struct with those predefined values. In the
future if a cache type needs to be registered more than once with different
options it can remove the embedded type and implement the method in a way
that allows for paramaterization.
* Implements a simple, tcp ingress gateway workflow
This adds a new type of gateway for allowing Ingress traffic into Connect from external services.
Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
The Init method provided the same functionality as the New constructor.
The constructor is both more widely used, and more idiomatic, so remove
the Init method.
This change is in preparation for fixing printing of these IDs.
Exposing checks is supposed to allow a Consul agent bound to a different
IP address (e.g., in a different Kubernetes pod) to access healthchecks
through the proxy while the underlying service binds to localhost. This
is an important security feature that makes sure no external traffic
reaches the service except through the proxy.
However, as far as I can tell, this is subtly broken in the case where
the Consul agent cannot reach the proxy over localhost.
If a proxy is configured with: `{ LocalServiceAddress: "127.0.0.1",
Checks: true }`, as is typical with a sidecar proxy, the Consul checks
are currently rewritten to `127.0.0.1:<random port>`. A Consul agent
that does not share the loopback address cannot reach this address. Just
to make sure I was not misunderstanding, I tried configuring the proxy
with `{ LocalServiceAddress: "<pod ip>", Checks: true }`. In this case,
while the checks are rewritten as expected and the agent can reach the
dynamic port, the proxy can no longer reach its backend because the
traffic is no longer on the loopback interface.
I think rewriting the checks to use `proxy.Address`, the proxy's own
address, is more correct in this case. That is the IP where the proxy
can be reached, both by other proxies and by a Consul agent running on
a different IP. The local service address should continue to use
`127.0.0.1` in most cases.
This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:
There are several distinct chunks of code that are affected:
* new flags and config options for the server
* retry join WAN is slightly different
* retry join code is shared to discover primary mesh gateways from secondary datacenters
* because retry join logic runs in the *agent* and the results of that
operation for primary mesh gateways are needed in the *server* there are
some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
at multiple layers of abstraction just to pass the data down to the right
layer.
* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers
* the function signature for RPC dialing picked up a new required field (the
node name of the destination)
* several new RPCs for manipulating a FederationState object:
`FederationState:{Apply,Get,List,ListMeshGateways}`
* 3 read-only internal APIs for debugging use to invoke those RPCs from curl
* raft and fsm changes to persist these FederationStates
* replication for FederationStates as they are canonically stored in the
Primary and replicated to the Secondaries.
* a special derivative of anti-entropy that runs in secondaries to snapshot
their local mesh gateway `CheckServiceNodes` and sync them into their upstream
FederationState in the primary (this works in conjunction with the
replication to distribute addresses for all mesh gateways in all DCs to all
other DCs)
* a "gateway locator" convenience object to make use of this data to choose
the addresses of gateways to use for any given RPC or gossip operation to a
remote DC. This gets data from the "retry join" logic in the agent and also
directly calls into the FSM.
* RPC (`:8300`) on the server sniffs the first byte of a new connection to
determine if it's actually doing native TLS. If so it checks the ALPN header
for protocol determination (just like how the existing system uses the
type-byte marker).
* 2 new kinds of protocols are exclusively decoded via this native TLS
mechanism: one for ferrying "packet" operations (udp-like) from the gossip
layer and one for "stream" operations (tcp-like). The packet operations
re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
overhead.
* the server instances specially wrap the `memberlist.NetTransport` when running
with gateway federation enabled (in a `wanfed.Transport`). The general gist is
that if it tries to dial a node in the SAME datacenter (deduced by looking
at the suffix of the node name) there is no change. If dialing a DIFFERENT
datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
gateways to eventually end up in a server's :8300 port.
* a new flag when launching a mesh gateway via `consul connect envoy` to
indicate that the servers are to be exposed. This sets a special service
meta when registering the gateway into the catalog.
* `proxycfg/xds` notice this metadata blob to activate additional watches for
the FederationState objects as well as the location of all of the consul
servers in that datacenter.
* `xds:` if the extra metadata is in place additional clusters are defined in a
DC to bulk sink all traffic to another DC's gateways. For the current
datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
balances all servers as well as one mini-cluster per node
(`<node>.server.<dc>.consul`)
* the `consul tls cert create` command got a new flag (`-node`) to help create
an additional SAN in certs that can be used with this flavor of federation.
This fixes issue #7318
Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.
What happened when a reload was:
1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks
In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.
This PR introduces the snap parameter, so step 3 can use information from step 1
Fixes#7231. Before an agent would always emit a warning when there is
an encrypt key in the configuration and an existing keyring stored,
which is happening on restart.
Now it only emits that warning when the encrypt key from the
configuration is not part of the keyring.