Exposing checks is supposed to allow a Consul agent bound to a different
IP address (e.g., in a different Kubernetes pod) to access healthchecks
through the proxy while the underlying service binds to localhost. This
is an important security feature that makes sure no external traffic
reaches the service except through the proxy.
However, as far as I can tell, this is subtly broken in the case where
the Consul agent cannot reach the proxy over localhost.
If a proxy is configured with: `{ LocalServiceAddress: "127.0.0.1",
Checks: true }`, as is typical with a sidecar proxy, the Consul checks
are currently rewritten to `127.0.0.1:<random port>`. A Consul agent
that does not share the loopback address cannot reach this address. Just
to make sure I was not misunderstanding, I tried configuring the proxy
with `{ LocalServiceAddress: "<pod ip>", Checks: true }`. In this case,
while the checks are rewritten as expected and the agent can reach the
dynamic port, the proxy can no longer reach its backend because the
traffic is no longer on the loopback interface.
I think rewriting the checks to use `proxy.Address`, the proxy's own
address, is more correct in this case. That is the IP where the proxy
can be reached, both by other proxies and by a Consul agent running on
a different IP. The local service address should continue to use
`127.0.0.1` in most cases.
If a proxied service is a gRPC or HTTP2 service, but a path is exposed
using the HTTP1 or TCP protocol, Envoy should not be configured with
`http2ProtocolOptions` for the cluster backing the path.
A situation where this comes up is a gRPC service whose healthcheck or
metrics route (e.g. for Prometheus) is an HTTP1 service running on
a different port. Previously, if these were exposed either using
`Expose: { Checks: true }` or `Expose: { Paths: ... }`, Envoy would
still be configured to communicate with the path over HTTP2, which would
not work properly.
I spent some time today on my local Mac to figure out why Consul 1.6.3+
was not accepting limits.http_max_conns_per_client.
This adds an explicit check on number of file descriptors to be sure
it might work (this is no guarantee as if many clients are reaching
the agent, it might consume even more file descriptors)
Anyway, many users are fighting with RLIMIT_NOFILE, having a clear
message would allow them to figure out what to fix.
Example of message (reload or start):
```
2020-03-11T16:38:37.062+0100 [ERROR] agent: Error starting agent: error="system allows a max of 512 file descriptors, but limits.http_max_conns_per_client: 8192 needs at least 8212"
```
Due to merge #7562, upstream does not compile anymore.
Error is:
ERRO Running error: gofmt: analysis skipped: errors in package: [/Users/p.souchay/go/src/github.com/hashicorp/consul/agent/config_endpoint_test.go:188:33: too many arguments]
This function now only starts the agent.
Using:
git grep -l 'StartTestAgent(t, true,' | \
xargs sed -i -e 's/StartTestAgent(t, true,/StartTestAgent(t,/g'
When run in with `-dev` in DevMode, it is not possible to replace
the embeded UI with another one because `-dev` implies `-ui`.
This commit allows this an slightly change the error message
about Consul 0.7.0 which is very old and does not apply to
current version anyway.
This config entry will be used to configure terminating gateways.
It accepts the name of the gateway and a list of services the gateway will represent.
For each service users will be able to specify: its name, namespace, and additional options for TLS origination.
Co-authored-by: Kyle Havlovitz <kylehav@gmail.com>
Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
* Add Ingress gateway config entry and other relevant structs
* Add api package tests for ingress gateways
* Embed EnterpriseMeta into ingress service struct
* Add namespace fields to api module and test consul config write decoding
* Don't require a port for ingress gateways
* Add snakeJSON and camelJSON cases in command test
* Run Normalize on service's ent metadata
Sadly cannot think of a way to test this in OSS.
* Every protocol requires at least 1 service
* Validate ingress protocols
* Update agent/structs/config_entry_gateways.go
Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
Previously the log output included the test name twice and a long date
format. The test output is already grouped by test, so adding the test
name did not add any new information. The date and time are only useful
to understand elapsed time, so using a short format should provide
succident detail.
Also fixed a bug in NewTestAgentWithFields where nil was returned
instead of the test agent.
This test would occasionally fail because we checked for a status of
"critical" initially. This races with the actual healthcheck being run
and declared passing.
We instead use a ttl health check so that we don't rely on timing at all.
To reduce the chance of some tests not being run because it does not
match the regex passed to '-run'.
Also document why some tests are allowed to be skipped on CI.
If the CI environment is not correct for running tests the tests
should fail, so that we don't accidentally stop running some tests
because of a change to our CI environment.
Also removed a duplicate delcaration from init. I believe one was
overriding the other as they are both in the same package.
These changes are necessary to ensure advertisement happens correctly even when datacenters are connected via network areas in Consul enterprise.
This also changes how we check if ACLs can be upgraded within the local datacenter. Previously we would iterate through all LAN members. Now we just use the ServerLookup type to iterate through all known servers in the DC.
This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch:
There are several distinct chunks of code that are affected:
* new flags and config options for the server
* retry join WAN is slightly different
* retry join code is shared to discover primary mesh gateways from secondary datacenters
* because retry join logic runs in the *agent* and the results of that
operation for primary mesh gateways are needed in the *server* there are
some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur
at multiple layers of abstraction just to pass the data down to the right
layer.
* new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers
* the function signature for RPC dialing picked up a new required field (the
node name of the destination)
* several new RPCs for manipulating a FederationState object:
`FederationState:{Apply,Get,List,ListMeshGateways}`
* 3 read-only internal APIs for debugging use to invoke those RPCs from curl
* raft and fsm changes to persist these FederationStates
* replication for FederationStates as they are canonically stored in the
Primary and replicated to the Secondaries.
* a special derivative of anti-entropy that runs in secondaries to snapshot
their local mesh gateway `CheckServiceNodes` and sync them into their upstream
FederationState in the primary (this works in conjunction with the
replication to distribute addresses for all mesh gateways in all DCs to all
other DCs)
* a "gateway locator" convenience object to make use of this data to choose
the addresses of gateways to use for any given RPC or gossip operation to a
remote DC. This gets data from the "retry join" logic in the agent and also
directly calls into the FSM.
* RPC (`:8300`) on the server sniffs the first byte of a new connection to
determine if it's actually doing native TLS. If so it checks the ALPN header
for protocol determination (just like how the existing system uses the
type-byte marker).
* 2 new kinds of protocols are exclusively decoded via this native TLS
mechanism: one for ferrying "packet" operations (udp-like) from the gossip
layer and one for "stream" operations (tcp-like). The packet operations
re-use sockets (using length-prefixing) to cut down on TLS re-negotiation
overhead.
* the server instances specially wrap the `memberlist.NetTransport` when running
with gateway federation enabled (in a `wanfed.Transport`). The general gist is
that if it tries to dial a node in the SAME datacenter (deduced by looking
at the suffix of the node name) there is no change. If dialing a DIFFERENT
datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh
gateways to eventually end up in a server's :8300 port.
* a new flag when launching a mesh gateway via `consul connect envoy` to
indicate that the servers are to be exposed. This sets a special service
meta when registering the gateway into the catalog.
* `proxycfg/xds` notice this metadata blob to activate additional watches for
the FederationState objects as well as the location of all of the consul
servers in that datacenter.
* `xds:` if the extra metadata is in place additional clusters are defined in a
DC to bulk sink all traffic to another DC's gateways. For the current
datacenter we listen on a wildcard name (`server.<dc>.consul`) that load
balances all servers as well as one mini-cluster per node
(`<node>.server.<dc>.consul`)
* the `consul tls cert create` command got a new flag (`-node`) to help create
an additional SAN in certs that can be used with this flavor of federation.
This fixes issue #7318
Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.
What happened when a reload was:
1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks
In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.
This PR introduces the snap parameter, so step 3 can use information from step 1