consul

Commit Graph

Author	SHA1	Message	Date
Kyle Havlovitz	136549205c	Merge pull request #7759 from hashicorp/ingress/tls-hosts Add TLS option for Ingress Gateway listeners	2020-05-11 09:18:43 -07:00
Daniel Nephin	5655d7f34e	Add outlier_detection check to integration test Fix decoding of time.Duration types.	2020-05-08 14:56:57 -04:00
Kyle Havlovitz	d452769d92	Add TLS integration test for ingress gateway - Pull Consul Root CA from API in order to verify certificate chain - Assert on the DNSSAN as well to ensure it is correct	2020-05-06 15:12:02 -05:00
Chris Piraino	f40833d094	Allow Hosts field to be set on an ingress config entry - Validate that this cannot be set on a 'tcp' listener nor on a wildcard service. - Add Hosts field to api and test in consul config write CLI - xds: Configure envoy with user-provided hosts from ingress gateways	2020-05-06 15:06:13 -05:00
Kyle Havlovitz	e9e8c0e730	Ingress Gateways for TCP services (#7509 ) * Implements a simple, tcp ingress gateway workflow This adds a new type of gateway for allowing Ingress traffic into Connect from external services. Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>	2020-04-16 14:00:48 -07:00
Pierre Souchay	2199a134a0	More tolerant assert_alive_wan_member_count to fix unstable tests Example of failure (very frequent): https://circleci.com/gh/hashicorp/consul/157985	2020-04-13 16:02:45 +02:00
Hans Hasselberg	6a49a42e98	connect: support for envoy 1.13.1 and 1.12.3 (#7380 ) * setup new envoy versions for CI * bump version on the website too.	2020-03-10 11:04:46 +01:00
R.B. Boyer	6adad71125	wan federation via mesh gateways (#6884 ) This is like a Möbius strip of code due to the fact that low-level components (serf/memberlist) are connected to high-level components (the catalog and mesh-gateways) in a twisty maze of references which make it hard to dive into. With that in mind here's a high level summary of what you'll find in the patch: There are several distinct chunks of code that are affected: * new flags and config options for the server * retry join WAN is slightly different * retry join code is shared to discover primary mesh gateways from secondary datacenters * because retry join logic runs in the agent and the results of that operation for primary mesh gateways are needed in the server there are some methods like `RefreshPrimaryGatewayFallbackAddresses` that must occur at multiple layers of abstraction just to pass the data down to the right layer. * new cache type `FederationStateListMeshGatewaysName` for use in `proxycfg/xds` layers * the function signature for RPC dialing picked up a new required field (the node name of the destination) * several new RPCs for manipulating a FederationState object: `FederationState:{Apply,Get,List,ListMeshGateways}` * 3 read-only internal APIs for debugging use to invoke those RPCs from curl * raft and fsm changes to persist these FederationStates * replication for FederationStates as they are canonically stored in the Primary and replicated to the Secondaries. * a special derivative of anti-entropy that runs in secondaries to snapshot their local mesh gateway `CheckServiceNodes` and sync them into their upstream FederationState in the primary (this works in conjunction with the replication to distribute addresses for all mesh gateways in all DCs to all other DCs) * a "gateway locator" convenience object to make use of this data to choose the addresses of gateways to use for any given RPC or gossip operation to a remote DC. This gets data from the "retry join" logic in the agent and also directly calls into the FSM. * RPC (`:8300`) on the server sniffs the first byte of a new connection to determine if it's actually doing native TLS. If so it checks the ALPN header for protocol determination (just like how the existing system uses the type-byte marker). * 2 new kinds of protocols are exclusively decoded via this native TLS mechanism: one for ferrying "packet" operations (udp-like) from the gossip layer and one for "stream" operations (tcp-like). The packet operations re-use sockets (using length-prefixing) to cut down on TLS re-negotiation overhead. * the server instances specially wrap the `memberlist.NetTransport` when running with gateway federation enabled (in a `wanfed.Transport`). The general gist is that if it tries to dial a node in the SAME datacenter (deduced by looking at the suffix of the node name) there is no change. If dialing a DIFFERENT datacenter it is wrapped up in a TLS+ALPN blob and sent through some mesh gateways to eventually end up in a server's :8300 port. * a new flag when launching a mesh gateway via `consul connect envoy` to indicate that the servers are to be exposed. This sets a special service meta when registering the gateway into the catalog. * `proxycfg/xds` notice this metadata blob to activate additional watches for the FederationState objects as well as the location of all of the consul servers in that datacenter. * `xds:` if the extra metadata is in place additional clusters are defined in a DC to bulk sink all traffic to another DC's gateways. For the current datacenter we listen on a wildcard name (`server.<dc>.consul`) that load balances all servers as well as one mini-cluster per node (`<node>.server.<dc>.consul`) * the `consul tls cert create` command got a new flag (`-node`) to help create an additional SAN in certs that can be used with this flavor of federation.	2020-03-09 15:59:02 -05:00
Matt Keeler	0041102e29	Change where the envoy snapshots get put when a test fails (#7298 ) This will allow us to capture them in CI	2020-03-05 16:01:10 -05:00
Hans Hasselberg	9cb7adb304	add envoy version 1.12.2 and 1.13.0 to the matrix (#7240 ) * add 1.12.2 * add envoy 1.13.0 * Introduce -envoy-version to get 1.10.0 passing. * update old version and fix consul-exec case * add envoy_version and fix check * Update Envoy CLI tests to account for the 1.13 compatibility changes. Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>	2020-02-10 14:53:04 -05:00
Paschalis Tsilias	a335aa57c5	Expose Envoy's /stats for statsd agents (#7173 ) * Expose Envoy /stats for statsd agents; Add testcases * Remove merge conflict leftover * Add support for prefix instead of path; Fix docstring to mirror these changes * Add new config field to docs; Add testcases to check that /stats/prometheus is exposed as well * Parametrize matchType (prefix or path) and value * Update website/source/docs/connect/proxies/envoy.md Co-Authored-By: Paul Banks <banks@banksco.de> Co-authored-by: Paul Banks <banks@banksco.de>	2020-02-03 17:19:34 +00:00
Matt Keeler	bfc03ec587	Fix a couple bugs regarding intentions with namespaces (#7169 )	2020-01-29 17:30:38 -05:00
Matt Keeler	c09693e545	Updates to Config Entries and Connect for Namespaces (#7116 )	2020-01-24 10:04:58 -05:00
Chris Piraino	f3b54fa535	Allow configuration of upstream connection limits in Envoy (#6829 ) * Adds 'limits' field to the upstream configuration of a connect proxy This allows a user to configure the envoy connect proxy with 'max_connections', 'max_queued_requests', and 'max_concurrent_requests'. These values are defined in the local proxy on a per-service instance basis and should thus NOT be thought of as a global-level or even service-level value.	2019-12-03 14:13:33 -06:00
R.B. Boyer	2cd5a7e542	tests: make envoy integration tests more tolerant of internal retries that may inflate counters (#6539 ) This should remove false positives that look like: cluster.s2.default.primary.*cx_total - expected count: 2, actual count: 3	2019-09-25 09:08:42 -05:00
R.B. Boyer	8e22d80e35	connect: fix failover through a mesh gateway to a remote datacenter (#6259 ) Failover is pushed entirely down to the data plane by creating envoy clusters and putting each successive destination in a different load assignment priority band. For example this shows that normally requests go to 1.2.3.4:8080 but when that fails they go to 6.7.8.9:8080: - name: foo load_assignment: cluster_name: foo policy: overprovisioning_factor: 100000 endpoints: - priority: 0 lb_endpoints: - endpoint: address: socket_address: address: 1.2.3.4 port_value: 8080 - priority: 1 lb_endpoints: - endpoint: address: socket_address: address: 6.7.8.9 port_value: 8080 Mesh gateways route requests based solely on the SNI header tacked onto the TLS layer. Envoy currently only lets you configure the outbound SNI header at the cluster layer. If you try to failover through a mesh gateway you ideally would configure the SNI value per endpoint, but that's not possible in envoy today. This PR introduces a simpler way around the problem for now: 1. We identify any target of failover that will use mesh gateway mode local or remote and then further isolate any resolver node in the compiled discovery chain that has a failover destination set to one of those targets. 2. For each of these resolvers we will perform a small measurement of comparative healths of the endpoints that come back from the health API for the set of primary target and serial failover targets. We walk the list of targets in order and if any endpoint is healthy we return that target, otherwise we move on to the next target. 3. The CDS and EDS endpoints both perform the measurements in (2) for the affected resolver nodes. 4. For CDS this measurement selects which TLS SNI field to use for the cluster (note the cluster is always going to be named for the primary target) 5. For EDS this measurement selects which set of endpoints will populate the cluster. Priority tiered failover is ignored. One of the big downsides to this approach to failover is that the failover detection and correction is going to be controlled by consul rather than deferring that entirely to the data plane as with the prior version. This also means that we are bound to only failover using official health signals and cannot make use of data plane signals like outlier detection to affect failover. In this specific scenario the lack of data plane signals is ok because the effectiveness is already muted by the fact that the ultimate destination endpoints will have their data plane signals scrambled when they pass through the mesh gateway wrapper anyway so we're not losing much. Another related fix is that we now use the endpoint health from the underlying service, not the health of the gateway (regardless of failover mode).	2019-08-05 13:30:35 -05:00
Matt Keeler	3053342198	Envoy Mesh Gateway integration tests (#6187 ) * Allow setting the mesh gateway mode for an upstream in config files * Add envoy integration test for mesh gateways This necessitated many supporting changes in most of the other test cases. Add remote mode mesh gateways integration test	2019-07-24 17:01:42 -04:00
R.B. Boyer	e039dfd7f8	connect: rework how the service resolver subset OnlyPassing flag works (#6173 ) The main change is that we no longer filter service instances by health, preferring instead to render all results down into EDS endpoints in envoy and merely label the endpoints as HEALTHY or UNHEALTHY. When OnlyPassing is set to true we will force consul checks in a 'warning' state to render as UNHEALTHY in envoy. Fixes #6171	2019-07-23 20:20:24 -05:00
R.B. Boyer	aca2c5de3f	tests: adding new envoy integration tests for L7 service-resolvers (#6129 ) Additionally: - wait for bootstrap config entries to be applied - run the verify container in the host's PID namespace so we can kill envoys without mounting the docker socket * assert that we actually send HEALTHY and UNHEALTHY endpoints down in EDS during failover	2019-07-23 20:08:36 -05:00
R.B. Boyer	9138a97054	Fix bug in service-resolver redirects if the destination uses a default resolver. (#6122 ) Also: - add back an internal http endpoint to dump a compiled discovery chain for debugging purposes Before the CompiledDiscoveryChain.IsDefault() method would test: - is this chain just one resolver step? - is that resolver step just the default? But what I forgot to test: - is that resolver step for the same service that the chain represents? This last point is important because if you configured just one config entry: kind = "service-resolver" name = "web" redirect { service = "other" } and requested the chain for "web" you'd get back a default resolver for "other". In the xDS code the IsDefault() method is used to determine if this chain is "empty". If it is then we use the pre-discovery-chain logic that just uses data embedded in the Upstream object (and still lets the escape hatches function). In the example above that means certain parts of the xDS code were going to try referencing a cluster named "web..." despite the other parts of the xDS code maintaining clusters named "other...".	2019-07-12 12:21:25 -05:00
R.B. Boyer	911ed76e5b	tests: further reduce envoy integration test flakiness (#6112 ) In addition to waiting until s2 shows up healthy in the Catalog, wait until s2 endpoints show up healthy via EDS in the s1 upstream clusters.	2019-07-12 11:12:56 -05:00
R.B. Boyer	d4e58e9773	test: for envoy integration tests bump the time to wait for the upstream to be healthy (#6109 )	2019-07-10 18:07:47 -04:00
R.B. Boyer	20caa4f744	test: for envoy integration tests, wait until 's2' is healthy in consul before interrogating envoy (#6108 ) When the envoy healthy panic threshold was explicitly disabled as part of L7 traffic management it changed how envoy decided to load balance to endpoints in a cluster. This only matters when envoy is in "panic mode" aka "when you have a bunch of unhealthy endpoints". Panic mode sends traffic to unhealthy instances in certain circumstances. Note: Prior to explicitly disabling the healthy panic threshold, the default value is 50%. What was happening is that the test harness was bringing up consul the sidecars, and the service instances all at once and sometimes the proxies wouldn't have time to be checked by consul to be labeled as 'passing' in the catalog before a round of EDS happened. The xDS server in consul effectively queries /v1/health/connect/s2 and gets 1 result, but that one result has a 'critical' check so the xDS server sends back that endpoint labeled as UNHEALTHY. Envoy sees that 100% of the endpoints in the cluster are unhealthy and would enter panic mode and still send traffic to s2. This is why the test suites PRIOR to disabling the healthy panic threshold worked. They were _incorrectly_ passing. When the healthy panic threshol is disabled, envoy never enters panic mode in this situation and thus the cluster has zero healthy endpoints so load balancing goes nowhere and the tests fail. Why does this only affect the test suites for envoy 1.8.0? My guess is that https://github.com/envoyproxy/envoy/pull/4442 was merged into the 1.9.x series and somehow that plays a role. This PR modifies the bats scripts to explicitly wait until the upstream sidecar is healthy as measured by /v1/health/connect/s2?passing BEFORE trying to interrogate envoy which should make the tests less racy.	2019-07-10 15:58:25 -05:00
R.B. Boyer	4bdb690a25	activate most discovery chain features in xDS for envoy (#6024 )	2019-07-01 22:10:51 -05:00
Paul Banks	9f656a2dc8	Fix envoy 1.10 exec (#5964 ) * Make exec test assert Envoy version - it was not rebuilding before and so often ran against wrong version. This makes 1.10 fail consistenty. * Switch Envoy exec to use a named pipe rather than FD magic since Envoy 1.10 doesn't support that. * Refactor to use an internal shim command for piping the bootstrap through. * Fmt. So sad that vscode golang fails so often these days. * go mod tidy * revert go mod tidy changes * Revert "ignore consul-exec tests until fixed (#5986)" This reverts commit `683262a686`. * Review cleanups	2019-06-21 16:06:25 +01:00
Alvin Huang	19f44cd1cc	remove container after docker run exits (#5798 )	2019-05-07 10:13:07 -04:00
Paul Banks	446abd7aa2	Make central conf test work when run in a suite. (#5767 ) * Make central conf test work when run in a suite. This switches integration tests to hard restart Consul each time which causes less surpise when some tests need to set configs that don't work on consul reload. This also increases the isolation and repeatability of the tests by dropping Consul's state entirely for each case run. * Remove aborted attempt to make restart optional.	2019-05-02 12:53:06 +01:00
Paul Banks	0cfb6051ea	Add integration test for central config; fix central config WIP (#5752 ) * Add integration test for central config; fix central config WIP * Add integration test for central config; fix central config WIP * Set proxy protocol correctly and begin adding upstream support * Add upstreams to service config cache key and start new notify watcher if they change. This doesn't update the tests to pass though. * Fix some merging logic get things working manually with a hack (TODO fix properly) * Simplification to not allow enabling sidecars centrally - it makes no sense without upstreams anyway * Test compile again and obvious ones pass. Lots of failures locally not debugged yet but may be flakes. Pushing up to see what CI does * Fix up service manageer and API test failures * Remove the enable command since it no longer makes much sense without being able to turn on sidecar proxies centrally * Remove version.go hack - will make integration test fail until release * Remove unused code from commands and upstream merge * Re-bump version to 1.5.0	2019-05-01 16:39:31 -07:00
Paul Banks	421ecd32fc	Connect: allow configuring Envoy for L7 Observability (#5558 ) * Add support for HTTP proxy listeners * Add customizable bootstrap configuration options * Debug logging for xDS AuthZ * Add Envoy Integration test suite with basic test coverage * Add envoy command tests to cover new cases * Add tracing integration test * Add gRPC support WIP * Merged changes from master Docker. get CI integration to work with same Dockerfile now * Make docker build optional for integration * Enable integration tests again! * http2 and grpc integration tests and fixes * Fix up command config tests * Store all container logs as artifacts in circle on fail * Add retries to outer part of stats measurements as we keep missing them in CI * Only dump logs on failing cases * Fix typos from code review * Review tidying and make tests pass again * Add debug logs to exec test. * Fix legit test failure caused by upstream rename in envoy config * Attempt to reduce cases of bad TLS handshake in CI integration tests * bring up the right service * Add prometheus integration test * Add test for denied AuthZ both HTTP and TCP * Try ANSI term for Circle	2019-04-29 17:27:57 +01:00

29 Commits