When the envoy healthy panic threshold was explicitly disabled as part
of L7 traffic management it changed how envoy decided to load balance to
endpoints in a cluster. This only matters when envoy is in "panic mode"
aka "when you have a bunch of unhealthy endpoints". Panic mode sends
traffic to unhealthy instances in certain circumstances.
Note: Prior to explicitly disabling the healthy panic threshold, the
default value is 50%.
What was happening is that the test harness was bringing up consul the
sidecars, and the service instances all at once and sometimes the
proxies wouldn't have time to be checked by consul to be labeled as
'passing' in the catalog before a round of EDS happened.
The xDS server in consul effectively queries /v1/health/connect/s2 and
gets 1 result, but that one result has a 'critical' check so the xDS
server sends back that endpoint labeled as UNHEALTHY.
Envoy sees that 100% of the endpoints in the cluster are unhealthy and
would enter panic mode and still send traffic to s2. This is why the
test suites PRIOR to disabling the healthy panic threshold worked. They
were _incorrectly_ passing.
When the healthy panic threshol is disabled, envoy never enters panic
mode in this situation and thus the cluster has zero healthy endpoints
so load balancing goes nowhere and the tests fail.
Why does this only affect the test suites for envoy 1.8.0? My guess is
that https://github.com/envoyproxy/envoy/pull/4442 was merged into the
1.9.x series and somehow that plays a role.
This PR modifies the bats scripts to explicitly wait until the upstream
sidecar is healthy as measured by /v1/health/connect/s2?passing BEFORE
trying to interrogate envoy which should make the tests less racy.
* Make cluster names SNI always
* Update some tests
* Ensure we check for prepared query types
* Use sni for route cluster names
* Proper mesh gateway mode defaulting when the discovery chain is used
* Ignore service splits from PatchSliceOfMaps
* Update some xds golden files for proper test output
* Allow for grpc/http listeners/cluster configs with the disco chain
* Update stats expectation
* Make exec test assert Envoy version - it was not rebuilding before and so often ran against wrong version. This makes 1.10 fail consistenty.
* Switch Envoy exec to use a named pipe rather than FD magic since Envoy 1.10 doesn't support that.
* Refactor to use an internal shim command for piping the bootstrap through.
* Fmt. So sad that vscode golang fails so often these days.
* go mod tidy
* revert go mod tidy changes
* Revert "ignore consul-exec tests until fixed (#5986)"
This reverts commit 683262a686.
* Review cleanups
* Upgrade xDS (go-control-plane) API to support Envoy 1.10.
This includes backwards compatibility shim to work around the ext_authz package rename in 1.10.
It also adds integration test support in CI for 1.10.0.
* Fix go vet complaints
* go mod vendor
* Update Envoy version info in docs
* Update website/source/docs/connect/proxies/envoy.md
* Make central conf test work when run in a suite.
This switches integration tests to hard restart Consul each time which causes less surpise when some tests need to set configs that don't work on consul reload. This also increases the isolation and repeatability of the tests by dropping Consul's state entirely for each case run.
* Remove aborted attempt to make restart optional.
* Add integration test for central config; fix central config WIP
* Add integration test for central config; fix central config WIP
* Set proxy protocol correctly and begin adding upstream support
* Add upstreams to service config cache key and start new notify watcher if they change.
This doesn't update the tests to pass though.
* Fix some merging logic get things working manually with a hack (TODO fix properly)
* Simplification to not allow enabling sidecars centrally - it makes no sense without upstreams anyway
* Test compile again and obvious ones pass. Lots of failures locally not debugged yet but may be flakes. Pushing up to see what CI does
* Fix up service manageer and API test failures
* Remove the enable command since it no longer makes much sense without being able to turn on sidecar proxies centrally
* Remove version.go hack - will make integration test fail until release
* Remove unused code from commands and upstream merge
* Re-bump version to 1.5.0
* Add support for HTTP proxy listeners
* Add customizable bootstrap configuration options
* Debug logging for xDS AuthZ
* Add Envoy Integration test suite with basic test coverage
* Add envoy command tests to cover new cases
* Add tracing integration test
* Add gRPC support WIP
* Merged changes from master Docker. get CI integration to work with same Dockerfile now
* Make docker build optional for integration
* Enable integration tests again!
* http2 and grpc integration tests and fixes
* Fix up command config tests
* Store all container logs as artifacts in circle on fail
* Add retries to outer part of stats measurements as we keep missing them in CI
* Only dump logs on failing cases
* Fix typos from code review
* Review tidying and make tests pass again
* Add debug logs to exec test.
* Fix legit test failure caused by upstream rename in envoy config
* Attempt to reduce cases of bad TLS handshake in CI integration tests
* bring up the right service
* Add prometheus integration test
* Add test for denied AuthZ both HTTP and TCP
* Try ANSI term for Circle
This patch removes the porter tool which hands out free ports from a
given range with a library which does the same thing. The challenge for
acquiring free ports in concurrent go test runs is that go packages are
tested concurrently and run in separate processes. There has to be some
inter-process synchronization in preventing processes allocating the
same ports.
freeport allocates blocks of ports from a range expected to be not in
heavy use and implements a system-wide mutex by binding to the first
port of that block for the lifetime of the application. Ports are then
provided sequentially from that block and are tested on localhost before
being returned as available.
* Testutil falls back to random ports w/o porter
This PR allows the testutil server to be used without porter.
* Adds sterner-sounding fallback comments.
* new config parser for agent
This patch implements a new config parser for the consul agent which
makes the following changes to the previous implementation:
* add HCL support
* all configuration fragments in tests and for default config are
expressed as HCL fragments
* HCL fragments can be provided on the command line so that they
can eventually replace the command line flags.
* HCL/JSON fragments are parsed into a temporary Config structure
which can be merged using reflection (all values are pointers).
The existing merge logic of overwrite for values and append
for slices has been preserved.
* A single builder process generates a typed runtime configuration
for the agent.
The new implementation is more strict and fails in the builder process
if no valid runtime configuration can be generated. Therefore,
additional validations in other parts of the code should be removed.
The builder also pre-computes all required network addresses so that no
address/port magic should be required where the configuration is used
and should therefore be removed.
* Upgrade github.com/hashicorp/hcl to support int64
* improve error messages
* fix directory permission test
* Fix rtt test
* Fix ForceLeave test
* Skip performance test for now until we know what to do
* Update github.com/hashicorp/memberlist to update log prefix
* Make memberlist use the default logger
* improve config error handling
* do not fail on non-existing data-dir
* experiment with non-uniform timeouts to get a handle on stalled leader elections
* Run tests for packages separately to eliminate the spurious port conflicts
* refactor private address detection and unify approach for ipv4 and ipv6.
Fixes#2825
* do not allow unix sockets for DNS
* improve bind and advertise addr error handling
* go through builder using test coverage
* minimal update to the docs
* more coverage tests fixed
* more tests
* fix makefile
* cleanup
* fix port conflicts with external port server 'porter'
* stop test server on error
* do not run api test that change global ENV concurrently with the other tests
* Run remaining api tests concurrently
* no need for retry with the port number service
* monkey patch race condition in go-sockaddr until we understand why that fails
* monkey patch hcl decoder race condidtion until we understand why that fails
* monkey patch spurious errors in strings.EqualFold from here
* add test for hcl decoder race condition. Run with go test -parallel 128
* Increase timeout again
* cleanup
* don't log port allocations by default
* use base command arg parsing to format help output properly
* handle -dc deprecation case in Build
* switch autopilot.max_trailing_logs to int
* remove duplicate test case
* remove unused methods
* remove comments about flag/config value inconsistencies
* switch got and want around since the error message was misleading.
* Removes a stray debug log.
* Removes a stray newline in imports.
* Fixes TestACL_Version8.
* Runs go fmt.
* Adds a default case for unknown address types.
* Reoders and reformats some imports.
* Adds some comments and fixes typos.
* Reorders imports.
* add unix socket support for dns later
* drop all deprecated flags and arguments
* fix wrong field name
* remove stray node-id file
* drop unnecessary patch section in test
* drop duplicate test
* add test for LeaveOnTerm and SkipLeaveOnInt in client mode
* drop "bla" and add clarifying comment for the test
* split up tests to support enterprise/non-enterprise tests
* drop raft multiplier and derive values during build phase
* sanitize runtime config reflectively and add test
* detect invalid config fields
* fix tests with invalid config fields
* use different values for wan sanitiziation test
* drop recursor in favor of recursors
* allow dns_config.udp_answer_limit to be zero
* make sure tests run on machines with multiple ips
* Fix failing tests in a few more places by providing a bind address in the test
* Gets rid of skipped TestAgent_CheckPerformanceSettings and adds case for builder.
* Add porter to server_test.go to make tests there less flaky
* go fmt
* Updates Raft library to get new snapshot/restore API.
* Basic backup and restore working, but need some cleanup.
* Breaks out a snapshot module and adds a SHA256 integrity check.
* Adds snapshot ACL and fills in some missing comments.
* Require a consistent read for snapshots.
* Make sure snapshot works if ACLs aren't enabled.
* Adds a bit of package documentation.
* Returns an empty response from restore to avoid EOF errors.
* Adds API client support for snapshots.
* Makes internal file names match on-disk file snapshots.
* Adds DC and token coverage for snapshot API test.
* Adds missing documentation.
* Adds a unit test for the snapshot client endpoint.
* Moves the connection pool out of the client for easier testing.
* Fixes an incidental issue in the prepared query unit test.
I realized I had two servers in bootstrap mode so this wasn't a good setup.
* Adds a half close to the TCP stream and fixes panic on error.
* Adds client and endpoint tests for snapshots.
* Moves the pool back into the snapshot RPC client.
* Adds a TLS test and fixes half-closes for TLS connections.
* Tweaks some comments.
* Adds a low-level snapshot test.
This is independent of Consul so we can pull this out into a library
later if we want to.
* Cleans up snapshot and archive and completes archive tests.
* Sends a clear error for snapshot operations in dev mode.
Snapshots require the Raft snapshots to be readable, which isn't supported
in dev mode. Send a clear error instead of a deep-down Raft one.
* Adds docs for the snapshot endpoint.
* Adds a stale mode and index feedback for snapshot saves.
This gives folks a way to extract data even if the cluster has no
leader.
* Changes the internal format of a snapshot from zip to tgz.
* Pulls in Raft fix to cancel inflight before a restore.
* Pulls in new Raft restore interface.
* Adds metadata to snapshot saves and a verify function.
* Adds basic save and restore snapshot CLI commands.
* Gets rid of tarball extensions and adds restore message.
* Fixes an incidental bad link in the KV docs.
* Adds documentation for the snapshot CLI commands.
* Scuttle any request body when a snapshot is saved.
* Fixes archive unit test error message check.
* Allows for nil output writers in snapshot RPC handlers.
* Renames hash list Decode to DecodeAndVerify.
* Closes the client connection for snapshot ops.
* Lowers timeout for restore ops.
* Updates Raft vendor to get new Restore signature and integrates with Consul.
* Bounces the leader's internal state when we do a restore.