This new package provides a client agent implementation of an interface
for fetching the health of services.
This approach has a number of benefits:
1. It provides a much more explicit interface. Instead of everything
dependency on `RPC()` and `Cache.Get()` for many unrelated things
they can depend on a type that are named according to the behaviour
it provides.
2. It gives us a single place to vary the behaviour and migrate to
a new form of RPC (gRPC). The current implementation has two options
(cache, or direct RPC), and in the future we will have more.
It is also a great opporunity to start adding `context.Context` args
to these operations, which in the future will allow us to cancel
the operations.
3. As a concequence of the first, in the Server agent where we make
these calls we can replace the current in-memory RPC calls with
a thin adapter for the real method. This removes the `net/rpc`
machinery from the call in places where it is not needed.
This new package is quite small right now, but I think we can expect it
to grow to a more reasonable size as other RPC calls are replaced.
This change also happens to replace two very similar implementations with
a single implementation.
Previsouly it was done in Agent.Start, which is much later then it needs to be.
The new 'dns' package was required, because otherwise there would be an
import cycle. In the future we should move more of the dns server into
the dns package.
Previously, we were only returning a single ListenerPort for a single
service. However, we actually allow a single service to be serviced over
multiple ports, as well as allow users to define what hostnames they
expect their services to be contacted over. When no hosts are defined,
we return the default ingress domain for any configured DNS domain.
To show this in the UI, we modify the gateway-services-nodes API to
return a GatewayConfig.Addresses field, which is a list of addresses
over which the specific service can be contacted.
This will allow to increase cache value when DC is not valid (aka
return SOA to avoid too many consecutive requests) and will
distinguish DC being temporarily not available from DC not existing.
Implements https://github.com/hashicorp/consul/issues/8102
Blocking queries issues will still be uncancellable (that cannot be helped until we get rid of net/rpc). However this makes it so that if calling getWithIndex (like during a cache Notify go routine) we can cancell the outer routine. Previously it would keep issuing more blocking queries until the result state actually changed.
* Implements a simple, tcp ingress gateway workflow
This adds a new type of gateway for allowing Ingress traffic into Connect from external services.
Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
* Use consts for well known tagged adress keys
* Add ipv4 and ipv6 tagged addresses for node lan and wan
* Add ipv4 and ipv6 tagged addresses for service lan and wan
* Use IPv4 and IPv6 address in DNS
* Add updated github.com/miekg/dns to go modules
* Add updated github.com/miekg/dns to vendor
* Fix github.com/miekg/dns api breakage
* Decrease size when trimming UDP packets
Need more room for the header(?), if we don't decrease the size we get an
"overflow unpacking uint32" from the dns library
* Fix dns truncate tests with api changes
* Make windows build working again. Upgrade x/sys and x/crypto and vendor
This upgrade is needed because of API breakage in x/sys introduced
by the minimal x/sys dependency of miekg/dns
This API breakage has been fixed in commit
855e68c859
Refactor dns to have same behavior between A and SRV.
Current implementation returns the node name instead of the service
address.
With this fix when querying for SRV record service address is return in
the SRV record.
And when performing a simple dns lookup it returns a CNAME to the
service address.
This allows addresses to be tagged at the service level similar to what we allow for nodes already. The address translation that can be enabled with the `translate_wan_addrs` config was updated to take these new addresses into account as well.
The DNS config parameters `recursors` and `dns_config.*` are now hot
reloaded on SIGHUP or `consul reload` and do not need an agent restart
to be modified.
Config is stored in an atomic.Value and loaded at the beginning of each
request. Reloading only affects requests that start _after_ the
reload. Ongoing requests are not affected. To match the current
behavior the recursor handler is loaded and unloaded as needed on config
reload.
* Fix race condition in DNS when using cache
The healty node filtering was modifying the result from the cache, which
caused a crash when multiple queries were made to the same service
simultaneously.
We now copy the node slice before filtering to ensure we do not modify
the data stored in the cache.
* Fix wording in dns cache config doc
s/dns_max_age/cache_max_age/
Adds two new configuration parameters "dns_config.use_cache" and
"dns_config.cache_max_age" controlling how DNS requests use the agent
cache when querying servers.
* Avoid to have infinite recursion in DNS lookups when resolving CNAMEs
This will avoid killing Consul when a Service.Address is using CNAME
to a Consul CNAME that creates an infinite recursion.
This will fix https://github.com/hashicorp/consul/issues/4907
* Use maxRecursionLevel = 3 to allow several recursions
* bugfix: use ServiceTags to generate cahce key hash
* update unit test
* update
* remote print log
* Update .gitignore
* Completely deprecate ServiceTag field internally for clarity
* Add explicit test for CacheInfo cases
This implements parts of RFC 7871 where Consul is acting as an authoritative name server (or forwarding resolver when recursors are configured)
If ECS opt is present in the request we will mirror it back and return a response with a scope of 0 (global) or with the same prefix length as the request (indicating its valid specifically for that subnet).
We only mirror the prefix-length (non-global) for prepared queries as those could potentially use nearness checks that could be affected by the subnet. In the future we could get more sophisticated with determining the scope bits and allow for better caching of prepared queries that don’t rely on nearness checks.
The other thing this does not do is implement the part of the ECS RFC related to originating ECS headers when acting as a intermediate DNS server (forwarding resolver). That would take a quite a bit more effort and in general provide very little value. Consul will currently forward the ECS headers between recursors and the clients transparently, we just don't originate them for non-ECS clients to get potentially more accurate "location aware" results.
* Implementation of Weights Data structures
Adding this datastructure will allow us to resolve the
issues #1088 and #4198
This new structure defaults to values:
```
{ Passing: 1, Warning: 0 }
```
Which means, use weight of 0 for a Service in Warning State
while use Weight 1 for a Healthy Service.
Thus it remains compatible with previous Consul versions.
* Implemented weights for DNS SRV Records
* DNS properly support agents with weight support while server does not (backwards compatibility)
* Use Warning value of Weights of 1 by default
When using DNS interface with only_passing = false, all nodes
with non-Critical healthcheck used to have a weight value of 1.
While having weight.Warning = 0 as default value, this is probably
a bad idea as it breaks ascending compatibility.
Thus, we put a default value of 1 to be consistent with existing behaviour.
* Added documentation for new weight field in service description
* Better documentation about weights as suggested by @banks
* Return weight = 1 for unknown Check states as suggested by @banks
* Fixed typo (of -> or) in error message as requested by @mkeeler
* Fixed unstable unit test TestRetryJoin
* Fixed unstable tests
* Fixed wrong Fatalf format in `testrpc/wait.go`
* Added notes regarding DNS SRV lookup limitations regarding number of instances
* Documentation fixes and clarification regarding SRV records with weights as requested by @banks
* Rephrase docs
* Fixes the DNS recursor properly resolving the requests
* Added a test case for the recursor bug
* Refactored code && added a test case for all failing recursors
* Inner indentation moved into else if check
This also changes where the enforcement of the enable_additional_node_meta_txt configuration gets applied.
formatNodeRecord returns the main RRs and the meta/TXT RRs in separate slices. Its then up to the caller to add to the appropriate sections or not.
This just makes sure that if multiple services are registered with unique service addresses that we don’t blast back multiple CNAMEs for the same service DNS name and keeps us within the DNS specs.
Will fix https://github.com/hashicorp/consul/issues/4036
Instead of removing one by one the entries, find the optimal
size using binary search.
For SRV records, with 5k nodes, duration of DNS lookups is
divided by 4 or more.
Update docs a little
Update/add tests. Make sure all the various ways of determining the source IP work
Update X-Forwarded-For header parsing. This can be a comma separated list with the first element being the original IP so we now handle csv data there.
Got rid of error return from sourceAddrFromRequest
Queries to the DNS server can contain an optional datacenter
name in the query name. You can query for 'foo.service.consul'
or 'foo.service.dc.consul' to get a response for either the
default or a specific datacenter.
Datacenter names cannot have dots, therefore the datacenter
name can refer to only one element in the DNS query name.
The DNS server allowed extra labels between the optional
datacenter name and the domain and returned a valid response
instead of returning NXDOMAIN. For example, if the domain
is set to '.consul' then 'foo.service.dc1.extra.consul'
should return NXDOMAIN because of 'extra' being between
the datacenter name 'dc1' and the domain '.consul'.
Fixes#3200
* new config parser for agent
This patch implements a new config parser for the consul agent which
makes the following changes to the previous implementation:
* add HCL support
* all configuration fragments in tests and for default config are
expressed as HCL fragments
* HCL fragments can be provided on the command line so that they
can eventually replace the command line flags.
* HCL/JSON fragments are parsed into a temporary Config structure
which can be merged using reflection (all values are pointers).
The existing merge logic of overwrite for values and append
for slices has been preserved.
* A single builder process generates a typed runtime configuration
for the agent.
The new implementation is more strict and fails in the builder process
if no valid runtime configuration can be generated. Therefore,
additional validations in other parts of the code should be removed.
The builder also pre-computes all required network addresses so that no
address/port magic should be required where the configuration is used
and should therefore be removed.
* Upgrade github.com/hashicorp/hcl to support int64
* improve error messages
* fix directory permission test
* Fix rtt test
* Fix ForceLeave test
* Skip performance test for now until we know what to do
* Update github.com/hashicorp/memberlist to update log prefix
* Make memberlist use the default logger
* improve config error handling
* do not fail on non-existing data-dir
* experiment with non-uniform timeouts to get a handle on stalled leader elections
* Run tests for packages separately to eliminate the spurious port conflicts
* refactor private address detection and unify approach for ipv4 and ipv6.
Fixes#2825
* do not allow unix sockets for DNS
* improve bind and advertise addr error handling
* go through builder using test coverage
* minimal update to the docs
* more coverage tests fixed
* more tests
* fix makefile
* cleanup
* fix port conflicts with external port server 'porter'
* stop test server on error
* do not run api test that change global ENV concurrently with the other tests
* Run remaining api tests concurrently
* no need for retry with the port number service
* monkey patch race condition in go-sockaddr until we understand why that fails
* monkey patch hcl decoder race condidtion until we understand why that fails
* monkey patch spurious errors in strings.EqualFold from here
* add test for hcl decoder race condition. Run with go test -parallel 128
* Increase timeout again
* cleanup
* don't log port allocations by default
* use base command arg parsing to format help output properly
* handle -dc deprecation case in Build
* switch autopilot.max_trailing_logs to int
* remove duplicate test case
* remove unused methods
* remove comments about flag/config value inconsistencies
* switch got and want around since the error message was misleading.
* Removes a stray debug log.
* Removes a stray newline in imports.
* Fixes TestACL_Version8.
* Runs go fmt.
* Adds a default case for unknown address types.
* Reoders and reformats some imports.
* Adds some comments and fixes typos.
* Reorders imports.
* add unix socket support for dns later
* drop all deprecated flags and arguments
* fix wrong field name
* remove stray node-id file
* drop unnecessary patch section in test
* drop duplicate test
* add test for LeaveOnTerm and SkipLeaveOnInt in client mode
* drop "bla" and add clarifying comment for the test
* split up tests to support enterprise/non-enterprise tests
* drop raft multiplier and derive values during build phase
* sanitize runtime config reflectively and add test
* detect invalid config fields
* fix tests with invalid config fields
* use different values for wan sanitiziation test
* drop recursor in favor of recursors
* allow dns_config.udp_answer_limit to be zero
* make sure tests run on machines with multiple ips
* Fix failing tests in a few more places by providing a bind address in the test
* Gets rid of skipped TestAgent_CheckPerformanceSettings and adds case for builder.
* Add porter to server_test.go to make tests there less flaky
* go fmt
This patch replaces the code which determines the list of servers in the
current cluster with an RPC call to get the list of active consul
service instances which only run on servers.
This replaces the previous implementation which was more complex and
relied on serf messages which can provide a different view than the
consistent response from the raft log.
As a side effect it makes the implementation independent of the server
and the agent which means it works consistently across both. Different
behavior for server and agent was the root cause for the bug in
http://github.com/hashicorp/consul/issue/3047.
Fixes#3407