The observed bug was that a full restart of a consul datacenter (servers
and clients) in conjunction with a restart of a connect-flavored
application with bring-your-own-service-registration logic would very
frequently cause the envoy sidecar service check to never reflect the
aliased service.
Over the course of investigation several bugs and unfortunate
interactions were corrected:
(1)
local.CheckState objects were only shallow copied, but the key piece of
data that gets read and updated is one of the things not copied (the
underlying Check with a Status field). When the stock code was run with
the race detector enabled this highly-relevant-to-the-test-scenario field
was found to be racy.
Changes:
a) update the existing Clone method to include the Check field
b) copy-on-write when those fields need to change rather than
incrementally updating them in place.
This made the observed behavior occur slightly less often.
(2)
If anything about how the runLocal method for node-local alias check
logic was ever flawed, there was no fallback option. Those checks are
purely edge-triggered and failure to properly notice a single edge
transition would leave the alias check incorrect until the next flap of
the aliased check.
The change was to introduce a fallback timer to act as a control loop to
double check the alias check matches the aliased check every minute
(borrowing the duration from the non-local alias check logic body).
This made the observed behavior eventually go away when it did occur.
(3)
Originally I thought there were two main actions involved in the data race:
A. The act of adding the original check (from disk recovery) and its
first health evaluation.
B. The act of the HTTP API requests coming in and resetting the local
state when re-registering the same services and checks.
It took awhile for me to realize that there's a third action at work:
C. The goroutines associated with the original check and the later
checks.
The actual sequence of actions that was causing the bad behavior was
that the API actions result in the original check to be removed and
re-added _without waiting for the original goroutine to terminate_. This
means for brief windows of time during check definition edits there are
two goroutines that can be sending updates for the alias check status.
In extremely unlikely scenarios the original goroutine sees the aliased
check start up in `critical` before being removed but does not get the
notification about the nearly immediate update of that check to
`passing`.
This is interlaced wit the new goroutine coming up, initializing its
base case to `passing` from the current state and then listening for new
notifications of edge triggers.
If the original goroutine "finishes" its update, it then commits one
more write into the local state of `critical` and exits leaving the
alias check no longer reflecting the underlying check.
The correction here is to enforce that the old goroutines must terminate
before spawning the new one for alias checks.
* Improve startup message to avoid confusing users when no error occurs
Several times, some users not very familiar with Consul get confused
by error message at startup:
`[INFO] agent: (LAN) joined: 1 Err: <nil>`
Having `Err: <nil>` seems weird to many users, I propose to have the
following instead:
* Success: `[INFO] agent: (LAN) joined: 1`
* Error: `[WARN] agent: (LAN) couldn't join: %d Err: ERROR`
Logic updated to evaluate with a boolean after the type assertion.
This allows us to check if the type assertion succeeded and be
more clear with errors.
* bump middleman-hashicorp to 0.3.40 and exclude guide rendering
* add notes to Makefile for volume mounts hack PR#5847
* make note of the PR number in the comment
1. Amends our `base` animation placeholder to always reset
transition-duration. This has no effect on other components that are
already using this animation.
2. Adds a confirming class whenever a key is pressed, to show the green
tick. Uses CSS via `transition-delay` for debouncing.
Reference: https://github.com/hashicorp/consul/issues/4090
Examples covering a variety of potential use cases. Verified via `sockaddr eval` and `console agent -bind` on a test machine:
```console
# Baseline
$ sockaddr eval 'GetAllInterfaces'
[127.0.0.1/8 {1 65536 lo up|loopback} ::1 {1 65536 lo up|loopback} 10.0.0.10/8 {2 1500 eth0 b8:27:eb:7b:36:95 up|broadcast|multicast} fe80::12dc:5e4d:8ff8:2d96/64 {2 1500 eth0 b8:27:eb:7b:36:95 up|broadcast|multicast} 192.168.1.10/24 {3 1500 wlan0 b8:27:eb:2e:63:c0 up|broadcast|multicast} fe80::b6dc:5758:c306:b15b/64 {3 1500 wlan0 b8:27:eb:2e:63:c0 up|broadcast|multicast}]
# Using address within a specific CIDR
$ sockaddr eval 'GetPrivateInterfaces | include "network" "10.0.0.0/8" | attr "address"'
10.0.0.10
# Using a static network interface name
$ sockaddr eval 'GetInterfaceIP "eth0"'
10.0.0.10
# Using regular expression matching for network interface name that is forwardable and up
$ sockaddr eval 'GetAllInterfaces | include "name" "^eth" | include "flags" "forwardable|up" | attr "address"'
10.0.0.10
```
https://github.com/hashicorp/consul/issues/2121https://www.freedesktop.org/software/systemd/man/systemd.service.html
When set to notify, systemd will not attempt to start any dependent
services until after consul sends the notify signal. This is useful
in cases where there services that rely on the local consul agent
to be up and functional before they can start. The default is simple,
which will immediately mark the service as up and functioning even
if consul has not yet joined the cluster and has started listening
for connnections.