This commit adds a monotonically increasing nonce to include in peering
replication response messages. Every ack/nack from the peer handling a
response will include this nonce, allowing to correlate the ack/nack
with a specific resource.
At the moment nothing is done with the nonce when it is received. In the
future we may want to add functionality such as retries on NACKs,
depending on the class of error.
This commit adds handling so that the replication stream considers
whether the user intends to peer through mesh gateways.
The subscription will return server or mesh gateway addresses depending
on the mesh configuration setting. These watches can be updated at
runtime by modifying the mesh config entry.
Previously establishment and pending secrets were only checked at the
RPC layer. However, given that these are Check-and-Set transactions we
should ensure that the given secrets are still valid when persisting a
secret exchange or promotion.
Otherwise it would be possible for concurrent requests to overwrite each
other.
Previously there was a field indicating the operation that triggered a
secrets write. Now there is a message for each operation and it contains
the secret ID being persisted.
Previously the updates to the peering secrets UUID table relied on
inferring what action triggered the update based on a reconciliation
against the existing secrets.
Instead we now explicitly require the operation to be given so that the
inference isn't necessary. This makes the UUID table logic easier to
reason about and fixes some related bugs.
There is also an update so that the peering secrets get handled on
snapshots/restores.
* Avoid logging StreamSecretID
* Wrap additional errors in stream handler
* Fix flakiness in leader test and rename servers for clarity. There was
a race condition where the peering was being deleted in the test
before the stream was active. Now the test waits for the stream to be
connected on both sides before deleting the associated peering.
* Run flaky test serially
When we receive a FailedPrecondition error, retry that more quickly
because we expect it will resolve shortly. This is particularly
important in the context of Consul servers behind a load balancer
because when establishing a connection we have to retry until we
randomly land on a leader node.
The default retry backoff goes from 2s, 4s, 8s, etc. which can result in
very long delays quite quickly. Instead, this backoff retries in 8ms
five times, then goes exponentially from there: 16ms, 32ms, ... up to a
max of 8152ms.
This mimics xDS's discovery protocol where you must request a resource
explicitly for the exporting side to send those events to you.
As part of this I aligned the overall ResourceURL with the TypeURL that
gets embedded into the encoded protobuf Any construct. The
CheckServiceNodes is now wrapped in a better named "ExportedService"
struct now.
* peerstream: dialer should reconnect when stream closes
If the stream is closed unexpectedly (i.e. when we haven't received
a terminated message), the dialer should attempt to re-establish the
stream.
Previously, the `HandleStream` would return `nil` when the stream
was closed. The caller then assumed the stream was terminated on purpose
and so didn't reconnect when instead it was stopped unexpectedly and
the dialer should have attempted to reconnect.
Previously, public referred to gRPC services that are both exposed on
the dedicated gRPC port and have their definitions in the proto-public
directory (so were considered usable by 3rd parties). Whereas private
referred to services on the multiplexed server port that are only usable
by agents and other servers.
Now, we're splitting these definitions, such that external/internal
refers to the port and public/private refers to whether they can be used
by 3rd parties.
This is necessary because the peering replication API needs to be
exposed on the dedicated port, but is not (yet) suitable for use by 3rd
parties.