Add common blocking implementation details to docs (#5358)

* Add common blocking implementation details to docs

These come up over and over again with blocking query loops in our own code and third-party's. #5333 is possibly a case (unconfirmed) where "badly behaved" blocking clients cause issues, however since we've never explicitly documented these things it's not reasonable for third-party clients to have guessed that they are needed!

This hopefully gives us something to point to for the future.

It's a little wordy - happy to consider breaking some of the blocking stuff out of this page if we think it's appropriate but just wanted to quickly plaster over this gap in our docs for now.

* Update index.html.md

* Apply suggestions from code review

Co-Authored-By: banks <banks@banksco.de>

* Update index.html.md

* Update index.html.md

* Clearified monotonically

* Fixing formating
This commit is contained in:
Paul Banks 2019-02-21 21:33:45 +00:00 committed by kaitlincarter-hc
parent 00aa50cfa2
commit 360e3acc7c

View File

@ -77,6 +77,51 @@ to the supplied maximum `wait` time to spread out the wake up time of any
concurrent requests. This adds up to `wait / 16` additional time to the maximum
duration.
### Implementation Details
While the mechanism is relatively simple to work with, there are a few edge
cases that must be handled correctly.
* **Reset the index if it goes backwards**. While indexes in general are
monotonically increasing(i.e. they should only ever increase as time passes),
there are several real-world scenarios in
which they can go backwards for a given query. Implementations must check
to see if a returned index is lower than the previous value,
and if it is, should reset index to `0` - effectively restarting their blocking loop.
Failure to do so may cause the client to miss future updates for an unbounded
time, or to use an invalid index value that causes no blocking and increases
load on the servers. Cases where this can occur include:
* If a raft snapshot is restored on the servers with older version of the data.
* KV list operations where an item with the highest index is removed.
* A Consul upgrade changes the way watches work to optimize them with more
granular indexes.
* **Sanity check index is greater than zero**. After the initial request (or a
reset as above) the `X-Consul-Index` returned _should_ always be greater than zero. It
is a bug in Consul if it is not, however this has happened a few times and can
still be triggered on some older Consul versions. It's especially bad because it
causes blocking clients that are not aware to enter a busy loop, using excessive
client CPU and causing high load on servers. It is _always_ safe to use an
index of `1` to wait for updates when the data being requested doesn't exist
yet, so clients _should_ sanity check that their index is at least 1 after
each blocking response is handled to be sure they actually block on the next
request.
* **Rate limit**. The blocking query mechanism is reasonably efficient when updates
are relatively rare (order of tens of seconds to minutes between updates). In cases
where a result gets updated very fast however - possibly during an outage or incident
with a badly behaved client - blocking query loops degrade into busy loops that
consume excessive client CPU and cause high server load. While it's possible to just add a sleep
to every iteration of the loop, this is **not** recommended since it causes update
delivery to be delayed in the happy case, and it can exacerbate the problem since
it increases the chance that the index has changed on the next request. Clients
_should_ instead rate limit the loop so that in the happy case they proceed without
waiting, but when values start to churn quickly they degrade into polling at a
reasonable rate (say every 15 seconds). Ideally this is done with an algorithm that
allows a couple of quick successive deliveries before it starts to limit rate - a
[token bucket](https://en.wikipedia.org/wiki/Token_bucket) with burst of 2 is a simple
way to achieve this.
### Hash-based Blocking Queries
A limited number of agent endpoints also support blocking however because the