mirror of
https://github.com/status-im/consul.git
synced 2025-01-26 13:40:20 +00:00
72f991d8d3
* agent: remove agent cache dependency from service mesh leaf certificate management This extracts the leaf cert management from within the agent cache. This code was produced by the following process: 1. All tests in agent/cache, agent/cache-types, agent/auto-config, agent/consul/servercert were run at each stage. - The tests in agent matching .*Leaf were run at each stage. - The tests in agent/leafcert were run at each stage after they existed. 2. The former leaf cert Fetch implementation was extracted into a new package behind a "fake RPC" endpoint to make it look almost like all other cache type internals. 3. The old cache type was shimmed to use the fake RPC endpoint and generally cleaned up. 4. I selectively duplicated all of Get/Notify/NotifyCallback/Prepopulate from the agent/cache.Cache implementation over into the new package. This was renamed as leafcert.Manager. - Code that was irrelevant to the leaf cert type was deleted (inlining blocking=true, refresh=false) 5. Everything that used the leaf cert cache type (including proxycfg stuff) was shifted to use the leafcert.Manager instead. 6. agent/cache-types tests were moved and gently replumbed to execute as-is against a leafcert.Manager. 7. Inspired by some of the locking changes from derek's branch I split the fat lock into N+1 locks. 8. The waiter chan struct{} was eventually replaced with a singleflight.Group around cache updates, which was likely the biggest net structural change. 9. The awkward two layers or logic produced as a byproduct of marrying the agent cache management code with the leaf cert type code was slowly coalesced and flattened to remove confusion. 10. The .*Leaf tests from the agent package were copied and made to work directly against a leafcert.Manager to increase direct coverage. I have done a best effort attempt to port the previous leaf-cert cache type's tests over in spirit, as well as to take the e2e-ish tests in the agent package with Leaf in the test name and copy those into the agent/leafcert package to get more direct coverage, rather than coverage tangled up in the agent logic. There is no net-new test coverage, just coverage that was pushed around from elsewhere.
64 lines
2.5 KiB
Go
64 lines
2.5 KiB
Go
package leafcert
|
|
|
|
import (
|
|
"time"
|
|
|
|
"github.com/hashicorp/consul/agent/structs"
|
|
)
|
|
|
|
// calculateSoftExpiry encapsulates our logic for when to renew a cert based on
|
|
// it's age. It returns a pair of times min, max which makes it easier to test
|
|
// the logic without non-deterministic jitter to account for. The caller should
|
|
// choose a time randomly in between these.
|
|
//
|
|
// We want to balance a few factors here:
|
|
// - renew too early and it increases the aggregate CSR rate in the cluster
|
|
// - renew too late and it risks disruption to the service if a transient
|
|
// error prevents the renewal
|
|
// - we want a broad amount of jitter so if there is an outage, we don't end
|
|
// up with all services in sync and causing a thundering herd every
|
|
// renewal period. Broader is better for smoothing requests but pushes
|
|
// both earlier and later tradeoffs above.
|
|
//
|
|
// Somewhat arbitrarily the current strategy looks like this:
|
|
//
|
|
// 0 60% 90%
|
|
// Issued [------------------------------|===============|!!!!!] Expires
|
|
// 72h TTL: 0 ~43h ~65h
|
|
// 1h TTL: 0 36m 54m
|
|
//
|
|
// Where |===| is the soft renewal period where we jitter for the first attempt
|
|
// and |!!!| is the danger zone where we just try immediately.
|
|
//
|
|
// In the happy path (no outages) the average renewal occurs half way through
|
|
// the soft renewal region or at 75% of the cert lifetime which is ~54 hours for
|
|
// a 72 hour cert, or 45 mins for a 1 hour cert.
|
|
//
|
|
// If we are already in the softRenewal period, we randomly pick a time between
|
|
// now and the start of the danger zone.
|
|
//
|
|
// We pass in now to make testing easier.
|
|
func calculateSoftExpiry(now time.Time, cert *structs.IssuedCert) (min time.Time, max time.Time) {
|
|
certLifetime := cert.ValidBefore.Sub(cert.ValidAfter)
|
|
if certLifetime < 10*time.Minute {
|
|
// Shouldn't happen as we limit to 1 hour shortest elsewhere but just be
|
|
// defensive against strange times or bugs.
|
|
return now, now
|
|
}
|
|
|
|
// Find the 60% mark in diagram above
|
|
softRenewTime := cert.ValidAfter.Add(time.Duration(float64(certLifetime) * 0.6))
|
|
hardRenewTime := cert.ValidAfter.Add(time.Duration(float64(certLifetime) * 0.9))
|
|
|
|
if now.After(hardRenewTime) {
|
|
// In the hard renew period, or already expired. Renew now!
|
|
return now, now
|
|
}
|
|
|
|
if now.After(softRenewTime) {
|
|
// Already in the soft renew period, make now the lower bound for jitter
|
|
softRenewTime = now
|
|
}
|
|
return softRenewTime, hardRenewTime
|
|
}
|