[Fix] Services sometimes not being synced with acl_enforce_version_8 = false (#4771)

Fixes: https://github.com/hashicorp/consul/issues/3676

This fixes a bug were registering an agent with a non-existent ACL token can prevent other 
services registered with a good token from being synced to the server when using 
`acl_enforce_version_8 = false`.

## Background

When `acl_enforce_version_8` is off the agent does not check the ACL token validity before 
storing the service in its state.
When syncing a service registered with a missing ACL token we fall into the default error 
handling case (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1255)
and stop the sync (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1082)
without setting its Synced property to true like in the permission denied case.
This means that the sync will always stop at the faulty service(s).
The order in which the services are synced is random since we iterate on a map. So eventually
all services with good ACL tokens will be synced, this can however take some time and is influenced 
by the cluster size, the bigger the slower because retries are less frequent.
Having a service in this state also prevent all further sync of checks as they are done after
the services.

## Changes 

This change modify the sync process to continue even if there is an error. 
This fixes the issue described above as well as making the sync more error tolerant: if the server repeatedly refuses
a service (the ACL token could have been deleted by the time the service is synced, the servers 
were upgraded to a newer version that has more strict checks on the service definition...). 
Then all services and check that can be synced will, and those that don't will be marked as errors in 
the logs instead of blocking the whole process.
This commit is contained in:
Aestek 2019-01-04 16:01:50 +01:00 committed by Matt Keeler
parent fde5d75c68
commit 5960974db1

View File

@ -1210,7 +1210,7 @@ func (l *State) deleteService(id string) error {
l.logger.Printf("[INFO] agent: Deregistered service %q", id)
return nil
case acl.IsErrPermissionDenied(err):
case acl.IsErrPermissionDenied(err), acl.IsErrNotFound(err):
// todo(fs): mark the service to be in sync to prevent excessive retrying before next full sync
// todo(fs): some backoff strategy might be a better solution
l.services[id].InSync = true
@ -1248,7 +1248,7 @@ func (l *State) deleteCheck(id types.CheckID) error {
l.logger.Printf("[INFO] agent: Deregistered check %q", id)
return nil
case acl.IsErrPermissionDenied(err):
case acl.IsErrPermissionDenied(err), acl.IsErrNotFound(err):
// todo(fs): mark the check to be in sync to prevent excessive retrying before next full sync
// todo(fs): some backoff strategy might be a better solution
l.checks[id].InSync = true
@ -1316,7 +1316,7 @@ func (l *State) syncService(id string) error {
l.logger.Printf("[INFO] agent: Synced service %q", id)
return nil
case acl.IsErrPermissionDenied(err):
case acl.IsErrPermissionDenied(err), acl.IsErrNotFound(err):
// todo(fs): mark the service and the checks to be in sync to prevent excessive retrying before next full sync
// todo(fs): some backoff strategy might be a better solution
l.services[id].InSync = true
@ -1365,7 +1365,7 @@ func (l *State) syncCheck(id types.CheckID) error {
l.logger.Printf("[INFO] agent: Synced check %q", id)
return nil
case acl.IsErrPermissionDenied(err):
case acl.IsErrPermissionDenied(err), acl.IsErrNotFound(err):
// todo(fs): mark the check to be in sync to prevent excessive retrying before next full sync
// todo(fs): some backoff strategy might be a better solution
l.checks[id].InSync = true
@ -1397,7 +1397,7 @@ func (l *State) syncNodeInfo() error {
l.logger.Printf("[INFO] agent: Synced node info")
return nil
case acl.IsErrPermissionDenied(err):
case acl.IsErrPermissionDenied(err), acl.IsErrNotFound(err):
// todo(fs): mark the node info to be in sync to prevent excessive retrying before next full sync
// todo(fs): some backoff strategy might be a better solution
l.nodeInfoInSync = true