Improve network automation resiliency by enabling high availability for Consul-Terraform-Sync. HA enables persistent task and event data so that CTS functions as expected during a failover event.
This topic describes how to run Consul-Terraform-Sync (CTS) configured for high availability. High availability is an enterprise capability that ensures that all changes to Consul that occur during a failover transition are processed and that CTS continues to operate as expected.
A network always has exactly one instance of the CTS cluster that is the designated leader. The leader is responsible for monitoring and running tasks. If the leader fails, CTS triggers the following process when it is configured for high availability:
1. The new leader begins running all existing tasks in `once-mode` in order to process changes that occurred during the failover transition period. In this mode, CTS runs all existing tasks one time.
1. The new leader logs any errors that occur during `once-mode` operation and the new leader continues to monitor Consul for changes.
In a standard configuration, CTS exits if errors occur when the CTS instance runs tasks in `once-mode`. In a high availability configuration, CTS logs the errors and continues to operate without interruption.
The following diagram shows operating state when high availability is enabled. CTS Instance A is the current leader and is responsible for monitoring and running tasks:
The following diagram shows the CTS cluster state after the leader stops. CTS Instance B becomes the leader responsible for monitoring and running tasks.
- The time it takes for a new leader to be elected is determined by the `high_availability.cluster.storage.session_ttl` configuration. The minimum failover time is equal to the `session_ttl` value. The maximum failover time is double the `session_ttl` value.
- If failover occurs during task execution, a new leader is elected. The new leader will attempt to run all tasks once before continuing to monitor for changes.
- If using the [HCP Terraform driver](/consul/docs/nia/network-drivers/terraform-cloud), the task finishes and CTS starts a new leader that attempts to queue a run for each task in HCP Terraform in once-mode.
- If using [Terraform driver](/consul/docs/nia/network-drivers/terraform), the task may complete depending on the cause of the failover. The new leader starts and attempts to run each task in [once-mode](/consul/docs/nia/cli/start#modes). Depending on the module and provider, the task may require manual intervention to fix any inconsistencies between the infrastructure and Terraform state.
We recommend specifying the [HCP Terraform driver](/consul/docs/nia/network-drivers/terraform-cloud) in your CTS configuration if you want to run in high availability mode.
Add the `high_availability` block in your CTS configuration and configure the required settings to enable high availability. Refer to the [Configuration reference](/consul/docs/nia/configuration#high-availability) for details about the configuration fields for the `high_availability` block.
The `session` and `keys` resources in your Consul environment must have `write` permissions. Refer to the [ACL documentation](/consul/docs/security/acl) for details on how to define ACL policies.
If the `high_availability.cluster.storage.namespace` field is configured, then your ACL policy must also enable `write` permissions for the `namespace` resource.
1. Create an HCL configuration file that includes the settings you want to include, including the `high_availability` block. Refer to [Configuration Options for Consul-Terraform-Sync](/consul/docs/nia/configuration) for all configuration options.
1. Issue the startup command and pass the configuration file. Refer to the [`start` command reference](/consul/docs/nia/cli/start#modes) for additional information about CTS startup modes.
1. You can call the `/status` API endpoint to verify the status of tasks CTS is configured to monitor. Only the leader of the cluster will return a successful response. Refer to the [`/status` API reference documentation](/consul/docs/nia/api/status) for information about usage and responses.
Repeat the procedure to start the remaining instances for your cluster. We recommend using near-identical configurations for all instances in your cluster. You may not be able to use exact configurations in all cases, but starting instances with the same configuration improves consistency and reduces confusion if you need to troubleshoot errors.
You can implement a rolling update to update a non-task configuration for a CTS instance, such as the Consul connection settings. If you need to update a task in the instance configuration, refer to [Modify tasks](#modify-tasks).
1. Identify the leader CTS instance by either making a call to the [`status/cluster` API endpoint](/consul/docs/nia/api/status#cluster-status) or by checking the logs for the following entry:
When high availability is enabled, CTS persists task and event data. Refer to [State storage and persistence](/consul/docs/nia/architecture#state-storage-and-persistence) for additional information.
You can use the following methods for modifying tasks when high availability is enabled. We recommend choosing a single method to make all task configuration changes because inconsistencies between the state and the configuration can occur when mixing methods.
We recommend deleting and recreating a task if you need to make a modification. Use the CTS API to identify the CTS leader instance and replace a task.
1. Identify the leader CTS instance by either making a call to the [`status/cluster` API endpoint](/consul/docs/nia/api/status#cluster-status) or by checking the logs for the following entry:
1. Send a `DELETE` call to the [`/task/<task-name>` endpoint](/consul/docs/nia/api/tasks#delete-task) to delete the task. In the following example, the leader instance is at `localhost:8558`:
You can restart the CTS cluster using the [`-reset-storage` flag](/consul/docs/nia/cli/start#options) to discard persisted data if you need to update a task.
1. Restart the instance you restarted in step 3 without the `-reset-storage` flag so that it starts up with the current state. If you continue to run an instance with the `-reset-storage` flag enabled, then CTS will reset the state data whenever the instance becomes the leader.
Use the following troubleshooting procedure if a previous leader had been running a task successfully but the new leader logs an error after a failover:
1. Check the logs printed to the console for errors. Refer to the [`syslog` configuration](/consul/docs/nia/configuration#syslog) for information on how to locate the logs. In the following example output, CTS reported a `401: Bad credentials` error:
| with module.config-task.provider["registry.terraform.io/integrations/github"],
| on .terraform/modules/config-task/main.tf line 11, in provider "github":
| 11: provider "github" {
|
```
1. Check for differences between the previous leader and new leader, such as differences in configurations, environment variables, and local resources.
1. Start a new instance with the fix that resolves the issue.