Manage failovers
Trigger a failover
You can trigger a failover manually using the Temporal Cloud Web UI, the tcld CLI, or the Cloud Ops API.
Always check the replication lag before initiating a failover. A forced failover when there is a significant replication lag has a higher likelihood of rolling back Workflow progress.
- Web UI
- tcld
- Cloud Ops API
- Visit the Namespace page on the Temporal Cloud Web UI.
- Navigate to your Namespace details page and select the Trigger a failover option from the menu.
- Confirm your action. After confirmation, Temporal initiates the failover.
To manually trigger a failover, run the following command in your terminal:
tcld namespace failover \
--namespace <namespace_id>.<account_id> \
--region <target_region>
The <target_region> must be the name of a region (example: us-east-1) where the Namespace has a replica that is
ready to be failed over to (replica state is Activated).
If using API key authentication with the --api-key flag, you must add it directly after the tcld command and before
namespace failover.
You can trigger a failover programmatically using the Cloud Ops API. The API is available via both HTTP and gRPC.
Using HTTP
Send a POST request to the
FailoverNamespaceRegion
endpoint:
POST https://saas-api.tmprl.cloud/cloud/namespaces/<namespace>/failover-region
Request body:
{
"region": "<target_region>",
"asyncOperationId": "<optional_async_operation_id>"
}
region(required): The region code of the region to failover to. Must be a region where the Namespace has a replica inActivatedreplica state, indicating the replica is ready to be failed over to. Example:aws-us-east-1asyncOperationId(optional): A user-defined ID for tracking the async operation. If not set, the server will assign one.
Using gRPC
Use the
FailoverNamespaceRegion
RPC with a
FailoverNamespaceRegionRequest:
message FailoverNamespaceRegionRequest {
// The namespace to failover.
string namespace = 1;
// The id of the region to failover to.
// Must be a region that the namespace is currently available in.
string region = 2;
// The id to use for this async operation - optional.
string async_operation_id = 3;
}
Both methods return a
FailoverNamespaceRegionResponse
containing an async operation that you can use to track the failover status.
The Temporal Cloud Terraform provider does not support triggering failovers. You must use the Web UI, tcld CLI, or Cloud Ops API.
Once the failover async operation returns successfully, the Namespace will be failed over. Temporal manages retries for the failover Workflow. In the rare event that an internal error prevents the failover from completing, the Temporal on-call team is automatically paged to intervene and force the failover to completion.
Return to the primary with failbacks
Failback behavior depends on whether the failover was automatic or manually triggered.
After an automatic failover
After an automatic failover, Temporal Cloud automatically fails back to the original region once the region is healthy. No action is required from you. Follow Temporal's status page for updates on the original region's health.
If you prefer to manage failback yourself, you have two options:
-
Opt out of automatic failback (manage failback manually): After the automatic failover has completed, disable automatic failovers on the Namespace to prevent Temporal from automatically failing back. When you're ready to return to the original region, trigger a failover to that region and then re-enable automatic failovers.
-
Stay on the new region permanently ("fail forward"): After the automatic failover has completed, trigger a failover to the region that is already active. This tells Temporal that you want to treat the new region as your primary for as long as it's healthy. Automatic failovers remain enabled, so Temporal will still protect you if the new region has an outage.
After a user-triggered failover
If you triggered a failover yourself during an outage (instead of relying on an automatic failover), Temporal will not automatically fail back for you. You must trigger a failover back to the original region when it is healthy. Monitor Temporal's status page for updates on region health.
Automatic failback is only available when the most recent failover was automatic.
How to check whether your Namespace will be automatically failed back
If you are not sure whether your Namespace will be automatically failed back, check the list of failovers in the Temporal Cloud Web UI on your Namespace's detail page:
- If the most recent failover was automatic, then Temporal will fail the Namespace back when the original region is healthy.
- If the most recent failover was user-triggered, then the Namespace will not be automatically failed back. You must trigger the failback yourself.
Workers and failovers
Enabling High Availability for Namespaces does not require specific Worker configuration. When a Namespace fails over to the replica, the DNS redirection orchestrated by Temporal ensures that your existing Workers continue to poll the Namespace without interruption. Temporal Cloud forwards their requests from the passive replica to the active region and the responses back, so Workers keep running through a failover.
To choose where your Worker fleets run across regions, see Deployment models for High Availability.
To route Workers to the passive region's replica, see How requests reach the replica.
To stop forwarding Worker polls to the active region, see Change the forwarding behavior.
To disable automatic failovers, see Enable or disable automatic failovers.
When a Namespace fails over to a replica in a different region, Workers will be communicating cross-region.
- If your application cannot tolerate this latency, deploy a second set of Workers in the replica's region or opt for a replica in the same region.
- In the case of a complete regional outage, Workers in the original region may fail alongside the original Namespace. To keep Workflows moving during this level of outage, deploy a second set of Workers to the secondary region.
Temporal Cloud enforces a maximum connection lifetime of 5 minutes, which gives your Workers an opportunity to re-resolve the DNS.
Test failovers
Temporal recommends regular failover testing for mission-critical applications in production. By testing in non-emergency conditions, you verify that your application continues to function even when parts of the infrastructure fail.
If this is your first time performing a failover test, run it with a test-specific Namespace and application. Practice runs help ensure the process runs smoothly during real incidents in production.