Issues Related to DR Auto-Sync Fault Recovery

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于dr auto-sync故障恢复的问题

| username: TiDBer_Lee

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v7.5.0
When designing a TiDB disaster recovery plan, to balance cost and disaster recovery, I plan to use the dr auto-sync solution:

  • 2 replicas in the primary availability zone, both are voters
  • 2 replicas in the backup availability zone, with roles of 1 follower and 1 learner


  • How to recover if the entire primary availability zone goes down
  • How to restore to the default 3-replica mode without the cluster adding more replicas, since the number of machines in the backup availability zone will be lower than in the primary availability zone
| username: tidb菜鸟一只 | Original post link

There should be 6 replicas:

  • 3 replicas in the primary availability zone, which are voters
  • 3 replicas in the secondary availability zone, with roles being 2 followers and 1 learner
| username: dba远航 | Original post link

Take a look at the architecture design part of video 303.

| username: Jellybean | Original post link

  1. If the primary availability zone is unavailable, the cluster will be unavailable.
    If it was in normal synchronous mode before, then at this time RPO=0, the cluster can be restored using the data from the backup availability zone. However, you might need to seek help from the official team for the restoration process.

If it was in asynchronous mode before, it means that the backup availability zone has data lagging behind the primary availability zone. Now that the primary availability zone is unavailable, the RPO is not zero, and only a loss recovery is possible.

  1. If you want to set the cluster to the default 3-replica mode, you need to remove the DR Auto-sync setting and adjust the relevant PD configuration to the normal 3-replica mode. This will most likely require a cluster restart.

In summary, this kind of high availability mode with dual availability zones requires too much manual intervention in the maintenance process. Although it requires fewer hardware facilities, the subsequent human resource cost is very high. In comparison, it is recommended to deploy in three availability zones, or use a master-slave cluster, or introduce a third arbitration zone in a dual data center solution.

| username: TiDBer_Lee | Original post link

There are only general ideas, no details.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.