Questions about Switching DR auto_sync from Synchronous to Asynchronous

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: DR auto_sync同步切异步疑问

| username: h5n1

Switching from Synchronous to Asynchronous: PD determines whether TiKV is down or disconnected by periodically checking TiKV’s heartbeat information. If the number of down instances exceeds the number of primary-replicas and dr-replicas for PRIMARY/DR respectively, it means synchronous replication cannot be completed, and the state needs to be switched.

Assuming 3 voter replicas, primary to backup ratio of 2:1, primary has 7 TiKV and backup has 3 TiKV:
(1) The synchronous and asynchronous states refer to whether the backup is guaranteed to be written. How is this related to the number of TiKV failures in the primary primary-replicas?

(2) If 2 TiKV instances in the primary fail, some regions will be unavailable. Can the regions that are not affected continue to maintain sync? What impact does the failure recovery of multi-replicas in the unavailable regions have?

(3) If one TiKV in the backup fails, the follower on it becomes unavailable. Does this mean the entire TiDB cluster will switch to async mode for all regions? After the remaining TiKV in the backup center supplements the replicas beyond max_store_down_time, can it revert to sync mode?

| username: TiDBer_Lee | Original post link

If two primary TiKV nodes fail, some regions will become unavailable, and the system will definitely switch to asynchronous mode and report errors and failures. If one backup TiKV node fails, since the ratio is 2:1, it will also switch to asynchronous mode. By default, it will switch after 1 minute.

| username: wangkk2024 | Original post link

Come here to learn.

| username: zhang_2023 | Original post link

As long as two out of the three replicas in the region are available, it’s fine.