Discussion on Single-Zone AZ Availability

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 单区 AZ 可用性探讨

| username: GreenGuan

I have a question. Should there be at least 5 PD nodes in the diagram, 3 + 2? Because if the Primary IDC goes down completely, the majority of PDs will shift to the DR DC. However, the DR DC only has one PD, and one PD cannot cast a majority vote, making the entire cluster unavailable?

| username: zhanggame1 | Original post link

I don’t think it will work.

| username: xfworld | Original post link

A single PD is fine; the foundation of Raft is an odd number of instances, such as 1, 3, 5, 7.

Therefore, if it’s a dual-center setup, you need to consider the distribution scheme of PD to determine which is more reasonable.

| username: zhang_2023 | Original post link

Suggest 5

| username: tidb菜鸟一只 | Original post link

A single PD can run, and you can specify one PD server during deployment, but it won’t be highly available. If it fails, you’re done for.

| username: shigp_TIDBER | Original post link

It is possible to have one PD in DR DC, but if the PD in DR DC goes down, the entire cluster will become unavailable.

| username: zhaokede | Original post link

Available, but not highly available.

| username: GreenGuan | Original post link

I conducted a test, disregarding AZ availability, and in a single data center, I used a firewall to isolate two PD nodes to simulate a crash scenario. The TiKV nodes were all in a down state.

image

Then the final state was

image

| username: xfworld | Original post link

Sure, adjust the scenario, both TiKV and TiDB are in the up state, try isolating PD.

| username: TiDBer_QYr0vohO | Original post link

One PD is available.

| username: TiDBer_JUi6UvZm | Original post link

The original poster is probably asking whether the PD in the backup center can function normally after the majority of PDs go offline.

| username: TiDBer_JUi6UvZm | Original post link

Regardless of whether it’s 1, 3, 5, or 7 nodes, as long as the primary center goes down, according to Raft, the backup center does not have the ability to elect a leader. In this case, can TiDB force the backup center to become the leader by dropping PD, so that it can continue to provide services?

| username: YuchongXU | Original post link

At least 3 centers

| username: zhanggame1 | Original post link

When deploying, one works fine, but when deploying three, only one remains working.

| username: Jack-li | Original post link

Available solutions

| username: jiayou64 | Original post link

What you said is correct. According to the configuration, if the primary center goes down, the backup center will be unavailable.
The principle for setting PD nodes is the same as for TiKV nodes, with a primary-to-backup center ratio of 3:2 for 5 nodes. The roles of the nodes should be at least 2 voters and 1 follower in the primary center, and at least 2 voters in the backup center. This design ensures availability.

| username: 小龙虾爱大龙虾 | Original post link

  • If the primary AZ fails and most of the Voter replicas are lost, but the secondary AZ has complete data, you can recover the data from the secondary AZ. This requires manual intervention and the use of professional tools for recovery. For support, please contact PingCAP Service and Support.
| username: tidb菜鸟一只 | Original post link

In practice, it is possible, but it requires manual intervention. If a single region with dual AZs has one AZ down, whether it’s PD or TiKV, manual intervention is needed…