I would like to ask if 2 machines can be deployed for high availability? That is, if one machine goes down, the other can still provide service. I encountered a problem where if one machine shuts down, all PDs go down

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 想请教下2台机器可以部署成高可用吗?就是一台宕机的情况另一台还能提供服务,我这里碰到问题一台关机后,所有pd都会down掉

| username: TiDBer_NIpp5a2i

[TiDB Usage Environment] Poc
[TiDB Version] v7.5.0
[Reproduction Path] What operations were performed to encounter the issue
[Encountered Issue: Problem Phenomenon and Impact]
When deploying with 2 machines, can one machine still provide service if the other one crashes? I encountered an issue where after one machine was shut down, all PDs went down.
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Logs/Monitoring]

| username: TiDBer_NIpp5a2i | Original post link

Asking for experts to take a look.

| username: 啦啦啦啦啦 | Original post link

At least 3 nodes, PD also needs to be an odd number of nodes, otherwise, a leader cannot be elected. Of course, if conditions permit in a production environment, it is recommended to deploy according to the official recommended configuration.

| username: TiDBer_NIpp5a2i | Original post link

Is it not possible with 2 machines? We only have a scenario with 2 machines, and the customer requires that if one machine goes down, it must still be usable.

| username: 胡杨树旁 | Original post link

The official documentation recommends deploying an odd number of PD nodes. PD itself is composed of at least 3 nodes, providing high availability. It is recommended to deploy an odd number of PD nodes.

| username: 啦啦啦啦啦 | Original post link

No matter what, 2 machines just won’t work.

| username: TiDBer_NIpp5a2i | Original post link

Okay, thank you.

| username: TiDBer_NIpp5a2i | Original post link

If one out of three machines goes down, can the other two still be used?

| username: 啦啦啦啦啦 | Original post link

If each of the 3 machines is deployed with 1 PD, 1 TiDB, and 1 TiKV, this kind of topology can still be used if one machine goes down.

| username: Miracle | Original post link

If there are only two machines, TiDB is not a good choice. For mixed deployment, at least three machines are required.

| username: TiDBer_NIpp5a2i | Original post link

The business volume is actually very small. The reason for choosing TiDB is for information innovation and MySQL compatibility. Previously, the MySQL solution was dual-master and standby, so I wanted to try TiDB as well.

| username: forever | Original post link

Regardless of the scale, it is necessary to deploy according to the official production recommendations. The resources for each machine can be slightly reduced based on your actual situation.

| username: 连连看db | Original post link

TiDB can also use two machines for a master-slave cluster, but this won’t fully utilize TiDB’s potential and features. In this case, MySQL master-slave might be more reliable.

| username: linnana | Original post link

Function testing does not consider high availability; it can also be deployed on a single machine.

| username: TiDBer_NIpp5a2i | Original post link

Due to the requirements of information innovation, the code is all based on MySQL, and I don’t want to change it.

| username: TiDBer_NIpp5a2i | Original post link

I tested it and it didn’t work. As long as one machine goes down, the PD on the other machine will inevitably be down as well.

| username: Kongdom | Original post link

You can’t do it with 2 machines, you need at least 3. :thinking: Or set it up as master-slave, with the second machine being a slave to the first.

| username: 春风十里 | Original post link

You have two machines with 4 PDs, 2 PDs on each machine. If any one machine goes down, you will lose half of the PDs, which will prevent the majority from selecting a leader. Therefore, two physical machines are not enough; you need at least three.

Single Region Multi-AZ Deployment TiDB | PingCAP Documentation Center

Raft is a distributed consensus algorithm. In the TiDB cluster, both PD and TiKV use Raft to achieve data disaster recovery. Raft’s disaster recovery capability is achieved through the following mechanisms:

  • The essence of Raft members is log replication and state machines. Raft members achieve data synchronization by replicating logs; Raft members switch their member states under different conditions with the goal of electing a leader to provide external services.
  • Raft is a voting system that follows the majority protocol. In a Raft Group, if a member receives the majority of votes, its member state will change to leader. This means that as long as a Raft Group retains the majority of nodes, it can elect a leader to provide external services.

Following the reliability characteristics of Raft, in real-world scenarios:

  • To overcome the failure of any 1 server (host), at least 3 servers should be provided.
  • To overcome the failure of any 1 rack, at least 3 racks should be provided.
  • To overcome the failure of any 1 availability zone (AZ, which can also be multiple data centers in the same city), at least 3 AZs should be provided.
  • To handle the disaster scenario of any 1 region, at least 3 regions should be planned for cluster deployment.

It is evident that the native Raft protocol does not support even-numbered replicas very well. Considering the impact of cross-region network latency, a three-AZ deployment in the same region might be the most suitable high-availability and disaster recovery solution for Raft.

| username: 小龙虾爱大龙虾 | Original post link

Find a domestic single-node database for master-slave replication.

| username: dba远航 | Original post link

At least three.