PD Multi-Node Failure

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PD多节点故障

| username: Hacker_5KEgzcj2

[TiDB Usage Environment] Production Environment
[TiDB Version] V6.1.0

  1. When multiple PD nodes fail, the official article mentions “deploying a new PD.” What specific method should be used? Scale-out doesn’t work.
    PD Recover 使用文档 | PingCAP 文档中心

  2. I also referred to this article: 专栏 - 使用pd-recover 恢复pd 多数节点故障的场景 | TiDB 社区. Restarting a single node results in the following error:

Please, experts, take a look.

| username: ffeenn | Original post link

All the PD nodes are down?

| username: 考试没答案 | Original post link

Display the current status?

| username: xfworld | Original post link

First, try to restore the state of PD.

Refer to these operations and see:

| username: tidb菜鸟一只 | Original post link

It should be creating a new cluster, not expanding the original cluster.

| username: Hacker_5KEgzcj2 | Original post link

Two out of three PD nodes are down.

| username: Hacker_5KEgzcj2 | Original post link

All are in DOWN status.

| username: Hacker_5KEgzcj2 | Original post link

If a new cluster is created, how can the previous TiKV data be restored?

| username: Hacker_5KEgzcj2 | Original post link

The first article was also referenced, but it fails at this step, as mentioned in the post.

| username: WalterWj | Original post link

If all the PDs are down, set up a new PD cluster and then use pd-recover to register the TiKV.
Normal expansion with tiup definitely won’t work…

| username: tidb菜鸟一只 | Original post link

Use pd-recover to register the PD of the new cluster as the PD of the old cluster.

| username: Hacker_5KEgzcj2 | Original post link

There is still one PD node alive, not all down.

| username: Hacker_5KEgzcj2 | Original post link

Oh, I see. It’s like deploying a new cluster and then registering the new cluster’s PD over there. I’ll give it a try.

| username: Hacker_5KEgzcj2 | Original post link

After the recovery is completed

Currently, there is one faulty cluster and one PD cluster. What should be done next?

| username: 考试没答案 | Original post link

Can a cluster with only one PD still be accessed normally? Will it work properly?

| username: Hacker_5KEgzcj2 | Original post link

No, PD will show a DOWN status.

| username: jansu-dev | Original post link

Could you post a more complete pd.log? It seems we haven’t figured out why PD is down.

| username: liuis | Original post link

No, you can’t.