PD service cannot start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PD服务无法启动

| username: terry0219

【TiDB Usage Environment】Testing
【TiDB Version】7.5.0
【Reproduction Path】Simulate PD node failure by deleting the deploy directory of 3 PD nodes, then redeploy the PD cluster and encounter the issue
【Encountered Issue: Symptoms and Impact】
PD error log:
[2023/12/27 11:38:33.055 +08:00] [ERROR] [middleware.go:156] [“redirect but server is not leader”] [from=pd-10.0.7.64-2379] [server=pd-10.0.7.64-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]
[2023/12/27 11:38:33.056 +08:00] [ERROR] [middleware.go:156] [“redirect but server is not leader”] [from=pd-10.0.7.64-2379] [server=pd-10.0.7.64-2379] [error=“[PD:apiutil:ErrRedirect]redirect failed”]

Another PD error log:
[2023/12/27 11:39:00.728 +08:00] [ERROR] [client.go:150] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Unavailable desc = server not started: rpc error: code = Unavailable desc = server not started”]
[2023/12/27 11:39:01.729 +08:00] [INFO] [client.go:146] [“server starts to synchronize with leader”] [server=pd-10.0.7.64-2379] [leader=pd-10.0.7.64-2379] [request-index=47800]
[2023/12/27 11:39:01.730 +08:00] [ERROR] [client.go:150] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Unavailable desc = server not started: rpc error: code = Unavailable desc = server not started”]

| username: Jellybean | Original post link

How many PD nodes have you deployed in total?

If it’s 3 nodes, then deleting all 3 PD nodes at once will definitely cause the cluster to malfunction. The PD cluster acts as the brain of the entire cluster, and once the nodes are up, they provide cluster services through their leader. If you delete the deploy directory of all nodes, you might have already lost the critical metadata of the original cluster. Recovery will be very troublesome. The key is to ensure that TiKV data is not lost.

| username: 小龙虾爱大龙虾 | Original post link

Okay, all three PDs have been deleted. Please rebuild the cluster.

| username: zhanggame1 | Original post link

After deleting the deploy directory of the 3 PDs, how do you redeploy the PD cluster specifically? How do you redeploy it? How many PDs were in the original cluster?

| username: wangccsy | Original post link

The logs clearly tell you that there is no PD Leader.

| username: terry0219 | Original post link

A total of 3 PD nodes.

| username: terry0219 | Original post link

There are a total of 3 PDs. The operation method is: first delete the deploy directories of the 3 nodes, then create a new cluster with only PD nodes, and then copy the deploy directory to the previous PD directory.

| username: 像风一样的男子 | Original post link

Good simulation, I also want to know how to recover data if all PDs are deleted.

| username: terry0219 | Original post link

Does rebuilding the cluster mean only rebuilding the PD cluster, or does everything need to be rebuilt?

| username: 啦啦啦啦啦 | Original post link

Refer to this: 专栏 - TiDB集群数据库灾难恢复手册 | TiDB 社区

| username: 小龙虾爱大龙虾 | Original post link

What is your test scenario? Are you testing high availability by losing 3 PDs? This scenario is unreasonable, so it is recommended that you redesign the scenario and rebuild a new environment for testing. This is the quickest and best way to restore your test cluster.

If you are specifically testing complete PD data loss and rebuilding PD, then follow the steps for rebuilding PD as described here: PD Recover 使用文档 | PingCAP 文档中心

| username: kelvin | Original post link

This test doesn’t make much sense, right? Normally, it’s impossible for 3 nodes to fail simultaneously in a production environment.

| username: terry0219 | Original post link

Thank you for the reply. I just want to test some extreme scenarios to see if recovery is possible. If it goes into production in the future, I will feel more confident.

| username: terry0219 | Original post link

Thank you for your reply. I followed this document and succeeded.

| username: terry0219 | Original post link

According to the document posted below, the operation was successful.

| username: Jellybean | Original post link

This is truly an excellent article, worth studying in depth.

The PD cluster only stores metadata information, and it reports all data to the PD cluster through TiKV heartbeats. After all data is lost, the entire TiDB cluster can be repaired by rebuilding the PD cluster.

| username: dba远航 | Original post link

The information stored in PD is gone, it’s just an empty shell.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.