Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: pd 全部宕机后的异常情况和问题请教

[TiDB Usage Environment] Production
[TiDB Version] v6.5.0
[Reproduction Path]
- The PD processes on tidb-a and tidb-b unexpectedly went down (display shows);
- Attempted to scale down 1 PD and 3 TiKV processes on tidb-b using tiup on the tidb-a machine (without --force), but the scale-down failed;
- Shut down the tidb-b machine, added the --force parameter on tidb-a, and successfully scaled down (prompted success);
- Currently, the cluster only has 1 PD and 3 TiKV processes on tidb-a;
- Used tiup on tidb-a to restart the cluster, but TiKV couldn’t connect to the local PD, and the startup failed (PD process stuck indefinitely on port 2379);
- After starting tidb-b, the PD and TiKV processes on tidb-b automatically started again, and the cluster on tidb-a could successfully start (but the cluster information displayed by tiup on tidb-a did not include any information from tidb-b);
- Introduced a third physical machine tidb-c (with the same configuration), then used tiup on tidb-a to individually scale out the PD (prompted success);
- At this point, the cluster is running normally.
[Encountered Issues: Symptoms and Impact]
- If we want to clear data and perform maintenance on the tidb-b physical machine, how can we safely shut down and uninstall the TiKV and PD processes on tidb-b? (Because the PD on tidb-a still seems to be contacting the PD port on tidb-b, even though the cluster information no longer includes any processes from tidb-b)
- If we stop the TiKV processes on tidb-b without scaling out the TiKV processes to the tidb-c machine (only scaled out the PD), will it cause data loss?
- How can we troubleshoot the initial issue of the PD processes on both machines going down?
[Resource Configuration]
The current cluster consists of two physical machines (tidb-a and tidb-b):
- Each physical machine has 4 NVMe disks (1 for the system, 3 for TiKV)
- Each has 1 PD process and 3 TiKV processes
[Attachments: Screenshots/Logs/Monitoring]
Latest Progress:
Individually scaled out the PD on tidb-b from tidb-a, then scaled it down again.
Currently, the members’ data has been restored to normal, and the PD on tidb-b has been completely taken offline.
Remaining Issues:
The TiKV processes on tidb-b are still running, and it’s unclear how to proceed.
Latest Progress (0710):
After scaling out the TiKV processes to tidb-c, shutting down tidb-b still encountered issues.
Checked a TiKV process log on tidb-a and found it still connecting to TiKV on tidb-c:
[2023/07/10 15:01:41.922 +08:00] [ERROR] [raft_client.rs:821] [“wait connect timeout”] [addr=tidb-b:20160] [store_id=19]
Checked a TiKV process log on tidb-c and found it still connecting to TiKV on tidb-c:
[2023/07/10 15:38:54.303 +08:00] [ERROR] [raft_client.rs:821] [“wait connect timeout”] [addr=tidb-b:20161] [store_id=18]
This is truly a perplexing cluster.