Consultation on Abnormal Situations and Issues After Complete PD Failure

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd 全部宕机后的异常情况和问题请教

| username: TiDBer_QBsB7GOW

[TiDB Usage Environment] Production
[TiDB Version] v6.5.0
[Reproduction Path]

  1. The PD processes on tidb-a and tidb-b unexpectedly went down (display shows);
  2. Attempted to scale down 1 PD and 3 TiKV processes on tidb-b using tiup on the tidb-a machine (without --force), but the scale-down failed;
  3. Shut down the tidb-b machine, added the --force parameter on tidb-a, and successfully scaled down (prompted success);
  4. Currently, the cluster only has 1 PD and 3 TiKV processes on tidb-a;
  5. Used tiup on tidb-a to restart the cluster, but TiKV couldn’t connect to the local PD, and the startup failed (PD process stuck indefinitely on port 2379);
  6. After starting tidb-b, the PD and TiKV processes on tidb-b automatically started again, and the cluster on tidb-a could successfully start (but the cluster information displayed by tiup on tidb-a did not include any information from tidb-b);
  7. Introduced a third physical machine tidb-c (with the same configuration), then used tiup on tidb-a to individually scale out the PD (prompted success);
  8. At this point, the cluster is running normally.

[Encountered Issues: Symptoms and Impact]

  1. If we want to clear data and perform maintenance on the tidb-b physical machine, how can we safely shut down and uninstall the TiKV and PD processes on tidb-b? (Because the PD on tidb-a still seems to be contacting the PD port on tidb-b, even though the cluster information no longer includes any processes from tidb-b)
  2. If we stop the TiKV processes on tidb-b without scaling out the TiKV processes to the tidb-c machine (only scaled out the PD), will it cause data loss?
  3. How can we troubleshoot the initial issue of the PD processes on both machines going down?

[Resource Configuration]

The current cluster consists of two physical machines (tidb-a and tidb-b):

  1. Each physical machine has 4 NVMe disks (1 for the system, 3 for TiKV)
  2. Each has 1 PD process and 3 TiKV processes
    [Attachments: Screenshots/Logs/Monitoring]

Latest Progress:

Individually scaled out the PD on tidb-b from tidb-a, then scaled it down again.

Currently, the members’ data has been restored to normal, and the PD on tidb-b has been completely taken offline.

Remaining Issues:

The TiKV processes on tidb-b are still running, and it’s unclear how to proceed.

Latest Progress (0710):

After scaling out the TiKV processes to tidb-c, shutting down tidb-b still encountered issues.

Checked a TiKV process log on tidb-a and found it still connecting to TiKV on tidb-c:

[2023/07/10 15:01:41.922 +08:00] [ERROR] [raft_client.rs:821] [“wait connect timeout”] [addr=tidb-b:20160] [store_id=19]

Checked a TiKV process log on tidb-c and found it still connecting to TiKV on tidb-c:

[2023/07/10 15:38:54.303 +08:00] [ERROR] [raft_client.rs:821] [“wait connect timeout”] [addr=tidb-b:20161] [store_id=18]

This is truly a perplexing cluster.

| username: TiDBer_QBsB7GOW | Original post link

Why is it that even though I forcibly scaled down tidb-b on tidb-a, tidb-b still exists in the members group? Did the scale-down not succeed? However, when displaying the results, there are no components on the tidb-b machine:

ID                   Role  Host            Ports        OS/Arch       Status  Data Dir                     Deploy Dir
--                   ----  ----            -----        -------       ------  --------                     ----------
{tidb-c}:2379  pd    {tidb-c}  2379/2380    linux/x86_64  Up      /home/tidb-data/pd-2379      /home/tidb-deploy/pd-2379
{tidb-a}:2379   pd    {tidb-a}   2379/2380    linux/x86_64  Up|L    /home/tidb-data/pd-2379      /home/tidb-deploy/pd-2379
{tidb-a}:20160  tikv  {tidb-a}   20160/20180  linux/x86_64  Up      /data1/tidb-data/tikv-20160  /data1/tidb-deploy/tikv-20160
{tidb-a}:20161  tikv  {tidb-a}   20161/20181  linux/x86_64  Up      /data2/tidb-data/tikv-20161  /data2/tidb-deploy/tikv-20161
{tidb-a}:20162  tikv  {tidb-a}   20162/20182  linux/x86_64  Up      /data3/tidb-data/tikv-20162  /data3/tidb-deploy/tikv-20162
Total nodes: 5
| username: redgame | Original post link

  1. tiup cluster scale-in $cluster-name -N tidb-b
| username: redgame | Original post link

  1. If you copy the TiKV instance to the tidb-c machine, you should start and add the TiKV instance on tidb-c to the cluster before shutting down the TiKV process on tidb-b. However, this attempt is not necessary.
| username: redgame | Original post link

  1. Find
    tail -f tikv.log
    tail -f pd.log
| username: tidb菜鸟一只 | Original post link

Why did you proceed with scaling down directly after pddown? Now the node on the tidb-b node can no longer rejoin the cluster, so we can only try a lossy recovery…

| username: TiDBer_QBsB7GOW | Original post link

Thank you for the reply:

Because at that time, performing a restart operation on a always failed to connect to b, and only after scaling down could a continue on its own (except that pd could not start normally).

b was considered to have potential issues with the system kernel and needed to be taken offline for maintenance.

So, when both pd instances were down at that time, what should be done? Any suggestions?

| username: TiDBer_QBsB7GOW | Original post link

Hello, thank you for your reply.

When I perform this step on the current tiup on server a, it will not succeed because:

There is no information about server b in the current cluster information (it’s just unclear why the pd on server a still needs to rely on the pd on server b to start).

| username: TiDBer_QBsB7GOW | Original post link

Since it was mentioned above that “the PD on server A actually still depends on the PD on server B,” I am currently unsure whether the cluster still relies on the TiKV on server B.

I am not sure how to ensure that “the current cluster can operate without any services from server B.”

| username: TiDBer_QBsB7GOW | Original post link

Why did you proceed with scaling down directly after pddown? Now the node on tidb-b cannot rejoin the cluster, so we can only try a lossy recovery…

I don’t want b to rejoin the cluster now; instead, I want the cluster to completely get rid of b (because when b was shut down before, a’s pd also failed to start, causing a’s tikv to fail to start as well).

| username: TiDBer_QBsB7GOW | Original post link

This will report an error:

Error: failed to destroy: cannot find node id ‘tidb-b:2379’ in topology

| username: TiDBer_QBsB7GOW | Original post link

Current Progress:

Re-expanded the PD of tidb-b separately on tidb-a, and then scaled it down.

Currently, the members’ data has been restored to normal, and the PD of tidb-b has been completely decommissioned.

Remaining Issue:

The tikv process on tidb-b is still running, and it is temporarily unclear how to proceed.