How to Remove a DOWN Node in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB怎么移除DOWN的节点

| username: robert233

【TiDB Usage Environment】Production Environment
【TiDB Version】v5.4.2
【Reproduction Path】What operations were performed when the issue occurred

【Encountered Issue: Issue Phenomenon and Impact】

  • Issue: One physical machine malfunctioned and frequently restarted. I manually evicted the leader on this physical machine and then stopped these stores. Currently, the replicas are being rebuilt. How can I manually remove these down nodes?

【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】

| username: 裤衩儿飞上天 | Original post link

What is the current status of the node?

  1. You can use tiup cluster display XXXX to check; XXXX is the cluster name.
  2. Use pd ctl to check the corresponding store, what is its status?
| username: robert233 | Original post link

Down, do I need to manually set these to tombstone to remove them from the cluster now?

| username: 裤衩儿飞上天 | Original post link

Check if this node still has any regions @robert233

| username: robert233 | Original post link

The status of the store in pd is down, and it has exceeded the default max-store-down-time of 30 minutes. From the monitoring, it can be seen that the cluster has started to replenish the replicas of each region on the surviving stores. There must be non-leader regions on these down nodes. How can the store be removed from the cluster?

| username: robert233 | Original post link

The image is not available for translation. Please provide the text content directly.

| username: 裤衩儿飞上天 | Original post link

  1. How many replicas does your cluster have? What is the cluster topology? I see that there are currently two nodes down.
  2. Don’t rush to execute the delete operation. If there are 3 replicas, there might be a situation where multiple replicas are lost.
  3. If multiple replicas are not lost, wait for the region to be replenished on other nodes before executing the delete store operation.
| username: robert233 | Original post link

Three physical machines with 9 TiKV, one physical machine with 3 TiKV down, I checked, there are no regions with missing replicas.

No regions with missing replicas.

After the regions are replenished, will the store change from down to tombstone state?

| username: 裤衩儿飞上天 | Original post link

  1. First, ask if TiKV has set a label?
  2. If there is no label, you can directly use tiup cluster scale-in XXX IP:Port (XXX is the cluster name, IP:Port is the IP and port of the node to be taken offline). After the status becomes tombstone, you can perform the prune operation.

You can refer to this for the specific status of the store:

| username: 裤衩儿飞上天 | Original post link

!!!
Your length < 3, don’t forget the TiFlash replicas.
First, use pd ctl to check the node to be taken offline to see if there are any regions, then proceed with the node offline operation.

| username: robert233 | Original post link

Single-machine multi-instance deployment, according to the official documentation, configured labels

  config:
    server.labels:
      host: host1

Additionally, on the down store, I did not find all regions where the number of down replicas is greater than the number of normal replicas. The following result is empty:

region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(4,5,918726) then . else empty end) | length>=$total-length) }"

As for the regions with replicas on the downed stores 4, 5, 918726:

region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(any(.==(4,5,918726)))}"

So, does this mean that I need to manually operator add remove-peer for all regions with replicas on the down store?

| username: 裤衩儿飞上天 | Original post link

  1. If you have set the label, the nodes on the other two physical machines should not be able to supplement the replicas. It is recommended to check the method:
    Under ptctl, execute store XX (XX is the store id), and post this part of the information.
  2. For a 3-replica cluster, you should first add a machine and set the relevant label;
    After the replicas are replenished, you can scale-in normally.
| username: robert233 | Original post link

The issue is that this physical machine has already crashed and cannot be brought back up. After expanding with a new machine and completing the region, is it useful to scale-in?

| username: 裤衩儿飞上天 | Original post link

  1. The current state does not allow for normal scale-in because the remaining machines cannot satisfy the three replicas requirement. The normal process is to ensure three replicas are met before taking a machine offline.
  2. If you must forcefully take a machine offline, you need to remove the regions on that machine first; otherwise, the offline process will fail.
  3. Since this is a production environment, it is recommended to follow the normal process for taking a machine offline, as this is the simplest and safest method.
| username: robert233 | Original post link

As the expert said, solve the problem by following the normal scaling method :+1:

| username: 裤衩儿飞上天 | Original post link

Glad it’s resolved :+1:t2::+1:t2::+1:t2: