A physical machine has been down for more than a day and won't start. There are four TiKV nodes on this machine. How should I handle this?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 有一台物理机down机超过1天了,启动不了了,这台机器上有四个tikv节点,我该如何处理

| username: xingzhenxiang

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Reproduction Path】What operations were performed when the issue occurred
【Encountered Issue: Issue Phenomenon and Impact】
【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】


image

| username: 小龙虾爱大龙虾 | Original post link

If the physical machine can be brought back up after being repaired, then bring it back up. If the machine is completely gone, then scale down.

| username: Fly-bird | Original post link

With so many KV nodes, reducing these 4 shouldn’t be a problem. Just reduce them one by one.

| username: 普罗米修斯 | Original post link

My first reaction is to label it. If it doesn’t affect the business currently, scale down and decommission one by one.

| username: 像风一样的男子 | Original post link

If the tags have been applied, just shrink these 4 KVs directly.

| username: xingzhenxiang | Original post link

Tagged it, huh? Can only try forced scaling down?

| username: xingzhenxiang | Original post link

You’ve tagged it, can you only try forced scaling down?

| username: xingzhenxiang | Original post link

The server has been repaired and is up and running.

| username: tidb菜鸟一只 | Original post link

If you label it, the cluster will not be affected in use, and it will gradually start to supplement replicas on other nodes. You can try to decommission it directly through tiup. Column - Three Tricks for Handling Abnormal TiKV Decommissioning | TiDB Community

| username: xingzhenxiang | Original post link

I labeled this at the cabinet level. Does it mean that even if there is a problem with the cabinet, it won’t be affected?

| username: tidb菜鸟一只 | Original post link

If there are enough racks, it will be done this way. If not, isolation may only be possible at the host level…

| username: xingzhenxiang | Original post link

Four cabinets should be enough, right?
PD is also configured as follows:
replication.location-labels: [“rack”, “host”]
replication.isolation-level: “rack”

| username: tidb菜鸟一只 | Original post link

Well, if the max-replica of PD is 3, which means the number of replicas is 3, then losing one out of four cabinets won’t have an impact.

| username: 春风十里 | Original post link

Will it automatically recover when the physical machine is restored? Is there any manual operation required?

| username: TIDB-Learner | Original post link

In a perfect state, if you set auto-start to enable, it can start automatically, and these few KVs can gradually balance the regions. If auto-start is not set, start manually.

| username: xingzhenxiang | Original post link

No manual intervention is required; the node will automatically clean the data first and then automatically resynchronize the data.

| username: xingzhenxiang | Original post link

Thank you, thank you.

| username: dba远航 | Original post link

It’s good that it’s resolved. Please share how you fixed it so that everyone can learn.

| username: xingzhenxiang | Original post link

No intervention is needed. There’s no need to intervene with a single machine. TiDB will handle the repair internally. Just start the machine and ensure the service starts normally. No other intervention is required.

| username: xingzhenxiang | Original post link

My storage distribution is as follows: