Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tikv节点服务器故障后,剔除后状态一直是Offline,后来强制剔除后,还是有故障节点的相关信息。
【TiDB Usage Environment】Production Environment
【TiDB Version】v5.4.3
【Reproduction Path】Operations performed that led to the issue
A TiKV node in the cluster experienced hardware failure. When the server could not be restored to normal in a short time, the faulty TiKV node was scaled down. The status remained Offline for a long time, and later it was forcibly scaled down. Using tiup cluster display
to check the cluster status, the faulty node information was no longer present, but the node information still existed in PD. Now, it is unclear how to completely scale down the node.
【Encountered Issue: Issue Phenomenon and Impact】
【Resource Configuration】Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】
Offline is due to regions that have not been migrated.
tiup cluster scale-in | PingCAP 文档中心…#–force
You can use tiup to forcibly remove this node. Do not forcibly take other nodes offline before the replicas are replenished.
The node has been forcibly deleted. After forcibly deleting it, I checked through pd-ctl and found that there is still information about the node, with the status being offline. It has not been completely removed.
The screenshot is missing some parts. Check the number of regions. Once the number of regions drops to 0, it will naturally become a tombstone.
How many TiKV nodes do you have now? Did you scale down and then scale up a node again?
This faulty node cannot be recovered; the database directory on this server has been deleted. We have tried deleting some leaders, but the remaining 177 leaders cannot be deleted despite multiple attempts. Is there any way to completely clear them now?
Is it because there are not enough remaining nodes to supplement the replicas?
After this node performs the offline operation, if the cluster replicas are replenished, this node will enter the tombstone state.
When an issue occurred, there were a total of three nodes. After one node went down, the storage capacity of the two remaining TiKV nodes was insufficient, and scheduling basically stopped. Then, a new node was added, and the store space threshold low-space-ratio was adjusted. After the cluster returned to normal, the faulty node was restarted, but many data files were found to be corrupted. Some leaders could not be scheduled out from that node, nor could they be deleted from that node, so it was forcibly removed. However, PD still had information about that node.
You forcibly deleted it, but there are still so many leaders, which means that these 177 regions cannot elect new leaders. You probably need to consider unsafe recovery.
It has always been in an offline state, indicating that there are still regions that have not been migrated. Once the migration is complete, it will change to a tombstone state.
It is because regions are still being migrated. Once the region migration is complete, they will become tombstone status, and then prune will be executed to clear the nodes in tombstone status.
The cluster is now normal, tiup cannot see this node, but pdctl can see it?
You probably need to restart the cluster to let PD re-acquire the status of TiKV.
It feels like some follow-up work hasn’t been completed.
Kick it out and re-expand?
Before forcibly taking a TiKV node offline, you need to ensure that the other normal TiKV nodes meet the minimum requirements (at least 3 normal TiKV nodes). Then, wait for the regions on the faulty TiKV node to be scheduled and migrated before you can properly scale down and remove it.
The tiup --force
command only removes the node information from the tiup metadata and does not actually decommission the node. For reference, see 专栏 - TiKV缩容下线异常处理的三板斧 | TiDB 社区