After a TiKV node server failure, the status remains Offline after removal, and even after forced removal, there is still related information about the faulty node

translator_bot · June 20, 2024, 9:11pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点服务器故障后，剔除后状态一直是Offline，后来强制剔除后，还是有故障节点的相关信息。

| username: liujun6315

【TiDB Usage Environment】Production Environment
【TiDB Version】v5.4.3
【Reproduction Path】Operations performed that led to the issue
A TiKV node in the cluster experienced hardware failure. When the server could not be restored to normal in a short time, the faulty TiKV node was scaled down. The status remained Offline for a long time, and later it was forcibly scaled down. Using tiup cluster display to check the cluster status, the faulty node information was no longer present, but the node information still existed in PD. Now, it is unclear how to completely scale down the node.
【Encountered Issue: Issue Phenomenon and Impact】
【Resource Configuration】Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

translator_bot · June 20, 2024, 9:11pm

| username: TiDBer_jYQINSnf | Original post link

Offline is due to regions that have not been migrated.
tiup cluster scale-in | PingCAP 文档中心…#–force
You can use tiup to forcibly remove this node. Do not forcibly take other nodes offline before the replicas are replenished.

translator_bot · June 20, 2024, 9:11pm

| username: liujun6315 | Original post link

The node has been forcibly deleted. After forcibly deleting it, I checked through pd-ctl and found that there is still information about the node, with the status being offline. It has not been completely removed.

translator_bot · June 20, 2024, 9:11pm

| username: TiDBer_jYQINSnf | Original post link

The screenshot is missing some parts. Check the number of regions. Once the number of regions drops to 0, it will naturally become a tombstone.

translator_bot · June 20, 2024, 9:11pm

| username: tidb菜鸟一只 | Original post link

How many TiKV nodes do you have now? Did you scale down and then scale up a node again?

translator_bot · June 20, 2024, 9:11pm

| username: liujun6315 | Original post link

This faulty node cannot be recovered; the database directory on this server has been deleted. We have tried deleting some leaders, but the remaining 177 leaders cannot be deleted despite multiple attempts. Is there any way to completely clear them now?

translator_bot · June 20, 2024, 9:11pm

| username: WalterWj | Original post link

Is it because there are not enough remaining nodes to supplement the replicas?
After this node performs the offline operation, if the cluster replicas are replenished, this node will enter the tombstone state.

translator_bot · June 20, 2024, 9:11pm

| username: liujun6315 | Original post link

When an issue occurred, there were a total of three nodes. After one node went down, the storage capacity of the two remaining TiKV nodes was insufficient, and scheduling basically stopped. Then, a new node was added, and the store space threshold low-space-ratio was adjusted. After the cluster returned to normal, the faulty node was restarted, but many data files were found to be corrupted. Some leaders could not be scheduled out from that node, nor could they be deleted from that node, so it was forcibly removed. However, PD still had information about that node.

translator_bot · June 20, 2024, 9:11pm

| username: TiDBer_jYQINSnf | Original post link

You forcibly deleted it, but there are still so many leaders, which means that these 177 regions cannot elect new leaders. You probably need to consider unsafe recovery.

translator_bot · June 20, 2024, 9:11pm

| username: Hacker_PtIIxHC1 | Original post link

It has always been in an offline state, indicating that there are still regions that have not been migrated. Once the migration is complete, it will change to a tombstone state.

translator_bot · June 20, 2024, 9:11pm

| username: TiDBer_QYr0vohO | Original post link

It is because regions are still being migrated. Once the region migration is complete, they will become tombstone status, and then prune will be executed to clear the nodes in tombstone status.

translator_bot · June 20, 2024, 9:11pm

| username: tidb菜鸟一只 | Original post link

The cluster is now normal, tiup cannot see this node, but pdctl can see it?

translator_bot · June 20, 2024, 9:11pm

| username: liujun6315 | Original post link

Yes.

translator_bot · June 20, 2024, 9:11pm

| username: tidb菜鸟一只 | Original post link

You probably need to restart the cluster to let PD re-acquire the status of TiKV.

translator_bot · June 20, 2024, 9:11pm

| username: dba远航 | Original post link

It feels like some follow-up work hasn’t been completed.

translator_bot · June 20, 2024, 9:11pm

| username: xiaoqiao | Original post link

Kick it out and re-expand?

translator_bot · June 20, 2024, 9:11pm

| username: TiDBer_21wZg5fm | Original post link

Before forcibly taking a TiKV node offline, you need to ensure that the other normal TiKV nodes meet the minimum requirements (at least 3 normal TiKV nodes). Then, wait for the regions on the faulty TiKV node to be scheduled and migrated before you can properly scale down and remove it.

translator_bot · June 20, 2024, 9:11pm

| username: 友利奈绪 | Original post link

How about restarting?

translator_bot · June 20, 2024, 9:11pm

| username: h5n1 | Original post link

The tiup --force command only removes the node information from the tiup metadata and does not actually decommission the node. For reference, see 专栏 - TiKV缩容下线异常处理的三板斧 | TiDB 社区