Issue with TiKV Node Scale-Down Failure

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点缩容失败问题

| username: 林先森cC

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] 4.0.11
[Reproduction Path]
Scale down a TiKV node in the test environment.
First, add a scheduler to evict the leader from the TiKV node to be scaled down using the tiup command:
tiup ctl pd -u http://172.2xxxx:2379 scheduler add evict-leader-scheduler 1

After executing the scheduling task, I found that one leader remained in store-1 and did not migrate.



I checked and found that this region is empty.

image

At that time, I scaled down the node, and the display showed the cluster node as Tombstone, which seemed normal.
I executed tiup cluster prune tidb-test, and the display showed that the scaled-down node 126 had been removed from the cluster.

However, I found that the tikv_store_status table still had information about this node, showing it as down.


The monitoring page also showed it as down.
Abnormalstores
I checked other community posts and executed store remove-tombstone, but it didn’t work.
[tidb@dba-test-12124 ~]$ tiup ctl pd -u 192.168.xxx4:2379 -i
Starting component ctl: /home/tidb/.tiup/components/ctl/v4.0.11/ctl pd -u 192.xxx:2379 -i
» store remove-tombstone
Success!
and
curl -X DELETE pd-addr:port/pd/api/v1/stores/remove-tombstone
Abnormalstores

[Encountered Problem: Problem Phenomenon and Impact]
Later, I re-expanded the previously scaled-down node 126. After the expansion succeeded, the node failed during the automatic pull phase.
So, I did not execute the leader scheduling task and directly scaled down the node again. This time, the scale-down was unsuccessful, but the node pulled up successfully by itself.
I then directly executed the scale-down command again without executing the leader scheduling task, and this time it was successfully taken offline. Now the logs are gone, and I have repeated the scale-up and scale-down several times without reproducing the issue.

Dear teachers, please advise:

  1. When encountering the above situation where the TiKV node has been taken offline and the prune command is executed, the monitoring shows that it changes from tombstone to down store. How should this be resolved? I suspect it is because the region leader was not scheduled away.

Abnormalstores
2. When scaling down TiKV, is it better to evict the region leaders from the TiKV node to be scaled down first and then scale down after all leaders have been evicted, or is it better to directly execute the scale-down command?

  1. During the scale-down or leader eviction scheduling task, if the region leader migration gets stuck, is there any command to manually specify the migration of the region leader?

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: 这里介绍不了我 | Original post link

I don’t understand why we need to manually add evict-leader-scheduler.

| username: 林先森cC | Original post link

The concern is that scaling down TiKV nodes might lead to region leader migration failures or region migration getting stuck during the scaling down process, causing regions to have no leader and affecting business operations. According to the community documentation, it is recommended to manually transfer first.

| username: Billmay表妹 | Original post link

You can try the following solutions:

  1. First, you can try manually scheduling the migration of the region leader. You can use the TiUP console command scheduler add to add a scheduling task to force the migration of a specified region leader. For example, you can execute tiup ctl pd -u http://172.2xxxx:2379 scheduler add balance-leader-scheduler to balance the distribution of region leaders.
  2. When scaling down TiKV nodes, it is recommended to evict the region leaders on the TiKV nodes to be scaled down before executing the scale-down command. This ensures that there will be no issues caused by region leaders not being scheduled away during the scale-down process. You can use the TiUP console command scheduler add to add a scheduling task to evict region leaders, and wait for the leader scheduling to complete before executing the scale-down command.
  3. If you encounter situations where region leaders do not migrate during the scale-down or leader eviction scheduling tasks, you can try manually specifying the migration of region leaders. You can use the TiUP console command scheduler add to add a scheduling task for manually migrating region leaders. For example, you can execute tiup ctl pd -u http://172.2xxxx:2379 scheduler add balance-leader-scheduler to manually balance the distribution of region leaders.
| username: Billmay表妹 | Original post link

I’m sorry, but I can’t access external content such as the URL you provided. If you can provide the text you need translated, I’d be happy to help with the translation.

| username: h5n1 | Original post link

First, try manually transferring the leader: pd-ctl operator add remove-peer <region_id> <from_store_id>, and see if there are any errors.

| username: Kamner | Original post link

The article you are reading has a solution below.

Usually, it’s directly scaled in without manually migrating the leader.