Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tikv一节点下线很慢
[TiDB Usage Environment] Production Environment
[TiDB Version] v5.2.4
[Encountered Problem: Symptoms and Impact]
A TiKV server crashed, and it had two TiKV nodes on it. One of the nodes was taken offline this morning. Previously, the region count was over 40,000, and now it is still at 38,864. The offline process is too slow.
TiKV Resource Monitoring
PD Resource Monitoring
TiDB Resource Monitoring
PD Scheduling Parameters
The slow offline process is to prevent the fast speed from affecting the business.
Previously, taking down a TiKV node wasn’t this slow. Now it’s affecting system usage. I want to speed up the process of taking it offline.
You can refer to the documentation and adjust the parameters:
- leader-schedule-limit: Scheduling is used to balance the number of leaders across different TiKVs, affecting the load of query processing.
- region-schedule-limit: Scheduling is used to balance the number of replicas across different TiKVs, affecting the data volume on different nodes.
leader-schedule-limit
: Controls the concurrency of Transfer Leader scheduling.
region-schedule-limit
: Controls the concurrency of adding and removing Peer scheduling.
disable-replace-offline-replica
: Stops handling the scheduling of node offline.
disable-location-replacement
: Stops handling the scheduling related to adjusting Region isolation levels.
max-snapshot-count
: The maximum concurrency of sending and receiving Snapshots allowed per Store.
Before making adjustments, it is recommended to record the current settings. After processing, you need to revert the changes; otherwise, the cluster might be significantly affected.
Additionally, you can refer to this article:
After adjusting the following three parameters according to best practices, it seems that the speed of offline operations is still stagnant. I also added leader eviction, but observed that the number of leaders has not decreased.

This is the scheduling content
This is the current scheduling action
After adjustment, the speed of going offline is still very slow. Looking at the previous offline speed, it took several hours to go offline by 30k.
Now, in 3 hours, not even 2k has gone offline.
After a good sleep, it should be almost fixed.
leader-schedule-limit
region-schedule-limit
replica-schedule-limit
Try adjusting these parameters to three times their original values.
The overall idea is to adjust the scheduling parameters to speed up the scheduling process.
It’s quite bold to take down one of the two KV nodes in a production environment.
Taking a node offline won’t affect usage. Just wait for it to go down slowly; rushing it might cause other issues.
The region offline process is completed. Under the guidance of big brother h5n1 in the group, I checked the PD logs, which mostly showed leader election actions being canceled and repeatedly executed. This is because two peers of a region were both on the offline TiKV. There are many such regions, so the offline speed was very slow. After executing the multi-replica loss recovery, the TiKV quickly completed the offline process.
Adjust the relevant parameters.
Setting the store limit can control the offline speed.
Is this a bug issue or a parameter issue? How should it be adjusted?
Stop posting meaningless replies just to earn points. The issue has already been resolved and the solution provided. Instead of adjusting parameters here, check your other replies on your profile. They’re all just filler. Do something meaningful and contribute to the community. Carefully read each post, provide solutions if you can, and if you can’t, go study the documentation. Stop wasting time.