TiKV Node Offline Very Slowly

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv一节点下线很慢

| username: 普罗米修斯

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.2.4
[Encountered Problem: Symptoms and Impact]
A TiKV server crashed, and it had two TiKV nodes on it. One of the nodes was taken offline this morning. Previously, the region count was over 40,000, and now it is still at 38,864. The offline process is too slow.


TiKV Resource Monitoring

PD Resource Monitoring

TiDB Resource Monitoring

PD Scheduling Parameters

| username: 芮芮是产品 | Original post link

The slow offline process is to prevent the fast speed from affecting the business.

| username: 普罗米修斯 | Original post link

Previously, taking down a TiKV node wasn’t this slow. Now it’s affecting system usage. I want to speed up the process of taking it offline.

| username: xfworld | Original post link

You can refer to the documentation and adjust the parameters:

  • leader-schedule-limit: Scheduling is used to balance the number of leaders across different TiKVs, affecting the load of query processing.
  • region-schedule-limit: Scheduling is used to balance the number of replicas across different TiKVs, affecting the data volume on different nodes.
  • leader-schedule-limit: Controls the concurrency of Transfer Leader scheduling.
  • region-schedule-limit: Controls the concurrency of adding and removing Peer scheduling.
  • disable-replace-offline-replica: Stops handling the scheduling of node offline.
  • disable-location-replacement: Stops handling the scheduling related to adjusting Region isolation levels.
  • max-snapshot-count: The maximum concurrency of sending and receiving Snapshots allowed per Store.

Before making adjustments, it is recommended to record the current settings. After processing, you need to revert the changes; otherwise, the cluster might be significantly affected.

Additionally, you can refer to this article:

| username: 普罗米修斯 | Original post link

After adjusting the following three parameters according to best practices, it seems that the speed of offline operations is still stagnant. I also added leader eviction, but observed that the number of leaders has not decreased.
Image
This is the scheduling content

| username: 普罗米修斯 | Original post link

This is the current scheduling action

| username: 芮芮是产品 | Original post link

Adjust store limit

| username: 普罗米修斯 | Original post link

Try setting it to 30.

| username: 普罗米修斯 | Original post link

After adjustment, the speed of going offline is still very slow. Looking at the previous offline speed, it took several hours to go offline by 30k.


Now, in 3 hours, not even 2k has gone offline.

| username: 芮芮是产品 | Original post link

After a good sleep, it should be almost fixed.

| username: 小龙虾爱大龙虾 | Original post link

Increase, increase

| username: Jellybean | Original post link

leader-schedule-limit
region-schedule-limit
replica-schedule-limit

Try adjusting these parameters to three times their original values.

The overall idea is to adjust the scheduling parameters to speed up the scheduling process.

| username: 像风一样的男子 | Original post link

It’s quite bold to take down one of the two KV nodes in a production environment.

| username: Fly-bird | Original post link

Taking a node offline won’t affect usage. Just wait for it to go down slowly; rushing it might cause other issues.

| username: 普罗米修斯 | Original post link

10 nodes

| username: 普罗米修斯 | Original post link

The region offline process is completed. Under the guidance of big brother h5n1 in the group, I checked the PD logs, which mostly showed leader election actions being canceled and repeatedly executed. This is because two peers of a region were both on the offline TiKV. There are many such regions, so the offline speed was very slow. After executing the multi-replica loss recovery, the TiKV quickly completed the offline process.

| username: dba远航 | Original post link

Adjust the relevant parameters.

| username: andone | Original post link

Setting the store limit can control the offline speed.

| username: oceanzhang | Original post link

Is this a bug issue or a parameter issue? How should it be adjusted?

| username: 普罗米修斯 | Original post link

Stop posting meaningless replies just to earn points. The issue has already been resolved and the solution provided. Instead of adjusting parameters here, check your other replies on your profile. They’re all just filler. Do something meaningful and contribute to the community. Carefully read each post, provide solutions if you can, and if you can’t, go study the documentation. Stop wasting time.