Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tidb收缩1个服务器1.4T数据需要多久完成
Now that some historical data has been deleted and less space is being used, we are planning to shrink a few TiKV servers. How long will it take to shrink 1.4T of data from one server? How is this calculated?
You first need to look at the number of Regions on the node and calculate based on the reduction over a certain period of time. If you feel that the offline speed is slow, you can make appropriate adjustments.
Each node is about 55k. Does this number affect the migration speed? Is it not much related to size?
It should take around 8 hours for 1t, right? If you’re not worried about performance, you can increase the store limit to 200.
“store-limit-mode”: “manual”, this one?
Here are the current parameter values:
“hot-region-cache-hits-threshold”: 3,
“hot-region-schedule-limit”: 4,
“leader-schedule-limit”: 4,
“leader-schedule-policy”: “count”,
“low-space-ratio”: 0.8,
“max-merge-region-keys”: 200000,
“max-merge-region-size”: 20,
“max-pending-peer-count”: 16,
“max-snapshot-count”: 3,
“max-store-down-time”: “30m0s”,
“merge-schedule-limit”: 8,
“patrol-region-interval”: “100ms”,
“region-schedule-limit”: 2048,
“replica-schedule-limit”: 64,
“scheduler-max-waiting-operator”: 5,
“split-merge-interval”: “1h0m0s”,
“store-limit-mode”: “manual”,
“tolerant-size-ratio”: 0
Each Region has a default size, and the scheduling of Regions is handled by PD. When all Regions and Leaders of a Store (TiKV node) are evicted, it enters the Tombstone state and can then be pruned for cleanup.
Set store limit all to 200, done.
If it affects performance significantly, change store limit 15 back to the default value.
First, save the current configuration without affecting the cluster performance. On one hand, you can change the following parameters (you can think of them as consumers):
leader-schedule-limit, max-pending-peer-count, max-snapshot-count, replica-schedule-limit, merge-schedule-limit
e.g., config set parameter_name xxx
On the other hand (equivalent to producers):
Use pd-ctl to input store limit and adjust add-peer, remove-peer,
e.g., store limit all 200
Mark it, no rush, let’s wait for it to come slowly, still in the background.
Take it easy, find a test environment, and test the speed under default settings. Don’t max out the bandwidth. Stability first.
This depends on the server hardware. We usually start by scaling down and then monitor the metrics. Based on the downward trend, we can roughly estimate how long it will take. If it takes too long, we adjust the parameters to speed up the scaling down process.
The hardware of the equipment is the main factor, but it also includes optimization.
Which part does this “store limit all 200” refer to? Is it corresponding to “store-limit-mode”: “manual”?
Which curve are you referring to? The storage space curve of the TiKV corresponding to the scaled-down IP?
Log in to the Grafana monitoring interface, go to the overview → TiKV panel, and check the distribution of leaders and regions to confirm that the replica migration scheduling is complete.
After a period of scaling down, we extended the downward curve to roughly estimate when it would end.
Store Limit | PingCAP Documentation Center
Check the documentation here for store limit.
The leader count dropping to 0 indicates that there are no leaders inside. Can it be taken offline directly? I just noticed that one of my TiKV nodes has 0 leaders, and the regions are the same as other TiKV nodes. Does this mean that this node does not have any region leaders? Does it mean that all are replicas?
Thank you, I will take a look.
How long approximately? Please share.
No, no, this is just because I lost my original image at that time and found one from the internet. It should be that both the leader and the region are gone. If you evict the leader, the node’s leader count could also be zero.