How long does it take for TiDB to shrink 1.4T of data on one server?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb收缩1个服务器1.4T数据需要多久完成

| username: 舞动梦灵

Now that some historical data has been deleted and less space is being used, we are planning to shrink a few TiKV servers. How long will it take to shrink 1.4T of data from one server? How is this calculated?

| username: 这里介绍不了我 | Original post link

You first need to look at the number of Regions on the node and calculate based on the reduction over a certain period of time. If you feel that the offline speed is slow, you can make appropriate adjustments.

| username: 舞动梦灵 | Original post link

Each node is about 55k. Does this number affect the migration speed? Is it not much related to size?

| username: TiDBer_jYQINSnf | Original post link

It should take around 8 hours for 1t, right? If you’re not worried about performance, you can increase the store limit to 200.

| username: 舞动梦灵 | Original post link

“store-limit-mode”: “manual”, this one?
Here are the current parameter values:
“hot-region-cache-hits-threshold”: 3,
“hot-region-schedule-limit”: 4,
“leader-schedule-limit”: 4,
“leader-schedule-policy”: “count”,
“low-space-ratio”: 0.8,
“max-merge-region-keys”: 200000,
“max-merge-region-size”: 20,
“max-pending-peer-count”: 16,
“max-snapshot-count”: 3,
“max-store-down-time”: “30m0s”,
“merge-schedule-limit”: 8,
“patrol-region-interval”: “100ms”,
“region-schedule-limit”: 2048,
“replica-schedule-limit”: 64,
“scheduler-max-waiting-operator”: 5,
“split-merge-interval”: “1h0m0s”,
“store-limit-mode”: “manual”,
“tolerant-size-ratio”: 0

| username: 这里介绍不了我 | Original post link

Each Region has a default size, and the scheduling of Regions is handled by PD. When all Regions and Leaders of a Store (TiKV node) are evicted, it enters the Tombstone state and can then be pruned for cleanup.

| username: TiDBer_jYQINSnf | Original post link

Set store limit all to 200, done.
If it affects performance significantly, change store limit 15 back to the default value.

| username: 这里介绍不了我 | Original post link

First, save the current configuration without affecting the cluster performance. On one hand, you can change the following parameters (you can think of them as consumers):
leader-schedule-limit, max-pending-peer-count, max-snapshot-count, replica-schedule-limit, merge-schedule-limit
e.g., config set parameter_name xxx
On the other hand (equivalent to producers):
Use pd-ctl to input store limit and adjust add-peer, remove-peer,
e.g., store limit all 200

| username: DBAER | Original post link

Mark it, no rush, let’s wait for it to come slowly, still in the background.

| username: TiDBer_JUi6UvZm | Original post link

Take it easy, find a test environment, and test the speed under default settings. Don’t max out the bandwidth. Stability first.

| username: Kongdom | Original post link

This depends on the server hardware. We usually start by scaling down and then monitor the metrics. Based on the downward trend, we can roughly estimate how long it will take. If it takes too long, we adjust the parameters to speed up the scaling down process.

| username: Jack-li | Original post link

The hardware of the equipment is the main factor, but it also includes optimization.

| username: 舞动梦灵 | Original post link

Which part does this “store limit all 200” refer to? Is it corresponding to “store-limit-mode”: “manual”?

| username: 舞动梦灵 | Original post link

Which curve are you referring to? The storage space curve of the TiKV corresponding to the scaled-down IP?

| username: Kongdom | Original post link

Log in to the Grafana monitoring interface, go to the overview → TiKV panel, and check the distribution of leaders and regions to confirm that the replica migration scheduling is complete.
:yum: After a period of scaling down, we extended the downward curve to roughly estimate when it would end.

| username: 这里介绍不了我 | Original post link

Store Limit | PingCAP Documentation Center
Check the documentation here for store limit.

| username: 舞动梦灵 | Original post link

The leader count dropping to 0 indicates that there are no leaders inside. Can it be taken offline directly? I just noticed that one of my TiKV nodes has 0 leaders, and the regions are the same as other TiKV nodes. Does this mean that this node does not have any region leaders? Does it mean that all are replicas?

| username: 舞动梦灵 | Original post link

Thank you, I will take a look.

| username: xiaoqiao | Original post link

How long approximately? Please share.

| username: Kongdom | Original post link

No, no, this is just because I lost my original image at that time and found one from the internet. It should be that both the leader and the region are gone. If you evict the leader, the node’s leader count could also be zero.