Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: v6.5.2tidb集群,ticdc同步主备集群当存在数据删除的情况下,主集群不回收空间,只有备集群回收空间
【TiDB Usage Environment】Production Environment
【TiDB Version】v6.5.2
【Reproduction Path】High availability between primary and standby clusters achieved through ticdc
【Encountered Problem: Phenomenon and Impact】After running for a period of time, the primary cluster’s size keeps increasing while the standby cluster remains stable. This has resulted in the primary cluster reaching 1TB with only 200GB of data, whereas the standby cluster is only 280GB.
【Resource Configuration】Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】
Is the primary database GC progressing?
Check if the monitoring TiDB => GC panel is running normally.
Check the GC retention time and whether GC is enabled.
First, check if the number of monitored regions is continuously increasing.
The regions of the primary cluster keep increasing, while the backup cluster remains stable and does not increase.
The GC time of the main cluster uses the default time of 10 minutes. This GC time should be controlled by the GC parameter of the ticdc-server, and the gc-ttl time set by the ticdc-server is 12 hours.
The primary database GC is progressing, but it is not reclaiming space.
“The main cluster has reached 1T, while the backup cluster only has 280G.”
The gap is too large. How much data have you deleted? Is the disk exclusively for TiKV, or is it mixed with other services?
I ultimately adjusted “max-merge-region-size” from 20 to 100, and “max-merge-region-keys” from 200,000 to 500,000, and lowered the merge scheduling value (to observe if there was any impact on queries). The main cluster space and regions started to slowly recover, eventually dropping from 1.25T to 300G, roughly equal to the backup cluster.
I think these two TiDB parameters have an AND relationship, meaning a region will only trigger a merge with an adjacent region if its size is <= 20MB and the number of keys is <= 200,000. If a region is larger than 20MB and has undergone a large number of delete operations, regardless of the number of keys (as long as it’s more than 0), it will not merge and recover, causing fragmentation. I’m not sure if this logic can be adjusted or if my understanding is incorrect. I welcome any insights from the experts.
Check the regions in the main cluster to see which object is occupying space. I encountered an issue where historical data analysis of a table was occupying hundreds of gigabytes. The SQL query is as follows:
select DB_NAME, TABLE_NAME, sum(APPROXIMATE_SIZE) from
(
select t.DB_NAME, t.TABLE_NAME, region_id, t.APPROXIMATE_SIZE from information_schema.TIKV_REGION_STATUS t
group by t.DB_NAME, t.TABLE_NAME, region_id, t.APPROXIMATE_SIZE
) a
group by DB_NAME, TABLE_NAME
order by 3 desc
The two parameters have an “and” relationship. Additionally, the default value of merge-schedule-limit is 8, which is already very conservative and does not need to be lowered.
Generally speaking, data usually doesn’t require parameter tuning for optimization. First, take a look at the data usage situation.
The business deletes data every day, with more deletions in the early stages; TiKV is dedicated.
How did you handle it in the end? Will the analyzed historical data also be temporarily stored in the table?
Change the tidb_enable_historical_stats
parameter to off. In version 6.5, it should be off by default, but in version 7.5, it is on by default, which is quite troublesome. Then truncate mysql.stats_history
.
Although theoretically it doesn’t free up disk space, it can be reused after GC and shouldn’t keep expanding.
I remember that after TICDC is started, it will modify TiDB’s TTL time to prevent GC from cleaning up the data. Please check.