V6.5.2 TiDB Cluster: In a TiCDC Synchronized Primary-Secondary Cluster, Space is Not Reclaimed in the Primary Cluster When Data is Deleted, Only in the Secondary Cluster

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: v6.5.2tidb集群,ticdc同步主备集群当存在数据删除的情况下,主集群不回收空间,只有备集群回收空间

| username: TiDBer_uEurBqwn

【TiDB Usage Environment】Production Environment
【TiDB Version】v6.5.2
【Reproduction Path】High availability between primary and standby clusters achieved through ticdc
【Encountered Problem: Phenomenon and Impact】After running for a period of time, the primary cluster’s size keeps increasing while the standby cluster remains stable. This has resulted in the primary cluster reaching 1TB with only 200GB of data, whereas the standby cluster is only 280GB.
【Resource Configuration】Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

| username: tidb菜鸟一只 | Original post link

Is the primary database GC progressing?

| username: 小龙虾爱大龙虾 | Original post link

Check if the monitoring TiDB => GC panel is running normally.

| username: dba远航 | Original post link

Check the GC retention time and whether GC is enabled.

| username: zhanggame1 | Original post link

First, check if the number of monitored regions is continuously increasing.

| username: TiDBer_uEurBqwn | Original post link

The regions of the primary cluster keep increasing, while the backup cluster remains stable and does not increase.

| username: TiDBer_uEurBqwn | Original post link

The GC time of the main cluster uses the default time of 10 minutes. This GC time should be controlled by the GC parameter of the ticdc-server, and the gc-ttl time set by the ticdc-server is 12 hours.

| username: TiDBer_uEurBqwn | Original post link

The primary database GC is progressing, but it is not reclaiming space.

| username: wfxxh | Original post link

“The main cluster has reached 1T, while the backup cluster only has 280G.”
The gap is too large. How much data have you deleted? Is the disk exclusively for TiKV, or is it mixed with other services?

| username: TiDBer_uEurBqwn | Original post link

I ultimately adjusted “max-merge-region-size” from 20 to 100, and “max-merge-region-keys” from 200,000 to 500,000, and lowered the merge scheduling value (to observe if there was any impact on queries). The main cluster space and regions started to slowly recover, eventually dropping from 1.25T to 300G, roughly equal to the backup cluster.

I think these two TiDB parameters have an AND relationship, meaning a region will only trigger a merge with an adjacent region if its size is <= 20MB and the number of keys is <= 200,000. If a region is larger than 20MB and has undergone a large number of delete operations, regardless of the number of keys (as long as it’s more than 0), it will not merge and recover, causing fragmentation. I’m not sure if this logic can be adjusted or if my understanding is incorrect. I welcome any insights from the experts.

| username: zhanggame1 | Original post link

Check the regions in the main cluster to see which object is occupying space. I encountered an issue where historical data analysis of a table was occupying hundreds of gigabytes. The SQL query is as follows:

select DB_NAME, TABLE_NAME, sum(APPROXIMATE_SIZE) from 
(
select t.DB_NAME, t.TABLE_NAME, region_id, t.APPROXIMATE_SIZE from information_schema.TIKV_REGION_STATUS t
group by t.DB_NAME, t.TABLE_NAME, region_id, t.APPROXIMATE_SIZE
) a
group by DB_NAME, TABLE_NAME
order by 3 desc
| username: Jayjlchen | Original post link

The two parameters have an “and” relationship. Additionally, the default value of merge-schedule-limit is 8, which is already very conservative and does not need to be lowered.

| username: zhanggame1 | Original post link

Generally speaking, data usually doesn’t require parameter tuning for optimization. First, take a look at the data usage situation.

| username: TiDBer_uEurBqwn | Original post link

The business deletes data every day, with more deletions in the early stages; TiKV is dedicated.

| username: TiDBer_uEurBqwn | Original post link

How did you handle it in the end? Will the analyzed historical data also be temporarily stored in the table?

| username: zhanggame1 | Original post link

Change the tidb_enable_historical_stats parameter to off. In version 6.5, it should be off by default, but in version 7.5, it is on by default, which is quite troublesome. Then truncate mysql.stats_history.

| username: zhanggame1 | Original post link

Although theoretically it doesn’t free up disk space, it can be reused after GC and shouldn’t keep expanding.

| username: dba远航 | Original post link

I remember that after TICDC is started, it will modify TiDB’s TTL time to prevent GC from cleaning up the data. Please check.