V6.5.2 TiDB Cluster: In a TiCDC Synchronized Primary-Secondary Cluster, Space is Not Reclaimed in the Primary Cluster When Data is Deleted, Only in the Secondary Cluster

translator_bot · June 21, 2024, 7:34am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: v6.5.2tidb集群，ticdc同步主备集群当存在数据删除的情况下，主集群不回收空间，只有备集群回收空间

| username: TiDBer_uEurBqwn

【TiDB Usage Environment】Production Environment
【TiDB Version】v6.5.2
【Reproduction Path】High availability between primary and standby clusters achieved through ticdc
【Encountered Problem: Phenomenon and Impact】After running for a period of time, the primary cluster’s size keeps increasing while the standby cluster remains stable. This has resulted in the primary cluster reaching 1TB with only 200GB of data, whereas the standby cluster is only 280GB.
【Resource Configuration】Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

translator_bot · June 21, 2024, 7:34am

| username: tidb菜鸟一只 | Original post link

Is the primary database GC progressing?

translator_bot · June 21, 2024, 7:34am

| username: 小龙虾爱大龙虾 | Original post link

Check if the monitoring TiDB => GC panel is running normally.

translator_bot · June 21, 2024, 7:34am

| username: dba远航 | Original post link

Check the GC retention time and whether GC is enabled.

translator_bot · June 21, 2024, 7:34am

| username: zhanggame1 | Original post link

First, check if the number of monitored regions is continuously increasing.

translator_bot · June 21, 2024, 7:34am

| username: TiDBer_uEurBqwn | Original post link

The regions of the primary cluster keep increasing, while the backup cluster remains stable and does not increase.

translator_bot · June 21, 2024, 7:34am

| username: TiDBer_uEurBqwn | Original post link

The GC time of the main cluster uses the default time of 10 minutes. This GC time should be controlled by the GC parameter of the ticdc-server, and the gc-ttl time set by the ticdc-server is 12 hours.

translator_bot · June 21, 2024, 7:34am

| username: TiDBer_uEurBqwn | Original post link

The primary database GC is progressing, but it is not reclaiming space.

translator_bot · June 21, 2024, 7:34am

| username: wfxxh | Original post link

“The main cluster has reached 1T, while the backup cluster only has 280G.”
The gap is too large. How much data have you deleted? Is the disk exclusively for TiKV, or is it mixed with other services?

translator_bot · June 21, 2024, 7:34am

| username: TiDBer_uEurBqwn | Original post link

I ultimately adjusted “max-merge-region-size” from 20 to 100, and “max-merge-region-keys” from 200,000 to 500,000, and lowered the merge scheduling value (to observe if there was any impact on queries). The main cluster space and regions started to slowly recover, eventually dropping from 1.25T to 300G, roughly equal to the backup cluster.

I think these two TiDB parameters have an AND relationship, meaning a region will only trigger a merge with an adjacent region if its size is <= 20MB and the number of keys is <= 200,000. If a region is larger than 20MB and has undergone a large number of delete operations, regardless of the number of keys (as long as it’s more than 0), it will not merge and recover, causing fragmentation. I’m not sure if this logic can be adjusted or if my understanding is incorrect. I welcome any insights from the experts.

translator_bot · June 21, 2024, 7:34am

| username: zhanggame1 | Original post link

Check the regions in the main cluster to see which object is occupying space. I encountered an issue where historical data analysis of a table was occupying hundreds of gigabytes. The SQL query is as follows:

select DB_NAME, TABLE_NAME, sum(APPROXIMATE_SIZE) from 
(
select t.DB_NAME, t.TABLE_NAME, region_id, t.APPROXIMATE_SIZE from information_schema.TIKV_REGION_STATUS t
group by t.DB_NAME, t.TABLE_NAME, region_id, t.APPROXIMATE_SIZE
) a
group by DB_NAME, TABLE_NAME
order by 3 desc

translator_bot · June 21, 2024, 7:34am

| username: Jayjlchen | Original post link

The two parameters have an “and” relationship. Additionally, the default value of merge-schedule-limit is 8, which is already very conservative and does not need to be lowered.

translator_bot · June 21, 2024, 7:34am

| username: zhanggame1 | Original post link

Generally speaking, data usually doesn’t require parameter tuning for optimization. First, take a look at the data usage situation.

translator_bot · June 21, 2024, 7:34am

| username: TiDBer_uEurBqwn | Original post link

The business deletes data every day, with more deletions in the early stages; TiKV is dedicated.

translator_bot · June 21, 2024, 7:34am

| username: TiDBer_uEurBqwn | Original post link

How did you handle it in the end? Will the analyzed historical data also be temporarily stored in the table?

translator_bot · June 21, 2024, 7:34am

| username: zhanggame1 | Original post link

Change the tidb_enable_historical_stats parameter to off. In version 6.5, it should be off by default, but in version 7.5, it is on by default, which is quite troublesome. Then truncate mysql.stats_history.

translator_bot · June 21, 2024, 7:34am

| username: zhanggame1 | Original post link

Although theoretically it doesn’t free up disk space, it can be reused after GC and shouldn’t keep expanding.

translator_bot · June 21, 2024, 7:34am

| username: dba远航 | Original post link

I remember that after TICDC is started, it will modify TiDB’s TTL time to prevent GC from cleaning up the data. Please check.