Some Issues with Deleting Data in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 删除TiDB数据的一些问题

| username: TiDB_C罗

There is a scenario where TiDB’s data is synchronized from other MySQL databases or merged together (such as sharding, partitioning, etc.). In TiDB, some statistical queries are performed, and only the most recent data needs to be retained in TiDB, while other data is purged using pt-archiver.

Problems:

  1. The cleared data does not release disk space.
  2. The select min(id) query is extremely slow.

Suggested improvements:

  1. GC should check if all data in a region has been deleted, and if so, delete the region.
  2. If the delete operation covers the entire region, then delete the entire region.
  3. Optimize min/max queries by storing the maximum and minimum values in each region, and the region IDs of the maximum and minimum values in the table metadata.
| username: zhanggame1 | Original post link

Is this id a primary key? It shouldn’t be very slow, right?

| username: tidb菜鸟一只 | Original post link

  1. Check the settings for “max-merge-region-size” and “max-merge-region-keys”. Normally, if there are regions in TiDB that meet the conditions, it will automatically merge regions to free up space.
  2. If there is an index or primary key on the id, min(id) will be very fast. If there is no index, since TiDB’s underlying storage is in a key-value structure and the ids on each region are not stored in descending order, getting the minimum value requires a full region scan…
| username: 小龙虾爱大龙虾 | Original post link

Here are my personal thoughts:
Questions:

  1. Due to the existence of MVCC, deletions do not actually remove data immediately; they need to wait for GC, and the underlying RocksDB also requires compaction to truly free up data.
  2. Because a large amount of GC data is scanned, you can shorten the GC time or use a different method to delete data (truncate partition).

Suggestions:

  1. A region is just a logical concept. Deleting a region will create holes, so merging is still necessary to reduce the number.
  2. A region is just a logical concept. I think what you mean is deleting the underlying corresponding SST, but I believe determining whether all SSTs of this region need to be deleted is an issue. Operations like drop table or truncate table correspond to a range of keys, which can be mapped to regions, and this range will no longer have data written to it. After GC, the underlying SST can be directly deleted. However, delete operations do not necessarily correspond to a range of keys. When is the condition for a region to be considered empty determined? What if there are subsequent inserts into this region?
  3. I’m not sure if the logical concept of a region can store the maximum and minimum values. If stored, where would it be stored, and which column’s maximum and minimum values would be stored? Which MVCC version’s maximum and minimum values would be stored? Storing the region ID with the maximum and minimum values in the table metadata is unnecessary, right? The PD will tell you which region to look for through index queries for the maximum and minimum values.
| username: yulei7633 | Original post link

Clearing data can free up disk space, but you need to wait for GC.

| username: Soysauce520 | Original post link

Adjusting the gc_file_time, which defaults to 10 minutes, is recommended to be done when there is a timely pump or BR in place. For the issue of slow select queries, it is suggested to check the execution plan to see if there is no pushdown, or if the KV resources themselves are fully utilized.

| username: TiDB_C罗 | Original post link

I waited for a month, but there was no release. A lot of data was deleted, and the number of regions in the table only increased.

| username: zhanggame1 | Original post link

Delete does not release.

| username: 像风一样的男子 | Original post link

Delete will not release space. MySQL also has this mechanism.

| username: 像风一样的男子 | Original post link

You can manually trigger compaction to free up space.

| username: TiDB_C罗 | Original post link

If part of the data in a region is deleted, not releasing it is understandable. But if the entire region is deleted, why is it still retained?

| username: Raymond | Original post link

What version are you using?
I also encountered this issue where the MVCC version is not being recycled (version 6.5.3). In the end, I adjusted a parameter and it was resolved by setting enable-compaction-filter to false (this can be adjusted online, and after adjusting this parameter, it may take 1-2 days).