Methods for Cleaning Historical Data in a 4TB TiDB Table

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb一个4T大的表清理历史数据方法

| username: 舞动梦灵

There is now a 4TB table. I want to keep the data within one year and delete the data older than one year to reduce disk usage.
Two methods:

  1. Create an empty table, insert the data within one year, and then directly drop the old table.
  2. Delete the data older than one year.

I want to ask if anyone has operated a similar scheme.
I want to know, for example, if the data within one year is 700GB, which takes longer: inserting 700GB of data or deleting the data older than one year?
Which is more efficient, and which reduces disk usage faster?

| username: Kongdom | Original post link

:thinking: When the data in the table is older than 2 years, I choose option 1.

| username: 像风一样的男子 | Original post link

Choose option 1. It is recommended to set partitions for newly created tables, as it will speed up the deletion process.

| username: 随缘天空 | Original post link

This depends on the percentage of your data from the past year relative to the entire table. If it’s around 50%, it’s recommended to directly perform a delete operation, as the time taken to insert a new record is generally higher than deleting one. If the data from the past year is a smaller proportion, you might consider creating a new table, inserting the data, and then dropping the old table; otherwise, use the other approach.

| username: 随缘天空 | Original post link

One thing to note is that the delete operation does not release disk space. Newly inserted data will reuse the disk space from the delete operation, but the drop operation will release disk space.

| username: Kongdom | Original post link

:yum: If you just want to reduce disk usage, you can clear the logs. The logs also take up a considerable amount of space.

| username: h5n1 | Original post link

Solution 1: If data import and export is not required for one year, you can use non-transactional DML and run it in batches in the background.

| username: 普罗米修斯 | Original post link

Data from over a year ago exceeds 3TB. Deleting it won’t immediately free up space; instead, it will increase space usage by inserting delete markers. Recommended solution 1: After dropping, the data will be cleared during the GC time.

| username: zhanggame1 | Original post link

Deleting large amounts of data can cause a temporary spike in space usage, so proceed with caution.

| username: zhanggame1 | Original post link

drop will release disk space

| username: Kongdom | Original post link

Even if it’s not deleted immediately, it shouldn’t cause a sudden increase in space usage, right?

| username: zhanggame1 | Original post link

The non-transactional DML delete starts in tens of milliseconds, but after running for a long time, it takes a few seconds to delete once, which is too slow.

| username: zhanggame1 | Original post link

TiDB features: All DML operations are inserts, and deletes consume a lot of disk space.

| username: zhanggame1 | Original post link

Reducing disk usage can only be achieved with solution 1, not solution 2. Solution 2 will not reduce disk usage.

| username: Kongdom | Original post link

:joy: I can understand if you say it consumes memory, but it’s really hard to understand why it consumes hard disk space. Isn’t it just marking something?

| username: zhanggame1 | Original post link

It could also be the raft logs or something similar taking up space. In any case, you should test it to find out. The most critical issue is that the more you delete, the slower the deletion process becomes.

| username: 随缘天空 | Original post link

Yes, the drop operation will release.

| username: 随缘天空 | Original post link

It will increase. Because the delete operation in TiDB is equivalent to adding a new piece of data, it just marks the status of the data as “del”. In subsequent queries, the data with the latest timestamp will be used, and once found, previous data will not be queried. The old data will be reclaimed after the GC cycle.

| username: Fly-bird | Original post link

  1. Create an empty table, insert data from within the past year, and then directly drop the old table.
| username: andone | Original post link

Choose 1. Disk 2 will not be released, and deleting it will generate a lock.