How to Quickly Delete Over 400 Million Rows from a Single Table Based on Conditions in TiDB 6.x or 7.x

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB6.x或者7.x单表中根据条件怎么快速删除超4亿的数据。

| username: TiDBer_RQobNXGv

How can I quickly delete more than 400 million rows of data from a single table based on conditions in TiDB 6.x or 7.x? In the previous 5.x version, it was somewhat slow to delete through code loops.

| username: dba远航 | Original post link

If it’s a full table deletion, you can use TRUNCATE. If it’s a partial deletion and the remaining data is very small, you can use CREATE TABLE tab_name AS SELECT * FROM table + filter conditions, then delete the old table and rename the new table.

| username: 小龙虾爱大龙虾 | Original post link

You can give it a try :grinning:

| username: zhanggame1 | Original post link

It’s still better to delete in a loop. Be careful not to use “LIMIT 10000” in the loop, otherwise, it will get slower and slower. It’s best to add a primary key range in the WHERE condition.

As mentioned above, if the remaining data is also in the hundreds of millions, then don’t bother with creating a new table.

| username: ShawnYan | Original post link

So is it still slow in the new version?

| username: zhanggame1 | Original post link

No matter which version, it’s slow. Deletion is divided into two parts: 1. Query the data to be deleted 2. Write the data. When the data volume is large, both steps are slow.

| username: wangccsy | Original post link

Based on the size of the table’s data, if the amount of data to be retained and deleted (you need to delete 400 million rows) is roughly equivalent, then using the original method for deletion should not have much room for performance improvement. If the data to be retained is much smaller than the amount to be deleted or if you need to delete everything, you can handle it according to the method mentioned in the second post.

| username: xingzhenxiang | Original post link

Directly use the script to connect, unable to copy

| username: xingzhenxiang | Original post link

This script

| username: porpoiselxj | Original post link

For deletions of this magnitude, the amount of data expected to be retained will not be small. Consider using Dumpling to export the data that needs to be retained with conditions, then rename the original table, create a new table, and use TiDB Lightning to import.

| username: TiDBer_RQobNXGv | Original post link

This table is constantly being written to, and if a new table is created, some data will be lost.

| username: Kongdom | Original post link

:thinking: If you need to delete 400 million records, then this table is very large. It is still recommended to delete in batches by primary key.

| username: Miracle | Original post link

In this case, you can only use a script to delete in a loop; other methods are not applicable.

| username: tidb菜鸟一只 | Original post link

For this kind of situation, it’s best to create a partitioned table. You can directly truncate the specified partition. If the table is not partitioned and you need to retain a lot of data, it’s recommended to first record the primary keys that meet the conditions, then delete in parallel in chunks. If the amount of data to be retained is small, it’s suggested to pause the business, rename the table to create a new original table, and then insert the retained data back into the original table.

| username: kkpeter | Original post link

What you said seems more reliable.

| username: yiduoyunQ | Original post link

Refer to 分页查询 | PingCAP 文档中心 for batch deletion. Manually control the speed to avoid being too fast (according to GC configuration).

| username: zhanggame1 | Original post link

This kind of SQL with billions of rows will cause OOM (Out of Memory).

| username: lemonade010 | Original post link

Learned.

| username: porpoiselxj | Original post link

This table was not planned well from the beginning. If the historical data of the table needs to be cleaned up periodically, it should be built as a partitioned table as tidb菜鸟一只 suggested, and then the expired data can be directly dropped by partition.

You need to make a firm decision to refactor it. If necessary, apply to the business to stop the service for a period of time. Once the refactoring is done, it will be more convenient in the future.

| username: zhanggame1 | Original post link

Partitioned tables have their own issues. The partitioned tables in TiDB might become more usable in the next version. I’ve encountered many problems during my testing.