How to Safely and Efficiently Delete Large Amounts of Data

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 如何安全高效删除大量数据

| username: 大飞飞呀

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.0.4
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Problem Phenomenon and Impact]
How to delete a large amount of data without affecting online business
DELETE
from stat.table1 WHERE
uid NOT IN (
subquery (more than 8000 rows)
)
A total of more than 70 million rows need to be deleted
Due to historical reasons, this part of the data is no longer needed. After deleting it, the table becomes much smaller, and the query speed becomes faster.
[Resource Configuration]
[Attachment: Screenshot/Log/Monitoring]

| username: xfworld | Original post link

  1. Correct use of indexes

  2. Delete data in batches to reduce large transactions

  3. Use partitions as much as possible to quickly locate and process data (truncate command)

Please refer to this.

| username: 大飞飞呀 | Original post link

The problem is that deleting with “not in” still can’t use the index, isn’t that frustrating?

| username: Billmay表妹 | Original post link

@大飞飞jeffery Why do you need to delete a large amount of data? What are the underlying reasons and use cases? What problem are you trying to solve, and what are your expectations? Please provide more details.

| username: tidb狂热爱好者 | Original post link

I feel like your requirement is select into and rename table new to old.

| username: 大飞飞呀 | Original post link

No, it will affect online business.

| username: tidb狂热爱好者 | Original post link

I feel your requirement is select into new table

rename table old to oldbackup

rename table new to old
This is at the millisecond level
Unless you have continuous writes

| username: Kongdom | Original post link

According to your description, the table becomes much smaller after deletion, which indicates that the required data is minimal. You can follow the suggestion above: first, copy the table, then insert the required data into the copied table. At this point, it should be an “in” operation, right? Then rename the original table to a backup table and rename the copied table to the original table.

| username: ljluestc | Original post link

If not executed properly, deleting a large amount of data can impact online business. You can minimize the impact of deleting data by following these steps:

import threading
import queue
import time
import pymysql

BATCH_SIZE = 10000
NUM_THREADS = 10
DB_HOST = 'localhost'
DB_PORT = 4000
DB_USER = 'user'
DB_PASSWORD = 'password'
DB_NAME = 'database'

sub_query = """
SELECT uid FROM stat.Table2
"""
delete_query = """
DELETE FROM stat.Table1 WHERE uid NOT IN ({sub_query})
"""
delete_queue = queue.Queue()

def fetch_data_to_delete():
    """
    Fetch the data to delete from the database and add it to the queue
    """
    conn = pymysql.connect(host=DB_HOST, port=DB_PORT, user=DB_USER, password=DB_PASSWORD, db=DB_NAME)
    with conn.cursor() as cursor:
        cursor.execute(sub_query)
        rows = cursor.fetchall()
    batch = []
    for row in rows:
        batch.append(row[0])
        if len(batch) == BATCH_SIZE:
            delete_queue.put(batch)
            batch = []
    if batch:
        delete_queue.put(batch)
    conn.close()

def delete_data():
    """
    Delete the data from the database
    """
    conn = pymysql.connect(host=DB_HOST, port=DB_PORT, user=DB_USER, password=DB_PASSWORD, db=DB_NAME)
    while True:
        data = delete_queue.get()
        if data is None:
            break
        in_values = ','.join(str(uid) for uid in data)
        sub_query_formatted = sub_query.format(in_values=in_values)
        delete_query_formatted = delete_query.format(sub_query=sub_query_formatted)
        with conn.cursor() as cursor:
            cursor.execute(delete_query_formatted)
        conn.commit()
    conn.close()

def main():
    fetch_data_to_delete()
    threads = []
    for i in range(NUM_THREADS):
        t = threading.Thread(target=delete_data)
        threads.append(t)
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    delete_queue.put(None)

if __name__ == '__main__':
    main()
| username: dba-kit | Original post link

This method +1, additionally TiDB should support the syntax rename table t1 to t1_backup, t1_new to t1 in some version of 5.X, but I forgot the exact version. You can try if it works in 5.0.

| username: 张雨齐0720 | Original post link

The remaining data volume is relatively small, so perform a backup in the early morning. Insert the smaller records of the database into the new table, then rename the old table, and finally rename the new table to the table used for online business.

| username: tidb菜鸟一只 | Original post link

If you need to delete one-tenth of the data, I would recommend following this process:
Stop the business at midnight → rename the original table to original_table_old; → create the original table; → insert into the original table select * from original_table_old where the data is useful;
Especially since you need to delete ninety-nine percent of the data…

| username: 裤衩儿飞上天 | Original post link

Saving time and effort.

| username: ealam_小羽 | Original post link

It should be possible. I often do this with version 4.0.

| username: Kongdom | Original post link

He is probably referring to the syntax for renaming multiple tables in one operation, right?

| username: 考试没答案 | Original post link

Regarding the issue mentioned by the original poster, I have a question: For other relational databases, when performing a low-water mark operation, a new table is created, then the original table is backed up, and the data from the original table is checked back into the new table.

For Oracle, online expansion can be performed.

For TiDB, after deleting a large amount of data, there will be many empty regions. Will performing region merge reduce the high-water mark issue?

| username: 考试没答案 | Original post link

Or is there an online scaling-down feature? Will it be developed in the future?

| username: Kongdom | Original post link

Online scaling down? It shouldn’t refer to node scaling down, right? I suggest starting a new thread for consultation to get more responses from others.

| username: buptzhoutian | Original post link

The pt-archiver from percona-toolkit should be able to solve this kind of problem for you.

| username: 考试没答案 | Original post link

Do you think Oracle has the ability to shrink nodes online? Oracle uses the ability to move rows in the table to achieve high water mark shrinkage at the table level.