Deleting Archived Historical Data in TiDB

translator_bot · June 23, 2024, 6:50pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb归档历史数据删除

| username: 海水孤独

【TiDB Usage Environment】Production Environment
【TiDB Version】7.5.1
【Reproduction Path】Operations performed that led to the issue
【Encountered Issue: Phenomenon and Impact】Currently, several large tables in the database have severely impacted the core primary database’s business read and write operations. We need to retain only the latest three months of data, and archive historical data elsewhere. This involves the issue of deleting historical data after archiving. Let’s see if anyone has a complete solution for data archiving and mass deletion of historical data from the primary database.
【Resource Configuration】Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

translator_bot · June 23, 2024, 6:50pm

| username: 这里介绍不了我 | Original post link

How much data was there three months ago?

translator_bot · June 23, 2024, 6:50pm

| username: jetora | Original post link

For scenarios requiring the cleanup of historical data:

If you can control the table structure design in the early stages, it is best to use partitioned tables. Partition by time, and after archiving, you can quickly clean up historical data.
If you cannot control the table design stage, and you want to quickly delete data, you can consider finding the maximum and minimum values of the data to be deleted, then concatenate the SQL for batch deletion and transfer.

translator_bot · June 23, 2024, 6:50pm

| username: tidb菜鸟一只 | Original post link

Use partitioned tables, and directly clean up partitions after archiving.

translator_bot · June 23, 2024, 6:50pm

| username: Defined2014 | Original post link

You can use partition tables or TTL.

Deleting partitions has less pressure as it is directly garbage collected on the TiKV side.

With TTL, rows can be deleted in the background at the row level without human intervention, and it can be directly modified on the original table.

translator_bot · June 23, 2024, 6:50pm

| username: zhaokede | Original post link

Partition by month

translator_bot · June 23, 2024, 6:50pm

| username: 海水孤独 | Original post link

The historical data has not been partitioned, and there are about five to six hundred million records. I want to archive them to Databend. Can I use Dumpling for archiving? What is the appropriate way to delete historical data after archiving?

translator_bot · June 23, 2024, 6:50pm

| username: 海水孤独 | Original post link

Since it is a core business database, we are afraid of affecting the business. Do we need to write a script for this? What would be the general logic of the script, or do you have any scripts that we can refer to?

translator_bot · June 23, 2024, 6:50pm

| username: 海水孤独 | Original post link

Should the TTL be added when creating a new table, or can it be added to an existing table that is already online?

translator_bot · June 23, 2024, 6:50pm

| username: Defined2014 | Original post link

TTL can be added with ALTER TABLE, 使用 TTL (Time to Live) 定期删除过期数据 | PingCAP 文档中心

You can also use ALTER TABLE t PARTITION BY ... to directly modify the partition table, but I think there are risks, so use it cautiously.

translator_bot · June 23, 2024, 6:50pm

| username: 这里介绍不了我 | Original post link

Of course, you can use Dumpling. You can write a script to back up the data from three months ago on a daily or monthly basis, and then the script can delete it in batches afterward, as long as it doesn’t affect the online business.

translator_bot · June 23, 2024, 6:50pm

| username: 海水孤独 | Original post link

Have you written similar scripts, using Python? If so, can you let me take a look for reference?

translator_bot · June 23, 2024, 6:50pm

| username: 我是吉米哥 | Original post link

I generally have three solutions for MySQL:

Use Percona’s toolkit with the pt-archiver command:

pt-archiver --source h=10.11.1.1,P=3306,u=root,p='xxx',D=dbname,t=x_trx_log --purge --charset=latin1 --where "create_time <= '2018-12-31'"  --progress 1000 --limit 1000 --txn-size=1000 --bulk-delete --statistics --dry-run

Delete data in batches, splitting large transactions into smaller ones:

set innodb_lock_wait_timeout=60;
/* define procedure to run loop to delete 10000 rows at a time */
DELIMITER $$
CREATE PROCEDURE DeleteActions()
    BEGIN
    DECLARE counter INT DEFAULT 1;
    REPEAT
        delete from actions_action where action NOT IN ('like', 'subscribe') limit 10000; commit;
        SET counter = counter + 1;
        SELECT SLEEP(2);
    UNTIL counter >= 200
    END REPEAT;
END$$
DELIMITER ;
/* call the procedure */
CALL DeleteActions();

Equivalent SQL rewriting: create a temporary table, write the data to be retained into the temporary table, then swap the names of the temporary table and the original table, and delete the temporary table.
Check the data volume, how many records per month:

select count(*) from next_eip.t__exec_log where fcreated_time >= DATE_FORMAT(DATE_SUB(now(),interval 3 month),'%Y-%m-%d');
select date_format(fcreated_time,'%Y-%m'),count(*) from  next_eip.t__exec_log group by date_format(fcreated_time,'%Y-%m') order by 1 desc;

Create a temporary table:

CREATE TABLE IF NOT EXISTS t__exec_log_tmp like t__exec_log;

Add an exclusive write lock to the table to freeze data changes:

lock table t__exec_log write, t__exec_log_tmp write;

Write the data from the last month to be retained into the temporary table:

insert into t__exec_log_tmp select * from t__exec_log where fcreated_time >= date_format(date_sub(now(),interval 1 month),'%y-%m-%d');

Swap table names, rename the original table to the backup table, and rename the temporary table to the original table:

rename table t__exec_log to t__exec_log_bak, t__exec_log_tmp to t__exec_log;

Release the table lock:

unlock tables;

translator_bot · June 23, 2024, 6:50pm

| username: forever | Original post link

You can use either shell or Python.

translator_bot · June 23, 2024, 6:50pm

| username: zhanggame1 | Original post link

It is not recommended to use partition tables. For deletion, use “delete from XXX where primary key between” to delete. Split it into multiple SQL executions, each controlling below 100,000.

translator_bot · June 23, 2024, 6:50pm

| username: 海水孤独 | Original post link

I plan to write a Python script, and I want to include some aspects of system performance monitoring. What metrics need to be monitored, and should I do it when the server is under low stress? Or do you have any scripts in this regard that I can refer to? I need to be cautious in a production environment.

translator_bot · June 23, 2024, 6:51pm

| username: vincentLi | Original post link

It feels great. Besides being able to clean up data from the past three months, it can also be used for other applications. For example, resetting passwords for accounts that have been inactive for a certain period. Which version introduced this feature?