Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tidb 的insert table a selct * from table b 有没有风险
How to perform data cleaning with TiDB?
Can I directly use “insert into table a select * from table b”? Are there any good methods for handling large volumes of data?
You can only manually split the select based on conditions. If batch-DML is not enabled, it may cause OOM, and the submission speed for large batches is very slow. After version 6.1, there is a non-transactional DML feature that splits batches based on the specified size of the field, but currently, it only supports delete and cannot run in parallel. It is estimated that future versions should be able to solve Insert and select issues.
Look at the specific requirements:
- Full synchronization, this is nothing more than insert into, the main consideration is how to normally find the last position after an error. The specific efficiency depends on the performance of the three machines: source end → program → target end. Other considerations include how many threads in the program, how many batches of data per thread. More threads do not necessarily mean faster, nor does more data per batch. You need to print the time and debug it. There are no overly fancy methods.
You can write it yourself, or there are many program tools, such as cloudcanal.
- Incremental synchronization, subscribe to cdc → kafka → subscribe to topic and parse canal → insert into database.table. Parse JSON, clean according to content, filter or flatten the table, etc.
Add “limit” at the end, but it needs to be done in conjunction with the code, as TiDB does not support stored procedures.
It is recommended to first export the data and then insert it in batches. The insert … select * operation is likely to process one row at a time, which is slow.
Add an update_time field to table b.
I think we can write a loop to define how many rows to commit at a time.
It does not support stored procedures. For handling such large batch inserts, it is recommended to manually split them into smaller batch inserts.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.