Are there any risks with TiDB's "insert into table a select * from table b"?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb 的insert table a selct * from table b 有没有风险

| username: 大飞飞呀

How to perform data cleaning with TiDB?
Can I directly use “insert into table a select * from table b”? Are there any good methods for handling large volumes of data?

| username: Billmay表妹 | Original post link

Refer to this.

| username: h5n1 | Original post link

You can only manually split the select based on conditions. If batch-DML is not enabled, it may cause OOM, and the submission speed for large batches is very slow. After version 6.1, there is a non-transactional DML feature that splits batches based on the specified size of the field, but currently, it only supports delete and cannot run in parallel. It is estimated that future versions should be able to solve Insert and select issues.

| username: TiDBer_h8KizMa7 | Original post link

Look at the specific requirements:

  1. Full synchronization, this is nothing more than insert into, the main consideration is how to normally find the last position after an error. The specific efficiency depends on the performance of the three machines: source end → program → target end. Other considerations include how many threads in the program, how many batches of data per thread. More threads do not necessarily mean faster, nor does more data per batch. You need to print the time and debug it. There are no overly fancy methods.

You can write it yourself, or there are many program tools, such as cloudcanal.

  1. Incremental synchronization, subscribe to cdc → kafka → subscribe to topic and parse canal → insert into database.table. Parse JSON, clean according to content, filter or flatten the table, etc.
| username: Hacker007 | Original post link

Add “limit” at the end, but it needs to be done in conjunction with the code, as TiDB does not support stored procedures.

| username: 托马斯滑板鞋 | Original post link

It is recommended to first export the data and then insert it in batches. The insert … select * operation is likely to process one row at a time, which is slow.

| username: wuxiangdong | Original post link

Add an update_time field to table b.

| username: xiaohetao | Original post link

I think we can write a loop to define how many rows to commit at a time.

| username: xiaohetao | Original post link

| username: alfred | Original post link

It does not support stored procedures. For handling such large batch inserts, it is recommended to manually split them into smaller batch inserts.

| username: tidb狂热爱好者 | Original post link

Use CDC to migrate data.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.