How to improve the speed of insert into select?

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 如何能提高insert into select的速度?

| username: Zealot

A single table has 16GB of data and 200 million rows. I want to copy the table using the insert into select method, but the actual measured speed is only about 10,000-20,000 rows per second being written.

SET @@SESSION.autocommit = 1;
SET @@SESSION.tidb_batch_insert = 1;
SET @@SESSION.tidb_batch_delete = 1;
SET @@SESSION.tidb_dml_batch_size = 1000;
insert into table_b select * from table_a;
| username: 像风一样的男子 | Original post link

You can try non-transactional statements

| username: Zealot | Original post link

I have tried this as well, and there is not much difference. Moreover, it does not support subqueries, so many things cannot be implemented.

| username: 像风一样的男子 | Original post link

Then you can use Dumpling to export and Lightning to import, this is fast.

| username: zhanggame1 | Original post link

Is it simpler to export to CSV using Dumpling and then import it?

| username: 小龙虾爱大龙虾 | Original post link

Batch processing with concurrency, without concurrency it’s very slow, or as mentioned above, use dumpling+lightning.

| username: okenJiang | Original post link

Currently, dumpling+lightning might still be faster… In future versions, there will be an “import into select” feature, which will be even faster.

| username: zhanggame1 | Original post link

The “import into select” feature has already been updated in version 7.5, but it is still essentially Lightning.

| username: Zealot | Original post link

Eh, this is too troublesome. Currently, I’m using DataX, and the speed is okay. The principle is actually running “insert into select” in multiple concurrent segments. I’ll try using TiSpark for the insert later to see if it’s faster.

| username: 随缘天空 | Original post link

It is recommended to use the officially recommended migration tools for faster speed, or use data extraction tools, such as the DataX you mentioned. Create multiple pipeline tasks to execute in parallel (split the data according to certain conditions, and each task processes a batch of data), which will be faster.

| username: andone | Original post link

Export using dumpling to CSV or use the DataX tool.

| username: come_true | Original post link

I agree with the previous comment, using an export tool might be better.

| username: Kongdom | Original post link

:thinking: There probably won’t be much change with just SQL statements. Consider using auxiliary tools. I usually use Kettle, which supports multi-threaded batch insertion. The speed is quite fast.

| username: Zealot | Original post link

I tested TiSpark executing insert into select, and the speed is quite good. Processing and inserting 100 million rows of data only took an hour and a half, and this was under tight Spark resource conditions. If there were sufficient resources, it should be even faster.

| username: Jellybean | Original post link

The original poster must be an experienced user to know about this parameter. You can use dumpling for exporting and lightning for importing, which are very fast. These two tools can be executed using tiup, and the export and import can be completed with just two commands, making it convenient and quick.

| username: zxgaa | Original post link

It would be better to export first, then import.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.