A single table has 16GB of data and 200 million rows. I want to copy the table using the insert into select method, but the actual measured speed is only about 10,000-20,000 rows per second being written.
My SQL:
SET @@SESSION.autocommit = 1;
SET @@SESSION.tidb_batch_insert = 1;
SET @@SESSION.tidb_batch_delete = 1;
SET @@SESSION.tidb_dml_batch_size = 1000;
insert into table_b select * from table_a;
Eh, this is too troublesome. Currently, I’m using DataX, and the speed is okay. The principle is actually running “insert into select” in multiple concurrent segments. I’ll try using TiSpark for the insert later to see if it’s faster.
It is recommended to use the officially recommended migration tools for faster speed, or use data extraction tools, such as the DataX you mentioned. Create multiple pipeline tasks to execute in parallel (split the data according to certain conditions, and each task processes a batch of data), which will be faster.
There probably won’t be much change with just SQL statements. Consider using auxiliary tools. I usually use Kettle, which supports multi-threaded batch insertion. The speed is quite fast.
I tested TiSpark executing insert into select, and the speed is quite good. Processing and inserting 100 million rows of data only took an hour and a half, and this was under tight Spark resource conditions. If there were sufficient resources, it should be even faster.
The original poster must be an experienced user to know about this parameter. You can use dumpling for exporting and lightning for importing, which are very fast. These two tools can be executed using tiup, and the export and import can be completed with just two commands, making it convenient and quick.