[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] Using load to batch load CSV files or using batch on id limit 10000 insert into t2 select * from t1 for import
[Encountered Issues: Problem Phenomenon and Impact] The data import is relatively slow. In the same scenario, the performance of GP is about 10 times that of TiDB. Is there room for optimization?
Additionally, the screenshot shows the CPU and memory usage of TiKV through monitoring, with significant differences in usage rates (deployed on the same node with the same IP). Why does this phenomenon occur?
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
Refer to the official documentation here: TiDB Lightning 配置参数 | PingCAP 文档中心. There are quite a few adjustments that can be made to the CSV format, and most of them can be accommodated.
Is TiDB Lightning using the local mode? (This method is theoretically the fastest)
Pre-sharding can reduce the overhead of automatic table sharding and save time.
P.S: If it’s slow, you can split the CSV into multiple parts and run them concurrently, as long as the name prefixes are the same. Refer to: TiDB Lightning 快速上手 | PingCAP 文档中心
It’s unlikely. Are you not using the mode that writes directly to TiKV? You can check the documentation at 物理导入模式 | PingCAP 文档中心. Based on my experience, the import speed is generally proportional to the number of CPU cores on the tidb-lightning machine.
What is the configuration of the machine where your tidb-lightning is located? The local mode has relatively high configuration requirements for the lightning machine. Generally, the higher the configuration, the faster the import speed.
The local mode is what you mentioned as direct writing, right? From what I understand, Lightning essentially also loads in batches. It just so happens that I measured consistent results for both methods.
What is your current import speed? It should be shown in the lightning logs. What is the theoretical peak of your storage IO? What do you expect it to be? If it doesn’t work, split the CSV file into two parts with the same prefix name, then use lightning to import with concurrency set to 2 and see if the speed improves.
Currently, it’s 2w/s. IO usage is only at 20%. Concurrency might be effective, will verify later. How can we optimize this batch on id limit 10000 insert into t2 select * from t1?
I suggest using Lightning to export everything and then import it into the new table, this is the fastest way;
The method of “insert into t2 select *” has been encountered in other databases before, and it is very slow. The principle is probably because it processes data row by row (similar to a cursor). I wonder if TiDB is the same? (The processing method is also batch export to local and then import)
Batch DML is actually very inefficient. This is mainly to avoid large transactions exceeding TiDB’s default memory limit. It’s not even as fast as dividing chunks yourself and running several select into operations.
A single-threaded import speed of 30,000 rows per second is already quite high; mine is usually around 20,000 rows per second.
TiDB needs to run more concurrent processes to be faster. Alternatively, using Lightning’s local mode is much faster. If you can see SQL statements in TiDB’s processlist, then it’s not in local mode.