Efficiency of Data Import for Large Tables Using auto_random as Primary Key

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: auto_random作为主键的大表数据导入效率

| username: vincentLi

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] There are several tables in production, each with 300-500 million rows of data, which need to be migrated to tables with auto_random primary keys in TiDB. According to the heatmap, there are indeed no hot spots, but inserting with 4 threads takes 50-60 seconds for every 20,000 rows of data, which seems inefficient.
[Encountered Problem: Problem Phenomenon and Impact]
Is there room to improve the insertion efficiency? It is now known that each table has about 200GB of data (ORACLE space), which translates to about 30GB in TiDB. So, with each table calculated at 96MB, there are roughly over 100 regions. Can auto_random primary keys pre-allocate regions? If so, wouldn’t it save time by not needing to split?

| username: TiDBer_jYQINSnf | Original post link

Take a look at these two:

| username: FutureDB | Original post link

  1. auto_random is designed to split into multiple Regions when the Region reaches the split threshold later, thereby avoiding sequential writes to a single Region during large data imports and reducing hotspots. However, initially, before the Region splits, there will still be hotspots because data is continuously written to a single Region. This situation can be avoided by pre-setting several Regions before importing.
  2. Regarding the slow import efficiency, are you using Lightning for the import? Are you using TiDB mode or Local mode?
| username: ziptoam | Original post link

Check and adjust the relevant parameters of TiDB, PD, and TiKV, such as raftstore.region-split-check-diff, raftstore.region-split-size, tidb_distsql_scan_concurrency, etc., to meet the needs of large-scale data import.

| username: vincentLi | Original post link

Oh, thank you. In large companies, development and operations are generally separated. I am responsible for development. Is there a way to adjust parameters at the session level to improve import efficiency? We usually don’t dare to adjust the underlying parameters. If the technology is insufficient, it could affect other users or applications, which would be troublesome.

| username: zhanggame1 | Original post link

How did you import it? I used Navicat data transfer with multiple windows open, and it can run tens of thousands per second.

| username: 小龙虾爱大龙虾 | Original post link

You can pre-allocate regions to evenly distribute the load across TiKV nodes and avoid hotspots during writes. Split Region 使用文档 | PingCAP 文档中心

| username: forever | Original post link

What method did you use to import it?

| username: ziptoam | Original post link

I really don’t understand this, let’s see what other experts say.

| username: 友利奈绪 | Original post link

In TiDB, using AUTO_RANDOM primary keys can effectively avoid hotspot issues, but the insertion efficiency problem you mentioned may involve multiple aspects, including parameter configuration, concurrent insertion strategies, and hardware resource utilization. Here are some suggestions and methods to improve insertion efficiency:

1. Pre-allocate Regions

Although TiDB automatically splits and schedules Regions, manually splitting Regions in advance can reduce the overhead during runtime in scenarios with large data inserts. You can use the SPLIT REGION command to pre-split Regions.

For example, for a table my_table, you can execute the following command:

-- Assuming the AUTO_RANDOM column name is id
SPLIT TABLE my_table BY (0), (50000000), (100000000), (150000000), ...;

This will pre-split Regions based on the values of the id column, reducing the splitting overhead during the insertion process.

2. Increase Concurrency

Four threads of concurrency may not be sufficient to fully utilize cluster resources. You can try increasing the concurrency, such as using more client connections and threads for insertion operations. Ensure that the tasks of each thread are evenly distributed to avoid a single thread becoming a bottleneck.

3. Use Batch Inserts

Batch inserts can significantly improve insertion efficiency. Try to insert larger batches of data at once rather than inserting one row at a time. For example, adjust the amount of data inserted each time to 1000 rows or more, depending on your network and database configuration.

4. Adjust TiDB Configuration

  • TiDB Configuration: Adjust relevant TiDB configuration parameters, such as txn-total-size-limit, to ensure it is large enough to support batch inserts. You can configure this in the tidb.toml file.
txn-total-size-limit = 104857600  # 100MB
  • TiKV Configuration: Adjust relevant TiKV configuration parameters, such as setting raftstore.sync-log to false to improve write performance, but note that this will affect consistency and safety.
sync-log = false

5. Hardware Resource Check

Ensure that your cluster hardware resources (CPU, memory, disk I/O, etc.) are not becoming bottlenecks. If a node’s resources are exhausted, it will lead to overall performance degradation. You can use monitoring tools like Prometheus + Grafana to monitor the cluster status.

6. Network Optimization

If your insertion operations are performed over the network, ensure that network bandwidth and latency are not becoming bottlenecks. Optimize network configuration to reduce latency and improve bandwidth utilization.

7. Reasonable Index Design

Ensure that your table has only necessary indexes, as each index increases the insertion overhead. Avoid excessive or complex index designs.

8. Transaction Size

Control the size and number of records in transactions to avoid long lock wait times caused by overly large transactions. You can gradually adjust the number of records per transaction to find the optimal insertion efficiency.

9. Example of Optimized Parameters

Here are some example configurations of key parameters that can be adjusted to improve insertion performance:

# tidb.toml
max-txn-duration = "60s"
max-txn-ttl = 3600000
txn-total-size-limit = 104857600

level = "warn"

# tikv.toml
sync-log = false
apply-pool-size = 4
store-pool-size = 4

max-sub-compactions = 3
| username: vincentLi | Original post link

First of all, excellent. Could you please elaborate on how to use Grafana to determine whether it is still possible to start other threads to concurrently import data?