Is it a configuration issue that Lightning takes too long to import files into TiDB? Any better suggestions?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Lightning导入文件到tidb耗时太久,是配置问题吗?有更好的建议吗?

| username: TiDBer_ZHcgATCp

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] v7.1.0
[Reproduction Path] What operations were performed to encounter the problem
[Encountered Problem: Problem Phenomenon and Impact] Importing a file of about 128GB from HDFS to TiDB took more than 6 hours. Seeking advice from experts on where the issue might be.
[Resource Configuration]
TIKV: 8 (16 vCore) 64g
pd: 2 (4 vCore) 16g
TIDB: 8 (16 vCore) 64g


PD

[Attachments: Screenshots/Logs/Monitoring]

Configuration Parameters:
[lightning]
check-requirements = true

#index-concurrency = 4
#table-concurrency = 8

#region-concurrency = 32
level = “info”
file = “/home/hive/data/cdp_lightning_logs”

max-size = 256 # MB Log file size
max-days = 28
#io-concurrency = 5

max-error = 0
meta-schema-name = “lightning_metadata”

[tikv-importer]
backend = “local”
incremental-import = true
sorted-kv-dir = “/home/hive/data/cdp_lightning_kv”
#range-concurrency = 16
#send-kv-pairs = 98304 #32768
on-duplicate = “replace”
duplicate-resolution = “remove”
compress-kv-pairs = “gz”

[mydumper]
#read-block-size = “256MiB” # Default value
no-schema = true

Value range is (0 <= batch-import-ratio < 1).

batch-import-ratio = 0.75

data-source-dir = “/home/hive/data/cdp_lightning_data”
character-set = “auto”

data-character-set = “binary”
data-invalid-char-replace = “uFFFD”

strict-format = true
max-region-size = “256MiB” # Default value
[checkpoint]
enable = true

[post-restore]

checksum = “false”

analyze = “false”

[cron]

TiDB Lightning Automatic

switch-mode = “5m”

Print import progress in logs

log-progress = “5m”

| username: Fly-bird | Original post link

Is the amount of data you imported large?

| username: Daniel-W | Original post link

You can check the monitoring of various resources during the import to see where the bottleneck is.

| username: TiDBer_ZHcgATCp | Original post link

The file size is around 128GB, with approximately 4.5 billion rows.

| username: TiDBer_ZHcgATCp | Original post link

Looks pretty good.

| username: TiDBer_卑微打工rer | Original post link

The parquet file is around 128GB, and the KV is approximately 1TB.

| username: TiDBer_ZHcgATCp | Original post link

Yes, it’s a Parquet file, and converting it to KV is also as large as 1TB.

| username: 小龙虾爱大龙虾 | Original post link

The lightning local mode mainly consumes the resources of the lightning host. Enhancing the configuration of the lightning machine and using SSD for the sort dir is necessary. During the import process, check the resource consumption of lightning; it is normal for the CPU to run at high usage.
For performance tuning of the lightning local mode, refer to: 使用 Physical Import Mode | PingCAP 文档中心

| username: TiDBer_ZHcgATCp | Original post link

This might be part of the reason. Does it have such a big impact? Normally, processing half of this data takes 1 hour.

| username: Soysauce520 | Original post link

Is there a bottleneck with the IO and CPU situation of the lightning machine?

| username: TiDBer_ZHcgATCp | Original post link

It might be because of this reason, the CPU usage of Lightning is indeed not high.

| username: TiDBer_ZHcgATCp | Original post link

What are the reasons for low CPU usage in Lightning?

| username: 小龙虾爱大龙虾 | Original post link

Is the disk an SSD?

| username: Soysauce520 | Original post link

You can try increasing the concurrency, as mentioned above, you need to use high-performance disks for local.

| username: Daniel-W | Original post link

Add tuning parameters to improve import performance.
You can run multiple Lightning tasks, with each task corresponding to multiple tables or a single database.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.