Stuck for almost a day during full import with Lightning, please help!

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: lightning全量导入时卡住快一天,求助!!!

| username: 普罗米修斯

[TiDB Usage Environment] Production
[Encountered Problem] Dumpling downloaded 2.7T of data in full, and Lightning completed the full import of creating the database and tables. However, when executing the insert statement, it ran for more than 5 hours and stopped moving after 2022/10/24 20:05.

Lightning script:

#!/bin/bash
nohup /home/tidb/tidb-toolkit-v5.0.1-linux-amd64/bin/tidb-lightning -config tidb-lightning.toml > /home/tidb/tidb-toolkit-v5.0.1-linux-amd64/bin/tidb-lightning.log 2>&1 &

TiUP Cluster Display Information:

| username: 普罗米修斯 | Original post link

Download the full dump from TiDB v3.0.3 using Dumpling and import it into TiDB v5.2.4 using Lightning.

| username: 普罗米修斯 | Original post link

The configuration file of tidb-lightning

| username: buchuitoudegou | Original post link

It looks like both split and scatter have been completed. The next step should be importing kv into tikv.

You can check the logs on the tikv side or dump the routine of lightning to see where it is stuck.

FYI: TiDB Lightning 常见问题 | PingCAP 文档中心

| username: 普罗米修斯 | Original post link

The memory usage of the server executing the lightning script during the full import is at most 79.6%, and the CPU usage is also not high.

| username: 普罗米修斯 | Original post link

goroutines file

| username: 普罗米修斯 | Original post link

tikv log, please take a look.

| username: 普罗米修斯 | Original post link

There is a warning about transmission errors in PD.

| username: 普罗米修斯 | Original post link

The import speed is particularly slow during lightning import, and it has been pausing scheduling operations;


The total amount of data uploaded in two days is as follows

DC18 server is tikv, and it is also the server where lightning executes the import. The memory has been very high, so I lowered the region concurrency of lightning. If I increase it, the lightning process will exit after a while;

Image
I found that there are many empty regions in the imported cluster. After adjusting max-merge-region-keys and max-merge-region-size, the lightning process will trigger pause scheduling after a while. The number of empty regions will stop decreasing. When I check the configuration again, I find that the configured parameters are cleared to 0;

Image
I have a few questions to consult:

  1. Will the lightning process increase the number of empty regions?
  2. Will an increase in empty region count affect the import speed?
  3. What is causing the slow import speed in the current situation? How can the import speed be increased?
| username: 普罗米修斯 | Original post link

There are warnings during the transmission, and errors occur when writing to KV. Can you help me check it out and how to solve this problem specifically? Dumpling downloaded 2.7T of data, and it has been transmitting for more than a week but hasn’t finished yet. Now there are errors in writing. Could you please look into the issue of the very slow import rate and the writing errors? Thank you. Compensation can be provided, as the online business is quite urgent. Seeking help.

| username: tidb狂热爱好者 | Original post link

Separate TiSB out.

| username: tidb狂热爱好者 | Original post link

When TiDB encounters slow SQL, it will continuously OOM (Out of Memory).

| username: tidb狂热爱好者 | Original post link

The default configuration of TiDB and the default configuration of TiKV are both intended for use on standalone machines. If you want to use a mixed configuration, you need to modify the parameters yourself. The official documentation has the details.

| username: jansu-dev | Original post link

  1. After checking, it was found that there were many empty regions in the imported cluster. After adjusting max-merge-region-keys and max-merge-region-size, the lightning process would trigger pause scheduling after a period of time. The number of empty regions would decrease and then stop. Upon checking the configuration again, it was found that the configured parameters were reset to 0.
    → Before importing with lightning local, scheduling is usually paused. Pausing scheduling helps with fast import and avoids region and leader changes caused by merge or split. As for the XXX-keys and XXX-size parameters being reset to 0, this is usually related to an abnormal exit of lightning. Because lightning local first pauses the schedule and resumes it after the import is complete. If I remember correctly, it should be reset to 0 during the import.

  2. Dumpling downloaded 2.7T of data in full, and lightning completed the creation of the database and tables in full. When executing the insert statement, it ran for more than 5 hours and then stopped moving after 20:05 on 2022/10/24.
    → I see that the configuration file uses local mode, so there should be no execution of insert statements. It should be a direct import of csv–>sst–>tikv files.

  3. From the profile, I didn’t see any useful information, but the screenshot shows tcp connection reset by peer, indicating some issues with the TCP connection. However, the extent to which this affects import performance is uncertain. It’s best to collect a graph that shows execution time consumption to see which function is stuck.

curl http://{TiDBIP}:10080/debug/zip?seconds=60 --output debug.zip

Speaking of slow imports:

  1. Is there any abnormal point in lightning.log?
  2. Have you checked the disk load situation?
  3. From the CPU idle, it seems that not much performance was utilized during the local translation from csv to sst. Have you checked the performance of the downstream TiKV?

Please upload a copy of lightning’s own log first. :thinking:

btw: Under normal circumstances, it should be a bit slower than this speed → TiDB Lightning 导入模式 | PingCAP 归档文档站

| username: 普罗米修斯 | Original post link

  1. I didn’t see any anomalies in lightning.log, it’s all INFO messages. Now there are two warnings, errors occurred when writing to kv, please take a look.
  2. I checked the disk load and it is indeed very high.

    At that time, I thought the data downloaded by dumpling was very large. On 192.168.80.218, I used lvm to combine sdd and sdc into one drive and mounted it under /data. Both drives are hdd, and then I placed the lightning sorted-kv-dir under /data/lightning. These two drives’ IO reached 100%. The sda on 192.168.90.213 and the sdb on 192.168.80.218 are hdd mechanical drives. Because there were not enough ssd drives, I used them as tikv nodes, and they also reached 99.9%. The other drives are ssd, and their usage is also quite high, all above 90%. I am using a mixed deployment with default configurations and did not set parameters. Could it be related to the following parameters?
    readpool.unified.max-thread-count:
    raftstore.capacity:
    storage.block-cache.capacity:
    The memory usage on 192.168.80.217 and 192.168.80.218 is also quite high.


    Someone in the group just mentioned that tidb oom caused write to tikv to fail.
| username: jansu-dev | Original post link

  1. Just now, someone in the group said that TiDB OOM caused write to TiKV to fail.
    TiDB OOM will not cause write to TiKV to fail, but TiKV OOM might cause this issue. To verify if it is OOM, you can check /var/log/message.

  2. Regarding the parameters:
    a. raftstore.capacity → TiKV 配置文件描述 | PingCAP 归档文档站
    b. readpool.unified.max-thread-count → TiKV 配置文件描述 | PingCAP 归档文档站
    c. storage.block-cache.capacity → TiKV 配置文件描述 | PingCAP 归档文档站
    d. readpool.unified.max-thread-count is not very relevant. If TiKV really OOMs, then reducing storage.block-cache.capacity will somewhat reduce the probability of OOM. raftstore.capacity mainly depends on whether your store storage is nearing its limit, but it seems the disk storage used is relatively small. Is this k8? The dashboard looks unfamiliar :joy:.

  3. For this error, there was an issue writing SST to TiKV. Investigate along this line of thought :thinking:.

| username: tidb狂热爱好者 | Original post link

Go check his chat history in the group. He deployed 4 TiKV and 1 TiDB on one machine without changing the configuration. The TiKVs are competing for memory among themselves. OOM kills the one with the largest memory usage. Then systemctl restarts the killed TiKV. That’s his problem.

| username: tidb狂热爱好者 | Original post link

So you can see from his graph that it keeps restarting and OOMing within a few hours due to insufficient memory. A single TiKV cache by default occupies 40% of the total memory. If he starts 2 instances, it will OOM, let alone starting 4 instances.

| username: 普罗米修斯 | Original post link

  1. This is the lightning log, please take a look;
    tidb-lightning.log (2.2 MB)
  2. Checked the downstream server /var/log/message log, indeed tikv OOMed,


    , the server load is too high, the 192.168.80.218 server has 5 tikv nodes, now taking one tikv offline to reduce the load, then re-adding the mixed deployment parameters to try.
  3. Previously thought the data downloaded by dumpling was very large, used lvm on 192.168.80.218 to combine sdd and sdc into one drive mounted under /data, both drives are hdd, then also put lightning sorted-kv-dir under /data/lightning, these two drives’ IO reached 100%, should lightning and dumpling not be placed in the same directory;

    If separated, in the tidb.lightning.toml configuration file, re-specify sorted-kv-dir, copy the remaining data in the lightning directory over, and restart lightning.sh, right.
    Image
  4. To restart the lightning service, is it directly sh lightning script, or do you need to clear the breakpoint information or any other operations?
| username: jansu-dev | Original post link

  1. This pause operation keeps repeating, and the reason is currently unclear. After starting Lightning, did you manually modify the scheduler or enable the scheduler?

  2. Oh, we need to solve the OOM issue.

  3. Shouldn’t Lightning and Dumpling be placed in the same directory?
    → No need, Dumpling and Lightning are not operating simultaneously, right? As long as they are not operating at the same time, there won’t be IO resource contention, but it’s best if both are on SSDs. However, the main issue now seems to be that after data is written to TiKV, it can’t handle the OOM. You can analyze TiKV memory consumption besides blockcache, and check the TiKV-details panel (is it because IO is maxed out and all data is held in memory? etc.)

  4. It’s not recommended to split them, right? Have you already imported 600GB of data? Or almost none? Splitting into different directories is meaningless.

I think the current issues are:

  1. Identify the root cause of memory consumption to prevent TiKV OOM, and ensure continuous data import (even if it’s slow);
  2. Find a way to solve the TiKV IO saturation issue. If you can’t change the disk, find a way to reduce concurrency;
  3. If some data has already been imported, Lightning has a checkpoint mechanism to support resuming from breakpoints. Try to reuse it to save time, and focus on solving the TiKV side issues first.