Tidb-lightning Data Import OOM

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb-lightning导入数据OOM

| username: starCrush

[TiDB Usage Environment] Production Environment / Testing
[TiDB Version] v5.4.3
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
Approximately 60TB of data, each time lightning is started, it runs for a while and then gets OOMed by the system. The machine has 64GB of memory and 16 logical CPU cores. The region-concurrency was initially set to default, then adjusted to 10, and later to 5.
Here are two questions:

  1. According to the logs, it has reached the sample source data stage. Does this indicate that data import has already started? Or is it still in the backup file detection stage? Additionally, the sorting directory only contains some very small files. Is this normal?

  1. After lightning starts, it detects that the breakpoint resume file is not present but cannot create a new one. Does this mean that data import has not actually started?

The current phenomenon feels like the data has not been imported, the data volume on the PD monitor has not increased, and the memory usage of lightning keeps increasing until OOM.

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: starCrush | Original post link

Only one instance of Lightning is running on the machine, and the memory is entirely occupied by Lightning.

| username: tidb菜鸟一只 | Original post link

Are you using Lightning to import 60TB of data?

| username: hey-hoho | Original post link

How large is a single file? Try to split it into smaller files for import.

| username: starCrush | Original post link

Otherwise, what should we use to import?

| username: starCrush | Original post link

Some single files are quite small, and the table was also split into a maximum of 256M during the dump.

| username: Billmay表妹 | Original post link

Preliminary judgment suggests that TiDB Lightning encountered an out-of-memory (OOM) issue during data import, causing the process to be terminated by the system’s OOM Killer. This situation typically arises when the data volume is too large, requiring TiDB Lightning to use a significant amount of memory for operations like sorting and merging, which the machine’s memory cannot accommodate, leading to the process being killed by the OOM Killer.

To address this issue, you can try the following solutions:

  1. Adjust TiDB Lightning’s parameters to reduce memory usage. For example, you can try adjusting the sorter.memory-quota parameter to lower the memory usage for sorting operations. Additionally, you can adjust the tikv-importer.sorter.num-concurrent-worker parameter to reduce the number of concurrent sorting tasks, thereby decreasing memory usage.
  2. Increase the machine’s memory. If your machine’s memory is insufficient, consider adding more memory to meet TiDB Lightning’s requirements during data import.
  3. Import data in batches. If the data volume is too large, consider importing the data in batches to reduce memory usage during a single import. For example, you can batch the data by time range, table name, etc.
| username: starCrush | Original post link

Lightning must run on the machine with the backup data, right? You can find some machines with large memory, but you can’t specify a remote backup directory, can you?

| username: starCrush | Original post link

How do I add these two parameters to the configuration file? I tried several formats, but they all said the configuration was incorrect.
memory-quota= “55G”
sorter.memory-quota= “55G”

And how do I add this tikv-importer.sorter.num-concurrent-worker?

| username: 大鱼海棠 | Original post link

It looks like the precheck didn’t pass. Do you have a graph of the OOM? The region-concurrency is by default the number of vCPUs, so it shouldn’t be using that much memory. Did you adjust any other parameters?

| username: 大鱼海棠 | Original post link

Is there enough space in the sort area? This log print position will estimate the overall size of data KVs and index KVs.

| username: starCrush | Original post link

This is the entire configuration file, please take a look.
This is the OOM log:

| username: starCrush | Original post link

The sort area space is not enough, but every time it runs, only the memory keeps increasing. There are some small files in the sort area, and it seems that the sort area hasn’t been used yet.

/dev/md0 131T 81T 44T 66% /data1

I put the dump data files and the sort area in /data1, and there is no larger disk available.

| username: 大鱼海棠 | Original post link

index-concurrency = 2
table-concurrency = 6
Try adjusting these settings. Is this machine only running Lightning? I’m not sure where the 64GB usage is coming from; it doesn’t seem reasonable. Have the export files been split?

| username: starCrush | Original post link

Only ran lightning

These are the parameters exported by dumpling:
/root/tidb-toolkit-v5.4.2-linux-amd64/bin/dumpling -u xxx -p ‘xxx.com’ -P 3306 -h --filetype sql --read-timeout=1h -t 16 -o /data1/tmp -r 200000 -F256MiB

| username: hey-hoho | Original post link

Try importing with the new version of Lightning, maybe you’ve encountered a bug.

| username: 大鱼海棠 | Original post link

Try using a different version of Lightning as suggested by the expert above.

| username: starCrush | Original post link

After modifying the two parameters above, I ran it for a few more hours. Why is the sorted-kv-dir for sorting still so small? It feels like it’s not being utilized, and the memory keeps increasing. The lightning log has also progressed from checking the data file to sampling the source data.

Here is the lightning log at the beginning:

| username: 大鱼海棠 | Original post link

Haven’t started importing data yet. It seems there are too many tables. According to the logs, it’s still doing precheck.

| username: okenJiang | Original post link

Is this the bug? Lightning: Memory Leak on Large Source Files · Issue #39331 · pingcap/tidb · GitHub