TiDB-Lightning Stuck During Data Import

translator_bot · June 23, 2024, 6:23am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB-Lightning导入数据卡住

| username: hyman

【TiDB Usage Environment】Production
【TiDB Version】5.4
【Data Volume】800G, 3 TiDB nodes with 6 cores and 256G memory each, 4 KV nodes with the same configuration
【Reproduction Path】Stuck at the following log while using lightning for data recovery
【Problem Phenomenon and Impact】

TiDB-Lightning Data Import Too Slow

[INFO] [pd.go:406] [“pause scheduler(configs)”] [name=“[balance-region-scheduler,balance-leader-scheduler,balance-hot-region-scheduler]”] [cfg=“{"enable-location-replacement":"false","leader-schedule-limit":40,"max-merge-region-keys":0,"max-merge-region-size":0,"max-pending-peer-count":2147483647,"max-snapshot-count":40,"region-schedule-limit":40}”]

translator_bot · June 23, 2024, 6:23am

| username: wink | Original post link

I see that there is progress being made, why do you say it’s stuck?

translator_bot · June 23, 2024, 6:23am

| username: hyman | Original post link

Because this log has been stuck for 24 hours, and the total data is only 800GB.

translator_bot · June 23, 2024, 6:23am

| username: wink | Original post link

Please follow the steps in this answer to check the go routines of Lightning and see where it is stuck.

translator_bot · June 23, 2024, 6:23am

| username: hyman | Original post link

goroutine.log (126.3 KB)

translator_bot · June 23, 2024, 6:23am

| username: hyman | Original post link

The 800G data re-ran for a day and got stuck, then the restarted Lightning also got stuck.

translator_bot · June 23, 2024, 6:23am

| username: yilong | Original post link

According to the logs, only about 37% has been imported.
Both TiDB and TiKV have a lot of memory, but very few CPUs. Please check the monitoring of the TiDB cluster to see if the CPU, memory, and disk space are all normal.
Where is Lightning deployed? What configuration is being used? Please check the monitoring of Lightning to see if there are any resource bottlenecks.
Since TiDB Lightning consumes a lot of computer resources, it is recommended to allocate more than 16 GB of memory and more than 32 cores of CPU to achieve optimal performance.
TiDB cluster requirements

image1380×630 69.7 KB

translator_bot · June 23, 2024, 6:23am

| username: hyman | Original post link

Is the Lightning deployed on a separate server with 6 cores and 256G? It has already successfully imported several databases of 300G and 100G. Now it is importing the largest database of 800G. Can it be seen that the performance bottleneck is causing the issue?

translator_bot · June 23, 2024, 6:23am

| username: hyman | Original post link

Deleted the database and power-off files, then restarted lightning, but it still gets stuck. The logs are as follows: goroutine (1).log (335.2 KB) tidb-lightning.log.1 (299.5 KB)

translator_bot · June 23, 2024, 6:23am

| username: alfred | Original post link

I didn’t see any error information in the logs. Are there any monitoring items that have reached the threshold?

translator_bot · June 23, 2024, 6:23am

| username: hyman | Original post link

Currently, monitoring is not available due to internal restrictions. For now, you can only check through system commands. The import machine’s CPU has been at 100% continuously. Previously, it took 3 hours to successfully import a 300GB table. This 800GB table has already taken 16 hours and has been stuck for over ten hours.

translator_bot · June 23, 2024, 6:23am

| username: yilong | Original post link

According to the documentation, you can consider setting the parameter region-concurrency to reduce parallelism and CPU consumption.
TiDB Lightning 故障处理 | PingCAP 文档中心
Alternatively, consider using machine resources that meet the requirements as much as possible according to the documentation.
Or split the data and import it in batches.

translator_bot · June 23, 2024, 6:23am

| username: hyman | Original post link

The current tidb-lightning.toml only configures checkpoint resume, index-concurrency = 1, table-concurrency = 1, region-concurrency = 5, and everything else is the default standard configuration.

translator_bot · June 23, 2024, 6:23am

| username: hyman | Original post link

There is another problem, it is stuck now. If it is due to machine performance issues, what kind of scenario would cause it to be stuck in one place for more than 10 hours? After all, this configuration has successfully imported several 300G databases. Is there any other way to check where it is stuck?

translator_bot · June 23, 2024, 6:23am

| username: Ming | Original post link

Is it normal for it to always prompt “switch to import mode” after this? I feel like it shouldn’t be this way.

translator_bot · June 23, 2024, 6:23am

| username: tidb狂热爱好者 | Original post link

Communicated with the group owner. His machine has too few CPU cores and does not meet the minimum requirements. He needs to upgrade the hardware.

translator_bot · June 23, 2024, 6:23am

| username: gary | Original post link

Has tidb-lightning exited abnormally, possibly causing the cluster to remain in import mode and leading to a CPU spike? You can try forcing it back to normal mode and see if that helps.

translator_bot · June 23, 2024, 6:23am

| username: forever | Original post link

You can try importing in batches, one large table at a time.

translator_bot · June 23, 2024, 6:23am

| username: alfred | Original post link

You can expand the CPU.

translator_bot · June 23, 2024, 6:23am

| username: hyman | Original post link

Resolved. There was an issue with the full data; a single SQL file exceeded 50GB, which might have caused problems during transmission. After replacing it, everything worked fine.