Backup is stuck at checksum for a long time

translator_bot · June 23, 2024, 3:12am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: br 备份一直卡在checksum很长时间

| username: blaine

【TiDB Usage Environment】Production
【TiDB Version】v5.2.0
【Encountered Problem】BR full backup keeps on checksum
【Reproduction Path】None
【Problem Phenomenon and Impact】
【Attachment】

This image shows it is stuck on checksum

This image shows the backup log, no errors found, but the total data size is only 288GB, not that large. However, it has been stuck at 98.95% during checksum for an hour. Is this normal? If I stop the backup process now, can I restore the data on the target end?

After waiting for about 6 hours, it finally completed.

, the checksum took longer than the backup itself. What could be the reason for this? The data size is not very large, only 290GB, yet the checksum took around 6 hours. It’s quite frustrating.

translator_bot · June 23, 2024, 3:12am

| username: yilong | Original post link

You can check the BR monitoring to see if any resources are insufficient. For example, whether the CPU or IO is fully utilized.

translator_bot · June 23, 2024, 3:12am

| username: Tank001 | Original post link

Check the resources, or see if there are other places calling it.

translator_bot · June 23, 2024, 3:12am

| username: Hacker007 | Original post link

Based on experience, IO bottleneck, I had a similar situation before.

translator_bot · June 23, 2024, 3:12am

| username: zhouzeru | Original post link

Take a look at the monitoring, we can only make blind guesses otherwise.

translator_bot · June 23, 2024, 3:12am

| username: alfred | Original post link

Does it take a long time to checksum every time you back up?

translator_bot · June 23, 2024, 3:12am

| username: neilshen | Original post link

BR performs checksum calculations to ensure the integrity of backup data.

Based on the current information, the slow checksum calculation may be due to the following reasons:

There is a large amount of historical data in the TiDB cluster, leading to slow data reading. You can check the GC configuration by executing select VARIABLE_NAME, VARIABLE_VALUE from mysql.tidb; and appropriately reduce tikv_gc_life_time. Refer to TiDB Garbage Collection (GC).
The CPU of the TiKV node does not support PCLMULQDQ or SSE 4.1 instructions. You can check this by running cat /proc/cpuinfo.

You can also skip the checksum calculation during BR backup by adding --checksum=false.