BR Backup Error: Why Some Clusters Have No Issues

translator_bot · June 23, 2024, 10:04am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: BR 备份报错，为啥有的集群没问题

| username: lxzkenney

tidb 5.0.3
(There are two other clusters online, one is 5.0.1 and the other is 5.1.1. The backup tasks have been running for a year without any issues. Last night, I backed up this cluster for the first time, which is version 5.0.3. Later, I used dumpling to back up locally, but it also got interrupted. See the screenshot at the bottom. Do I need to increase the maximum SQL execution time?)

Error message:

br version:

I noticed from the monitoring that the tcp retransmissions are a bit high, but the network card traffic is not large at this point. This coincides with the time I was backing up.

dumpling interruption error:

I checked the logs, and the table in dumpling’s SQL has an ID with the auto_random attribute. The data volume is 250 million. My maximum SQL execution time is 10 minutes, which should have exceeded the time limit and got KILLed, causing the dumpling task to be interrupted. Not sure if this is the logic.

translator_bot · June 23, 2024, 10:04am

| username: xfworld | Original post link

Try upgrading BR.

translator_bot · June 23, 2024, 10:04am

| username: ShawnYan | Original post link

BR full backup timed out and failed. How long has the backup job been running? Is this the first time it has occurred? Was there a network anomaly between BR and TiKV at that time?

translator_bot · June 23, 2024, 10:04am

| username: lxzkenney | Original post link

Later, using this new version, I ran it twice and both times it got stuck on this error.

translator_bot · June 23, 2024, 10:04am

| username: lxzkenney | Original post link

This cluster is running a backup for the first time; it hasn’t been backed up before. The cluster has been running for a long time. It’s on Tencent Cloud hosts, and there are no anomalies in internal network communication. After 10 PM, there’s no traffic. The backup has been retried several times but always gets stuck on this error after running for about 10 minutes.

I noticed that the TCP retransmissions are relatively high, coinciding with my backup attempts. I see that the network card traffic isn’t significant.

translator_bot · June 23, 2024, 10:04am

| username: xiaohetao | Original post link

Where is the backup stored, and is the network communication okay during this process?

translator_bot · June 23, 2024, 10:04am

| username: banana_jian | Original post link

Is the backup directory using NFS?

translator_bot · June 23, 2024, 10:04am

| username: lxzkenney | Original post link

On COS, Tencent Cloud’s object storage. Their product communication is via the internal network.

translator_bot · June 23, 2024, 10:04am

| username: lxzkenney | Original post link

COS Cloud Object Storage

translator_bot · June 23, 2024, 10:04am

| username: banana_jian | Original post link

There are no issues with the access to this storage on the nodes where TiKV is located, right? Permissions, paths, etc.?

translator_bot · June 23, 2024, 10:05am

| username: lxzkenney | Original post link

Administrator privileges were granted, the data file was written, and the backup was interrupted after running for more than 10 minutes.

translator_bot · June 23, 2024, 10:05am

| username: banana_jian | Original post link

It seems there was a similar bug before. Let’s see if other experts can help you.

translator_bot · June 23, 2024, 10:05am

| username: lxzkenney | Original post link

The TiDB cluster and backup storage are across different clouds, and the dedicated line is unstable. There were network communication issues during the backup process, causing the task to time out and be interrupted.

translator_bot · June 23, 2024, 10:05am

| username: cs58_dba | Original post link

We previously also synchronized a large amount of data from Alibaba Cloud’s ECS server to OSS. Even within the same cloud, if the throughput is too high, there will be limitations, and the synchronization will be directly disconnected.

translator_bot · June 23, 2024, 10:05am

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.