【TiDB Usage Environment】Production Environment
【TiDB Version】5.4.0
【Reproduction Path】It used to be successful all the time. With business growth, it started to fail intermittently, and eventually, it failed consistently.
【Encountered Problem: Phenomenon and Impact】
Executing command:
br backup full
–pd “,,”
–storage "s3://aa-cn-aip-tidb-backup-1306458289/${timetag}full"
–s3.endpoint “
–s3.region “xxxx-xxxxxxx”
–ratelimit 128
–log-file /tmp/backupfull
Log screenshot

【Resource Configuration】
3 tidb 48c 192G
3 pd 32c 64g
8 tikv 48c 192g
2 ticdc 32c 64g
1 monitor 8c 16g
Attached is the log
backupfull_2023-08-02-02-30.log (77.4 MB)

Were there any errors reported by TiKV during the same time period? Check what the errors are.

TiKV did not receive any alerts. The backup was done at night, and there were no anomalies in the cluster status when checked in the morning.

It seems that your backup failed due to the “Region is unavailable” error. You can follow the steps below to troubleshoot:

When checking the TiKV logs corresponding to the backup error period, I found the following, not sure if it helps:

The continuous warnings might also be abnormal.

Continuing epoch not match

The following image shows errors that occur frequently even without backups.

The rest are just info logs without obvious anomalies.
Also, there are no occurrences of “busy,” “oom,” or “memory” throughout the logs.

Add some diagnostic information.
No OOM found in var/log/messages.
Checked the region and store status with pdctl, no anomalies found.

The region is damaged.
Check out this article: 专栏 - 记一次sst文件损坏修复过程 | TiDB 社区

