How to Handle TiDB Database Backup Failures?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB数据库无法正常备份,如何处理?

| username: TiDBer_ydSkDlLw

[TiDB Usage Environment] Test Environment
[TiDB Version]
[Reproduction Path]
[Encountered Issue: Problem Phenomenon and Impact]
Backup of the TiDB database failed. The current cluster has two databases. The ceShiSave database is continuously writing data, while the test database being backed up has only one table with three rows of data. The system’s CPU and memory are not fully utilized. There is a constant prompt related to GC issues, and the system’s GC setting is 10m.
Information as follows:
root@yiduo-VirtualBox:/opt# tiup br backup db -s “local:///opt/back1013” --pd “192.168.1.243:2379” --db test --log-file backuptable1013.log
Checking updates for component br… Timedout (after 2s)
Starting component br: /root/.tiup/components/br/v8.1.0/br backup db -s local:///opt/back1013 --pd 192.168.1.243:2379 --db test --log-file backuptable1013.log
Detail BR log in backuptable1013.log
[2024/06/12 14:10:07.977 +08:00] [INFO] [collector.go:77] [“Database Backup failed summary”] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0]
Error: GC safepoint 450408446345609216 exceed TS 450405012801323009: [BR:Backup:ErrBackupGCSafepointExceeded]backup GC safepoint exceeded

Backup to local, the local folder’s user permissions are assigned to the tidb user.
image

[Attachments: Screenshots/Logs/Monitoring]
Cluster Distribution


Overall Performance of Two Servers:
image
image

Production Environment Plan
In the production environment, only 3 PDs and 2 KVs or 1 PD and 1 KV will be deployed. As long as the performance is sufficient, it is acceptable. If there are more than 3 database servers in the project, consider deploying 3 PDs and more than 3 KVs. Can this ensure the normal use of the system? Manual periodic backups can be considered to avoid data loss.

| username: WinterLiu | Original post link

I personally feel that the machine doesn’t have enough memory, and there was a memory overflow when exporting data. Check the operating system’s message log to see if there are any OOM (Out of Memory) errors reported.

| username: 这里介绍不了我 | Original post link

Running a cluster on two machines in production and then running BR backup sounds very strenuous.

| username: lemonade010 | Original post link

Setting a longer GC time might be necessary because the MVCC version could have expired and been garbage collected during the backup process.

| username: TiDBer_ydSkDlLw | Original post link

I only have one table with 3 rows of data in my backup database. The GC time is 10 minutes, I changed it to 20 minutes, but the error remains the same.

Is such a small database still related to GC? Will another database also affect this relatively small database?

| username: lemonade010 | Original post link

Looking at this sentence, GC 450408446345609216 ‘2024-06-12 13:57:48.689000’
GC 450405012801323009 ‘2024-06-12 10:19:30.757000’
You have at least more than 3 hours of GC time.

| username: lemonade010 | Original post link

Try setting it to 4 hours.

| username: TiDBer_CkS2lbTx | Original post link

Judging from the error message, it indeed indicates that the data has been garbage collected. Please ensure that the backup has not been paused, or that the backup log files are not duplicated. You can try deleting all the files generated from the failed backup and then re-execute the backup.

| username: TiDBer_ydSkDlLw | Original post link

SET GLOBAL tidb_gc_life_time = ‘300m’;
Set the GC to 5 hours, restart the cluster, and the backup can proceed normally.

What puzzles me is that when I change the GC to 10 minutes and then back up, it can also proceed normally.

| username: lemonade010 | Original post link

Check if the GC task is executing normally? Is there any anomaly?

| username: ziptoam | Original post link

The reason for the TiDB backup failure is that the GC (Garbage Collection) safe point for the backup has exceeded the current time point. When TiDB performs a backup, it uses a GC time point to ensure that the data being backed up is not cleaned up by the GC process. This error occurs when the GC time point used by the backup operation lags behind the actual GC execution time point.

| username: 小于同学 | Original post link

Is the GC task running normally?