BR Full Backup to S3 Error - br failed

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: BR全备到S3报错-br failed

| username: 开心大河马

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Reproduction Path】Full backup, no other changes,
【Encountered Problem: Problem Phenomenon and Impact】
【Resource Configuration】
tikv: Physical machine, single machine using around 2T, tidb and pd are both virtual machines.
Backup to cloud S3, full backup error (full backup was normal for the past 2 days), but no errors in log backup currently.

【Attachments: Screenshots/Logs/Monitoring】
Backup error after 4 hours, not yet entered verification time.

Log backup is still normal:
Starting component br: /home/tidb/.tiup/components/br/v7.1.0/br log status
Detail BR log in /tmp/br.log.2023-08-02T15.00.42+0800
● Total 1 Tasks.

#1 <
name: pitr
status: ● NORMAL
start: 2023-06-26 15:01:22.682 +0800
end: 2090-11-18 22:07:45.624 +0800
storage: s3://xxxx/xxx-backup/log-backup
speed(est.): 1251.42 ops/s
checkpoint[global]: 2023-08-02 14:58:13.732 +0800; gap=2m30s

Error after 4 hours of full backup as follows:

[2023/08/02 12:53:38.506 +08:00] [WARN] [s3.go:986] ["failed to request s3, retrying"] [error="RequestError: send request failed\ncaused by: Put \"https://obs.xxxx.com/db-tidb-backup01/xxx-backup/snapshot-2023-08-02_091001/checkpoints/backup/data/J7zMTyj9Mvw0nrQP25TuozAkfR7uSKGCCZRtqT-NBbE%3D_2103379633206288111.cpt\": EOF"] [backoff=1.921003575s]
[2023/08/02 12:54:08.568 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1690952048532] [expire-at=1690952348532]
[2023/08/02 12:58:08.411 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1690952288332] [expire-at=1690952588332]
[2023/08/02 13:02:08.418 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1690952528332] [expire-at=1690952828332]
[2023/08/02 13:06:08.621 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1690952768332] [expire-at=1690953068332]
[2023/08/02 13:08:50.668 +08:00] [INFO] [client.go:876] ["backup range completed"] [range-sn=467] [startKey=74800000000000005D5F69800000000000000600] [endKey=74800000000000005D5F698000000000000006FB] [take=15m13.11835447s]
[2023/08/02 13:08:50.668 +08:00] [ERROR] [client.go:1403] ["fail to backup"] [range-sn=467] [store-id=5] [StoreID=5] [retry=0] [stack="github.com/pingcap/tidb/br/pkg/backup.SendBackup\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1403\ngithub.com/pingcap/tidb/br/pkg/backup.(*pushDown).pushBackup.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/push.go:94"]
[2023/08/02 13:08:50.668 +08:00] [INFO] [client.go:884] ["backup range started"] [range-sn=468] [startKey=74800000000000005E5F7201] [endKey=74800000000000005E5F72FB] [rateLimit=52428800] [concurrency=4]
[2023/08/02 13:08:50.668 +08:00] [ERROR] [client.go:1403] ["fail to backup"] [range-sn=467] [store-id=6] [StoreID=6] [retry=0] [stack="github.com/pingcap/tidb/br/pkg/backup.SendBackup\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1403\ngithub.com/pingcap/tidb/br/pkg/backup.(*pushDown).pushBackup.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/push.go:94"]
[2023/08/02 13:08:50.669 +08:00] [WARN] [backoff.go:218] ["unexpected error, stop to retry"] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).respForErr\n\t/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20230519121736-d15a686a670e/client.go:1313\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\t/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20230519121736-d15a686a670e/client.go:920\ngithub.com/pingcap/tidb/br/pkg/conn/util.GetAllTiKVStores\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/util/util.go:39\ngithub.com
/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/conn.go:83\ngithub.com/pingcap/tidb/br/pkg/utils.WithRetry\n\t/home/jenkins/age
nt/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/retry.go:56\ngithub.com/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg
/conn/conn.go:80\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:889\ngithub.com/pingcap/tidb/br/pkg/backup.(*Cl
ient).BackupRanges.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:848\ngithub.com/pingcap/tidb/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/home/jenkins/age
nt/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/worker.go:76\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/
src/runtime/asm_amd64.s:1598"]
[2023/08/02 13:08:50.669 +08:00] [INFO] [client.go:876] ["backup range completed"] [range-sn=468] [startKey=74800000000000005E5F7201] [endKey=74800000000000005E5F72FB] [take=531.372µs]
[2023/08/02 13:08:50.669 +08:00] [INFO] [client.go:884] ["backup range started"] [range-sn=469] [startKey=74800000000000005E5F69800000000000000100] [endKey=74800000000000005E5F698000000000000001FB] [rateLimit=52428800] [conc
urrency=4]
[2023/08/02 13:08:50.669 +08:00] [WARN] [backoff.go:218] ["unexpected error, stop to retry"] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).respForErr\n\t/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20230519121736-d15a686a670e/client.go:1313\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\t/go/pkg/mod/github.com/tikv/pd/clien
t@v0.0.0-20230519121736-d15a686a670e/client.go:920\ngithub.com/pingcap/tidb/br/pkg/conn/util.GetAllTiKVStores\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/util/util.go:39\ngithub.com
/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/conn.go:83\ngithub.com/pingcap/tidb/br/pkg/utils.WithRetry\n\t/home/jenkins/age
nt/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/retry.go:56\ngithub.com/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg
/conn/conn.go:80\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:889\ngithub.com/pingcap/tidb/br/pkg/backup.(*Cl
ient).BackupRanges.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:848\ngithub.com/pingcap/tidb/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/home/jenkins/age
nt/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/worker.go:76\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/
src/runtime/asm_amd64.s:1598"]
。。。。
[2023/08/02 13:08:50.796 +08:00] [ERROR] [backup.go:54] ["failed to backup"] [error="error happen in store 6 at 10.xxx.xxx.64:20161: Io(Custom { kind: Other, error: \"failed to put object rusoto error timeout after 15mins for upload part in s3 storage\" }): [BR:KV:ErrKVStorage]tikv storage occur I/O error"] [errorVerbose="[BR:KV:ErrKVStorage]tikv storage occur I/O error\nerror happen in store 6 at 10.xxx.xxx.64:20161: Io(Custom { kind: Other, error: \"failed to put object rusoto error timeout after 15mins for upload part in s3 storage\" })\ngithub.com/pingcap/tidb/br/pkg/backup.(*pushDown).pushBackup\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/push.go:222\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:934\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRanges.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:848\ngithub.com/pingcap/tidb/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/worker.go:76\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"] [stack="main.runBackupCommand\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/cmd/br/backup.go:54\nmain.newFullBackupCommand.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/cmd/br/backup.go:143\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:916\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:1044\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:968\nmain.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/cmd/br/main.go:58\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]
[2023/08/02 13:08:50.797 +08:00] [ERROR] [main.go:60] ["br failed"] [error="error happen in store 6 at 10.xxx.xxx.64:20161: Io(Custom { kind: Other, error: \"failed to put object rusoto error timeout after 15mins for upload part in s3 storage\" }): [BR:KV:ErrKVStorage]tikv storage occur I/O error"] [errorVerbose="[BR:KV:ErrKVStorage]tikv storage occur I/O error\nerror happen in store 6 at 10.xxx.xxx.64:20161: Io(Custom { kind: Other, error: \"failed to put object rusoto error timeout after 15mins for upload part in s3 storage\" })\ngithub.com/pingcap/tidb/br/pkg/backup.(*pushDown).pushBackup\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/push.go:222\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:934\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRanges.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:848\ngithub.com/pingcap/tidb/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/worker.go:76\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"] [stack="main.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/cmd/br/main.go:60\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]

| username: tidb菜鸟一只 | Original post link

Can store 6 of this TiKV node access your S3 storage? Try testing it manually.

| username: zhanggame1 | Original post link

Check if there are any network issues with the TiKV backup.

| username: 开心大河马 | Original post link

This error node is connected,

| username: zhanggame1 | Original post link

Are there any logs available to view on S3 cloud?

| username: 开心大河马 | Original post link

I looked into the cloud services, and there were no error reports for the requests. It’s very difficult to investigate.

| username: ffeenn | Original post link

Have you adjusted the configuration for S3, such as connection concurrency limits? It’s possible that the data volume has reached a threshold that triggers these limits. If you can’t pinpoint the issue on that side, you might consider using BR’s checkpoint restore feature.
Checkpoint Restore | PingCAP Documentation Center

| username: redgame | Original post link

Is there any configuration issue with S3 storage?

| username: 开心大河马 | Original post link

Today, I reinitiated the backup process. It has been running for about half an hour, and files are being uploaded successfully.

I reviewed yesterday’s logs, and the actual disconnection occurred here:
Are backup range and backup push down two separate steps?
Is backup range the completion of the backup, and backup push down the transfer of the backup to S3?

The backup to range-sn=465 was normal, with a step for flushing the checkpoint lock.
[2023/08/02 12:49:06.882 +08:00] [INFO] [client.go:911] ["backup push down started"] [range-sn=465]
[2023/08/02 12:49:06.882 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=465] [store-id=1] ["retry time"=0]
[2023/08/02 12:49:06.882 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=465] [store-id=5] ["retry time"=0]
[2023/08/02 12:49:06.882 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=465] [store-id=3] ["retry time"=0]
[2023/08/02 12:49:06.882 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=465] [store-id=4] ["retry time"=0]
[2023/08/02 12:49:06.882 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=465] [store-id=6] ["retry time"=0]
[2023/08/02 12:49:06.882 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=465] [store-id=2] ["retry time"=0]
[2023/08/02 12:50:08.488 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1690951808432] [expire-at=1690952108432]
[2023/08/02 12:50:52.594 +08:00] [INFO] [client.go:939] ["backup push down completed"] [range-sn=465] [small-range-count=1219]
[2023/08/02 12:50:52.595 +08:00] [INFO] [client.go:956] ["transactional range backup completed"] [range-sn=465] [StartTS=0] [EndTS=443269409376043097]
[2023/08/02 12:50:52.596 +08:00] [INFO] [client.go:876] ["backup range completed"] [range-sn=465] [startKey=74800000000000005D5F69800000000000000300] [endKey=74800000000000005D5F698000000000000003FB] [take=1m45.714702061s]
[2023/08/02 12:50:52.596 +08:00] [INFO] [client.go:884] ["backup range started"] [range-sn=466] [startKey=74800000000000005D5F69800000000000000400] [endKey=74800000000000005D5F698000000000000004FB] [rateLimit=52428800] [conc
urrency=4]

The backup to range-sn=466 was normal:
[2023/08/02 12:50:52.596 +08:00] [INFO] [client.go:884] ["backup range started"] [range-sn=466] [startKey=74800000000000005D5F69800000000000000400] [endKey=74800000000000005D5F698000000000000004FB] [rateLimit=52428800] [conc
urrency=4]
[2023/08/02 12:50:52.597 +08:00] [INFO] [client.go:911] ["backup push down started"] [range-sn=466]
[2023/08/02 12:50:52.597 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=466] [store-id=2] ["retry time"=0]
[2023/08/02 12:50:52.597 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=466] [store-id=1] ["retry time"=0]
[2023/08/02 12:50:52.597 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=466] [store-id=4] ["retry time"=0]
[2023/08/02 12:50:52.597 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=466] [store-id=6] ["retry time"=0]
[2023/08/02 12:50:52.597 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=466] [store-id=5] ["retry time"=0]
[2023/08/02 12:50:52.597 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=466] [store-id=3] ["retry time"=0]
[2023/08/02 12:53:37.547 +08:00] [INFO] [client.go:939] ["backup push down completed"] [range-sn=466] [small-range-count=1884]
[2023/08/02 12:53:37.547 +08:00] [INFO] [client.go:956] ["transactional range backup completed"] [range-sn=466] [StartTS=0] [EndTS=443269409376043097]
[2023/08/02 12:53:37.550 +08:00] [INFO] [client.go:876] ["backup range completed"] [range-sn=466] [startKey=74800000000000005D5F69800000000000000400] [endKey=74800000000000005D5F698000000000000004FB] [take=2m44.953455046s]

The backup to range-sn=467 encountered an error. The `backup range completed` was successful, but there was no `backup push down completed` log.
[2023/08/02 12:53:37.550 +08:00] [INFO] [client.go:884] ["backup range started"] [range-sn=467] [startKey=74800000000000005D5F69800000000000000600] [endKey=74800000000000005D5F698000000000000006FB] [rateLimit=52428800] [conc
urrency=4]
[2023/08/02 12:53:37.550 +08:00] [INFO] [client.go:911] ["backup push down started"] [range-sn=467]
[2023/08/02 12:53:37.550 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=467] [store-id=1] ["retry time"=0]
[2023/08/02 12:53:37.550 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=467] [store-id=4] ["retry time"=0]
[2023/08/02 12:53:37.550 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=467] [store-id=6] ["retry time"=0]
[2023/08/02 12:53:37.551 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=467] [store-id=3] ["retry time"=0]
[2023/08/02 12:53:37.550 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=467] [store-id=2] ["retry time"=0]
[2023/08/02 12:53:37.550 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=467] [store-id=5] ["retry time"=0]
[2023/08/02 12:53:38.506 +08:00] [WARN] [s3.go:986] ["failed to request s3, retrying"] [error="RequestError: send request failed\ncaused by: Put \"https://obs.xxx.com/db-tidb-backup01/xxx-backup/snap
shot-2023-08-02_091001/checkpoints/backup/data/J7zMTyj9Mvw0nrQP25TuozAkfR7uSKGCCZRtqT-NBbE%3D_2103379633206288111.cpt\": EOF"] [backoff=1.921003575s]
[2023/08/02 12:54:08.568 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1690952048532] [expire-at=1690952348532]
[2023/08/02 12:58:08.411 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1690952288332] [expire-at=1690952588332]
[2023/08/02 13:02:08.418 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1690952528332] [expire-at=1690952828332]
[2023/08/02 13:06:08.621 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1690952768332] [expire-at=1690953068332]
[2023/08/02 13:08:50.668 +08:00] [INFO] [client.go:876] ["backup range completed"] [range-sn=467] [startKey=74800000000000005D5F69800000000000000600] [endKey=74800000000000005D5F698000000000000006FB] [take=15m13.11835447s]
[2023/08/02 13:08:50.668 +08:00] [ERROR] [client.go:1403] ["fail to backup"] [range-sn=467] [store-id=5] [StoreID=5] [retry=0] [stack="github.com/pingcap/tidb/br/pkg/backup.SendBackup\n\t/home/jenkins/agent/workspace/build-c
ommon/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1403\ngithub.com/pingcap/tidb/br/pkg/backup.(*pushDown).pushBackup.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/pus
h.go:94"]

However, I can still see that range-sn=468 and 469 are the same. The `backup range completed` was successful, but there is no log for `backup push down started`.
[2023/08/02 13:08:50.668 +08:00] [INFO] [client.go:884] ["backup range started"] [range-sn=468] [startKey=74800000000000005E5F7201] [endKey=74800000000000005E5F72FB] [rateLimit=52428800] [concurrency=4]
[2023/08/02 13:08:50.668 +08:00] [ERROR] [client.go:1403] ["fail to backup"] [range-sn=467] [store-id=6] [StoreID=6] [retry=0] [stack="github.com/pingcap/tidb/br/pkg/backup.SendBackup\n\t/home/jenkins/agent/workspace/build-c
ommon/go/src/github.com/pingcap/br/br/pkg/backup/client.go:1403\ngithub.com/pingcap/tidb/br/pkg/backup.(*pushDown).pushBackup.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/pus
h.go:94"]
[2023/08/02 13:08:50.669 +08:00] [WARN] [backoff.go:218] ["unexpected error, stop to retry"] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\ngit
hub.com/tikv/pd/client.(*client).respForErr\n\t/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20230519121736-d15a686a670e/client.go:1313\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\t/go/pkg/mod/github.com/tikv/pd/clien
t@v0.0.0-20230519121736-d15a686a670e/client.go:920\ngithub.com/pingcap/tidb/br/pkg/conn/util.GetAllTiKVStores\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/util/util.go:39\ngithub.com
/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/conn.go:83\ngithub.com/pingcap/tidb/br/pkg/utils.WithRetry\n\t/home/jenkins/age
nt/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/retry.go:56\ngithub.com/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg
/conn/conn.go:80\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:889\ngithub.com/pingcap/tidb/br/pkg/backup.(*Cl
ient).BackupRanges.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:848\ngithub.com/pingcap/tidb/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/home/jenkins/age
nt/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/worker.go:76\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/
src/runtime/asm_amd64.s:1598"]
[2023/08/02 13:08:50.669 +08:00] [INFO] [client.go:876] ["backup range completed"] [range-sn=468] [startKey=74800000000000005E5F7201] [endKey=74800000000000005E5F72FB] [take=531.372µs]
[2023/08/02 13:08:50.669 +08:00] [INFO] [client.go:884] ["backup range started"] [range-sn=469] [startKey=74800000000000005E5F69800000000000000100] [endKey=74800000000000005E5F698000000000000001FB] [rateLimit=52428800] [conc
urrency=4]
[2023/08/02 13:08:50.669 +08:00] [WARN] [backoff.go:218] ["unexpected error, stop to retry"] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\ngit
hub.com/tikv/pd/client.(*client).respForErr\n\t/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20230519121736-d15a686a670e/client.go:1313\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\t/go/pkg/mod/github.com/tikv/pd/clien
t@v0.0.0-20230519121736-d15a686a670e/client.go:920\ngithub.com/pingcap/tidb/br/pkg/conn/util.GetAllTiKVStores\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/util/util.go:39\ngithub.com
/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/conn/conn.go:83\ngithub.com/pingcap/tidb/br/pkg/utils.WithRetry\n\t/home/jenkins/age
nt/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/retry.go:56\ngithub.com/pingcap/tidb/br/pkg/conn.GetAllTiKVStoresWithRetry\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg
/conn/conn.go:80\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:889\ngithub.com/pingcap/tid
| username: 开心大河马 | Original post link

The tests are all passing now, so I’m not sure if there was an issue at that time. If it doesn’t work later, try using a point-in-time backup.

| username: 开心大河马 | Original post link

Update it,
Currently, it is still uncertain whether the issue is with the network or S3, because today there were only warn retries, but the retries were successful. It seems unrelated to TiDB itself.

Today’s warn, it seems to have resolved itself after retrying. No other changes were made.

[2023/08/03 10:49:29.424 +08:00] [INFO] [client.go:884] ["backup range started"] [range-sn=206] [startKey=74800000000000008D5F69800000000000000400] [endKey=74800000000000008D5F698000000000000004FB] [rateLimit=52428800] [concurrency=4]
[2023/08/03 10:49:29.424 +08:00] [INFO] [client.go:911] ["backup push down started"] [range-sn=206]
[2023/08/03 10:49:29.425 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=206] [store-id=5] ["retry time"=0]
[2023/08/03 10:49:29.425 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=206] [store-id=2] ["retry time"=0]
[2023/08/03 10:49:29.425 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=206] [store-id=3] ["retry time"=0]
[2023/08/03 10:49:29.425 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=206] [store-id=6] ["retry time"=0]
[2023/08/03 10:49:29.425 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=206] [store-id=1] ["retry time"=0]
[2023/08/03 10:49:29.425 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=206] [store-id=4] ["retry time"=0]
[2023/08/03 10:50:08.692 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1691031008632] [expire-at=1691031308632]
[2023/08/03 10:53:29.605 +08:00] [INFO] [client.go:939] ["backup push down completed"] [range-sn=206] [small-range-count=2353]
[2023/08/03 10:53:29.605 +08:00] [INFO] [client.go:956] ["transactional range backup completed"] [range-sn=206] [StartTS=0] [EndTS=443292058984644871]
[2023/08/03 10:53:29.609 +08:00] [INFO] [client.go:876] ["backup range completed"] [range-sn=206] [startKey=74800000000000008D5F69800000000000000400] [endKey=74800000000000008D5F698000000000000004FB] [take=4m0.185079912s]
[2023/08/03 10:53:29.609 +08:00] [INFO] [client.go:884] ["backup range started"] [range-sn=207] [startKey=74800000000000008D5F69800000000000000500] [endKey=74800000000000008D5F698000000000000005FB] [rateLimit=52428800] [concurrency=4]
[2023/08/03 10:53:29.610 +08:00] [INFO] [client.go:911] ["backup push down started"] [range-sn=207]
[2023/08/03 10:53:29.610 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=207] [store-id=4] ["retry time"=0]
[2023/08/03 10:53:29.610 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=207] [store-id=5] ["retry time"=0]
[2023/08/03 10:53:29.610 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=207] [store-id=6] ["retry time"=0]
[2023/08/03 10:53:29.610 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=207] [store-id=1] ["retry time"=0]
[2023/08/03 10:53:29.610 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=207] [store-id=3] ["retry time"=0]
[2023/08/03 10:53:29.610 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=207] [store-id=2] ["retry time"=0]
[2023/08/03 10:54:08.838 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1691031248782] [expire-at=1691031548782]
[2023/08/03 10:55:08.617 +08:00] [WARN] [s3.go:986] ["failed to request s3, retrying"] [error="RequestError: send request failed\ncaused by: Put \"https://obs.xxx/db-tidb-xxxx/xxxx-backup/snapshot-2023-08-03_091001/checkpoints/backup/data/hNlORZx9eiNtKboz7ZYDqDfGSjvE68_QUOIFEKIhs6M%3D_17282580143202904324.cpt\": EOF"] [backoff=1.385704497s]
[2023/08/03 10:57:42.930 +08:00] [INFO] [client.go:939] ["backup push down completed"] [range-sn=207] [small-range-count=1888]
[2023/08/03 10:57:42.930 +08:00] [INFO] [client.go:956] ["transactional range backup completed"] [range-sn=207] [StartTS=0] [EndTS=443292058984644871]
[2023/08/03 10:57:42.933 +08:00] [INFO] [client.go:876] ["backup range completed"] [range-sn=207] [startKey=74800000000000008D5F69800000000000000500] [endKey=74800000000000008D5F698000000000000005FB] [take=4m13.32398707s]
[2023/08/03 10:57:42.933 +08:00] [INFO] [client.go:884] ["backup range started"] [range-sn=208] [startKey=74800000000000008F5F7201] [endKey=74800000000000008F5F72FB] [rateLimit=52428800] [concurrency=4]
[2023/08/03 10:57:42.934 +08:00] [INFO] [client.go:911] ["backup push down started"] [range-sn=208]
[2023/08/03 10:57:42.934 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=208] [store-id=5] ["retry time"=0]
[2023/08/03 10:57:42.934 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=208] [store-id=6] ["retry time"=0]
[2023/08/03 10:57:42.934 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=208] [store-id=1] ["retry time"=0]
[2023/08/03 10:57:42.934 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=208] [store-id=4] ["retry time"=0]
[2023/08/03 10:57:42.934 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=208] [store-id=2] ["retry time"=0]
[2023/08/03 10:57:42.934 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=208] [store-id=3] ["retry time"=0]
[2023/08/03 10:58:08.692 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1691031488632] [expire-at=1691031788632]
[2023/08/03 11:02:08.806 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1691031728782] [expire-at=1691032028782]
[2023/08/03 11:03:08.728 +08:00] [WARN] [s3.go:986] ["failed to request s3, retrying"] [error="RequestError: send request failed\ncaused by: Put \"https://obs.xxx/db-tidb-xxxx/xxxx-backup/snapshot-2023-08-03_091001/checkpoints/backup/data/c6W4HZU2YWRyDMJrQIqf-eShtUDdiweeX-pt1COmN6g%3D_17207526156971781154.cpt\": write tcp 10.249.145.41:50836->121.36.235.132:443: use of closed network connection"] [backoff=1.85872139s]
[2023/08/03 11:04:08.689 +08:00] [WARN] [s3.go:986] ["failed to request s3, retrying"] [error="RequestError: send request failed\ncaused by: Put \"https://obs.xxx/db-tidb-xxxx/xxxx-backup/snapshot-2023-08-03_091001/checkpoints/backup/data/9CbB3udHMDMv2CpAzPGlMo6-Ak4SZ59SjBSIAFTn2zc%3D_3623562631776311930.cpt\": EOF"] [backoff=1.62683036s]
[2023/08/03 11:06:08.606 +08:00] [INFO] [checkpoint.go:610] ["start to flush the checkpoint lock"] [lock-at=1691031968582] [expire-at=1691032268582]
[2023/08/03 11:07:00.391 +08:00] [INFO] [client.go:939] ["backup push down completed"] [range-sn=208] [small-range-count=4360]
[2023/08/03 11:07:00.392 +08:00] [INFO] [client.go:956] ["transactional range backup completed"] [range-sn=208] [StartTS=0] [EndTS=443292058984644871]
[2023/08/03 11:07:00.406 +08:00] [INFO] [client.go:876] ["backup range completed"] [range-sn=208] [startKey=74800000000000008F5F7201] [endKey=74800000000000008F5F72FB] [take=9m17.472518237s]
[2023/08/03 11:07:00.406 +08:00] [INFO] [client.go:884] ["backup range started"] [range-sn=209] [startKey=74800000000000008F5F69800000000000000100] [endKey=74800000000000008F5F698000000000000001FB] [rateLimit=52428800] [concurrency=4]
[2023/08/03 11:07:00.407 +08:00] [INFO] [client.go:911] ["backup push down started"] [range-sn=209]
[2023/08/03 11:07:00.407 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=209] [store-id=3] ["retry time"=0]
[2023/08/03 11:07:00.407 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=209] [store-id=5] ["retry time"=0]
[2023/08/03 11:07:00.407 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=209] [store-id=4] ["retry time"=0]
[2023/08/03 11:07:00.407 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=209] [store-id=2] ["retry time"=0]
[2023/08/03 11:07:00.407 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=209] [store-id=1] ["retry time"=0]
[2023/08/03 11:07:00.407 +08:00] [INFO] [client.go:1389] ["try backup"] [range-sn=209] [store-id=6] ["retry time"=0]
[2023/08/03 11:07:00.419 +08:00] [INFO] [client.go:939] ["backup push down completed"] [range-sn=209] [small-range-count=1]
[2023/08/03 11:07:00.419 +08:00] [INFO] [client.go:956] ["transactional range backup completed"] [range-sn=209] [StartTS=0] [EndTS=443292058984644871]
[2023/08/03 11:07:00.419 +08:00] [INFO] [client.go:876] ["backup range completed"] [range-sn=209] [startKey=74800000000000008F5F69800000000000000100] [endKey=74800000000000008F5F698000000000000001FB] [take=13.235186ms]

Additionally, I also noticed other warn messages during the backup, indicating a lack of S3 permissions. After the backup was completed, the permissions were added. We will continue to monitor for any further errors. For reference:
[2023/07/30 09:10:03.417 +08:00] [WARN] [s3.go:474] [“failed to check object lock for bucket”] [bucket=db-tidb-xxx] [error=“AccessDenied: Access Denied\n\tstatus code: 403, request id: 00000189A4586144950E8Exxxx3, host id: 32AAAQAAEAABAAAQAAEAxxxxwJEgTjC3”]

| username: zhanggame1 | Original post link

Network storage troubleshooting still heavily relies on the storage provider.

| username: 开心大河马 | Original post link

In the end, it was still an issue with the cloud S3 permissions. Currently, the necessary multipart upload permissions have been granted, and there are no more errors. If similar issues arise in the future, you can check whether the corresponding cloud S3 multipart upload permissions are sufficient.

The specific method is:
If the backup logs do not show specifics, you can check the tikv.log at the time of the error reported in the backup logs. There will be a detailed request ID, which the cloud S3 provider can use to quickly pinpoint the cause of the error.

| username: 开心大河马 | Original post link

Yes, in the end, it was indeed due to insufficient storage permissions. Troubleshooting was quite troublesome.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.