BR Backup to S3 Failed [BR:KV:ErrKVStorage] TiKV Storage Encountered I/O Error

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Br备份S3失败[BR:KV:ErrKVStorage]tikv storage occur I/O error

| username: TiDBer_yyy

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.0.4
[Reproduction Path] Backup to S3, backup failed. Failed 3 times in a row, backing up multiple databases to the same S3 path, some databases succeeded, some failed. The failed databases are relatively large, 3-5TB.

[2023/08/23 00:21:29.453 +08:00] [ERROR] [main.go:58] ["br failed"] [error="[BR:KV:ErrKVStorage]tikv storage occur I/O error"] [errorVerbose="[BR:KV:ErrKVStorage]tikv storage occur I/O error\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br@/pkg/backup/client.go:552\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRanges.func2.1\n\tgithub.com/pingcap/br@/pkg/backup/client.go:486\ngithub.com/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\tgithub.com/pingcap/br@/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"] [stack="main.main\n\tgithub.com/pingcap/br@/cmd/br/main.go:58\nruntime.main\n\truntime/proc.go:203"]
| username: Kongdom | Original post link

Could you please provide the complete log? There could be multiple reasons causing this error.

| username: tidb菜鸟一只 | Original post link

Could you share the command? Which S3 are you using? Also, please provide the full error message.

| username: TiDBer_yyy | Original post link

aws’s s3

  • Command
tiup br:v5.0.4 backup db --pd "xxxxx:2379" --storage "s3://acc-hk-exchange-tidb-backups/tidb_backup/prod_datacenter_backups/prod_datacenter_broker_20230802" --db db --s3.region "ap-east-1" --send-credentials-to-tikv=true --ratelimit 128 --log-file backup_db_db.log
| username: TiDBer_yyy | Original post link

Bro, I’ve posted the logs below.

| username: tidb菜鸟一只 | Original post link

I found a similar post
br backup failure issue [pd] failed updateMember - :ringer_planet: TiDB technical issues - TiDB Q&A community (asktug.com)
Adding --send-credentials-to-tikv=true: indicates passing S3 access permissions to TiKV nodes. This configuration sometimes works, you can give it a try.

| username: Kongdom | Original post link

I also think the reason for the failure is caused by failed updateMember.

| username: TiDBer_yyy | Original post link

The command has already included this parameter.

| username: tidb菜鸟一只 | Original post link

It should still be an issue with PD. When BR performs a backup, it needs to send a request to PD to get the leader, but it failed to obtain it. Does version 5.0 not retry? Why not check if the node at 172.16.25.123:2379 is the PD leader? If not, try switching to the leader. Also, is there a network issue between the machine performing the BR backup and the node at 172.16.25.123:2379?

| username: TiDBer_yyy | Original post link

Not the leader, the database is relatively large, and the historical error rate is around 100%. A total of more than 40 databases were backed up, and 4 of them failed to back up.

| username: TiDBer_yyy | Original post link

The PD node written in the command line is 172.16.25.151:2379.

| username: tidb菜鸟一只 | Original post link

However, in your log, it shows [address=http://172.16.25.123:2379]

| username: TiDBer_yyy | Original post link

Why is this happening? Why do we need to connect to 123?

| username: tidb菜鸟一只 | Original post link

It wasn’t specified by you, so TiKV should have obtained the leader from PD by itself. But why did it choose this node…

| username: GreenGuan | Original post link

The issue I encountered before was that the error codes returned by S3 (self-built) and the br tool were inconsistent, leading to frequent failures. The solutions are:

  1. Recompile the br tool to extend the following times: NumMaxRetries, MinRetryDelay, MinThrottleDelay.
  2. Modify the S3 return codes to adapt to the br tool.
func defaultS3Retryer() request.Retryer {
    return retryerWithLog{
        DefaultRetryer: client.DefaultRetryer{
            NumMaxRetries:    maxRetries,
            MinRetryDelay:    1 * time.Second,
            MinThrottleDelay: 2 * time.Second,
        },
    }
}
| username: TiDBer_yyy | Original post link

Excellent. I’m not very confident in compiling it myself. Which version do you have?

| username: TiDBer_yyy | Original post link

Additional error information

[2023/08/28 15:34:36.338 +08:00] [INFO] [collector.go:67] ["Database backup failed summary"] [total-ranges=3834] [ranges-succeed=3632] [ranges-failed=202] [backup-total-ranges=3834] [unit-name="range start:7480000000000024f45f69800000000000000e00 end:7480000000000024f45f69800000000000000efb"] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\ngithub.com/tikv/pd/client.(*client).GetAllStores\n\tgithub.com/tikv/pd@v1.1.0-beta.0.20210323123936-c8fa72502f16/client/client.go:1196\ngithub.com/pingcap/br/pkg/conn.GetAllTiKVStores\n\tgithub.com/pingcap/br@/pkg/conn/conn.go:140
[2023/08/28 15:34:36.341 +08:00] [ERROR] [backup.go:41] ["failed to backup"] [error="[BR:KV:ErrKVStorage]tikv storage occur I/O error"] [errorVerbose="[BR:KV:ErrKVStorage]tikv storage occur I/O error
[2023/08/28 15:34:36.341 +08:00] [ERROR] [main.go:58] ["br failed"] [error="[BR:KV:ErrKVStorage]tikv storage occur I/O error"] [errorVerbose="[BR:KV:ErrKVStorage]tikv storage occur I/O error
| username: TiDBer_yyy | Original post link

There will be some warning messages

[2023/08/21 10:50:04.297 +08:00] [WARN] [push.go:157] ["backup occur region error"] [error="{\"RegionError\":{\"message\":\"peer is not leader for region 736943940, leader may Some(id: 1147730709 store_id: 19)\",\"not_leader\":{\"region_id\":736943940,\"leader\":{\"id\":1147730709,\"store_id\":19}}}}"]
| username: TiDBer_yyy | Original post link

After changing, the error tikv no-leader still appears.

backup_db_stat_err.log (197.3 KB)

| username: TiDBer_yyy | Original post link

Could you please take a look?