BR Backup Fails Due to Checksum Failure

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: BR 备份 checksum 失败导致备份失败

| username: liuhan907

[TiDB Usage Environment]
Production Environment

[TiDB Version]
v5.4.0

[Reproduction Path]
Use the command to back up the database

br backup full \
  --pd 10.0.7.114:2379 \
  --storage s3://[******]/data_2022-11-21-17-48?access-key=[******]&secret-access-key=[******] \
  --s3.region ap-guangzhou \
  --s3.endpoint http://cos.ap-guangzhou.myqcloud.com \
  --send-credentials-to-tikv=true \
  --ratelimit 128 \
  --log-file /data/[******]/tidb/tidb_br_home/log/data_2022-11-21-17-48_backuptable.log

[Encountered Problem: Problem Phenomenon and Impact]
Backup failed, error message

[2022/11/21 14:57:04.487 +08:00] [ERROR] [global.go:46] ["checksum mismatch"] [db=lucifer-cn] [table=PlayerItems] ["origin tidb crc64"=1126378096669210666] ["calculated crc64"=6933529848215176403] ["origin tidb total kvs"=12480622] ["calculated total kvs"=12480621] ["origin tidb total bytes"=698822936] ["calculated total bytes"=698822898] [stack="github.com/pingcap/log.Error
\t/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20221015072633-39906604fb81/global.go:46
github.com/pingcap/tidb/br/pkg/checksum.FastChecksum
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/checksum/validate.go:70
github.com/pingcap/tidb/br/pkg/task.RunBackup
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/task/backup.go:550
main.runBackupCommand
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/cmd/br/backup.go:48
main.newFullBackupCommand.func1
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/cmd/br/backup.go:117
github.com/spf13/cobra.(*Command).execute
\t/go/pkg/mod/github.com/spf13/cobra@v1.5.0/command.go:872
github.com/spf13/cobra.(*Command).ExecuteC
\t/go/pkg/mod/github.com/spf13/cobra@v1.5.0/command.go:990
github.com/spf13/cobra.(*Command).Execute
\t/go/pkg/mod/github.com/spf13/cobra@v1.5.0/command.go:918
main.main
\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/cmd/br/main.go:57
runtime.main
\t/usr/local/go/src/runtime/proc.go:250"]
| username: luancheng | Original post link

Is the data from the original cluster backup still available?
If so, manually execute admin checksum lucifer-cn.PlayerItems and see if the output matches the records in the br logs.

| username: liuhan907 | Original post link

Since it is a production environment server, the data has already been cleared for repair.

| username: luancheng | Original post link

Could you please send a complete BR log?

This error means that the backup obtained 12,480,622 keys through scanning, while the admin checksum obtained 12,480,621 keys. One extra key was scanned. Can you confirm the actual number of keys the table should have? For example, the number of rows and indexes in the table.

| username: YuJuncen | Original post link

Are there any other ERROR level messages in the logs? (If there are many, the first few should be enough.) Those versions of BR have some rather peculiar issues and might log a Checksum mismatch when failing for other reasons.

| username: liuhan907 | Original post link

Here is the log of the backup and restore process.

data_2022-11-21-14-52_backuptable.zip (704.0 KB)

| username: liuhan907 | Original post link

The backup and restore logs have been posted below.

| username: YuJuncen | Original post link

It looks very strange, there are no other ERROR logs indicating the backup failed, so it seems like some edge case might have been triggered causing a certain Key Value pair to not be backed up properly. Can you try the backup again to see if it succeeds?

| username: tidb狂热爱好者 | Original post link

Is this issue because the business hasn’t stopped? Try stopping the business and then backing up.

| username: liuhan907 | Original post link

I haven’t tried backing up again, so I don’t know if the retry was successful.