Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: TiDB 备份工具问题
I encountered the following error when using the backup tool br to back up to a self-built S3 storage:
- What is the cause of this problem?
- What is the handling logic of the br tool when encountering a 4xx problem?
TiDB version is 5.4.2
[BR:KV:ErrKVStorage]tikv storage occur I/O error\nerror happen in store 7181537 at xxx:20160: Io(Custom { kind: Other, error: \"failed to put object rusoto error Request ID: None Body: <Error><Code>InvalidPart</Code><Message>One or more of the specified parts could not be found. The part might not have been uploaded, or the specified entity tag might not have matched the part's entity tag.</Message>
It’s somewhat similar to this question, you can take a look~
It’s not always unusable; the issue occurs occasionally. I asked colleagues in the company’s S3 team, and they suspect that after the uploadPart request fails in BR, incorrect part information is passed in during the subsequent completeMultipartUpload, causing the failure. Therefore, I hope to ask the community if any users have encountered similar errors, or it would be great if the developers could briefly explain the shard upload logic of the BR tool.
Are there any other nodes with a pending offline status?
Here are a few similar issues with the same error, you can refer to them~
Backing up the TiKV nodes of this cluster did not show any anomalies. This I/O error issue has been reported before, but it seems more likely to be a problem with either BR or S3.
The BR tool supports the standard S3 protocol, but other vendors or self-built S3-like clusters may have some incompatibilities. What kind of S3 are you using?
Scenario: BR backup data to S3
Issue: Due to the BR tool invocation, some parts were missing during the multipart upload submission, resulting in a 40x error and causing the entire backup task to exit.
Expectation from the community: Please help locate whether there is an issue in this part of the logic, where some parts were incorrectly recorded? Is the logic for the entire backup task exit reasonable, and can a switch be added to skip this?
Could you please clarify what the time in this screenshot represents? Additionally, could you provide information on which technology you used to build your own S3? Is it Minio?
I consulted with a colleague from S3, and it seems highly likely that the issue is with the br tool. Is there any update?
How long had BR been running when the error occurred? It seems highly likely that it’s an issue with S3. You might want to check with your colleagues who develop S3-like services to see if there’s a maximum retention time for shards.
Alternatively, you can set up an S3 storage using MinIO locally to test if there are any errors when backing up to MinIO.
This task lasted about 2 hours from backup error to exit, with a maximum retention time of 2 days.
Well, that’s unclear. My backups to Alibaba Cloud OSS and Tencent Cloud COS are compatible, but the S3 protocol is essentially a protocol defined by Amazon itself and is still evolving. Various tools can only try to be as compatible as possible, but 100% compatibility cannot be guaranteed…