Background: Self-built S3 backup failed, can more information be printed to make troubleshooting more referential?
Database version v5.4.x, BR version v5.4.x
Suggestions for improvement:
When an upload fails, the following message is printed. It seems that some S3 nodes have high load, so can it print which file upload failed? The purpose is to output a filename to provide more reference for downstream S3. Thank you.
[2022/07/30 11:19:43.832 +08:00] [ERROR] [backup.go:40] [“failed to backup”] [error=“error happen in store 9349506 at xxxxx:20160: Io(Custom { kind: Other, error: "failed to put object rusoto error timeout after 15mins for upload part in s3 storage" }): [BR:KV:ErrKVStorage]tikv storage occur I/O error”]
Can the timeout duration be set as an adjustable parameter?
failed to put object rusoto error timeout after 15mins for upload part in s3 storage
Another question:
If S3 connection is found to be unavailable, does BR wait or try to reconnect?
This log can be added, but it will be on the TiKV node.
The timeout parameter cannot be adjusted at the moment. The reason for this situation is that the s3 put/upload_part request did not complete within 15 minutes. Previously, we encountered issues where no timeout was set, but the actual wait time on MinIO could exceed several hours, causing the backup to get stuck. Therefore, we set a 15-minute limit to promptly return an error and rely on external BR retries to bypass this issue. When encountering this problem, BR has a certain retry mechanism. If the failures persist, it will eventually lead to a final failure.
If it is a storage load issue, you can work around it by lowering the concurrency parameters, such as br concurrency or tikv [backup] num-threads.
The daily backup task has a retry mechanism. In this business scenario, the cluster is quite large, and the backup takes 30 hours. The initial intention of proposing this parameter is to give the underlying S3 more time to respond.
Version v5.4.0
3 PD, 5 TiDB, 9 TiKV, with a cluster size of around 13T.
BR average speed is 284MB/s.
I feel the root cause is that when backing up SST or performing checksum, all TiKV nodes need to establish a connection with S3. If the network is unstable or times out, the entire backup is considered a failure.