Improvement of br tools

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: br 工具完善

| username: GreenGuan

Background: Self-built S3 backup failed, can more information be printed to make troubleshooting more referential?
Database version v5.4.x, BR version v5.4.x

Suggestions for improvement:

  1. When an upload fails, the following message is printed. It seems that some S3 nodes have high load, so can it print which file upload failed? The purpose is to output a filename to provide more reference for downstream S3. Thank you.

[2022/07/30 11:19:43.832 +08:00] [ERROR] [backup.go:40] [“failed to backup”] [error=“error happen in store 9349506 at xxxxx:20160: Io(Custom { kind: Other, error: "failed to put object rusoto error timeout after 15mins for upload part in s3 storage" }): [BR:KV:ErrKVStorage]tikv storage occur I/O error”]

  1. Can the timeout duration be set as an adjustable parameter?
    failed to put object rusoto error timeout after 15mins for upload part in s3 storage

Another question:
If S3 connection is found to be unavailable, does BR wait or try to reconnect?

| username: luancheng | Original post link

  1. This log can be added, but it will be on the TiKV node.
  2. The timeout parameter cannot be adjusted at the moment. The reason for this situation is that the s3 put/upload_part request did not complete within 15 minutes. Previously, we encountered issues where no timeout was set, but the actual wait time on MinIO could exceed several hours, causing the backup to get stuck. Therefore, we set a 15-minute limit to promptly return an error and rely on external BR retries to bypass this issue. When encountering this problem, BR has a certain retry mechanism. If the failures persist, it will eventually lead to a final failure.

If it is a storage load issue, you can work around it by lowering the concurrency parameters, such as br concurrency or tikv [backup] num-threads.

| username: GreenGuan | Original post link

The daily backup task has a retry mechanism. In this business scenario, the cluster is quite large, and the backup takes 30 hours. The initial intention of proposing this parameter is to give the underlying S3 more time to respond.

Is this the --ratelimit parameter of the BR tool?

| username: IANTHEREAL | Original post link

As the cluster grows larger, the probability of encountering issues increases, which requires BR to optimize retries to be more robust.

Is it convenient to provide the following information?

  • What is the current cluster capacity? What are the cluster configurations and topology like?
  • What is the backup speed when the backup is successful? (This can be seen in the summary section of the log after a successful backup)
| username: IANTHEREAL | Original post link

Is there a way to speed up the client’s backup process?

| username: GreenGuan | Original post link

Version v5.4.0
3 PD, 5 TiDB, 9 TiKV, with a cluster size of around 13T.
BR average speed is 284MB/s.

I feel the root cause is that when backing up SST or performing checksum, all TiKV nodes need to establish a connection with S3. If the network is unstable or times out, the entire backup is considered a failure.

| username: IANTHEREAL | Original post link

The longer the backup time, the higher the probability of encountering anomalies, making anomaly handling more important.

Additionally, may I ask if your cluster is also deployed on AWS and if you are backing up to an S3 bucket in the same region as the cluster?

| username: GreenGuan | Original post link

The cluster is deployed in a self-built data center, and the S3 is also self-built.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.