Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: BR 备份非常缓慢
[TiDB Usage Environment] Production Environment
[TiDB Version] v6.1.0
We have 2 production environments, each with the same TiDB topology and resource specifications (both deployed in K8S through the operator); the data volume, query QPS, etc. are also similar. However, BR backup in one environment takes only 37 minutes, while in the other environment it takes 6 hours.
Do you have any troubleshooting ideas?
Is the amount of backed-up data and the network situation the same?
The data volume is of the same order. The backup of the slow cluster (6 hours) to object storage is 35GB, while the backup of the fast cluster (37 minutes) to object storage is 24.5GB.
The network conditions are the same. The fast cluster is on Alibaba Cloud, backing up to OSS; the slow cluster is on AWS, backing up to S3. I don’t think there is a network bottleneck.
Moreover, the slow cluster has only started to become slow in the past half month.
In our AWS cluster, TiKV is set to 2 instances, but due to some anomalies, there are actually 4 instances of TiKV, one of which is down. This has caused some warn-level alerts during backup with BR, and I’m not sure if this is the reason.
Currently, the distribution of TiKV stores is as follows:
tikv:
bootStrapped: true
failoverUID: 186982df-36ba-4391-b21b-40bba57a2222
failureStores:
"104":
createdAt: "2023-11-29T14:39:14Z"
podName: basicai-tikv-2
storeID: "104"
image: pingcap/tikv:v6.1.0
phase: Scale
statefulSet:
collisionCount: 0
currentReplicas: 3
currentRevision: basicai-tikv-654cf466dd
observedGeneration: 8
readyReplicas: 3
replicas: 3
updateRevision: basicai-tikv-654cf466dd
updatedReplicas: 3
stores:
"1":
id: "1"
ip: basicai-tikv-0.basicai-tikv-peer.tidb-cluster.svc
lastTransitionTime: "2024-06-05T13:56:51Z"
leaderCount: 1030
podName: basicai-tikv-0
state: Up
"6":
id: "6"
ip: basicai-tikv-1.basicai-tikv-peer.tidb-cluster.svc
lastTransitionTime: "2024-06-05T13:55:12Z"
leaderCount: 1030
podName: basicai-tikv-1
state: Up
"104":
id: "104"
ip: basicai-tikv-2.basicai-tikv-peer.tidb-cluster.svc
lastTransitionTime: "2024-06-07T15:38:32Z"
leaderCount: 1037
podName: basicai-tikv-2
state: Up
"30001":
id: "30001"
ip: basicai-tikv-3.basicai-tikv-peer.tidb-cluster.svc
lastTransitionTime: "2023-12-04T13:44:47Z"
leaderCount: 0 <----------- basicai-tikv-3 this POD does not exist, but there is a store with ID 3001
podName: basicai-tikv-3
state: Down
What kind of disk are you using?
OSS is about network issues during exams.
Is there any issue with the network?
@Xiao Yu It has nothing to do with the disk or network. I resolved it by cleaning up the invalid store using pd-ctl.
Excellent, self-questioning and self-answering.