BR Backup is Very Slow

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: BR 备份非常缓慢

| username: TiDBer_RywnG56h

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.1.0
We have 2 production environments, each with the same TiDB topology and resource specifications (both deployed in K8S through the operator); the data volume, query QPS, etc. are also similar. However, BR backup in one environment takes only 37 minutes, while in the other environment it takes 6 hours.

Do you have any troubleshooting ideas?

| username: zhaokede | Original post link

Is the amount of backed-up data and the network situation the same?

| username: TiDBer_RywnG56h | Original post link

The data volume is of the same order. The backup of the slow cluster (6 hours) to object storage is 35GB, while the backup of the fast cluster (37 minutes) to object storage is 24.5GB.

The network conditions are the same. The fast cluster is on Alibaba Cloud, backing up to OSS; the slow cluster is on AWS, backing up to S3. I don’t think there is a network bottleneck.

Moreover, the slow cluster has only started to become slow in the past half month.

| username: TiDBer_RywnG56h | Original post link

In our AWS cluster, TiKV is set to 2 instances, but due to some anomalies, there are actually 4 instances of TiKV, one of which is down. This has caused some warn-level alerts during backup with BR, and I’m not sure if this is the reason.

Currently, the distribution of TiKV stores is as follows:

tikv:
    bootStrapped: true
    failoverUID: 186982df-36ba-4391-b21b-40bba57a2222
    failureStores:
      "104":
        createdAt: "2023-11-29T14:39:14Z"
        podName: basicai-tikv-2
        storeID: "104"
    image: pingcap/tikv:v6.1.0
    phase: Scale
    statefulSet:
      collisionCount: 0
      currentReplicas: 3
      currentRevision: basicai-tikv-654cf466dd
      observedGeneration: 8
      readyReplicas: 3
      replicas: 3
      updateRevision: basicai-tikv-654cf466dd
      updatedReplicas: 3
    stores:
      "1":
        id: "1"
        ip: basicai-tikv-0.basicai-tikv-peer.tidb-cluster.svc
        lastTransitionTime: "2024-06-05T13:56:51Z"
        leaderCount: 1030
        podName: basicai-tikv-0
        state: Up
      "6":
        id: "6"
        ip: basicai-tikv-1.basicai-tikv-peer.tidb-cluster.svc
        lastTransitionTime: "2024-06-05T13:55:12Z"
        leaderCount: 1030
        podName: basicai-tikv-1
        state: Up
      "104":
        id: "104"
        ip: basicai-tikv-2.basicai-tikv-peer.tidb-cluster.svc
        lastTransitionTime: "2024-06-07T15:38:32Z"
        leaderCount: 1037
        podName: basicai-tikv-2
        state: Up
      "30001":
        id: "30001"
        ip: basicai-tikv-3.basicai-tikv-peer.tidb-cluster.svc
        lastTransitionTime: "2023-12-04T13:44:47Z"
        leaderCount: 0  <----------- basicai-tikv-3 this POD does not exist, but there is a store with ID 3001
        podName: basicai-tikv-3
        state: Down
| username: 哈喽沃德 | Original post link

What kind of disk are you using?

| username: 哈喽沃德 | Original post link

OSS is about network issues during exams.

| username: 小于同学 | Original post link

Is there any issue with the network?

| username: TiDBer_RywnG56h | Original post link

@Xiao Yu It has nothing to do with the disk or network. I resolved it by cleaning up the invalid store using pd-ctl.

| username: zhaokede | Original post link

Excellent.

| username: lemonade010 | Original post link

Excellent, self-questioning and self-answering.

| username: jiayou64 | Original post link

Learn it.