Testing Auto Sync Disaster Recovery: Unsafe Remove-Failed-Stores Not Triggered

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 测试auto sync 灾备,未触发unsafe remove-failed-stores

| username: TiDBer_yyy

[Problem Encountered] Reproduction
https://asktug.com/t/topic/573034
When following Online Unsafe Recovery 使用文档 | PingCAP 文档中心, unsafe remove-failed-stores was not triggered.

[TiDB Version] v5.4.0
[TiDB Usage Environment] Test environment, virtual machine CentOS 7
Configuration file:

global:
  user: tidb
  ssh_port: 22
  deploy_dir: /data/tidb-deploy
  data_dir: /data/tidb-data/
  os: linux
  arch: amd64
monitored:
  node_exporter_port: 39100
  blackbox_exporter_port: 39115
  deploy_dir: /data/tidb-deploy/monitor-39100
  data_dir: /data/tidb-data/monitor_data
  log_dir: /data/tidb-deploy/monitor-39100/log
server_configs:
  tidb:
    oom-use-tmp-storage: true
    performance.max-procs: 0
    performance.txn-total-size-limit: 2097152
    prepared-plan-cache.enabled: true
    tikv-client.copr-cache.capacity-mb: 128.0
    tikv-client.max-batch-wait-time: 0
    tmp-storage-path: /data/tidb-data/tmp_oom
    split-table: true
  tikv:
    coprocessor.split-region-on-table: true
    readpool.coprocessor.use-unified-pool: true
    readpool.storage.use-unified-pool: false
    server.grpc-compression-type: none
    storage.block-cache.shared: true
  pd:
    enable-cross-table-merge: false
    replication.enable-placement-rules: true
    schedule.leader-schedule-limit: 4
    schedule.region-schedule-limit: 2048
    schedule.replica-schedule-limit: 64
    replication.location-labels: ["dc","logic","rack","host"]
  tiflash: {}
  tiflash-learner: {}
  pump: {}
  drainer: {}
  cdc: {}
tidb_servers:
- host: 192.168.8.11
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: /data/tidb-deploy/tidb-4000
- host: 192.168.8.12
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: /data/tidb-deploy/tidb-4000
- host: 192.168.8.13
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: /data/tidb-deploy/tidb-4000

tikv_servers:
- host: 192.168.8.11
  ssh_port: 22
  port: 20160
  status_port: 20180
  deploy_dir: /data/tidb-deploy/tikv-20160
  data_dir: /data/tidb-data/tikv_data
  config:
    server.labels: { dc: "dc1",logic: "logic1",rack: "r1",host: "192_168_8_11" }

- host: 192.168.8.12
  ssh_port: 22
  port: 20160
  status_port: 20180
  deploy_dir: /data/tidb-deploy/tikv-20160
  data_dir: /data/tidb-data/tikv_data
  config:
    server.labels: { dc: "dc1",logic: "logic2",rack: "r1",host: "192_168_8_12" }

- host: 192.168.8.13
  ssh_port: 22
  port: 20160
  status_port: 20180
  deploy_dir: /data/tidb-deploy/tikv-20160
  data_dir: /data/tidb-data/tikv_data
  config:
    server.labels: { dc: "dc2",logic: "logic3",rack: "r1",host: "192_168_8_13" }
- host: 192.168.8.13
  ssh_port: 22
  port: 20161
  status_port: 20181
  deploy_dir: /data/tidb-deploy/tikv-20161
  data_dir: /data/tidb-data/tikv_data-20161
  config:
    server.labels: { dc: "dc2",logic: "logic4",rack: "r1",host: "192_168_8_13" }
pd_servers:
- host: 192.168.8.11
  ssh_port: 22
  name: pd-192.168.8.11-2379
  client_port: 2379
  peer_port: 2380
  deploy_dir: /data/tidb-deploy/pd-2379
  data_dir: /data/tidb-data/pd_data
- host: 192.168.8.12
  ssh_port: 22
  name: pd-192.168.8.12-2379
  client_port: 2379
  peer_port: 2380
  deploy_dir: /data/tidb-deploy/pd-2379
  data_dir: /data/tidb-data/pd_data
- host: 192.168.8.13
  ssh_port: 22
  name: pd-192.168.8.13-2379
  client_port: 2379
  peer_port: 2380
  deploy_dir: /data/tidb-deploy/pd-2379
  data_dir: /data/tidb-data/pd_data 

rule.json

[
  {
    "group_id": "pd",
    "group_index": 0,
    "group_override": false,
    "rules": [
      {
        "group_id": "pd",
        "id": "logic1",
        "start_key": "",
        "end_key": "",
        "role": "voter",
        "count": 1,
        "location_labels": ["dc", "logic", "rack", "host"],
        "label_constraints": [{"key": "logic", "op": "in", "values": ["logic1"]}]
      },
      {
        "group_id": "pd",
        "id": "logic2",
        "start_key": "",
        "end_key": "",
        "role": "voter",
        "count": 1,
        "location_labels": ["dc", "logic", "rack", "host"],
        "label_constraints": [{"key": "logic", "op": "in", "values": ["logic2"]}]
      },
      {
        "group_id": "pd",
        "id": "logic3",
        "start_key": "",
        "end_key": "",
        "role": "voter",
        "count": 1,
        "location_labels": ["dc", "logic", "rack", "host"],
        "label_constraints": [{"key": "logic", "op": "in", "values": ["logic3"]}]
      },
      {
        "group_id": "pd",
        "id": "logic4",
        "start_key": "",
        "end_key": "",
        "role": "learner",
        "count": 1,
        "location_labels": ["dc", "logic", "rack", "host"],
        "label_constraints": [{"key": "logic", "op": "in", "values": ["logic4"]}]
      }
    ]
  }
]

[Disaster Recovery Switch] Cluster Status

[root@centos3 ~]# tiup cluster display dr-auto-sync
tiup is checking updates for component cluster ...timeout!
Starting component `cluster`: /root/.tiup/components/cluster/v1.10.2/tiup-cluster display dr-auto-sync
Cluster type:       tidb
Cluster name:       dr-auto-sync
Cluster version:    v5.4.0
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://192.168.8.13:2379/dashboard
ID                  Role  Host          Ports        OS/Arch       Status   Data Dir                         Deploy Dir
--                  ----  ----          -----        -------       ------   --------                         ----------
192.168.8.13:2379   pd    192.168.8.13  2379/2380    linux/x86_64  Up|L|UI  /data/tidb-data/pd_data          /data/tidb-deploy/pd-2379
192.168.8.13:4000   tidb  192.168.8.13  4000/10080   linux/x86_64  Down     -                                /data/tidb-deploy/tidb-4000
192.168.8.13:20160  tikv  192.168.8.13  20160/20180  linux/x86_64  Up       /data/tidb-data/tikv_data        /data/tidb-deploy/tikv-20160
192.168.8.13:20161  tikv  192.168.8.13  20161/20181  linux/x86_64  Up       /data/tidb-data/tikv_data-20161  /data/tidb-deploy/tikv-20161

[Reproduction Path]
Execute the command:

tiup ctl:v5.4.0 pd -u http://192.168.8.13:2379 config placement-rules rule-bundle save --in=rules_dr.json
tiup ctl:v5.4.0 pd -u http://192.168.8.13:2379 config set replication-mode majority
tiup ctl:v5.4.0 pd -u http://192.168.8.13:2379 unsafe remove-failed-stores 1,6
tiup ctl:v5.4.0 pd -u http://192.168.8.13:2379 unsafe remove-failed-stores show
[
  "No on-going operation."
]

# It seems that after PD restarts, unsafe recover is not triggered
tiup ctl:v5.4.0 pd -u http://192.168.8.13:2379 unsafe remove-failed-stores history
[
  "No unsafe recover has been triggered since PD restarted."
]


[Problem Phenomenon and Impact]
Problem: After executing tiup ctl:v5.4.0 pd -u http://192.168.8.13:2379 unsafe remove-failed-stores 1,6, unsafe remove-failed-stores was not triggered.

pd.log

[2022/07/20 13:36:17.224 +08:00] [WARN] [forwarder.go:106] ["Unable to resolve connection address since no alive TiDB instance"]
[2022/07/20 13:36:17.224 +08:00] [ERROR] [tidb_requests.go:64] ["fail to send schema request"] [component=TiDB] [error=error.tidb.no_alive_tidb]
| username: ddhe9527 | Original post link

The error says there are no available TiDB instances. Why don’t you try scaling up by adding a TiDB instance?

| username: Gin | Original post link

Refer to this manual for operations: DR Auto-Sync Setup and Disaster Recovery Manual

| username: TiDBer_yyy | Original post link

Is it necessary to have TiDB version 6.1.0 or above?

| username: Gin | Original post link

Yes, there were quite a few bugs before version 6.1.

| username: TiDBer_yyy | Original post link

Understood, in version 5.4.0, after restarting PD, it is unable to process the unsafe remove-failed-stores request. You need to first execute pd-server to repair the PD node (create a new PD cluster).

| username: Gin | Original post link

This is the complete disaster recovery process. For details, you can refer to the disaster recovery manual in the link above.

  1. Force recover a single replica PD (in a scenario where 5 PDs are deployed in a 3:2 configuration across two centers, any one of the PDs can be chosen for recovery, and the other PD in the disaster recovery center will be abandoned).
  2. Adjust Placement-Rules to convert Learner replicas to Voter replicas, resulting in a 2-replica mode for the recovered cluster.
  3. Disable the DR Auto-Sync feature and switch to the default Majority mode.
  4. Use pd-ctl to clear all TiKV in the primary center online.
  5. Use pd-recover to increase the PD allocate-id by +100000000 to ensure that the subsequently allocated region ids, etc., do not roll back.
| username: TiDBer_yyy | Original post link

In the fourth step, a problem might have been triggered:
unsafe remove-failed-stores history

[
  "No unsafe recover has been triggered since PD restarted."
]
| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.