After TiKV Scaling Down, ReplicaChecker Did Not Schedule Operator to Handle Offline-Region

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV缩容后ReplicaChecker未调度Operator处理offline-region

| username: Sin30

【TiDB Usage Environment】Production Environment
【TiDB Version】v4.0.0-rc.2
【Encountered Issue】
ReplicaChecker did not schedule an Operator to handle the offline Store.
First, two new TiKV machines were added, and then two old machines were removed to complete the replacement, with a total of 8 TiKV instances. The old machines are in Offline status, but ReplicaChecker did not schedule an Operator to transfer these extra offline-peer-regions.
Through the dashboard, it was found that there are only balance-region and balance-leader in the Operator, and no replace-offline-replica and remove-extra-replica scheduling.

According to the documentation, ReplicaChecker cannot be turned off, only balance-leader-scheduler and balance-region-scheduler can be turned off.

Dear experts, please check how to start this ReplicaChecker to handle these offline-peer-regions.

【Reproduction Path】
Execute TiKV’s scale-in to go offline, observe that the status of 2 TiKV is Offline, but data transfer did not start automatically.

PD Configuration
{
“replication”: {
“enable-placement-rules”: “false”,
“location-labels”: “”,
“max-replicas”: 3,
“strictly-match-label”: “false”
},
“schedule”: {
“enable-cross-table-merge”: “false”,
“enable-debug-metrics”: “true”,
“enable-location-replacement”: “true”,
“enable-make-up-replica”: “true”,
“enable-one-way-merge”: “false”,
“enable-remove-down-replica”: “true”,
“enable-remove-extra-replica”: “true”,
“enable-replace-offline-replica”: “true”,
“high-space-ratio”: 0.8,
“hot-region-cache-hits-threshold”: 3,
“hot-region-schedule-limit”: 4,
“leader-schedule-limit”: 4,
“leader-schedule-policy”: “count”,
“low-space-ratio”: 0.9,
“max-merge-region-keys”: 200000,
“max-merge-region-size”: 20,
“max-pending-peer-count”: 16,
“max-snapshot-count”: 8,
“max-store-down-time”: “30m0s”,
“merge-schedule-limit”: 8,
“patrol-region-interval”: “100ms”,
“region-schedule-limit”: 32,
“replica-schedule-limit”: 64,
“scheduler-max-waiting-operator”: 5,
“split-merge-interval”: “1h0m0s”,
“store-balance-rate”: 15,
“store-limit-mode”: “manual”,
“tolerant-size-ratio”: 0
}
}

【Issue Phenomenon and Impact】
No Operator to transfer the offline TiKV Store.

【Attachments】

| username: yilong | Original post link

  1. Is there new replica information in the expanded TiKV?
  2. Are expansion and contraction operations performed simultaneously?
  3. Could you provide the complete PD monitoring feedback? You can refer to the following document for export. Thanks.
    [FAQ] Grafana Metrics 页面的导出和导入 - TiDB 的问答社区