What to do when regions miss-peer occurs in TiDB?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb出现regions miss-peer该怎么办?

| username: TiDBer_5GvAkLi0

【TiDB Usage Environment】Test Environment
【TiDB Version】v5.4.1
【Encountered Problem】
When executing the command tiup cluster check tidb-test --cluster in TiDB, the following exception was found! How can this be resolved?

| username: songxuecheng | Original post link

Use pd-ctl to check for regions with missing peers:

region check miss-peer

If the data volume is not large, you can manually add the replicas:

operator add add-peer xx xx
| username: TiDBer_5GvAkLi0 | Original post link

Okay, thank you for the explanation. However, I have found many miss-peer issues. What should I do if there are too many? Manually adding replicas for so many is not realistic! What causes these miss-peer issues?

| username: songxuecheng | Original post link

Theoretically, region miss will automatically decrease as replicas are automatically replenished. You can observe this. If it persists without decreasing, you can check the disk capacity of TiKV to see if it has reached the threshold.

| username: 张雨齐0720 | Original post link

Scheduling Policy Control

You can adjust PD’s scheduling policy using pd-ctl from the following three aspects. For more detailed information, refer to PD Control.

Start/Stop Schedulers

pd-ctl supports dynamically creating and deleting Schedulers. You can control PD’s scheduling behavior through these operations, such as:

  • scheduler show: Display the current Schedulers in the system
  • scheduler remove balance-leader-scheduler: Remove (deactivate) the balance region scheduler
  • scheduler add evict-leader-scheduler 1: Add a scheduler to remove all Leaders from Store 1

Manually Add Operators

PD supports directly creating or deleting Operators via pd-ctl, such as:

  • operator add add-peer 2 5: Add a Peer for Region 2 on Store 5
  • operator add transfer-leader 2 5: Transfer the Leader of Region 2 to Store 5
  • operator add split-region 2: Split Region 2 into 2 equally sized Regions
  • operator remove 2: Cancel the current pending Operator for Region 2

Adjust Scheduling Parameters

Use pd-ctl to execute the config show command to view all scheduling parameters, and execute config set {key} {value} to adjust the corresponding parameter values. Common adjustments include:

  • leader-schedule-limit: Control the concurrency of Transfer Leader scheduling
  • region-schedule-limit: Control the concurrency of adding/removing Peer scheduling
  • enable-replace-offline-replica: Enable scheduling for node offline
  • enable-location-replacement: Enable scheduling related to adjusting Region isolation levels
  • max-snapshot-count: The maximum concurrency of Snapshot sending/receiving allowed per Store
| username: HACK | Original post link

I would like to ask, under what circumstances is the phenomenon of region miss likely to occur?

| username: songxuecheng | Original post link

Typically, it’s about going online/offline, or TiKV crashes, TiFlash replicating copies, etc.

| username: TiDBer_5GvAkLi0 | Original post link

Okay, thank you for the explanation.

| username: HACK | Original post link

:ok_hand:

| username: TiDBer_5GvAkLi0 | Original post link

OK, thank you for the explanation.

| username: TiDBer_5GvAkLi0 | Original post link

May I ask if the missing replicas (miss-peer) do not affect the normal operation of the cluster? The cluster data will not be abnormal, right? After checking, the disk capacity is all okay, but there are still 161 missing peers. Does this mean that missing replicas (miss peer) can be ignored?

| username: songxuecheng | Original post link

Please provide the information from pd-ctl store.

| username: TiDBer_5GvAkLi0 | Original post link

» store
{
“count”: 3,
“stores”: [
{
“store”: {
“id”: 4,
“address”: “10.60.28.171:20160”,
“version”: “5.4.1”,
“status_address”: “10.60.28.171:20180”,
“git_hash”: “91fe561f0af87cc47359cdf61d6e6838471cb644”,
“start_timestamp”: 1657783577,
“deploy_path”: “/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1657792548350658608,
“state_name”: “Up”
},
“status”: {
“capacity”: “99.75GiB”,
“available”: “94.71GiB”,
“used_size”: “42.7MiB”,
“leader_count”: 85,
“leader_weight”: 1,
“leader_score”: 85,
“leader_size”: 92,
“region_count”: 161,
“region_weight”: 1,
“region_score”: 789.034187893511,
“region_size”: 168,
“slow_score”: 1,
“start_ts”: “2022-07-14T15:26:17+08:00”,
“last_heartbeat_ts”: “2022-07-14T17:55:48.350658608+08:00”,
“uptime”: “2h29m31.350658608s”
}
},
{
“store”: {
“id”: 63,
“address”: “10.60.28.178:3930”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v5.4.1”,
“peer_address”: “10.60.28.178:20170”,
“status_address”: “10.60.28.178:20292”,
“git_hash”: “1d20105ad4a1b8516e6ee1b20acc257e3584fdd6”,
“start_timestamp”: 1657768678,
“deploy_path”: “/tidb-deploy/tiflash-9000/bin/tiflash”,
“last_heartbeat”: 1657792550566992459,
“state_name”: “Up”
},
“status”: {
“capacity”: “99.75GiB”,
“available”: “95.14GiB”,
“used_size”: “84.01KiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 1,
“region_score”: 0,
“region_size”: 0,
“slow_score”: 0,
“start_ts”: “2022-07-14T11:17:58+08:00”,
“last_heartbeat_ts”: “2022-07-14T17:55:50.566992459+08:00”,
“uptime”: “6h37m52.566992459s”
}
},
{
“store”: {
“id”: 1,
“address”: “10.60.28.172:20160”,
“version”: “5.4.1”,
“status_address”: “10.60.28.172:20180”,
“git_hash”: “91fe561f0af87cc47359cdf61d6e6838471cb644”,
“start_timestamp”: 1657783572,
“deploy_path”: “/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1657792543334977209,
“state_name”: “Up”
},
“status”: {
“capacity”: “99.75GiB”,
“available”: “95.44GiB”,
“used_size”: “33.16MiB”,
“leader_count”: 76,
“leader_weight”: 1,
“leader_score”: 76,
“leader_size”: 76,
“region_count”: 161,
“region_weight”: 1,
“region_score”: 784.7309712676937,
“region_size”: 168,
“slow_score”: 1,
“start_ts”: “2022-07-14T15:26:12+08:00”,
“last_heartbeat_ts”: “2022-07-14T17:55:43.334977209+08:00”,
“uptime”: “2h29m31.334977209s”
}
}
]
}

| username: songxuecheng | Original post link

2 TiKV and one TiFlash? With 3 replicas, you need at least one more KV, otherwise, the lost replica cannot be automatically replenished.

| username: cs58_dba | Original post link

This feeling is like master-slave delay; if it can catch up, then it’s fine.

| username: TiDBer_5GvAkLi0 | Original post link

Well, it means that you need to add one TiKV, at least 3 TiKVs.

| username: songxuecheng | Original post link

Yes, after adding the new TiKV, take another look.

| username: TiDBer_5GvAkLi0 | Original post link

OK, thanks for the guidance.

| username: TiDBer_5GvAkLi0 | Original post link

After adding a TiKV node, I found that there are no more miss-peer issues. Thank you for the guidance, much appreciated.

| username: songxuecheng | Original post link

OK, with 3 replicas, at least three KV nodes must be maintained.