[Err] 9005 - Region is unavailable, [Err] 9001 - PD server timeout - Urgent

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: [Err] 9005 - Region is unavailable、[Err] 9001 - PD server timeout-加急

| username: 普罗米修斯

[TiDB Usage Environment] Production Environment
[TiDB Version] TiDB v3.0.3
[Encountered Issue] Cluster access exception when using SQL statements, reporting [Err] 9005 - Region is unavailable, [Err] 9001 - PD server timeout
[Reproduction Path]

  1. Decommissioned two KV nodes, decommissioned for a week and finally set to tombstone state using tikv-ctl, but there are still 56 region-counts that cannot be removed normally;
  2. There are still two KV nodes in the cluster that are down, one of which reports last index < Applied index in the TiKV log, and the other cannot start due to interdependence with this KV node interface;
  3. Used unsafe-recover remove-fail-stores to remove bad-regions on the two down nodes’ TiKV, removed some, but 40 still cannot be removed; because some replicas in the region contain peers on the KV nodes set to tombstone mode;
    [Issue Phenomenon and Impact]
  4. Database SQL statement execution error [Err] 9005 - Region is unavailable, [Err] 9001 - PD server timeout

[Attachment]

“regions”: [
{
“id”: 24567918,
“start_key”: “7480000000000000FF6B5F698000000000FF0000010380000000FF0000006101484E30FF3732303533FF314DFF425232562626FF33FF30302D31303030FFFF5F56524D53000000FFFC03800000000000FF03450419A6B41547FF00000001484E3037FF32303533FF310000FF0000000000F80000FD”,
“end_key”: “7480000000000000FF6B5F698000000000FF0000010380000000FF0000006101484E30FF3732303533FF314EFF41434E554C48FF26FF26302D31303030FFFF5F4C460000000000FFFA03800000000000FF034A0419A6A93EEFFF00000001484E3037FF32303533FF310000FF0000000000F80000FD”,
“epoch”: {
“conf_ver”: 1261,
“version”: 73
},
“peers”: [
{
“id”: 61333959,
“store_id”: 39485340
},
{
“id”: 61831614,
“store_id”: 13585107
},
{
“id”: 62153411,
“store_id”: 61971239
}
]
},

For example, this region, 61971239 is a down node, 13585107 is a node set to tombstone mode, and 39485340 is a normal up store. How can I clear such bad-regions?

| username: 普罗米修斯 | Original post link

Some regions are in an even worse situation:
“regions”: [
{
“id”: 53858501,
“start_key”: “7480000000000000FF6B5F698000000000FF00000301484E3037FF32303533FF324D42FF5232480000FD0380FF0000014A28449400FE”,
“end_key”: “7480000000000000FF6B5F698000000000FF00000301484E3037FF32303533FF324D42FF5232480000FD0380FF0000033E22166900FE”,
“epoch”: {
“conf_ver”: 1387,
“version”: 72
},
“peers”: [
{
“id”: 60328806,
“store_id”: 22010754
},
{
“id”: 62116511,
“store_id”: 61971239
},
{
“id”: 62129903,
“store_id”: 62119544
}
]
}
For this region, 61971239 is a down node, 62119544 is a down node, and 22010754 is a node set to tombstone mode for decommissioning;

  1. How to clear this kind of region?
  2. For nodes that have been set to tombstone mode for decommissioning, how can we clear the regions of the decommissioned KV nodes if they are still visible in the PD layer using pd-ctl?
| username: jansu-dev | Original post link

  1. How to clear this region? → Can you accept data loss? If yes, unsafe recover
1. Temporarily suspend the following scheduling in pd-ctl
scheduler pause balance-leader-scheduler
scheduler pause balance-region-scheduler
scheduler pause balance-hot-region-scheduler
config set replica-schedule-limit 0

2. Use pd-ctl to check Regions with at least half of the replicas on the faulty nodes
Requirement: PD is running
Assume faulty nodes are 15, 22, 43, 45, 46
pd-ctl -u <endpoint> -d region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(15,22,43,45,46) then . else empty end) | length>=$total-length) }'

3. On all normal TiKV instances, remove all Peers located on the faulty nodes for all Regions
Requirement: Run on all non-faulty machines, TiKV nodes need to be shut down
## tikv-ctl --db /path/to/tikv-data/db unsafe-recover remove-fail-stores -s <s1,s2,....> --all-regions
Here store id is 30725906
tikv-ctl --db /path/to/tikv-data/db unsafe-recover remove-fail-stores -s 30725906 --all-regions

4. Use pd-ctl to check Regions without a Leader
Requirement: PD is running
pd-ctl -u <endpoint> -d region --jq '.regions[]|select(has("leader")|not)|{id: .id, peer_stores: [.peers[].store_id]}'

5. Start TiKV

6. Check data index consistency
Requirement: PD, TiKV, TiDB are running, for example:
select count(*) from table as c1;
select count(*) from table force index `idx_name` as c2;
select c1 = c2;
  1. Also, for nodes that have been set to tombstone mode and decommissioned, how to clear the regions of the decommissioned kv in the pd layer using pd-ctl? → After completing operation 1, Tombstone: indicates that the TiKV Store is fully decommissioned, and you can use the remove-tombstone interface to safely clean up the TiKV in this state. from TiDB 数据库的调度 | PingCAP 文档中心 Specific operation → PD Control 使用说明 | PingCAP 文档中心
| username: 普罗米修斯 | Original post link

It was completed last Friday, basically following the above operations. The difference was that after executing store remove-tombstone, some regions still had peers of the offline peers. By performing unsafe recover with the specified tombstone mode on all KV nodes and restarting the KV, the cluster returned to normal.

| username: jansu-dev | Original post link

OK!!!

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.