TiKV Scaling Down Stuck in Pending Offline State, Information Still Exists in PD After Forced Removal

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV缩容一直处于Pending Offline状态,强制下线后,pd依旧存在该tikv信息

| username: seiang

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v5.0.3
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]

Yesterday, we scaled down two TiKV nodes (originally 7 nodes), and they have been in Pending Offline status since then. We handled it as follows:

  1. Confirmed the number of regions and leaders on the two scaled-down TiKVs, leader=0, region=1.
  2. Confirmed that the only region on the two scaled-down TiKVs is region 9783357, which is an empty region.
  3. Tried to add a replica for Region 9783357 and remove-peer 9783357, but neither worked.
  4. Attempted to forcefully offline one of the TiKV nodes using tiup cluster scale-in xxxx -N 10.30.xx.xx:20160 --force. After that, it indeed disappeared from tiup, but when checked with pd-ctl, the store information still existed.

I don’t know how to proceed from here. Currently, the store information is still recorded in PD. Please advise, thank you.

| username: h5n1 | Original post link

Recreate this empty region, and then after the status becomes tombstone, remove-tombstone.

| username: xingzhenxiang | Original post link

./pd-ctl -u http://XXX.XXX.XXX.XXX:2379 -d store remove-tombstone

| username: seiang | Original post link

However, an error occurred when recreating this empty region:
./tikv-ctl --db /data/tidb-data/tikv-20160/db recreate-region -p 10.30.xx.xx:2379 -r 9783357
error while open kvdb: Storage Engine IO error: While lock file: /data/tidb-data/tikv-20160/db/LOCK: Resource temporarily unavailable
LOCK file conflict indicates TiKV process is running. Do NOT delete the LOCK file and force the command to run. Doing so could cause data corruption.

| username: h5n1 | Original post link

You need to stop the TiKV where the region is located. You can check it using pd-ctl region xxxx.

| username: seiang | Original post link

./tikv-ctl --db /data/tidb-data/tikv-20160/db recreate-region -p ‘10.30.xx.xx:2379’ -r 9783357

initing empty region 10113953 with peer_id 10113954…
Debugger::recreate_region: “[src/server/debug.rs:639]: "[src/server/debug.rs:664]: region still exists id: 10113953 start_key: 7480000000000000FF375F698000000000FF0000040380000000FF0D2F659003800000FF0000000002038000FF00009043FDAD0000FD end_key: 7480000000000008FF875F72FC00000019FF18E0020000000000FA region_epoch { conf_ver: 1 version: 15792 } peers { id: 10113954 store_id: 8009882 }"”

Below is the region information
» region 9783357
{
“id”: 9783357,
“start_key”: “7480000000000000FF375F698000000000FF0000040380000000FF0D2F659003800000FF0000000002038000FF00009043FDAD0000FD”,
“end_key”: “7480000000000008FF875F72FC00000019FF18E0020000000000FA”,
“epoch”: {
“conf_ver”: 8012,
“version”: 15791
},
“peers”: [
{
“id”: 9783358,
“store_id”: 8009882,
“role_name”: “Voter”
},
{
“id”: 9783359,
“store_id”: 6,
“role_name”: “Voter”
},
{
“id”: 9783360,
“store_id”: 8009881,
“role_name”: “Voter”
},
{
“id”: 10113880,
“store_id”: 1,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
],
“leader”: {
“id”: 9783359,
“store_id”: 6,
“role_name”: “Voter”
},
“down_peers”: [
{
“down_seconds”: 4967,
“peer”: {
“id”: 9783360,
“store_id”: 8009881,
“role_name”: “Voter”
}
},
{
“down_seconds”: 317,
“peer”: {
“id”: 10113880,
“store_id”: 1,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
}
],
“pending_peers”: [
{
“id”: 10113880,
“store_id”: 1,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
],
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 1,
“approximate_keys”: 0
}]

It still doesn’t seem to work
“store_id”: 8009881 is always not visible in tiup, but it can still be seen in pd-ctl, and “store_id”: 8009882 is always in Pending Offline status

| username: 爱白话的晓辉 | Original post link

Delete might still be present after deletion. If you are sure it is no longer needed, use the unsafe command to force delete it, and it will no longer exist in the metadata.

| username: h5n1 | Original post link

Try using remove-peer to delete from store 8009882. If it still doesn’t work, try again.

| username: seiang | Original post link

Still not working
» operator add remove-peer 9783357 8009881
Failed! [500] “failed to add operator, maybe already have one”
» operator add remove-peer 9783357 8009882
Failed! [500] “failed to add operator, maybe already have one”
» operator add remove-peer 9783357 6
Failed! [500] “fail to build operator: plan is empty, maybe no valid leader”

./tikv-ctl --db /data/tidb-data/tikv-20160/db tombstone -p ‘10.30.xx.xx:2379’ -r 9783357 --force

region: 9783357, error: “[src/server/debug.rs:1190]: invalid conf_ver: please make sure you have removed the peer by PD”

| username: h5n1 | Original post link

Could you check the current status using pd-ctl region?

| username: seiang | Original post link

It is an empty region.
$ curl http://10.30.xx.xx:10080/regions/9783357
{
“start_key”: “dIAAAAAAAAA3X2mAAAAAAAAABAOAAAAADS9lkAOAAAAAAAAAAgOAAAAAkEP9rQ==”,
“end_key”: “dIAAAAAAAAiHX3L8AAAAGRjgAg==”,
“start_key_hex”: “7480000000000000375f69800000000000000403800000000d2f659003800000000000000203800000009043fdad”,
“end_key_hex”: “7480000000000008875f72fc0000001918e002”,
“region_id”: 9783357,
“frames”: null
}

| username: h5n1 | Original post link

You can only use this -s to specify these store_ids and -r region_id.

| username: seiang | Original post link

Do we need to stop these four stores: 8009882, 6, 8009881, 1? Stores 1 and 6 are normal TiKV nodes. If we stop them, it will affect normal services, right?

./tikv-ctl --db /data/tidb-data/tikv-20160/db unsafe-recover remove-fail-stores -s 8009882,6,8009881,1 -r 9783357

error while open kvdb: Storage Engine IO error: While lock file: /data/tidb-data/tikv-20160/db/LOCK: Resource temporarily unavailable
LOCK file conflict indicates TiKV process is running. Do NOT delete the LOCK file and force the command to run. Doing so could cause data corruption.

| username: h5n1 | Original post link

Did your previous operations involve stopping the TiKV? These operations require stopping the TiKV where the region peer is located. For unsafe recovery, if the number of regions is small, you can stop the involved TiKV one by one, like in your case with just one region. If many regions are involved, generally all TiKV instances are stopped. Stopping TiKV will cause leader migration and some fluctuations. After exceeding the max-store-down-time, replicas will automatically be replenished on other nodes. You can temporarily increase this parameter using pd-ctl config set.

| username: seiang | Original post link

Do all four stores need to be stopped simultaneously? Can’t you stop one, then execute ./tikv-ctl --db /data/tidb-data/tikv-20160/db unsafe-recover remove-fail-stores -s 6 -r 9783357 and then bring it back up before stopping another? Would that not work?

| username: h5n1 | Original post link

I haven’t tried this, you can give it a shot. I guess it won’t work, otherwise, there wouldn’t be a need for online unsafe-recover in version 6.1.

| username: seiang | Original post link

It seems indeed not feasible, but stopping everything would impact the business.

| username: h5n1 | Original post link

Now there are 5 available nodes. After stopping stores 1 and 6, the leader migration availability will not be affected, but performance will be impacted.

| username: seiang | Original post link

After stopping everything, executing ./tikv-ctl --db /data/tidb-data/tikv-20160/db unsafe-recover remove-fail-stores -s 6 -r 9783357 still doesn’t work.

| username: h5n1 | Original post link

Each involved store is handled this way -s can specify multiple stores -s 1,2,3,4