TiKV remains Pending Offline and cannot be removed after scaling down

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Tikv 缩容后一直 Pending Offline无法移除

| username: 最强王者

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
6.2.5
[Reproduction Path] What operations were performed to cause the issue
Scaled down TiKV
Operation command: tiup cluster scale-in tidb-online --node 10.10.4.53:20160
[Encountered Issue: Problem Phenomenon and Impact]
Stuck in Pending Offline

“version”: “6.5.2”,
“peer_address”: “10.10.4.53:20160”,
“status_address”: “10.10.4.53:20180”,
“git_hash”: “a29f525cec48a801e9d8b1748356a88385bcfd33”,
“start_timestamp”: 1698251070,
“deploy_path”: “/data1/tidb/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1700097187495042106,
“state_name”: “Offline”
},
“status”: {
“capacity”: “1.719TiB”,
“available”: “1.44TiB”,
“used_size”: “8.286GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 562,
“region_weight”: 1,
“region_score”: 19856.65592724587,
“region_size”: 16820,
“witness_count”: 0,
“slow_score”: 1,
“start_ts”: “2023-10-26T00:24:30+08:00”,
“last_heartbeat_ts”: “2023-11-16T09:13:07.495042106+08:00”,
“uptime”: “512h48m37.495042106s”
}
},
Monitoring information shows that the leader has ended, but there are still regions as shown in the picture

Can anyone help take a look? Thanks

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: h5n1 | Original post link

The regions haven’t all been migrated yet.

| username: TiDBer_小阿飞 | Original post link

When a TiKV is scaled down, it will enter the Offline state. This state is just an intermediate state for TiKV going offline. In this state, TiKV will perform leader transfer and region balance. Once the leader_count/region_count both show that the transfer or balance is complete, the TiKV will change from Offline to Tombstone. While in the Offline state, TiKV can still provide services, perform GC, and other operations. Do not shut down the TiKV service, its physical server, or delete data files.

Your “leader_count”: 0,“region_count”: 562, probably hasn’t finished running yet, right?

| username: 最强王者 | Original post link

Offline waiting for almost a week, the log information reports that lender cannot be found. Could it be caused by region corruption?

| username: xmlianfeng | Original post link

Check the offline store progress and the estimated time in the PD panel.

| username: TiDBer_小阿飞 | Original post link

Then follow the first post’s documentation to restore to UP, manually adjust the leader_count/region_count transfer, and then try to take it offline!

| username: 最强王者 | Original post link

Okay, I’ll give it a try.

| username: 最强王者 | Original post link

Okay, I’ll go take a look.

| username: 小龙虾爱大龙虾 | Original post link

If you still can’t find the reason, you can collect the monitoring data and I’ll help you take a look.

| username: 最强王者 | Original post link

Okay, I will first set it to “up” as suggested above and manually adjust the leader_count/region_count.

| username: 最强王者 | Original post link

Teacher, the region currently has no leader. How can it be migrated?

| username: h5n1 | Original post link

Look at the document link posted earlier.

| username: 最强王者 | Original post link

Okay, thank you.

| username: 最强王者 | Original post link

Hello, how can I manually schedule regions?

| username: 最强王者 | Original post link

The documentation is not very clear.

| username: 最强王者 | Original post link

Sorry, I can’t translate the content from the image. Please provide the text you need translated.

| username: 像风一样的男子 | Original post link

The operations in his manual involve batch migration of regions in a loop.

 store_list='store1 store2...'
 for i in $store_list
 do
    for j in `pd-ctl region store "$i" | jq ".regions[] | {id: .id}"|grep id|awk '{print $2}'`
     do
        pd-ctl operator add remove-peer $j $i
     done
   pd-ctl store $i
 done

j is the regionID and i is the peerid.
The original command is:
pd-ctl operator add remove-peer regionID peerid