Scaling Down One of Three TiKV Nodes Results in Stuck Pending Offline Status

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 3个tikv缩容一个,结果就卡在这里一直处于 Pending Offline 状态

| username: zhimadi

[TiDB Usage Environment] Testing
[TiDB Version] v5.4.3
[Reproduction Path] What operations were performed to cause the issue
Executed scale-in operation.
[Encountered Issue: Issue Phenomenon and Impact]
After scaling in TiKV, the tiup status remains in Pending Offline state. The command operation is as follows:
tiup cluster scale-in zmd-cluster --node 10.0.0.40:20160
Through the command, you can see that the store status of the corresponding node is offline:
tiup ctl:v5.4.3 pd -u http://10.0.0.50:2379 store

{
“store”: {
“id”: 616763,
“address”: “10.0.0.40:20160”,
“state”: 1,
“version”: “5.4.3”,
“status_address”: “10.0.0.40:20180”,
“git_hash”: “deb149e42d97743349277ff9741f5cb9ae1c027d”,
“start_timestamp”: 1702273522,
“deploy_path”: “/tidb/deploy/tikv-20160/bin”,
“last_heartbeat”: 1702274328734343383,
“state_name”: “Offline”
},
“status”: {
“capacity”: “915.9GiB”,
“available”: “758.1GiB”,
“used_size”: “97.26GiB”,
“leader_count”: 18,
“leader_weight”: 1,
“leader_score”: 18,
“leader_size”: 18,
“region_count”: 121772,
“region_weight”: 1,
“region_score”: 877627.021699191,
“region_size”: 650414,
“slow_score”: 1,
“start_ts”: “2023-12-11T13:45:22+08:00”,
“last_heartbeat_ts”: “2023-12-11T13:58:48.734363383+08:00”,
“uptime”: “13m26.734363383s”
}
}
There are only 3 TiKV nodes, and now we need to take one offline, but it gets stuck in the Pending Offline state and cannot be removed. How should this be handled?
Currently, there are no extra machines to expand the nodes. How can it be restored?
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: tidb菜鸟一只 | Original post link

With only 3 replicas and 3 nodes, you cannot scale down by removing one node. You must first scale up and then scale down.

| username: zhimadi | Original post link

At this point, may I ask the expert how to restore it back in? :sob:

| username: xingzhenxiang | Original post link

Try scaling up.

| username: 像风一样的男子 | Original post link

Is it a test environment? Then feel free to mess around. Change it to a single replica, shrink successfully, then expand a node in place, and finally change it to 3 replicas.

| username: Jolyne | Original post link

Write an expansion file on this scaled-down machine to expand it again, but change the port to avoid conflicts with the original scaled-down one.

| username: 小龙虾爱大龙虾 | Original post link

You only need to use the pd ctl tool and execute the following command:

store cancel-delete 616763
| username: 路在何chu | Original post link

First expand the capacity, then you can use the database.

| username: zhanggame1 | Original post link

How about handling a damaged TiKV? Here are the steps: 专栏 - TiKV缩容下线异常处理的三板斧 | TiDB 社区

| username: Kongdom | Original post link

It is recommended to first expand by adding a node, and then shrink.

| username: TiDBer_小阿飞 | Original post link

Scale down, then scale up, and repeat :joy:

| username: 小龙虾爱大龙虾 | Original post link

There’s no need to scale down because it never scaled down in the first place. :joy_cat:

| username: Kongdom | Original post link

There is an order to follow. When there are only three nodes, you need to scale out first and then scale in.

| username: zhimadi | Original post link

Tried it, got the error: Error: port conflict for ‘20160’ between ‘tikv_servers:10.0.0.40.port’ and ‘tikv_servers:10.0.0.40.port’

| username: zhimadi | Original post link

It’s a development environment, so I don’t dare to make changes casually. How can I change it to a single replica? Are there detailed steps?

| username: zhimadi | Original post link

Changing the port, won’t this machine be running two tikv-servers?

| username: 小龙虾爱大龙虾 | Original post link

Isn’t your current requirement to restore? Just use pd ctl to handle it.

| username: zhimadi | Original post link

What does this command do?

| username: zhimadi | Original post link

No conditions to expand. No machines.

| username: Jolyne | Original post link

It’s okay, when you bring a new one online, the old one will migrate the data to the new one.