After Data Disk Loss Recovery, Forced Scale-In Cannot Delete Scaled-In Nodes

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 数据盘丢失恢复后强制缩容无法删除已缩容节点

| username: Hacker_tESkRka1

Data disk loss recovery and forced scale-in unable to delete scaled-in nodes
[TiDB Usage Environment]
Production environment
[Overview] Overview + Problem Description

  1. Disks 172.168.1.224:/data2 and 172.168.1.227:/data2 failed, causing the /data2 partition to disappear and resulting in the loss of nodes 172.168.1.224:20172 and 172.168.1.227:20172.

  2. Restarted the servers and remounted to recover the missing /data2, but the nodes failed to start, and the error messages were not retained.

  3. Manually scaled in the nodes (using --force)
    tiup cluster scale-in clustername -N 172.168.1.224:20172
    tiup cluster scale-in clustername -N 172.168.1.227:20172

  4. Re-added the nodes
    Scale-out configuration file: cat scale-out.yml
    tikv_servers:

    • host: 172.168.1.224
      ssh_port: 22
      port: 20172
      status_port: 20182
      deploy_dir: /usr/tidb/deploy/tikv-20172
      data_dir: /data2/tidb/data/tikv-20172
      log_dir: /data2/tidb/log/tikv-20172
      config:
      server.labels:
      host: kv-1-224
      arch: amd64
      os: linux
    • host: 172.168.1.227
      ssh_port: 22
      port: 20172
      status_port: 20182
      deploy_dir: /usr/tidb/deploy/tikv-20172
      data_dir: /data2/tidb/data/tikv-20172
      log_dir: /data2/tidb/log/tikv-20172
      config:
      server.labels:
      host: kv-1-227
      arch: amd64
      os: linux

    Executed the scale-out command: tiup cluster scale-out clustername scale-out.yml
    Scale-out completed, but the nodes failed to start with the following error:
    Error: failed to start tikv: failed to start: 172.168.1.224 tikv-20172.service, please check the instance’s log(/data2/tidb/log/tikv-20172) for more detail.: timed out waiting for port 20172 to be started after 2m0s

Log error information:
[2022/08/23 03:15:05.143 +08:00] [WARN] [client.rs:149] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:334]: cancel reconnection due to too small interval")”]
[2022/08/23 03:15:05.144 +08:00] [ERROR] [util.rs:491] [“request failed”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { code: 2-UNKNOWN, message: "duplicated store address: id:202782520 address:\"172.168.1.224:20172\" labels:<key:\"host\" value:\"kv-210-224\" > version:\"6.1.0\" status_address:\"172.168.1.224:20182\" git_hash:\"080d086832ae5ce2495352dccaf8df5d40f30687\" start_timestamp:1661195703 deploy_path:\"/usr/tidb/deploy/tikv-20172/bin\" , already registered by id:7290395 address:\"172.168.1.224:20172\" state:Offline labels:<key:\"host\" value:\"kv-210-224\" > version:\"6.1.0\" status_address:\"172.168.1.224:20182\" git_hash:\"080d086832ae5ce2495352dccaf8df5d40f30687\" start_timestamp:1661177249 deploy_path:\"/usr/tidb/deploy/tikv-20172/bin\" last_heartbeat:1661179953422459862 node_state:Removing ", details: }))”]
[2022/08/23 03:15:05.144 +08:00] [ERROR] [util.rs:500] [“reconnect failed”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/util.rs:334]: cancel reconnection due to too small interval")”]

[2022/08/23 03:15:05.142 +08:00] [INFO] [util.rs:575] [“connecting to PD endpoint”] [endpoints=http://172.168.1.220:2379]
[2022/08/23 03:15:05.142 +08:00] [INFO] [util.rs:575] [“connecting to PD endpoint”] [endpoints=http://172.168.1.221:2379]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:701] [“connected to PD member”] [endpoints=http://172.168.1.221:2379]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:218] [“heartbeat sender and receiver are stale, refreshing …”]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:231] [“buckets sender and receiver are stale, refreshing …”]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:258] [“update pd client”] [via=] [leader=http://172.168.1.221:2379] [prev_via=] [prev_leader=http://172.168.1.221:2379]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:385] [“trying to update PD client done”] [spend=1.43018ms]
[2022/08/23 03:15:05.143 +08:00] [INFO] [tso.rs:157] [“TSO worker terminated”] [receiver_cause=None] [sender_cause=None]
[2022/08/23 03:15:05.143 +08:00] [INFO] [client.rs:147] [“TSO stream is closed, reconnect to PD”]
[2022/08/23 03:15:05.143 +08:00] [WARN] [client.rs:149] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:334]: cancel reconnection due to too small interval")”]
[2022/08/23 03:15:05.144 +08:00] [ERROR] [util.rs:491] [“request failed”] [err_code=KV:PD:gRPC] [err="Grpc(RpcFailure(RpcStatus { code: 2-UNKNOWN, message: "duplicated store address: id:202782520 address:\"172.168.1.224:20172\" labels:<key:\"host\" value:\"kv-210-224\" > version:\"6.1.0\" status_address:\"172.168.1.224:20182\" git_hash:\"080d086832ae5ce2495352dccaf8df5d40f30687\" start_timesta

Checking store information reveals that the removed nodes still exist
tiup ctl:v5.0.1 pd -u http://172.168.1.220:2379 store
{
“store”: {
“id”: 7290395,
“address”: “172.168.1.224:20172”,
“state”: 1,
“labels”: [
{
“key”: “host”,
“value”: “kv-1-224”
}
],
“version”: “6.1.0”,
“status_address”: “172.168.1.224:20182”,
“git_hash”: “080d086832ae5ce2495352dccaf8df5d40f30687”,
“start_timestamp”: 1661177249,
“deploy_path”: “/usr/tidb/deploy/tikv-20172/bin”,
“last_heartbeat”: 1661179953422459862,
“node_state”: 2,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 1,
“region_score”: 0,
“region_size”: 0,
“slow_score”: 0,
“start_ts”: “2022-08-22T22:07:29+08:00”,
“last_heartbeat_ts”: “2022-08-22T22:52:33.422459862+08:00”,
“uptime”: “45m4.422459862s”
}
}

Executing the manual delete command indicates success but the node is not actually deleted
tiup ctl:v5.0.1 pd -u http://172.168.1.221:2379 store delete 7290395

[Background] Operations performed

[Phenomenon] Business and database phenomena
After the failure, the overall database read and write operations were abnormal. After restarting the entire cluster, reads were normal but writes experienced severe timeouts of over 4 minutes.

[Business Impact]
Business cannot write normally
[TiDB Version]
6.1.0

tiup cluster display clustername
172.168.1.220:2379 pd 172.168.1.220 2379/2380 linux/x86_64 Up /data/tidb/data/pd-2379 /usr/tidb/deploy/pd-2379
172.168.1.221:2379 pd 172.168.1.221 2379/2380 linux/x86_64 Up|L|UI /data/tidb/data/pd-2379 /usr/tidb/deploy/pd-2379
172.168.1.222:2379 pd 172.168.1.222 2379/2380 linux/x86_64 Up /data/tidb/data/pd-2379 /usr/tidb/deploy/pd-2379
172.168.3.81:9090 prometheus 172.168.3.81 9090/12020 linux/x86_64 Up /data/tidb/data/prometheus-9090 /usr/tidb/deploy/prometheus-9090
172.168.1.220:4000 tidb 172.168.1.220 4000/10080 linux/x86_64 Up - /usr/tidb/deploy/tidb-4000
172.168.1.221:4000 tidb 172.168.1.221 4000/10080 linux/x86_64 Up - /usr/tidb/deploy/tidb-4000
172.168.1.221:4001 tidb 172.168.1.221 4001/10081 linux/x86_64 Up - /usr/tidb/deploy/tidb-4001
172.168.1.222:4000 tidb 172.168.1.222 4000/10080 linux/x86_64 Up - /usr/tidb/deploy/tidb-4000
172.168.1.222:4001 tidb 172.168.1.222 4001/10081 linux/x86_64 Up - /usr/tidb/deploy/tidb-4001
172.168.1.223:20171 tikv 172.168.1.223 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.223:20172 tikv 172.168.1.223 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.224:20171 tikv 172.168.1.224 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.224:20172 tikv 172.168.1.224 20172/20182 linux/x86_64 Offline /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.225:20171 tikv 172.168.1.225 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.225:20172 tikv 172.168.1.225 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.226:20171 tikv 172.168.1.226 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.226:20172 tikv 172.168.1.226 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.227:20171 tikv 172.168.1.227 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.228:20171 tikv 172.168.1.228 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.228:20172 tikv 172.168.1.228 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.2.223:20171 tikv 172.168.2.223 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.2.224:20171 tikv 172.168.2.224 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171

| username: xfworld | Original post link

  1. Manually scale in nodes (using --force)
    tiup cluster scale-in clustername -N 172.168.1.224:20172
    tiup cluster scale-in clustername -N 172.168.1.227:20172

Is this step completed?

I see that the final result is that a single node is providing multiple TiKV instance services…

It still hasn’t been successfully taken offline yet, right?

| username: Hacker_tESkRka1 | Original post link

Manual forced scale-in is indicated as successful. My TiKV setup has two instances on one machine, but using the following command to query, I can see that the stores corresponding to the two scaled-in points are in an offline state:
tiup ctl:v5.0.1 pd -u http://172.168.1.220:2379 store

| username: xfworld | Original post link

Try this command

tiup cluster prune cluster-name