After Data Disk Loss Recovery, Forced Scale-In Cannot Delete Scaled-In Nodes

translator_bot · June 23, 2024, 5:58am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 数据盘丢失恢复后强制缩容无法删除已缩容节点

| username: Hacker_tESkRka1

Data disk loss recovery and forced scale-in unable to delete scaled-in nodes
[TiDB Usage Environment]
Production environment
[Overview] Overview + Problem Description

Disks 172.168.1.224:/data2 and 172.168.1.227:/data2 failed, causing the /data2 partition to disappear and resulting in the loss of nodes 172.168.1.224:20172 and 172.168.1.227:20172.
Restarted the servers and remounted to recover the missing /data2, but the nodes failed to start, and the error messages were not retained.
Manually scaled in the nodes (using --force)
tiup cluster scale-in clustername -N 172.168.1.224:20172
tiup cluster scale-in clustername -N 172.168.1.227:20172
Re-added the nodes
Scale-out configuration file: cat scale-out.yml
tikv_servers:
- host: 172.168.1.224
  ssh_port: 22
  port: 20172
  status_port: 20182
  deploy_dir: /usr/tidb/deploy/tikv-20172
  data_dir: /data2/tidb/data/tikv-20172
  log_dir: /data2/tidb/log/tikv-20172
  config:
  server.labels:
  host: kv-1-224
  arch: amd64
  os: linux
- host: 172.168.1.227
  ssh_port: 22
  port: 20172
  status_port: 20182
  deploy_dir: /usr/tidb/deploy/tikv-20172
  data_dir: /data2/tidb/data/tikv-20172
  log_dir: /data2/tidb/log/tikv-20172
  config:
  server.labels:
  host: kv-1-227
  arch: amd64
  os: linux
Executed the scale-out command: tiup cluster scale-out clustername scale-out.yml
Scale-out completed, but the nodes failed to start with the following error:
Error: failed to start tikv: failed to start: 172.168.1.224 tikv-20172.service, please check the instance’s log(/data2/tidb/log/tikv-20172) for more detail.: timed out waiting for port 20172 to be started after 2m0s

Log error information:
[2022/08/23 03:15:05.143 +08:00] [WARN] [client.rs:149] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:334]: cancel reconnection due to too small interval")”]
[2022/08/23 03:15:05.144 +08:00] [ERROR] [util.rs:491] [“request failed”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { code: 2-UNKNOWN, message: "duplicated store address: id:202782520 address:\"172.168.1.224:20172\" labels:<key:\"host\" value:\"kv-210-224\" > version:\"6.1.0\" status_address:\"172.168.1.224:20182\" git_hash:\"080d086832ae5ce2495352dccaf8df5d40f30687\" start_timestamp:1661195703 deploy_path:\"/usr/tidb/deploy/tikv-20172/bin\" , already registered by id:7290395 address:\"172.168.1.224:20172\" state:Offline labels:<key:\"host\" value:\"kv-210-224\" > version:\"6.1.0\" status_address:\"172.168.1.224:20182\" git_hash:\"080d086832ae5ce2495352dccaf8df5d40f30687\" start_timestamp:1661177249 deploy_path:\"/usr/tidb/deploy/tikv-20172/bin\" last_heartbeat:1661179953422459862 node_state:Removing ", details: }))”]
[2022/08/23 03:15:05.144 +08:00] [ERROR] [util.rs:500] [“reconnect failed”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/util.rs:334]: cancel reconnection due to too small interval")”]

[2022/08/23 03:15:05.142 +08:00] [INFO] [util.rs:575] [“connecting to PD endpoint”] [endpoints=http://172.168.1.220:2379]
[2022/08/23 03:15:05.142 +08:00] [INFO] [util.rs:575] [“connecting to PD endpoint”] [endpoints=http://172.168.1.221:2379]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:701] [“connected to PD member”] [endpoints=http://172.168.1.221:2379]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:218] [“heartbeat sender and receiver are stale, refreshing …”]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:231] [“buckets sender and receiver are stale, refreshing …”]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:258] [“update pd client”] [via=] [leader=http://172.168.1.221:2379] [prev_via=] [prev_leader=http://172.168.1.221:2379]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:385] [“trying to update PD client done”] [spend=1.43018ms]
[2022/08/23 03:15:05.143 +08:00] [INFO] [tso.rs:157] [“TSO worker terminated”] [receiver_cause=None] [sender_cause=None]
[2022/08/23 03:15:05.143 +08:00] [INFO] [client.rs:147] [“TSO stream is closed, reconnect to PD”]
[2022/08/23 03:15:05.143 +08:00] [WARN] [client.rs:149] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:334]: cancel reconnection due to too small interval")”]
[2022/08/23 03:15:05.144 +08:00] [ERROR] [util.rs:491] [“request failed”] [err_code=KV:PD:gRPC] [err="Grpc(RpcFailure(RpcStatus { code: 2-UNKNOWN, message: "duplicated store address: id:202782520 address:\"172.168.1.224:20172\" labels:<key:\"host\" value:\"kv-210-224\" > version:\"6.1.0\" status_address:\"172.168.1.224:20182\" git_hash:\"080d086832ae5ce2495352dccaf8df5d40f30687\" start_timesta

Checking store information reveals that the removed nodes still exist
tiup ctl:v5.0.1 pd -u http://172.168.1.220:2379 store
{
“store”: {
“id”: 7290395,
“address”: “172.168.1.224:20172”,
“state”: 1,
“labels”: [
{
“key”: “host”,
“value”: “kv-1-224”
}
],
“version”: “6.1.0”,
“status_address”: “172.168.1.224:20182”,
“git_hash”: “080d086832ae5ce2495352dccaf8df5d40f30687”,
“start_timestamp”: 1661177249,
“deploy_path”: “/usr/tidb/deploy/tikv-20172/bin”,
“last_heartbeat”: 1661179953422459862,
“node_state”: 2,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 1,
“region_score”: 0,
“region_size”: 0,
“slow_score”: 0,
“start_ts”: “2022-08-22T22:07:29+08:00”,
“last_heartbeat_ts”: “2022-08-22T22:52:33.422459862+08:00”,
“uptime”: “45m4.422459862s”
}
}

Executing the manual delete command indicates success but the node is not actually deleted
tiup ctl:v5.0.1 pd -u http://172.168.1.221:2379 store delete 7290395

[Background] Operations performed

[Phenomenon] Business and database phenomena
After the failure, the overall database read and write operations were abnormal. After restarting the entire cluster, reads were normal but writes experienced severe timeouts of over 4 minutes.

[Business Impact]
Business cannot write normally
[TiDB Version]
6.1.0

tiup cluster display clustername
172.168.1.220:2379 pd 172.168.1.221:2379 pd 172.168.1.222:2379 pd 172.168.3.81:9090 172.168.1.220:4000 tidb 172.168.1.221:4000 tidb 172.168.1.221:4001 tidb 172.168.1.222:4000 tidb 172.168.1.222:4001 tidb 172.168.1.223:20171 tikv 172.168.1.223:20172 tikv 172.168.1.224:20171 tikv 172.168.1.224:20172 tikv 172.168.1.225:20171 tikv 172.168.1.225:20172 tikv 172.168.1.226:20171 tikv 172.168.1.226:20172 tikv 172.168.1.227:20171 tikv 172.168.1.228:20171 tikv 172.168.1.228:20172 tikv 172.168.2.223:20171 tikv 172.168.2.224:20171 tikv 172.168.1.220 2379/2380 linux/x86_64 Up /data/tidb/data/pd-2379 /usr/tidb/deploy/pd-2379
172.168.1.221 2379/2380 linux/x86_64 Up|L|UI /data/tidb/data/pd-2379 /usr/tidb/deploy/pd-2379
172.168.1.222 2379/2380 linux/x86_64 Up /data/tidb/data/pd-2379 /usr/tidb/deploy/pd-2379
prometheus 172.168.3.81 9090/12020 linux/x86_64 Up /data/tidb/data/prometheus-9090 /usr/tidb/deploy/prometheus-9090
172.168.1.220 4000/10080 linux/x86_64 Up - /usr/tidb/deploy/tidb-4000
172.168.1.221 4000/10080 linux/x86_64 Up - /usr/tidb/deploy/tidb-4000
172.168.1.221 4001/10081 linux/x86_64 Up - /usr/tidb/deploy/tidb-4001
172.168.1.222 4000/10080 linux/x86_64 Up - /usr/tidb/deploy/tidb-4000
172.168.1.222 4001/10081 linux/x86_64 Up - /usr/tidb/deploy/tidb-4001
172.168.1.223 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.223 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.224 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.224 20172/20182 linux/x86_64 Offline /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.225 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.225 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.226 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.226 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.227 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.228 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.228 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.2.223 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.2.224 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171

translator_bot · June 23, 2024, 5:58am

| username: xfworld | Original post link

Manually scale in nodes (using --force)
tiup cluster scale-in clustername -N 172.168.1.224:20172
tiup cluster scale-in clustername -N 172.168.1.227:20172

Is this step completed?

I see that the final result is that a single node is providing multiple TiKV instance services…

It still hasn’t been successfully taken offline yet, right?

translator_bot · June 23, 2024, 5:58am

| username: Hacker_tESkRka1 | Original post link

Manual forced scale-in is indicated as successful. My TiKV setup has two instances on one machine, but using the following command to query, I can see that the stores corresponding to the two scaled-in points are in an offline state:
tiup ctl:v5.0.1 pd -u http://172.168.1.220:2379 store

translator_bot · June 23, 2024, 5:58am

| username: xfworld | Original post link

Try this command

tiup cluster prune cluster-name