Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: 数据盘丢失恢复后强制缩容无法删除已缩容节点
Data disk loss recovery and forced scale-in unable to delete scaled-in nodes
[TiDB Usage Environment]
Production environment
[Overview] Overview + Problem Description
-
Disks 172.168.1.224:/data2 and 172.168.1.227:/data2 failed, causing the /data2 partition to disappear and resulting in the loss of nodes 172.168.1.224:20172 and 172.168.1.227:20172.
-
Restarted the servers and remounted to recover the missing /data2, but the nodes failed to start, and the error messages were not retained.
-
Manually scaled in the nodes (using --force)
tiup cluster scale-in clustername -N 172.168.1.224:20172
tiup cluster scale-in clustername -N 172.168.1.227:20172 -
Re-added the nodes
Scale-out configuration file: cat scale-out.yml
tikv_servers:- host: 172.168.1.224
ssh_port: 22
port: 20172
status_port: 20182
deploy_dir: /usr/tidb/deploy/tikv-20172
data_dir: /data2/tidb/data/tikv-20172
log_dir: /data2/tidb/log/tikv-20172
config:
server.labels:
host: kv-1-224
arch: amd64
os: linux - host: 172.168.1.227
ssh_port: 22
port: 20172
status_port: 20182
deploy_dir: /usr/tidb/deploy/tikv-20172
data_dir: /data2/tidb/data/tikv-20172
log_dir: /data2/tidb/log/tikv-20172
config:
server.labels:
host: kv-1-227
arch: amd64
os: linux
Executed the scale-out command: tiup cluster scale-out clustername scale-out.yml
Scale-out completed, but the nodes failed to start with the following error:
Error: failed to start tikv: failed to start: 172.168.1.224 tikv-20172.service, please check the instance’s log(/data2/tidb/log/tikv-20172) for more detail.: timed out waiting for port 20172 to be started after 2m0s - host: 172.168.1.224
Log error information:
[2022/08/23 03:15:05.143 +08:00] [WARN] [client.rs:149] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:334]: cancel reconnection due to too small interval")”]
[2022/08/23 03:15:05.144 +08:00] [ERROR] [util.rs:491] [“request failed”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { code: 2-UNKNOWN, message: "duplicated store address: id:202782520 address:\"172.168.1.224:20172\" labels:<key:\"host\" value:\"kv-210-224\" > version:\"6.1.0\" status_address:\"172.168.1.224:20182\" git_hash:\"080d086832ae5ce2495352dccaf8df5d40f30687\" start_timestamp:1661195703 deploy_path:\"/usr/tidb/deploy/tikv-20172/bin\" , already registered by id:7290395 address:\"172.168.1.224:20172\" state:Offline labels:<key:\"host\" value:\"kv-210-224\" > version:\"6.1.0\" status_address:\"172.168.1.224:20182\" git_hash:\"080d086832ae5ce2495352dccaf8df5d40f30687\" start_timestamp:1661177249 deploy_path:\"/usr/tidb/deploy/tikv-20172/bin\" last_heartbeat:1661179953422459862 node_state:Removing ", details: }))”]
[2022/08/23 03:15:05.144 +08:00] [ERROR] [util.rs:500] [“reconnect failed”] [err_code=KV:PD:Unknown] [err=“Other("[components/pd_client/src/util.rs:334]: cancel reconnection due to too small interval")”]
[2022/08/23 03:15:05.142 +08:00] [INFO] [util.rs:575] [“connecting to PD endpoint”] [endpoints=http://172.168.1.220:2379]
[2022/08/23 03:15:05.142 +08:00] [INFO] [util.rs:575] [“connecting to PD endpoint”] [endpoints=http://172.168.1.221:2379]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:701] [“connected to PD member”] [endpoints=http://172.168.1.221:2379]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:218] [“heartbeat sender and receiver are stale, refreshing …”]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:231] [“buckets sender and receiver are stale, refreshing …”]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:258] [“update pd client”] [via=] [leader=http://172.168.1.221:2379] [prev_via=] [prev_leader=http://172.168.1.221:2379]
[2022/08/23 03:15:05.143 +08:00] [INFO] [util.rs:385] [“trying to update PD client done”] [spend=1.43018ms]
[2022/08/23 03:15:05.143 +08:00] [INFO] [tso.rs:157] [“TSO worker terminated”] [receiver_cause=None] [sender_cause=None]
[2022/08/23 03:15:05.143 +08:00] [INFO] [client.rs:147] [“TSO stream is closed, reconnect to PD”]
[2022/08/23 03:15:05.143 +08:00] [WARN] [client.rs:149] [“failed to update PD client”] [error=“Other("[components/pd_client/src/util.rs:334]: cancel reconnection due to too small interval")”]
[2022/08/23 03:15:05.144 +08:00] [ERROR] [util.rs:491] [“request failed”] [err_code=KV:PD:gRPC] [err="Grpc(RpcFailure(RpcStatus { code: 2-UNKNOWN, message: "duplicated store address: id:202782520 address:\"172.168.1.224:20172\" labels:<key:\"host\" value:\"kv-210-224\" > version:\"6.1.0\" status_address:\"172.168.1.224:20182\" git_hash:\"080d086832ae5ce2495352dccaf8df5d40f30687\" start_timesta
Checking store information reveals that the removed nodes still exist
tiup ctl:v5.0.1 pd -u http://172.168.1.220:2379 store
{
“store”: {
“id”: 7290395,
“address”: “172.168.1.224:20172”,
“state”: 1,
“labels”: [
{
“key”: “host”,
“value”: “kv-1-224”
}
],
“version”: “6.1.0”,
“status_address”: “172.168.1.224:20182”,
“git_hash”: “080d086832ae5ce2495352dccaf8df5d40f30687”,
“start_timestamp”: 1661177249,
“deploy_path”: “/usr/tidb/deploy/tikv-20172/bin”,
“last_heartbeat”: 1661179953422459862,
“node_state”: 2,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 1,
“region_score”: 0,
“region_size”: 0,
“slow_score”: 0,
“start_ts”: “2022-08-22T22:07:29+08:00”,
“last_heartbeat_ts”: “2022-08-22T22:52:33.422459862+08:00”,
“uptime”: “45m4.422459862s”
}
}
Executing the manual delete command indicates success but the node is not actually deleted
tiup ctl:v5.0.1 pd -u http://172.168.1.221:2379 store delete 7290395
[Background] Operations performed
[Phenomenon] Business and database phenomena
After the failure, the overall database read and write operations were abnormal. After restarting the entire cluster, reads were normal but writes experienced severe timeouts of over 4 minutes.
[Business Impact]
Business cannot write normally
[TiDB Version]
6.1.0
tiup cluster display clustername
172.168.1.220:2379 pd 172.168.1.220 2379/2380 linux/x86_64 Up /data/tidb/data/pd-2379 /usr/tidb/deploy/pd-2379
172.168.1.221:2379 pd 172.168.1.221 2379/2380 linux/x86_64 Up|L|UI /data/tidb/data/pd-2379 /usr/tidb/deploy/pd-2379
172.168.1.222:2379 pd 172.168.1.222 2379/2380 linux/x86_64 Up /data/tidb/data/pd-2379 /usr/tidb/deploy/pd-2379
172.168.3.81:9090 prometheus 172.168.3.81 9090/12020 linux/x86_64 Up /data/tidb/data/prometheus-9090 /usr/tidb/deploy/prometheus-9090
172.168.1.220:4000 tidb 172.168.1.220 4000/10080 linux/x86_64 Up - /usr/tidb/deploy/tidb-4000
172.168.1.221:4000 tidb 172.168.1.221 4000/10080 linux/x86_64 Up - /usr/tidb/deploy/tidb-4000
172.168.1.221:4001 tidb 172.168.1.221 4001/10081 linux/x86_64 Up - /usr/tidb/deploy/tidb-4001
172.168.1.222:4000 tidb 172.168.1.222 4000/10080 linux/x86_64 Up - /usr/tidb/deploy/tidb-4000
172.168.1.222:4001 tidb 172.168.1.222 4001/10081 linux/x86_64 Up - /usr/tidb/deploy/tidb-4001
172.168.1.223:20171 tikv 172.168.1.223 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.223:20172 tikv 172.168.1.223 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.224:20171 tikv 172.168.1.224 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.224:20172 tikv 172.168.1.224 20172/20182 linux/x86_64 Offline /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.225:20171 tikv 172.168.1.225 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.225:20172 tikv 172.168.1.225 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.226:20171 tikv 172.168.1.226 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.226:20172 tikv 172.168.1.226 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.1.227:20171 tikv 172.168.1.227 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.228:20171 tikv 172.168.1.228 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.1.228:20172 tikv 172.168.1.228 20172/20182 linux/x86_64 Up /data2/tidb/data/tikv-20172 /usr/tidb/deploy/tikv-20172
172.168.2.223:20171 tikv 172.168.2.223 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171
172.168.2.224:20171 tikv 172.168.2.224 20171/20181 linux/x86_64 Up /data1/tidb/data/tikv-20171 /usr/tidb/deploy/tikv-20171