Removing a Completely Offline TiKV Node from the Cluster

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 在集群中清除一台完全离线的tikv节点

| username: magdb

[TiDB Usage Environment] Testing
[TiDB Version] v6.5.1
[Problem Encountered]
In a TiDB cluster originally consisting of three nodes, after adding an additional node and using tiup to forcibly take one of the TiKV servers offline, the cluster can still be used normally, and the node information is no longer visible in the TiDB cluster. However, it can still be seen in PD. How can I completely clear the configuration information of this TiKV server in PD?
[Attachments: Screenshots/Logs/Monitoring]
Cluster status: Node 8.117:20163 is no longer visible


Still visible in PD, status is offline
image
Other TiKV nodes report an error: [ERROR] [raft_client.rs:851] [“wait connect timeout”] [addr=192.168.8.117:20163] [store_id=97687]

| username: h5n1 | Original post link

| username: tidb菜鸟一只 | Original post link

There are still 3 regions that haven’t been moved, right?

| username: magdb | Original post link

Yes, I don’t know how to remove it now.

| username: WalterWj | Original post link

You can directly shut down this node for 30 minutes to see if it will change to the down state. Also, are you unable to migrate because there are not enough remaining resources to supplement the 3 replica data?

| username: 像风一样的男子 | Original post link

Have you finished expanding the node? In theory, once the three replicas are completed, the remaining regions will be migrated.

| username: 像风一样的男子 | Original post link

Previously, I encountered insufficient replicas, so I used curl -X DELETE http://0.0.0.0:2379/pd/api/v1/store/storeid?force=true to set this node to the physically_destroyed state. After that, I made up the three replicas, and the problematic node automatically migrated the remaining regions.

| username: h5n1 | Original post link

First, find the region_id, then use pd-ctl operator add remove-peer.

| username: redgame | Original post link

You can use this: curl -X DELETE http://0.0.0.0:2379/pd/api/v1/store/storeid?force=true

| username: magdb | Original post link

First, we expanded to four nodes and migrated for a few days, but the status of this node remained offline. Later, in a hurry, we shut down the node and cleared the data directory. Now the cluster is usable, but the node information can still be seen in PD. :joy:

| username: magdb | Original post link

Okay, I’ll try it first, thank you.

| username: magdb | Original post link

Execution reports “The store is set as Offline.”

| username: magdb | Original post link

After execution, two regions were successfully cleared, but one region reported an error: Failed! [500] “cannot build operator for region with no leader”

| username: h5n1 | Original post link

Use pd-ctl region to check the one without a leader, and also see which table this region belongs to.

| username: magdb | Original post link

It seems there is no leader.

Region 454144
{
“id”: 454144,
“start_key”: “7480000000000012FF3B5F72800000001EFFBC8E7B0000000000FA”,
“end_key”: “7480000000000012FF3B5F72800000001EFFC908F90000000000FA”,
“epoch”: {
“conf_ver”: 65,
“version”: 4613
},
“peers”: [
{
“id”: 454148,
“store_id”: 97434,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
},
{
“id”: 477478,
“store_id”: 452778,
“role_name”: “Voter”
},
{
“id”: 477599,
“store_id”: 97687,
“role_name”: “Voter”
},
{
“id”: 489038,
“store_id”: 1,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
},
{
“id”: 490656,
“store_id”: 102001,
“role_name”: “Voter”
}
],
“leader”: {
“role_name”: “”
},
“cpu_usage”: 0,
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 0,
“approximate_keys”: 0
}

| username: h5n1 | Original post link

Directly recreate-region

| username: magdb | Original post link

Execution error :smiling_face_with_tear:

| username: h5n1 | Original post link

You can first check if there is still this region information on the TiKV side:
tikv-ctl --db /var/lib/tikv/store/db raft region -r XXXX

| username: magdb | Original post link

Executing this command results in the same error as mentioned above. Should I execute it on the other two TiKV nodes?

| username: h5n1 | Original post link

The TiKV process is still running.