Offline Store Node Cannot Be Decommissioned

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Offline的Store节点无法下线

| username: TiDBer_ugQyGgdL

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] After the machine went offline, the Store became Offline. Using pd-ctl operator add remove-peer to delete all peers on the store, three peers could not be deleted and were stuck at “cannot build operator for region which is in joint state.” This caused the statefulset pod to be unable to rejoin the cluster.
[Encountered Problem: Symptoms and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
The offline progress is stuck at 99.99%

| username: h5n1 | Original post link

Try using recreate-region to rebuild these.

| username: TiDBer_ugQyGgdL | Original post link

A machine with one of the TiKV nodes went offline abnormally, and the pod has been unable to start. TiKV-ctl cannot connect to this node. The newly launched pod keeps reporting an error: duplicated store address.

| username: TiDBer_jYQINSnf | Original post link

There is an IP address conflict. You need to delete the corresponding old store to resolve it.

| username: TiDBer_ugQyGgdL | Original post link

The old store was deleted through pd-ctl store, but it has been stuck in the offline state and cannot be taken offline.

| username: TiDBer_ugQyGgdL | Original post link

{
“count”: 3,
“regions”: [
{
“id”: 162617,
“start_key”: “313939333432333538343735333430363435383A4D54544E657773436F6E74656E74”,
“end_key”: “313939333632333030353936373134323033313A4D54544E657773536E6170”,
“epoch”: {
“conf_ver”: 24,
“version”: 17
},
“peers”: [
{
“id”: 162620,
“store_id”: 5,
“role”: 3,
“role_name”: “DemotingVoter”
},
{
“id”: 275614,
“store_id”: 274150,
“role_name”: “Voter”
},
{
“id”: 287072,
“store_id”: 274156,
“role_name”: “Voter”
},
{
“id”: 321305,
“store_id”: 274157,
“role”: 2,
“role_name”: “IncomingVoter”
}
],
“leader”: {
“id”: 162620,
“store_id”: 5,
“role”: 3,
“role_name”: “DemotingVoter”
},
“down_peers”: [
{
“down_seconds”: 32596,
“peer”: {
“id”: 275614,
“store_id”: 274150,
“role_name”: “Voter”
}
}
],
“written_bytes”: 37703,
“read_bytes”: 330190,
“written_keys”: 18,
“read_keys”: 199,
“approximate_size”: 93,
“approximate_keys”: 0
},
{
“id”: 239166,
“start_key”: “32663062633333646638336639323631396266623039653133373739383863623A4D54545573657250726F66696C655633”,
“end_key”: “32663062633333656638313363373365396238646436653133373739383863625F636F6D706C6574655F706C61793A4D545455736572416374696F6E3A766572”,
“epoch”: {
“conf_ver”: 48,
“version”: 17
},
“peers”: [
{
“id”: 239169,
“store_id”: 46,
“role”: 3,
“role_name”: “DemotingVoter”
},
{
“id”: 300223,
“store_id”: 274150,
“role_name”: “Voter”
},
{
“id”: 319711,
“store_id”: 274149,
“role_name”: “Voter”
},
{
“id”: 319794,
“store_id”: 274144,
“role”: 2,
“role_name”: “IncomingVoter”
}
],
“leader”: {
“id”: 239169,
“store_id”: 46,
“role”: 3,
“role_name”: “DemotingVoter”
},
“down_peers”: [
{
“down_seconds”: 32583,
“peer”: {
“id”: 300223,
“store_id”: 274150,
“role_name”: “Voter”
}
}
],
“pending_peers”: [
{
“id”: 300223,
“store_id”: 274150,
“role_name”: “Voter”
}
],
“written_bytes”: 204126,
“read_bytes”: 361689,
“written_keys”: 114,
“read_keys”: 668,
“approximate_size”: 142,
“approximate_keys”: 0
},
{
“id”: 110081,
“start_key”: “393232303531363731313230313232313233303A4D54544E657773537461746963”,
“end_key”: “393232303639323239323832383139313333383A4D54544E657773436F6E74656E74”,
“epoch”: {
“conf_ver”: 58,
“version”: 21
},
“peers”: [
{
“id”: 110082,
“store_id”: 33,
“role”: 3,
“role_name”: “DemotingVoter”
},
{
“id”: 110174,
“store_id”: 15,
“role_name”: “Voter”
},
{
“id”: 306531,
“store_id”: 274150,
“role_name”: “Voter”
},
{
“id”: 320541,
“store_id”: 274146,
“role”: 2,
“role_name”: “IncomingVoter”
}
],
“leader”: {
“id”: 320541,
“store_id”: 274146,
“role”: 2,
“role_name”: “IncomingVoter”
},
“down_peers”: [
{
“down_seconds”: 32586,
“peer”: {
“id”: 306531,
“store_id”: 274150,
“role_name”: “Voter”
}
}
],
“written_bytes”: 57045,
“read_bytes”: 291690,
“written_keys”: 31,
“read_keys”: 194,
“approximate_size”: 50,
“approximate_keys”: 0
}
]
}

| username: h5n1 | Original post link

curl -X DELETE http://${HostIP}:2379/pd/api/v1/admin/cache/region/{region_id}. This is to clear the region information on the PD side, try to handle the remaining regions.

| username: 考试没答案 | Original post link

Is there no offline now??? Ensure that you can use the unsafe command. If delete is unsuccessful.

| username: 考试没答案 | Original post link

Since they are all high-availability clusters, there shouldn’t be a need to clean up regions, right?

| username: TiDBer_ugQyGgdL | Original post link

The region is normal. There is online data and the leader of other peers. What needs to be cleaned up is the offline peer, not the region.

| username: h5n1 | Original post link

The region is already abnormal. If the Leader is normal, you can resolve it by removing the peer as mentioned earlier. Clearing the PD region information does not mean deleting the region. The region on the PD side is just a cache. Check if the current region leader is normal.

| username: TiDBer_ugQyGgdL | Original post link

After executing curl -X DELETE to delete these three regions, it returned “The region is removed from server cache,” but these three regions can still be returned through region store 274150.

| username: TiDBer_ugQyGgdL | Original post link

pd-ctl -u <pd_addr> unsafe remove-failed-stores 274150
pd-ctl -u <pd_addr> unsafe remove-failed-stores show
Then check through region store 274150 and see no changes

[
  {
    "info": "Unsafe recovery enters collect report stage: failed stores 274150",
    "time": "2023-03-22 02:08:44.691"
  },
  {
    "info": "Unsafe recovery finished",
    "time": "2023-03-22 02:08:55.895"
  }
]
| username: TiDBer_ugQyGgdL | Original post link

tikv-ctl recreate-region error:
[2023/03/22 02:26:45.596 +00:00] [WARN] [config.rs:648] ["compaction guard is disabled due to region info provider not available"]
[2023/03/22 02:26:45.596 +00:00] [WARN] [config.rs:756] ["compaction guard is disabled due to region info provider not available"]
[2023/03/22 02:26:45.611 +00:00] [ERROR] [executor.rs:1092] ["error while open kvdb: Storage Engine IO error: While lock file: /var/lib/tikv/db/LOCK: Resource temporarily unavailable"]
[2023/03/22 02:26:45.611 +00:00] [ERROR] [executor.rs:1095] ["LOCK file conflict indicates TiKV process is running. Do NOT delete the LOCK file and force the command to run. Doing so could cause data corruption."]
| username: h5n1 | Original post link

Try this: you need to stop all three regions’ TiKV, and then execute on a normal TiKV node. It’s not easy to stop a specific TiKV in k8s, right?

| username: TiDBer_ugQyGgdL | Original post link

Yes. Because it’s a k8s pod… it’s not possible to operate like that. It was deployed using TiDB Operator.

| username: h5n1 | Original post link

Then we can only proceed after stopping everything.

| username: TiDBer_ugQyGgdL | Original post link

Then the online business will be unavailable… We can’t operate like this. There are read and write requests online, and we can’t shut down. This region has relatively small impact.

| username: TiDBer_jYQINSnf | Original post link

Does your cluster have any special characteristics? Why is there an incoming voter? I’ve never seen such a peer. Is it a 3-replica setup?

| username: h5n1 | Original post link

The intermediate process of leader demotion and promotion to leader