What happens if one TiKV data directory is accidentally deleted in a 3 TiKV, 3 replica setup?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 3个tikv,3副本,误删除一个tikv数据目录,会发生什么?

| username: DBRE

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.2.2
Operation:
3 TiKV, 3 replicas, mistakenly deleted one TiKV data directory, what will happen?

Questions:

  1. Will there be data loss?
  2. Are business read/write requests normal?
  3. How to recover? Force offline the mistakenly deleted TiKV node, then expand a new TiKV node?
| username: h5n1 | Original post link

  1. Data will not be lost because the majority is available.
  2. Read and write requests are generally normal because the majority is available, except in cases of anomalies or bugs. There may be a short-term performance drop because leader transfer will have backoff retries.
  3. New nodes need to be added for expansion, and problematic nodes should not be forcibly taken offline. Refer to: 专栏 - TiKV缩容下线异常处理的三板斧 | TiDB 社区
| username: Kongdom | Original post link

I think, first of all, it is necessary to confirm whether the data is balanced. If the data is balanced, accidentally deleting a TiKV data directory will have no impact. Previously, there was an issue where a certain node was evicted as a leader, resulting in only two nodes having leader replicas even though there were three nodes with three replicas in TiKV.

| username: DBRE | Original post link

So how should it be taken offline? The article says “--force is only applicable in extreme cases where the TiKV node is completely down or the data directory is deleted.”

| username: h5n1 | Original post link

The description is incomplete. This document describes the handling of issues encountered during normal decommissioning. The scale-in --force command only removes the cluster to be decommissioned from the TiUP topology. You can handle it as follows:

  1. Use scale-in --force to see if the store to be decommissioned can become offline (it probably won’t). If it doesn’t become offline, use pd-ctl store delete *store_id* to delete the problematic TiKV. The store ID can be found via pd-ctl store or Information_schema.tikv_store_status.
  2. Wait until the region count on the decommissioned store becomes 0, at which point it will normally change to the tombstone state. Then use pd-ctl remove-tombstone.
  3. If the region count does not decrease, you can follow the document to first add and then remove the peer. If there are still issues, further investigation will be needed.
| username: kkpeter | Original post link

There will be no data loss. Just start a new TiKV and add it in.

| username: Jellybean | Original post link

The data has 3 replicas. If there are only 2 TiKV nodes for a long time, there might not be any issues in the short term, but over the long term, there could be some scheduling problems.

A good practice is that the number of your TiKV nodes should not be less than the number of your replicas, which means at least 3 TiKV nodes.

| username: wuxiangdong | Original post link

Most likely no problem.

| username: 考试没答案 | Original post link

There should be no problem, just re-add one. Then delete the one you mistakenly deleted.