Repair Methods for a Single TiKV Replica Failure [In Case of Physical Machine Damage]

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV 单个副本故障【假如一台物理机器损坏】的修复手段

| username: residentevil

[TiDB Usage Environment] Production Environment
[TiDB Version] V6.5.8
[Encountered Problem: Phenomenon and Impact] Repair method for a single TIKV replica failure [e.g., if a physical machine is damaged]. I couldn’t find specific operational procedures in the official documentation. For this scenario, could any expert provide an SOP operation procedure? My rough idea is as follows, but I’m not sure if it’s accurate. I would appreciate guidance from the experts.

  1. Use tiup display to identify the faulty TIKV node.
  2. Using some status [currently unknown], use tiup to remove the faulty TIKV node.
  3. Reinitialize a server [prepare the basic environment], and edit the scale-in configuration.
  4. Register it using tiup scale.
  5. Monitor the data recovery progress and other information of the new TIKV node.
| username: Fly-bird | Original post link

A single replica would fail if there’s a KV fault. In a production environment, there should be at least 3 replicas, right?

| username: TiDBer_jYQINSnf | Original post link

After step 2, add the following:
Execute a store delete using pd-ctl.

| username: 像风一样的男子 | Original post link

  1. Use tiup display to show the faulty TiKV node.
  2. Reinitialize a server, edit the scale-out configuration, and expand the node.
  3. Observe the data recovery progress and other information of the new TiKV node.
  4. Forcefully scale in the faulty node using tiup cluster scale-in xxx --node [faulty TiKV node] --force.
| username: caiyfc | Original post link

The person above is correct. You need to expand first and then shrink. You must ensure that the number of normal TiKV nodes is greater than 3, and then shrink the problematic TiKV nodes.

| username: porpoiselxj | Original post link

Reference:

| username: TiDBer_aaO4sU46 | Original post link

Expand first, then shrink.

| username: zhanggame1 | Original post link

TiDB Cluster Recovery: TiKV Cluster Unavailable

| username: 不想干活 | Original post link

We have 3 TiKV nodes, better to have more backups.

| username: residentevil | Original post link

RAFT protocol, three can fail one.

| username: residentevil | Original post link

From the operation perspective, it looks OK. I need the specific steps now. I see that someone has provided the official documentation, so I’ll take a look at that first.

| username: residentevil | Original post link

This article is written in great detail. I will read it carefully first. Thank you, my friend.

| username: residentevil | Original post link

This article is written in great detail. I will read it carefully first. Thank you.

| username: caiyfc | Original post link

The specific steps are as mentioned above. I have also read these two articles. Given your current situation, the first article only mentions expanding and then shrinking, while the second article, which discusses the scenario of one machine going down, does not provide specific steps but rather explains the principles and precautions.

| username: residentevil | Original post link

What I am actually concerned about is whether the damaged replica has a state after adding and synchronizing all the data before releasing it. Is it possible that it cannot be released under certain conditions?

| username: 胡杨树旁 | Original post link

You can use the pd-ctl tool to check if there are any leaders on this node that have not been migrated. If the number of leaders is 0, it means the data is intact.

| username: residentevil | Original post link

If the number of regions is in the millions, the leader migration time is indeed likely to be quite long.

| username: 胡杨树旁 | Original post link

Yes, we can only wait for the leader to complete the migration before performing operations on the problematic node, otherwise data loss may occur. We had an incident in the test environment before where the leader had not completed the migration, and the node encountered an issue and went down unexpectedly, causing some leaders to not complete the migration in time.

| username: caiyfc | Original post link

When a machine has been down for a certain period, the replicas on this machine are considered unavailable to the cluster, leaving only two replicas in the cluster. During scaling out, the cluster will add replicas on the new nodes to maintain a three-replica state. Finally, when scaling in, you need to add the --force option to forcibly scale in the nodes on the faulty cluster, so there won’t be a situation where the nodes cannot be released.

| username: caiyfc | Original post link

Losing one replica out of three replicas, regardless of whether the lost replica is the leader or not, will not affect the data. Moreover, after expanding the nodes, the number of replicas will automatically be replenished. The leader will also automatically balance.