TiKV Node Failure Causes Cluster to Fail to Start Normally

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点故障,导致集群无法正常启动。

| username: liujun6315

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
TiKV node failure caused the cluster to be unable to start
[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

Yesterday, on February 9th, we received an alert that a TiKV node had crashed. I logged into the cluster and tested that it could read and write, but the business and development teams reported that the application was unusable. After waiting for less than two hours, the problematic physical machine recovered, but the TiKV node was still unusable. We decided to restart the cluster, but the restart operation failed at the problematic TiKV node with the following error.

This is the cluster topology and the current cluster status.

By checking the logs, we found that the SST files were missing. We performed a system-level repair, but after the repair was completed, the startup still failed with the same error.

Currently, we are considering two approaches:

  1. Remove the faulty node through the PD node to restore the cluster.
  2. Repair the SST files of the faulty node and then restore the cluster.

We have also looked at related posts, but since this is the only cluster we have, we are afraid that the operation might cause the cluster to be unrecoverable. We would like to seek guidance from experienced experts or official personnel. Thank you all.

| username: 小龙虾爱大龙虾 | Original post link

For now, don’t worry about the problematic TiKV since only one is broken. First, execute tiup cluster start tidb-pt -R tidb.

| username: 裤衩儿飞上天 | Original post link

If it is an urgent matter, do not perform any operations on the failed node first. Add a TiKV node, and after all replicas are complete, then handle the failed node. For the failed node, whether it is due to disk damage or SST loss, there may be data loss. After adding a new node, replicas will be automatically replenished. The application cannot be used because all TiDB servers are down. Check if there are any issues with the basic environment, such as network or firewall.

| username: xfworld | Original post link

I’ve seen the cluster status in the group, and it’s very likely that data will be lost…

Two ways:

  1. Release extended resources and expand TiKV nodes to see if the data replicas can be recovered.
  2. If the replicas are lost, you can only give up the data…
| username: Jellybean | Original post link

Priority should be given to restoring cluster access functionality, which mainly involves restarting the TiDB node.

Once the cluster is accessible, forcefully decommission the previously downed TiKV node, and then you can expand a TiKV node using the same directory and port.

Since this is a production cluster, please promptly provide feedback here if you have any questions or updates on the progress.