TiKV Node Keeps Restarting, Failed to Open Raft Engine

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点不断重启,failed to open raft engine

| username: mar_xxy

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.1
[Reproduction Path] tiup cluster reload
[Encountered Problem: Phenomenon and Impact] Restart error: metric tikv_raftstore_region_count{type=“leader”} not found
[Resource Configuration]
[Attachment: Screenshot/Log/Monitoring]

| username: TiDBer_jYQINSnf | Original post link

Rebuild this node, it’s easier. Seeing the checksum mismatch, could it be that the disk file is corrupted?

| username: mar_xxy | Original post link

How to operate specifically?

| username: zhanggame1 | Original post link

The problem with TiKV is scaling up and down.

| username: zhaokede | Original post link

First scale down, then scale up.

| username: TiDBer_jYQINSnf | Original post link

First, expand by adding a new node and ignore the faulty node; just keep it shut down. Once the number of regions for this faulty node in pd-ctl becomes 0 and its status changes to tombstone, you can delete the directory of this node.

| username: TIDB-Learner | Original post link

Under what circumstances does TiKV generally encounter problems? For example, what operations.

| username: 呢莫不爱吃鱼 | Original post link

At this point, scale down and then scale up again.

| username: zhaokede | Original post link

The TiKV node continuously restarting and reporting “failed to open raft engine” usually means that the Raft engine encountered an issue during startup. This could be due to several reasons:

  1. Insufficient Disk Space: TiKV requires enough disk space to store data. If the disk space is insufficient, the Raft engine may not open properly.

  2. File System Errors: There may be errors in the disk or file system, causing TiKV to be unable to read or write data.

  3. Configuration Issues: There may be issues with TiKV’s configuration, such as incorrect data directory settings or improper parameter configurations.

  4. Data Corruption: TiKV’s data may be corrupted, possibly due to hardware failures, sudden power outages, or other reasons.

To resolve this issue, you can try the following steps:

  • Check Disk Space: Use the df -h command to check if there is enough space on the disk where the TiKV data directory is located.
  • Check File System: Use the fsck command to check for errors in the file system.
  • Reconfigure TiKV: Ensure that TiKV’s configuration file is correct, especially the path to the data directory.
  • Recover Data: If the data is corrupted, you may need to restore data from a backup or use tools like tikv-tools to repair the corrupted data.

If the above steps do not resolve the issue, you may need to check TiKV’s log files for more detailed error information or seek help in the TiKV community forum.

If the issue is due to data corruption, you may need to run some diagnostic commands to check and repair the data. For example, you can use the tikv-ctl tool to check and repair TiKV’s data:

# Check data
tikv-ctl checkdb --data-dir /path/to/your/tikv/data

# Repair data
tikv-ctl recoverdb --data-dir /path/to/your/tikv/data

Please note that it is best to back up your data before performing these operations, just in case.