TiKV Raft Engine Log Loss Causes TiKV Recovery Failure

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV raft engine日志丢失Tikv恢复失败

| username: Timber

To improve efficiency, please provide the following information. A clear problem description will help resolve the issue faster:

Overview: When attempting to test the recovery of a single-node TiKV, the recovery-mode configuration item of raft-engine was modified to “tolerate-any-corruption” and the raft log file was manually deleted, resulting in TiKV failing to recover.

Application Framework and Business Logic Adaptation: Testing TiKV’s recovery in cases of file loss or corruption due to power outages or other anomalies.

Background:

  1. Modified the recovery-mode configuration item of raft-engine to “tolerate-any-corruption”.
  2. Restarted TiKV.
  3. Inserted data into TiKV from the application layer.
  4. Manually deleted the tail log file of raft-engine.
  5. Stopped data insertion.
  6. Restarted TiKV, and TiKV failed to start.

Phenomenon: TiKV failed to start and reported the error “[FATAL] [server.rs:950] [“failed to start node: Engine(Other("[components/raftstore/src/store/fsm/store.rs:1115]: \"[components/raftstore/src/store/peer_storage.rs:769]: [region 2] 3 validate state fail: Other(\\\"[components/raftstore/src/store/peer_storage.rs:595]: log at recorded commit index [8] 262607 doesn’t exist, may lose data, region 2, raft state hard_state { term: 8 vote: 3 commit: 103492 } last_index: 103494, apply state applied_index: 262607 commit_index: 262607 commit_term: 8 truncated_state { index: 262600 term: 8 }\\\")\""))”]”

Problem: TiKV cannot recover.

Business Impact:

TiDB Version:
tikv:v6.1.0
Attachments:

| username: Billmay表妹 | Original post link

You can follow these steps to fix the issue:

  1. First, you need to set the recovery-mode configuration item of raft-engine to “tolerate-any-corruption”, which will allow TiKV to continue running when data corruption is detected.
  2. Then, you need to restart TiKV to make the raft-engine configuration item take effect.
  3. Next, you need to write some data to TiKV to trigger the recovery process of the Raft state machine.
  4. Then, you need to manually delete the tail log file of raft-engine, which will force the Raft state machine to recover from the previous state.
  5. After deleting the log file, you need to stop writing data to TiKV.
  6. Finally, you need to restart TiKV again. At this point, TiKV should be able to start normally and recover the data.
| username: Timber | Original post link

Thank you for the reply. Unfortunately, after following your steps and retrying the experiment twice, it still cannot be restarted, and the error remains the same.

| username: Timber | Original post link

In the experiment, after setting recovery-mode to tolerate-any-corruption, restarting TiKV, inserting data, stopping data insertion, truncating the raft log file, and restarting TiKV, the same error still occurs. This is totally unexpected!!!

| username: Timber | Original post link

@Billmay Could you please forward this to the relevant developers? :blush:

| username: 程序员小王 | Original post link

Is it okay to use the default tolerate-tail-corruption?

## Determines how to deal with file corruption during recovery.
##
## Candidates:
##   absolute-consistency
##   tolerate-tail-corruption
##   tolerate-any-corruption
# recovery-mode = "tolerate-tail-corruption"