BR Restore Failure, TiKV Node Error

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: BR恢复失败,tikv节点报错

| username: GreenGuan

Hi

The cluster version information is v4.0.1. When using v4.0.16 for recovery, it shows recovery failure. After investigation, the TiKV node attempts to restart itself but is unable to start. For specific errors, see the TiKV error log. Has anyone encountered this problem? How to solve it, thank you~!!!

BR error:
Error: rpc error: code = Unavailable desc = transport is closing; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”

TiKV node error log:

[2022/06/21 09:29:25.964 +08:00] [INFO] [raw_node.rs:222] [“RawNode created with id 1216662.”] [id=1216662] [raft_id=1216662] [region_id=1216172]
[2022/06/21 09:29:25.964 +08:00] [INFO] [peer.rs:158] [“create peer”] [peer_id=1216781] [region_id=1216340]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raft.rs:783] [“became follower at term 9”] [term=9] [raft_id=1216781] [region_id=1216340]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raft.rs:285] [newRaft] [peers=“[(1216781, Progress { matched: 20, next_idx: 21, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: } }), (1216782, Progress { matched: 0, next_idx: 21, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: } }), (1216783, Progress { matched: 0, next_idx: 21, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: } })]”] [“last term”=9] [“last index”=20] [applied=20] [commit=20] [term=9] [raft_id=1216781] [region_id=1216340]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raw_node.rs:222] [“RawNode created with id 1216781.”] [id=1216781] [raft_id=1216781] [region_id=1216340]
[2022/06/21 09:29:25.964 +08:00] [INFO] [peer.rs:158] [“create peer”] [peer_id=1216876] [region_id=1216472]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raft.rs:783] [“became follower at term 7”] [term=7] [raft_id=1216876] [region_id=1216472]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raft.rs:285] [newRaft] [peers=“[(1216876, Progress { matched: 18, next_idx: 19, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: } }), (1216877, Progress { matched: 0, next_idx: 19, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: } }), (1216878, Progress { matched: 0, next_idx: 19, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: } })]”] [“last term”=7] [“last index”=18] [applied=18] [commit=18] [term=7] [raft_id=1216876] [region_id=1216472]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raw_node.rs:222] [“RawNode created with id 1216876.”] [id=1216876] [raft_id=1216876] [region_id=1216472]
[2022/06/21 09:29:25.964 +08:00] [INFO] [peer.rs:158] [“create peer”] [peer_id=1217557] [region_id=1217085]
[2022/06/21 09:29:25.964 +08:00] [FATAL] [server.rs:576] [“failed to start node: EngineTraits(Other("[components/raftstore/src/store/fsm/store.rs:812]: \"[components/raftstore/src/store/peer_storage.rs:578]: [region 1217085] 1217557 validate state fail: Other(\\\"[components/raftstore/src/store/peer_storage.rs:455]: log at recorded commit index [8] 17 doesn\\\\\\\'t exist, may lose data\\\")\""))”]

| username: HACK | Original post link

Did you fail during the recovery process, or did you fail right at the beginning of the recovery? Is there a network issue?

| username: GreenGuan | Original post link

There were no issues at the beginning of the recovery, but this problem occurred during the recovery process. I checked the network and there doesn’t seem to be any problem.

| username: yilong | Original post link

  1. 备份与恢复常见问题 | PingCAP 文档中心 Are the new cluster resources sufficient?
  2. From v4.0.1 BR restore to v4.0.16? You can try BR restore to v4.0.1 first to see if it succeeds, then upgrade to v4.0.16.
| username: YuJuncen | Original post link

Judging from this log, it says that the Raft Log is corrupted. Maybe you can ask in the neighboring TiKV section to see what’s going on?