TiKV Unable to Start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV无法启动

| username: 普罗米修斯

[TiDB Usage Environment] Production Environment
[TiDB Version] TiDB 3.0
[Encountered Issue]
After a power outage in the data center, when restarting TiDB, one TiKV failed to start. Checking the TiKV log showed the following error:
[FATAL] [lib.rs:499] [“[region 27] 11522647 unexpected raft log index: last_index 5177432 < applied_index 5177434”]
[Actions Taken]

  1. Checked bad-region using /home/rds/tidb-v3.0-linux-amd64/bin/tikv-ctl --db /TiDBDisk3/deploy/data/db bad-regions
  2. Stopped the scheduling service
    Image
  3. Stopped each TiKV service;
  4. Executed /home/rds/tidb-v3.0-linux-amd64/bin/tikv-ctl --db /TiDBDisk3/deploy/data/db unsafe-recover remove-fail-stores -s 170154 -r 27 on each TiKV node
    (170154 is the failed TiKV)
  5. When executing this command on the failed TiKV node, the following error was reported

[Resource Configuration]

| username: 普罗米修斯 | Original post link

[2023/03/15 19:16:46.109 +08:00] [FATAL] [lib.rs:499] [“[region 27] 11522647 unexpected raft log index: last_index 5177432 < applied_index 5177434”] [backtrace=“stack backtrace:\n 0: 0x55bfe319b51d - backtrace::backtrace::libunwind::trace::h0500f4f2825a5d17\n at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/libunwind.rs:54\n - backtrace::backtrace::trace::h4187244de1605a06\n at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/mod.rs:70\n 1: 0x55bfe318fcd0 - tikv_util::set_panic_hook::{{closure}}::h195100b0bbd49cfb\n at /home/jenkins/.target/release/build/backtrace-e20a32a05fd0b8fe/out/capture.rs:79\n 2: 0x55bfe333464f - std::panicking::rust_panic_with_hook::h8d2408723e9a2bd4\n at src/libstd/panicking.rs:479\n 3: 0x55bfe333442d - std::panicking::continue_panic_fmt::hb2aaa9386c4e5e80\n at src/libstd/panicking.rs:382\n 4: 0x55bfe33343db - std::panicking::begin_panic_fmt::h1c91fada5a982dcd\n at src/libstd/panicking.rs:337\n 5: 0x55bfe291a49a - tikv::raftstore::store::peer::Peer::new::h114b9c5233192fb4\n at src/raftstore/store/peer.rs:0\n 6: 0x55bfe27b70c5 - tikv::raftstore::store::fsm::peer::PeerFsm::create::h796ac694c7f2d0e5\n at src/raftstore/store/fsm/peer.rs:151\n 7: 0x55bfe26a0973 - tikv::raftstore::store::fsm::store::RaftPollerBuilder<T,C>::init::{{closure}}::h4c08e0783f93ab81\n at src/raftstore/store/fsm/store.rs:750\n - engine::iterable::scan_impl::h0450662012195a8e\n at /home/jenkins/workspace/release_tidb_3.0/tikv/components/engine/src/iterable.rs:198\n - engine::iterable::Iterable::scan_cf::h6d7aa7d8bbcfb1ed\n at /home/jenkins/workspace/release_tidb_3.0/tikv/components/engine/src/iterable.rs:174\n - tikv::raftstore::store::fsm::store::RaftPollerBuilder<T,C>::init::h75225d858b84b1b1\n at src/raftstore/store/fsm/store.rs:721\n - tikv::raftstore::store::fsm::store::RaftBatchSystem::spawn::h4291f89945cb9126\n at src/raftstore/store/fsm/store.rs:1003\n 8: 0x55bfe267c982 - tikv::server::node::Node::start_store::hb6e41ce5d17f8092\n at src/server/node.rs:341\n - tikv::server::node::Node::start::h502723d45c7f0962\n at src/server/node.rs:148\n - tikv::binutil::server::run_raft_server::he798388cc8852c3a\n at src/binutil/server.rs:276\n 9: 0x55bfe26451b1 - tikv::binutil::server::run_tikv::h17cbdf211cde42b2\n at src/binutil/server.rs:79\n 10: 0x55bfe263a805 - tikv_server::main::h28eadcb59f5aa918\n at src/bin/tikv-server.rs:159\n 11: 0x55bfe25b1362 - std::rt::lang_start::{{closure}}::hd8df218522d5a046\n at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/rt.rs:64\n 12: 0x55bfe263c128 - main\n 13: 0x7f69f5917b96 - __libc_start_main\n 14: 0x55bfe25871a8 - \n 15: 0x0 - ”] [location=src/raftstore/store/peer_storage.rs:494] [thread_name=main]
[2023/03/15 19:17:01.850 +08:00] [FATAL] [lib.rs:499] [“[region 27] 11522647 unexpected raft log index: last_index 5177432 < applied_index 5177434”] [backtrace=“stack backtrace:\n 0: 0x561944e9c51d - backtrace::backtrace::libunwind::trace::h0500f4f2825a5d17\n at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/libunwind.rs:54\n - backtrace::backtrace::trace::h4187244de1605a06\n at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/mod.rs:70\n 1: 0x561944e90cd0 - tikv_util::set_panic_hook::{{closure}}::h195100b0bbd49cfb\n at /home/jenkins/.target/release/build/backtrace-e20a32a05fd0b8fe/out/capture.rs:79\n 2: 0x56194503564f - std::panicking::rust_panic_with_hook::h8d2408723e9a2bd4\n at src/libstd/panicking.rs:479\n 3: 0x56194503542d - std::panicking::continue_panic_fmt::hb2aaa9386c4e5e80\n at src/libstd/panicking.rs:382\n 4: 0x5619450353db - std::panicking::begin_panic_fmt::h1c91fada5a982dcd\n at src/libstd/panicking.rs:337\n 5: 0x56194461b49a - tikv::raftstore::store::peer::Peer::new::h114b9c5233192fb4\n at src/raftstore/store/peer.rs:0\n 6: 0x5619444b80c5 - tikv::raftstore::store::fsm::peer::PeerFsm::create::h796ac694c7f2d0e5\n at src/raftstore/store/fsm/peer.rs:151\n 7: 0x5619443a1973 - tikv::raftstore::store::fsm::store::RaftPollerBuilder<T,C>::init::{{closure}}::h4c08e0783f93ab81\n at src/raftstore/store/fsm/store.rs:750\n - engine::iterable::scan_impl::h0450662012195a8e\n at /home/jenkins/workspace/release_tidb_3.0/tikv/components/engine/src/iterable.rs:198\n - engine::iterable::Iterable::scan_cf::h6d7aa7d8bbcfb1ed\n at /home/jenkins/workspace/release_tidb_3.0/tikv/components/engine/src/iterable.rs:174\n - tikv::raftstore::store::fsm::store::RaftPollerBuilder<T,C>::init::h75225d858b84b1b1\n at src/raftstore/store/fsm/store.rs:721\n - tikv::raftstore::store::fsm::store::RaftBatchSystem::spawn::h4291f89945cb9126\n at src/raftstore/store/fsm/store.rs:1003\n 8: 0x56194437d982 - tikv::server::node::Node::start_store::hb6e41ce5d17f8092\n at src/server/node.rs:341\n - tikv::server::node::Node::start::h502723d45c7f0962\n at src/server/node.rs:148\n - tikv::binutil::server::run_raft_server::he798388cc8852c3a\n at src/binutil/server.rs:276\n 9: 0x5619443461b1 - tikv::binutil::server::run_tikv::h17cbdf211cde42b2\n at src/binutil/server.rs:79\n 10: 0x56194433b805 - tikv_server::main::h28eadcb59f5aa918\n at src/bin/tikv-server.rs:159\n 11: 0x5619442b2362 - std::rt::lang_start::{{closure}}::hd8df218522d5a046\n at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/rt.rs:64\n 12: 0x56194433d128 - main\n 13: 0x7f642ab11b96 - __libc_start_main\n 14: 0x5619442881a8 - \n 15: 0x0 - ”] [location=src/raftstore/store/peer_storage.rs:494] [thread_name=main]

| username: TI表弟 | Original post link

First of all, there are many nodes, so it’s not a big deal if one goes down. Don’t panic. Listen to me and don’t operate recklessly. Take it slow.

| username: TI表弟 | Original post link

Step 1: Use tiup cluster display to check the status of the nodes, and pd-ctl store to view the status of the nodes. After checking, report back to me.

| username: 普罗米修斯 | Original post link

| username: 普罗米修斯 | Original post link

We deployed it using Ansible and wrote a webpage for monitoring TiDB.

| username: TI表弟 | Original post link

The first step is to use tiup cluster display to check the node status, and pd-ctl store to view the node status. After checking, please report back to me, okay?

| username: 普罗米修斯 | Original post link

Deployed with Ansible, no tiup tool, store status is the same as above.

| username: TI表弟 | Original post link

Focus on miss-peer and pending-peer.

| username: 普罗米修斯 | Original post link

The problem has been solved. The regions and leaders on the downed TiKV have all been transferred. After manually scaling down and scaling up, everything is back to normal.

| username: TI表弟 | Original post link

Keeping more than four TiKV nodes with sufficient disk space will significantly reduce the number of issues, and the probability of the cluster becoming unavailable will be much smaller.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.