A TiKV Node Fails to Start, Error [FATAL] [lib.rs:491] ["attempt to overwrite compacted entries in

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: [tikv一个节点无法起来,报错[FATAL] [lib.rs:491] "attempt to overwrite compacted entries in

| username: devopNeverStop

[TiDB Usage Environment] Production Environment / Test / Poc
Production
[TiDB Version]
v6.1.0
[Reproduction Path] What operations were performed that caused the issue

[Encountered Issue: Problem Phenomenon and Impact]
TiKV node is in a down state
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]
[FATAL] [lib.rs:491] [“attempt to overwrite compacted entries in 227990773”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:490:18\n 1: std::panicking::rust_panic_with_hook\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:702:17\n 2: std::panicking::begin_panic_handler::{{closure}}\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:588:13\n 3: std::sys_common::backtrace::_rust_end_short_backtrace\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:138:18\n 4: rust_begin_unwind\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:584:5\n 5: core::panicking::panic_fmt\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:143:14\n 6: raft_engine::memtable::MemTable::prepare_append\n 7: raft_engine::memtable::MemTable::append\n at /rust/git/checkouts/raft-engine-35ec7b0b2c07ddd2/0e066f8/src/memtable.rs:334:13\n raft_engine::memtable::MemTableAccessor::apply_append_writes\n at /rust/git/checkouts/raft-engine-35ec7b0b2c07ddd2/0e066f8/src/memtable.rs:965:21\n 8: <raft_engine::memtable::MemTableRecoverContext as raft_engine::file_pipe_log::pipe_builder::ReplayMachine>::replay\n at /rust/git/checkouts/raft-engine-35ec7b0b2c07ddd2/0e066f8/src/memtable.rs:1112:33\n raft_engine::file_pipe_log::pipe_builder::DualPipesBuilder::recover_queue::{{closure}}\n at /rust/git/checkouts/raft-engine-35ec7b0b2c07ddd2/0e066f8/src/file_pipe_log/pipe_builder.rs:265:33\n core::ops::function::impls::<impl core::ops::function::FnMut for &F>::call_mut\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:247:13\n core::ops::function::impls::<impl core::ops::function::FnOnce for &mut F>::call_once\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:280:13\n core::option::Option::map\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/option.rs:906:29\n <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/map.rs:103:9\n rayon::iter::plumbing::Folder::consume_iter\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:178:21\n <rayon::iter::map::MapFolder<C,F> as rayon::iter::plumbing::Folder>::consume_iter\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/map.rs:248:21\n rayon::iter::plumbing::Producer::fold_with\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:110:9\n rayon::iter::plumbing::bridge_producer_consumer::helper\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:438:13\n 9: rayon::iter::plumbing::bridge_producer_consumer::helper::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:418:21\n rayon_core::join::join_context::call_a::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/join/mod.rs:124:17\n <core::panic::unwind_safe::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:271:9\n std::panicking::try::do_call\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:492:40\n std::panicking::try\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:456:19\n std::panic::catch_unwind\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:137:14\n rayon_core::unwind::halt_unwinding\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/unwind.rs:17:5\n rayon_core::join::join_context::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/join/mod.rs:141:24\n 10: rayon_core::registry::in_worker\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/registry.rs:879:13\n rayon_core::join::join_context\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/join/mod.rs:132:5\n rayon::iter::plumbing::bridge_producer_consumer::helper\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:416:47\n 11: rayon::iter::plumbing::bridge_producer_consumer::helper::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-1.5.0/src/iter/plumbing/mod.rs:427:21\n rayon_core::join::join_context::call_b::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/join/mod.rs:129:25\n <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::call::{{closure}}\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/job.rs:113:21\n <core::panic::unwind_safe::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:271:9\n std::panicking::try::do_call\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:492:40\n std::panicking::try\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:456:19\n std::panic::catch_unwind\n at /rust/toolchains/nightly-2022-02-14-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:137:14\n rayon_core::unwind::halt_unwinding\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/unwind.rs:17:5\n <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/job.rs:119:38\n 12: rayon_core::job::JobRef::execute\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/job.rs:59:9\n rayon_core::registry::WorkerThread::execute\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/registry.rs:753:9\n rayon_core::registry::WorkerThread::wait_until_cold\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/registry.rs:730:17\n 13: rayon_core::registry::WorkerThread::wait_until\n at /rust/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.0/src/registry.rs:704:13\n rayon_core::registry::main

| username: 考试没答案 | Original post link

What operation are you performing? Is it reporting an error, or did it suddenly become like this? Please provide a detailed description of the operation.

| username: devopNeverStop | Original post link

Suddenly it just happened like this.

| username: 考试没答案 | Original post link

Have you tried restarting? What is the current status? Is the service running normally?

| username: devopNeverStop | Original post link

Restarted the service and the server, but it still didn’t come up.

| username: h5n1 | Original post link

It is suspected to be a bug. It probably needs scaling in or out to handle it. Let’s keep the current state and wait for official confirmation.

| username: devopNeverStop | Original post link

Yes, the plan is to add a new node first, complete the balance, and then remove the faulty node.

| username: WalterWj | Original post link

Does this version have a bug with the raft engine? :thinking: It is recommended to upgrade to the latest version 6.1.
Alternatively, set the raft engine to 1 thread.

| username: devopNeverStop | Original post link

Currently, we don’t dare to upgrade yet; we need to quickly restore the three nodes first.

| username: Minorli-PingCAP | Original post link

Scale in and out first, then recover.

| username: devopNeverStop | Original post link

Added a new node, almost done, the leaders are already in place, but the regions are still a bit short.

| username: devopNeverStop | Original post link

The new node has been added to TiKV, now we need to remove the problematic old TiKV from the cluster and then rejoin it to the cluster.

  1. Scale-in the original TiKV
    After execution, it was found that TiKV remained in the pending offline state.
  2. Scale-in --force the original TiKV
    The TiKV is no longer visible in TiUP.
  3. Scale-out the original TiKV to the cluster
    The log reports an error: a TiKV with the same IP but different ID exists, unable to start the new TiKV.
  4. Using pd-ctl, the original TiKV information can indeed be seen (showing there are still over 300 regions). Using pd-ctl to delete the original TiKV’s ID shows success, but the information still remains.

Currently, the data and deployment directory content of the original TiKV are empty, and it is not possible to use tikv-ctl to clear the region information of the original TiKV displayed in PD.

| username: devopNeverStop | Original post link

Have you considered changing the IP of this server? It should be able to expand into the cluster normally. Are there any other solutions?

| username: h5n1 | Original post link

The steps for scaling down are incorrect, and unsafe recovery needs to be performed.

| username: devopNeverStop | Original post link

Now that the TiKV node is no longer visible in tiup cluster display, can I still use unsafe recovery?

| username: h5n1 | Original post link

TiUP is just a management display, the actual metadata, etc., are in PD. You can use PD-ctl store to check.

| username: devopNeverStop | Original post link

Finally, the original abnormal TiKV was re-added to the cluster through the following steps:

  1. Remove all region information of the original TiKV in PD
for i in $(tiup ctl:v6.1.0 pd -u 192.168.7.188:2379 region store 8 | grep -B 1 start_key | grep id |awk '{print $2}'|sed 's/,//')
do
   tiup ctl:v6.1.0 pd -u 192.168.7.188:2379 operator add remove-peer $i 8
done
  1. Clear all information of the original TiKV in PD
tiup ctl:v6.1.0 pd -u 192.168.7.188:2379 store remove-tombstone
  1. Scale out and re-add the original TiKV node to the cluster

Currently, the new node is in normal balance. Thanks to everyone for their enthusiastic support, especially @h5n1—the first move helped me solve the problem!

| username: mayjiang0203 | Original post link

Known issue, raft engine panic during recovery · Issue #13123 · tikv/tikv · GitHub, has been fixed in version 6.1.1. It is recommended to upgrade to the latest version of the 6.1.x series.

| username: devopNeverStop | Original post link

Thank you.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.