Abnormal TiKV Restart in Production Environment

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 生产环境tikv异常重启

| username: Mingdr

[TiDB Usage Environment] Production Environment
[TiDB Version] 3.0.8

[Encountered Problem: Symptoms and Impact]
Today, the memory of two TiKV nodes dropped significantly. Upon inspection, it was found that the TiKV services on both servers had restarted at 22:02:00 on November 15th.
[Resource Configuration] 5kv-2server-3pd
[Attachments: Screenshots/Logs/Monitoring]
tidb.log of the tidb node:


Problematic 103 tikv server:

The FATAL log is:

[2022/11/15 22:01:36.060 +08:00] [FATAL] [lib.rs:499] ["index out of bounds: the len is 6 but the index is 6"] [backtrace="stack backtrace:\n   0:     0x56119bd6978d - backtrace::backtrace::libunwind::trace::h958f5f3eb75b2917\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/libunwind.rs:54\n                         - backtrace::backtrace::trace::hdf994f7eb3c12b81\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/mod.rs:70\n   1:     0x56119bd5f490 - tikv_util::set_panic_hook::{{closure}}::hf6c0260b2e4aea39\n                        at /home/jenkins/.target/release/build/backtrace-e20a32a05fd0b8fe/out/capture.rs:79\n   2:     0x56119bf090ff - std::panicking::rust_panic_with_hook::h8d2408723e9a2bd4\n                        at src/libstd/panicking.rs:479\n   3:     0x56119bf08edd - std::panicking::continue_panic_fmt::hb2aaa9386c4e5e80\n                        at src/libstd/panicking.rs:382\n   4:     0x56119bf18745 - rust_begin_unwind\n                        at src/libstd/panicking.rs:309\n   5:     0x56119bf232eb - core::panicking::panic_fmt::h79e840586f23493b\n                        at src/libcore/panicking.rs:85\n   6:     0x56119bf22df3 - core::panicking::panic_bounds_check::h9293ee7846bbb139\n                        at src/libcore/panicking.rs:61\n   7:     0x56119bd675fe - <usize as core::slice::SliceIndex<[T]>>::index_mut::h8595bb10a522f6ec\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libcore/slice/mod.rs:2700\n                         - core::slice::<impl core::ops::index::IndexMut<I> for [T]>::index_mut::h72f80a2aa264d71e\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libcore/slice/mod.rs:2561\n                         - <alloc::vec::Vec<T> as core::ops::index::IndexMut<I>>::index_mut::hf0b55e5b5c64c8b5\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/liballoc/vec.rs:1768\n                         - tokio_timer::wheel::Wheel<T>::insert::ha0811b20f18170ea\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.12/src/wheel/mod.rs:114\n                         - tokio_timer::timer::Timer<T,N>::add_entry::h417301f6c9308779\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.12/src/timer/mod.rs:324\n   8:     0x56119bd666bf - tokio_timer::timer::Timer<T,N>::process_queue::h39928e46093b552d\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.12/src/timer/mod.rs:0\n   9:     0x56119bd65e0b - <tokio_timer::timer::Timer<T,N> as tokio_executor::park::Park>::park::hdf0d553584b8cb22\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.12/src/timer/mod.rs:361\n                         - tokio_timer::timer::Timer<T,N>::turn::h67305fd216915f0e\n                        at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.12/src/timer/mod.rs:256\n  10:     0x56119bd654b3 - tikv_util::timer::start_global_timer::{{closure}}::h418812a1f8076ef5\n                        at components/tikv_util/src/timer.rs:94\n  11:     0x56119bd651e5 - std::sys_common::backtrace::__rust_begin_short_backtrace::h7a15993f24510b40\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/sys_common/backtrace.rs:77\n  12:     0x56119bd651d5 - std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}::hc908468389262d3d\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/thread/mod.rs:470\n  13:     0x56119bd651c5 - <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hea2226c234dcd674\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panic.rs:309\n  14:     0x56119bd651b8 - std::panicking::try::do_call::h5ae108cfa09964f0\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panicking.rs:294\n                         - std::panicking::try::hb747e5c89d6c749b\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250//src/libpanic_abort/lib.rs:29\n                         - std::panic::catch_unwind::h9d095dd8e5bc9701\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/panic.rs:388\n                         - std::thread::Builder::spawn_unchecked::{{closure}}::hd8b8a97aa6180ba0\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/thread/mod.rs:469\n                         - core::ops::function::FnOnce::call_once{{vtable.shim}}::hc4c2684293abcd20\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libcore/ops/function.rs:231\n  15:     0x56119bf1774e - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::he71721d2d956d451\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/liballoc/boxed.rs:746\n  16:     0x56119bf19a7b - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::he520045b8d28ce5c\n                        at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/liballoc/boxed.rs:746\n                         - std::sys_common::thread::start_thread::h2e98d1272dc6d74b\n                        at src/libstd/sys_common/thread.rs:13\n                         - std::sys::unix::thread::Thread::new::thread_start::h18485805666ccd3c\n                        at src/libstd/sys/unix/thread.rs:79\n  17:     0x7f42e149add4 - start_thread\n  18:     0x7f42e0ba202c - __clone\n  19:                0x0 - <unknown>"] [location=/rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libcore/slice/mod.rs:2700] [thread_name=timer]
[2022/11/15 22:02:01.278 +08:00] [INFO] [mod.rs:26] ["Welcome to TiKV."]

103 node message log:

Both restarted kv nodes have the same issue, attached monitoring:
image
image
image
The anomalies in the images are from the two problematic servers.
Could you please explain the reason for this? It’s a bit alarming that both servers encountered issues simultaneously.

| username: xfworld | Original post link

The version is quite old~

This is a known BUG
index out of bounds: the len is 6 but the index is 6

It was only fixed in versions after 4.X…

If the memory consumption is high, it can also be analyzed, no need to panic~
3.X also has slow query log, which can basically help you locate the issue.

| username: Mingdr | Original post link

I checked the slow SQL on the two server nodes, and there was no high memory-consuming slow SQL at that time.
image

However, if this is a bug, then the problem can be confirmed. Could you please provide the bug address? Thank you~

| username: Billmay表妹 | Original post link

You can search on GitHub, and also, make sure to update to the new version. Many issues in the lower versions have been resolved in the higher versions. The community here requires upgrading to at least version 4.x and above.

| username: Mingdr | Original post link

The production upgrade will have to wait until next year, so it’s best to see if we can avoid this situation now.
I just searched on GitHub and couldn’t find the issue I encountered, so I can only hope not to run into this problem again.
Thanks~

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.