TiKV Node Abnormal Restart: Index Out of Bounds - The Length is 6 but the Index is 6

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点异常重启index out of bounds: the len is 6 but the index is 6

| username: 胡杨树旁

Production environment, database version v5.1.2 TIKV node abnormal restart, error log:

[2024/03/04 05:55:16.243 +08:00] [WARN] [endpoint.rs:633] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 47439535, leader may Some(id: 47439890 store_id: 4680469)\" not_leader { region_id: 47439535 leader { id: 47439890 store_id: 4680469 } }"]
[2024/03/04 05:55:37.772 +08:00] [WARN] [endpoint.rs:633] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 47439535, leader may Some(id: 47439890 store_id: 4680469)\" not_leader { region_id: 47439535 leader { id: 47439890 store_id: 4680469 } }"]
[2024/03/04 05:55:47.186 +08:00] [WARN] [endpoint.rs:633] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 47439535, leader may Some(id: 47439890 store_id: 4680469)\" not_leader { region_id: 47439535 leader { id: 47439890 store_id: 4680469 } }"]
[2024/03/04 05:55:51.506 +08:00] [FATAL] [lib.rs:463] ["index out of bounds: the len is 6 but the index is 6"] [backtrace="stack backtrace:\n   0: tikv_util::set_panic_hook::{{closure}}\n             at components/tikv_util/src/lib.rs:462\n   1: std::panicking::rust_panic_with_hook\n             at library/std/src/panicking.rs:595\n   2: std::panicking::begin_panic_handler::{{closure}}\n             at library/std/src/panicking.rs:497\n   3: std::sys_common::backtrace::__rust_end_short_backtrace\n             at library/std/src/sys_common/backtrace.rs:141\n   4: rust_begin_unwind\n             at library/std/src/panicking.rs:493\n   5: core::panicking::panic_fmt\n             at library/core/src/panicking.rs:92\n   6: core::panicking::panic_bounds_check\n             at library/core/src/panicking.rs:69\n   7: <usize as core::slice::index::SliceIndex<[T]>>::index_mut\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/index.rs:188\n      core::slice::index::<impl core::ops::index::IndexMut<I> for [T]>::index_mut\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/index.rs:26\n      <alloc::vec::Vec<T,A> as core::ops::index::IndexMut<I>>::index_mut\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/vec/mod.rs:2398\n      tokio_timer::wheel::Wheel<T>::insert\n             at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/wheel/mod.rs:114\n      tokio_timer::timer::Timer<T,N>::add_entry\n             at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/timer/mod.rs:324\n   8: tokio_timer::timer::Timer<T,N>::process_queue\n             at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/timer/mod.rs:301\n   9: <tokio_timer::timer::Timer<T,N> as tokio_executor::park::Park>::park\n             at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/timer/mod.rs:361\n      tokio_timer::timer::Timer<T,N>::turn\n             at /rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/timer/mod.rs:256\n  10: tikv_util::timer::start_global_timer::{{closure}}\n             at components/tikv_util/src/timer.rs:98\n  11: std::sys_common::backtrace::__rust_begin_short_backtrace\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/sys_common/backtrace.rs:125\n  12: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/thread/mod.rs:474\n  13: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/panic.rs:344\n  14: std::panicking::try::do_call\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/panicking.rs:379\n      std::panicking::try\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/panicking.rs:343\n      std::panic::catch_unwind\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/panic.rs:431\n      std::thread::Builder::spawn_unchecked::{{closure}}\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/thread/mod.rs:473\n      core::ops::function::FnOnce::call_once{{vtable.shim}}\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/ops/function.rs:227\n  15: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546\n      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546\n      std::sys::unix::thread::Thread::new::thread_start\n             at library/std/src/sys/unix/thread.rs:71\n  16: start_thread\n  17: clone\n"] [location=/rust/registry/src/github.com-1ecc6299db9ec823/tokio-timer-0.2.13/src/wheel/mod.rs:114] [thread_name=timer]
[2024/03/04 05:56:21.053 +08:00] [INFO] [lib.rs:81] ["Welcome to TiKV"]
[2024/03/04 05:56:21.053 +08:00] [INFO] [lib.rs:86] ["Release Version:   5.1.2"]
[2024/03/04 05:56:21.053 +08:00] [INFO] [lib.rs:86] ["Edition:           Enterprise"]

System log:

Mar  4 05:53:30  systemd-logind: Removed session 1543332.
Mar  4 05:54:01  systemd: Started Session 1543334 of user root.
Mar  4 05:54:01  systemd: Started Session 1543333 of user root.
Mar  4 05:54:51  systemd-logind: New session 1543335 of user shsnc.
Mar  4 05:54:51  systemd: Started Session 1543335 of user shsnc.
Mar  4 05:54:51  systemd-logind: Removed session 1543335.
Mar  4 05:56:01  systemd: Started Session 1543336 of user root.
Mar  4 05:56:01  systemd: Started Session 1543337 of user root.
Mar  4 05:56:05  systemd: tikv-20163.service: main process exited, code=exited, status=1/FAILURE
Mar  4 05:56:05  systemd: Unit tikv-20163.service entered failed state.
Mar  4 05:56:05  systemd: tikv-20163.service failed.
Mar  4 05:56:16  systemd-logind: New session 1543338 of user shsnc.
Mar  4 05:56:16  systemd: Started Session 1543338 of user shsnc.
Mar  4 05:56:16  systemd-logind: Removed session 1543338.
Mar  4 05:56:20  systemd: tikv-20163.service holdoff time over, scheduling restart.
Mar  4 05:56:20  systemd: Stopped tikv service.
Mar  4 05:56:20  systemd: Started tikv service.
Mar  4 05:56:21  run_tikv.sh: sync ...
Mar  4 05:56:21  run_tikv.sh: real#0110m0.037s
Mar  4 05:56:21  run_tikv.sh: user#0110m0.001s
Mar  4 05:56:21  run_tikv.sh: sys#0110m0.035s
Mar  4 05:56:21  run_tikv.sh: ok
Mar  4 05:57:38  systemd-logind: New session 1543339 of user shsnc.

| username: 像风一样的男子 | Original post link

Known bug, it is recommended to upgrade the database version

| username: dba远航 | Original post link

It is indeed a BUG.

| username: tidb菜鸟一只 | Original post link

This is an old bug. Hasn’t the TiKV node been restarted for 2 years?

| username: 胡杨树旁 | Original post link

I saw the official introduction saying that it was fixed in version 4.0., but now this is version 5.1., which is a bit strange.

| username: 胡杨树旁 | Original post link

It has indeed been a long time since the last restart because it has been running quite stably. If it’s a bug, should I wait for the other nodes to restart? This cluster has 6 instances deployed, and currently, 3 instances have restarted while 3 instances have not. Do I need to intervene manually? I’m quite worried that if I restart manually, it might not come back up…

| username: 像风一样的男子 | Original post link

Versions 4.0, 5.1, 5.2, and 5.3 all have this bug.

| username: onlyacat | Original post link

TiDB is still very stable, and there are a lot of TiKV instances that haven’t been restarted for two years :rofl:

| username: DBRE | Original post link

After discovering this issue, we conducted an inspection of all online clusters, notified in advance one month, and proactively communicated with the business to schedule restarts. If a cluster has not been restarted since it was set up, there is a high probability that all nodes will simultaneously restart abnormally. To mitigate the impact on the business, we added inspections and proactively restarted.

Attached: Each cluster can log in to the tidb-server and execute the following SQL to find tikv nodes that have not been restarted for 2 years.

select INSTANCE, START_TIME, UPTIME, TIMESTAMPDIFF(day, START_TIME, now()) from information_schema.CLUSTER_INFO where type='tikv' and TIMESTAMPDIFF(day, START_TIME, now()) > 365*2
| username: 胡杨树旁 | Original post link

I see this as a TiFlash bug, but the cluster does not use TiFlash, only TiKV.

| username: DBRE | Original post link

This post contains

| username: 胡杨树旁 | Original post link

Okay, let me check. Not sure if it’s because it hasn’t been restarted for a long time.

| username: 像风一样的男子 | Original post link

It should be the same type of bug, upgrading to the new version will definitely fix it.

| username: redgame | Original post link

If there’s a chance, you should upgrade.

| username: tidb菜鸟一只 | Original post link

Check the last startup time of each node. If any are approaching two years without a restart, manually restart them one by one. This bug doesn’t affect anything else; it’s just an anomaly in the process time calculation.

| username: TiDBer_小阿飞 | Original post link

Is this based on the bug occurrence time to determine that it hasn’t been restarted for 2 years? Haha :joy:

| username: tidb菜鸟一只 | Original post link

The cause of the exception is here. It seems that versions before 5.3 used tokio-timer, which caused the TiKV process to restart abnormally due to calculation errors when running for 2 years.

| username: TiDBer_小阿飞 | Original post link

Learned a lot, thanks! :+1:

| username: 哈喽沃德 | Original post link

Learned.

| username: 小于同学 | Original post link

The art of rebooting