TiKV suddenly exited with an error: tikv-20160.service main process exited, FAILURE

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv突然出错后自己退出 tikv-20160.service main process exited, FAILURE

| username: zhimadi

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.2
[Reproduction Path] None
[Encountered Problem: Problem Phenomenon and Impact]
During off-peak business hours, the system was running smoothly, but suddenly one TiKV node encountered an error and exited on its own. This caused the system to crash, resulting in a large number of slow queries that are not usually slow queries. It looks like there might be a bug in the TiDB program?
Error message as follows:
tikv-20160.service: main process exited, code=exited, status=1/FAILURE

[2023/07/24 18:30:55.854 +08:00] [FATAL] [lib.rs:465] [“elapsed=5717168819; when=5717168817”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:464:18\n 1: std::panicking::rust_panic_with_hook\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:626:17\n 2: std::panicking::begin_panic_handler::{{closure}}\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:519:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:141:18\n 4: rust_begin_unwind\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:515:5\n 5: std::panicking::begin_panic_fmt\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:457:5\n 6: tokio_timer::wheel::Wheel::poll\n 7: tokio_timer::timer::Timer<T,N>::process\n at /rust/git/checkouts/tokio-8e927faba632ed16/e8ac149/tokio-timer/src/timer/mod.rs:272:33\n 8: tokio_timer::timer::Timer<T,N>::turn\n 9: tikv_util::timer::start_global_steady_timer::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/timer.rs:196:17\n 10: std::sys_common::backtrace::__rust_begin_short_backtrace\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:125:18\n 11: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:476:17\n 12: <std::panic::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:347:9\n 13: std::panicking::try::do_call\n at /rustc/2f

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
System Log:

tikv.log

Monitoring Graphs:



| username: 爱白话的晓辉 | Original post link

How is the machine configured?

| username: zhimadi | Original post link

3 TiDB, 6 TiKV, the usual machine memory and CPU usage is mostly under 50% most of the time.

| username: tidb狂热爱好者 | Original post link

This is not telling you that it is out of memory (OOM).

| username: 裤衩儿飞上天 | Original post link

It’s obviously an OOM (Out of Memory).

| username: redgame | Original post link

Add more memory…

| username: tidb菜鸟一只 | Original post link

Please provide the cluster topology diagram, the resources of each node, and whether there is any mixed deployment situation?

| username: zhanggame1 | Original post link

Where specifically can you see the OOM?

| username: zhimadi | Original post link

The OOM log will indicate the OOM.

| username: zhimadi | Original post link

The OOM log will indicate the OOM. It’s the first time I’ve seen an error at the [FATAL] level.

| username: zhimadi | Original post link

Sorry, I can’t translate images. Please provide the text you need translated.

| username: tidb菜鸟一只 | Original post link

Are there any error messages corresponding to the time under /var/log/message?

| username: zhimadi | Original post link

Yes, look at the first picture, Jul 24 18:30:56

| username: h5n1 | Original post link

It should be this bug: TiKV running over 2 years may panic #11940

Additionally, version 5.4.2 is not recommended for production use due to a serious bug

| username: tidb菜鸟一只 | Original post link

I guess this node hasn’t been running for two years, right?

| username: h5n1 | Original post link

The issue is two years, but I think triggering the bug doesn’t necessarily have to be after two years.

| username: zhimadi | Original post link

This node hasn’t been around for 2 years, but the entire cluster has.

| username: zhimadi | Original post link

I feel like it triggered some bug in TiDB. When we upgraded to 5.4.2, this bug hadn’t been exposed yet. As for upgrades, it’s best not to touch them if you don’t have to. You know what I mean.

| username: zhimadi | Original post link

So far, there hasn’t been a more definitive discovery, so we’ll just consider it a bug for now. :joy:

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.