TiKV Node Panic

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV节点panic

| username: 雪落香杉树

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.3
[Reproduction Path] None
[Encountered Problem: Problem Phenomenon and Impact]
TiKV node restarted, no OOM information found in dmesg, panic information found in logs. How to troubleshoot the cause or avoid similar situations?

[2024/04/07 21:26:25.431 +08:00] [FATAL] [lib.rs:465] ["commit_ts: TimeStamp(448920645354651662), resolved_ts: TimeStamp(448920645996904752)"] [backtrace="   0: tikv_util::set_panic_hook::{{closure}}\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:464:18\n   1: std::panicking::rust_panic_with_hook\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:626:17\n   2: std::panicking::begin_panic_handler::{{closure}}\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:519:13\n   3: std::sys_common::backtrace::__rust_end_short_backtrace\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:141:18\n   4: rust_begin_unwind\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:515:5\n   5: std::panicking::begin_panic_fmt\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:457:5\n   6: cdc::delegate::Delegate::sink_put\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:630:21\n      cdc::delegate::Delegate::sink_data\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:544:21\n   7: cdc::delegate::Delegate::on_batch\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/delegate.rs:416:17\n   8: cdc::endpoint::Endpoint<T,E>::on_multi_batch\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:740:33\n      <cdc::endpoint::Endpoint<T,E> as tikv_util::worker::pool::Runnable>::run\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/cdc/src/endpoint.rs:1548:18\n   9: tikv_util::worker::pool::Worker::start_with_timer_impl::{{closure}}\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/worker/pool.rs:454:25\n      <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/future/mod.rs:80:19\n      yatp::task::future::RawTask<F>::poll\n             at /rust/git/checkouts/yatp-e704b73c3ee279b6/d564d19/src/task/future.rs:59:9\n  10: yatp::task::future::TaskCell::poll\n             at /rust/git/checkouts/yatp-e704b73c3ee279b6/d564d19/src/task/future.rs:103:9\n      <yatp::task::future::Runner as yatp::pool::runner::Runner>::handle\n             at /rust/git/checkouts/yatp-e704b73c3ee279b6/d564d19/src/task/future.rs:387:20\n  11: <tikv_util::yatp_pool::YatpPoolRunner<T> as yatp::pool::runner::Runner>::handle\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/yatp_pool/mod.rs:104:24\n      yatp::pool::worker::WorkerThread<T,R>::run\n             at /rust/git/checkouts/yatp-e704b73c3ee279b6/d564d19/src/pool/worker.rs:48:13\n      yatp::pool::builder::LazyBuilder<T>::build::{{closure}}\n             at /rust/git/checkouts/yatp-e704b73c3ee279b6/d564d19/src/pool/builder.rs:91:25\n      std::sys_common::backtrace::__rust_begin_short_backtrace\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys_common/backtrace.rs:125:18\n  12: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:476:17\n      <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:347:9\n      std::panicking::try::do_call\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:401:40\n      std::panicking::try\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panicking.rs:365:19\n      std::panic::catch_unwind\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/panic.rs:434:14\n      std::thread::Builder::spawn_unchecked::{{closure}}\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/thread/mod.rs:475:30\n      core::ops::function::FnOnce::call_once{{vtable.shim}}\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/core/src/ops/function.rs:227:5\n  13: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572:9\n      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/alloc/src/boxed.rs:1572:9\n      std::sys::unix::thread::Thread::new::thread_start\n             at /rustc/2faabf579323f5252329264cc53ba9ff803429a3/library/std/src/sys/unix/thread.rs:91:17\n  14: start_thread\n  15: clone\n"] [location=components/cdc/src/delegate.rs:630] [thread_name=cdc-0]
[2024/04/07 21:26:45.070 +08:00] [INFO] [lib.rs:81] ["Welcome to TiKV"]
[2024/04/07 21:26:45.070 +08:00] [INFO] [lib.rs:86] ["Release Version:   5.4.3"]
[2024/04/07 21:26:45.071 +08:00] [INFO] [lib.rs:86] ["Edition:           Community"]
[2024/04/07 21:26:45.071 +08:00] [INFO] [lib.rs:86] ["Git Commit Hash:   deb149e42d97743349277ff8741f5cb9ae1c027d"]
[2024/04/07 21:26:45.071 +08:00] [INFO] [lib.rs:86] ["Git Commit Branch: heads/refs/tags/v5.4.3"]

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page


[Attachments: Screenshots/Logs/Monitoring]

| username: TiDBer_JUi6UvZm | Original post link

What is the output of dmesg -T | grep "error"?

| username: 雪落香杉树 | Original post link

No information

| username: Jolyne | Original post link

During this period, if there are a large number of queries and the returned data volume is too large, the gRPC sending speed may not keep up with the Coprocessor’s data output speed, which could also lead to memory overflow.

| username: TiDBer_jYQINSnf | Original post link

Looking at the code, there is this assert, but I don’t understand the CDC code.

| username: TIDB-Learner | Original post link

I have encountered TiDB restarts, but haven’t experienced TiKV restarts. Looking at the resource usage, it seems that the CPU has hit a bottleneck.

| username: TiDBer_JUi6UvZm | Original post link

According to this, raise an issue. Let the developers take a look.

| username: 雪落香杉树 | Original post link

Okay, let me take a look.

| username: DBAER | Original post link

Mark it.

| username: WalterWj | Original post link

Panic might be caused by a bug, try upgrading to resolve it.

| username: TiDBer_21wZg5fm | Original post link

Try it on other versions and see.

| username: Hacker_PtIIxHC1 | Original post link

Is there a monitoring graph for the TiKV machine? Check if the memory and CPU are fully utilized. Are there any large queries? You can also check the TiDB logs.

| username: 雪落香杉树 | Original post link

The CPU and memory don’t seem to be fully utilized.

| username: 雪落香杉树 | Original post link

The IO usage is relatively high, and there is a bottleneck.

| username: TiDBer_JUi6UvZm | Original post link

It’s safer to raise an issue for this kind of problem. After all, it’s a production issue. The official team might have already resolved it.

| username: 友利奈绪 | Original post link

There is a probability that slow queries cause CPU spikes to be too high and then crash.

| username: xfworld | Original post link

The lifecycle of version 5.X has ended, so it is recommended to upgrade to a new version, such as 6.5.x.

However, do not upgrade directly. It is best to use two sets of resources, old and new, and perform compatibility testing before making the switch.

| username: dba远航 | Original post link

Check the system logs for any anomalies.

| username: changpeng75 | Original post link

Is it only one TiKV restarting? Is there any system setting that limits TiKV’s use of system memory, because the system memory is not exhausted.
For the TiKV Panic issue, you can refer to the documentation:

| username: ffeenn | Original post link

Upgrade it.