A TiKV Node Failure Causes Cluster Unavailability

translator_bot · June 22, 2024, 3:05am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 一个tikv节点宕机导致集群不可用

| username: Holland

[TiDB Usage Environment] Production Environment
[TiDB Version]
v4.0.14
[Reproduction Path] Operations performed that led to the issue
Deployed TiDB v4.0.14 on a cloud server with 3 TiKV nodes. After one TiKV node crashed and restarted, it failed to come back up and panicked. This caused the entire cluster to become unavailable.
[Encountered Issue: Problem Description and Impact]

Finally resolved the issue by adding a new TiKV node, shutting down the faulty node, and restarting the remaining 2 TiKV nodes.
[Resource Configuration] Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
Here is the TiKV error log:

{"log":"[2023/07/20 12:49:05.709 +08:00] [FATAL] [lib.rs:481] [\"to_commit 1238767 is out of range [last_index 1238765], raft_id: 893827, region_id: 893825\"] [backtrace=\"stack backtrace:\\n   0: tikv_util::set_panic_hook::{{closure}}\\n             at components/tikv_util/src/lib.rs:480\\n   1: std::panicking::rust_panic_with_hook\\n             at src/libstd/panicking.rs:475\\n   2: rust_begin_unwind\\n             at src/libstd/panicking.rs:375\\n   3: std::panicking::begin_panic_fmt\\n             at src/libstd/panicking.rs:326\\n   4: raft::raft_log::RaftLog\u003cT\u003e::commit_to\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/\u003c::std::macros::panic macros\u003e:9\\n   5: raft::raft::Raft\u003cT\u003e::handle_heartbeat\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft.rs:1877\\n   6: raft::raft::Raft\u003cT\u003e::step_follower\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft.rs:1718\\n      raft::raft::Raft\u003cT\u003e::step\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft.rs:1129\\n   7: raft::raw_node::RawNode\u003cT\u003e::step\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raw_node.rs:339\\n      raftstore::store::peer::Peer::step\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer.rs:941\\n      raftstore::store::fsm::peer::PeerFsmDelegate\u003cT,C\u003e::on_raft_message\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/peer.rs:1206\\n   8: raftstore::store::fsm::peer::PeerFsmDelegate\u003cT,C\u003e::handle_msgs\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/peer.rs:455\\n   9: \u003craftstore::store::fsm::store::RaftPoller\u003cT,C\u003e as batch_system::batch::PollHandler\u003craftstore::store::fsm::peer::PeerFsm\u003cengine_rocks::engine::RocksEngine\u003e,raftstore::store::fsm::store::StoreFsm\u003e\u003e::handle_normal\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/store.rs:785\\n  10: batch_system::batch::Poller\u003cN,C,Handler\u003e::poll\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:325\\n  11: batch_system::batch::BatchSystem\u003cN,C\u003e::spawn::{{closure}}\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:402\\n      std::sys_common::backtrace::__rust_begin_short_backtrace\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/sys_common/backtrace.rs:136\\n  12: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/thread/mod.rs:469\\n      \u003cstd::panic::AssertUnwindSafe\u003cF\u003e as core::ops::function::FnOnce\u003c()\u003e\u003e::call_once\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panic.rs:318\\n      std::panicking::try::do_call\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panicking.rs:292\\n      std::panicking::try\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8//src/libpanic_unwind/lib.rs:78\\n      std::panic::catch_unwind\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panic.rs:394\\n      std::thread::Builder::spawn_unchecked::{{closure}}\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/thread/mod.rs:468\\n      core::ops::function::FnOnce::call_once{{vtable.shim}}\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libcore/ops/function.rs:232\\n  13: \u003calloc::boxed::Box\u003cF\u003e as core::ops::function::FnOnce\u003cA\u003e\u003e::call_once\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/liballoc/boxed.rs:1022\\n  14: \u003calloc::boxed::Box\u003cF\u003e as core::ops::function::FnOnce\u003cA\u003e\u003e::call_once\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/liballoc/boxed.rs:1022\\n      std::sys_common::thread::start_thread\\n             at src/libstd/sys_common/thread.rs:13\\n      std::sys::unix::thread::Thread::new::thread_start\\n             at src/libstd/sys/unix/thread.rs:80\\n  15: \u003cunknown\u003e\\n  16: clone\\n\"] [location=/rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft_log.rs:237] [thread_name=raftstore-22185-0]\n","stream":"stderr","time":"2023-07-20T04:49:05.709255578Z"}

translator_bot · June 22, 2024, 3:05am

| username: tidb菜鸟一只 | Original post link

SHOW config WHERE NAME LIKE ‘%max-replicas%’;
What is the max-replica parameter? Why does the entire cluster become unavailable when one of the three TiKV nodes goes down…

translator_bot · June 22, 2024, 3:05am

| username: Holland | Original post link

translator_bot · June 22, 2024, 3:05am

| username: Holland | Original post link

I am also curious about this. Clearly, 3 replicas were set. When one KV fails, the entire cluster becomes unreachable.

translator_bot · June 22, 2024, 3:05am

| username: TiDB_C罗 | Original post link

Print using tiup display.

translator_bot · June 22, 2024, 3:05am

| username: Holland | Original post link

Deployed with Docker

translator_bot · June 22, 2024, 3:05am

| username: tidb菜鸟一只 | Original post link

Then check the logs of other TiKV and TiDB nodes to see what errors are reported?

translator_bot · June 22, 2024, 3:05am

| username: Holland | Original post link

Other KVs reported this error during the period:

[2023/07/20 12:47:26.961 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1689828446.961442851\",\"description\":\"Failed to connect to remote host: Connection timed out\",\"errno\":110,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.5.3/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":205,\"os_error\":\"Connection timed out\",\"syscall\":\"getsockopt(SO_ERROR)\",\"target_address\":\"ipv4:10.0.1.11:10000\"}"]
[2023/07/20 12:47:26.961 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f5daeeb9b80: Retry immediately"]
[2023/07/20 12:47:26.961 +08:00] [INFO] [<unknown>] ["Failed to connect to channel, retrying"]
[2023/07/20 12:47:26.961 +08:00] [WARN] [raft_client.rs:296] ["RPC batch_raft fail"] [err="Some(RpcFailure(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") }))"] [sink_err="Some(RpcFinished(Some(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") })))"] [to_addr=10.0.1.11:10000]
[2023/07/20 12:47:26.961 +08:00] [WARN] [raft_client.rs:199] ["send to 10.0.1.11:10000 failed, the gRPC connection could be broken"]
[2023/07/20 12:47:26.961 +08:00] [ERROR] [transport.rs:163] ["send raft msg err"] [err="Other(\"[src/server/raft_client.rs:208]: RaftClient send fail\")"]

Errors reported during TiDB period:

[2023/07/20 04:47:20.901 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.917 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.917 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.917 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.952 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.953 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.955 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.957 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.959 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]

The address 10.0.1.11:10000 is the downed KV.

translator_bot · June 22, 2024, 3:05am

| username: tidb菜鸟一只 | Original post link

Didn’t you deploy using Docker? Try scaling out a TiKV node first.

translator_bot · June 22, 2024, 3:05am

| username: Holland | Original post link

This has already been resolved by adding a new TiKV, stopping the faulty KV, and restarting the remaining two KVs. However, the cause needs to be investigated.

translator_bot · June 22, 2024, 3:05am

| username: Holland | Original post link

I don’t know which step worked. Anyway, after performing these operations, the broken KV is still in a stopped state. It hasn’t been taken offline from pd-ctl, keeping it for investigation.

translator_bot · June 22, 2024, 3:05am

| username: tidb菜鸟一只 | Original post link

What is the raftstore.sync-log parameter?

translator_bot · June 22, 2024, 3:05am

| username: Holland | Original post link

It is false

translator_bot · June 22, 2024, 3:05am

| username: tidb菜鸟一只 | Original post link

github.com/pingcap/tidb

Tikv server cannot startup after power faliure

opened 09:52AM - 11 May 20 UTC

closed 08:54AM - 05 Jun 24 UTC

zengyongjie

type/question type/stale

My Tikv server cannot startup after a power failiure. In the log , there are …following information: ``` [2020/05/11 17:49:52.613 +08:00] [FATAL] [lib.rs:499] ["[region 2] 5 unexpected raft log index: last_index 24065539 < applied_index 24065551"] [backtrace="stack backtrace:\n 0: 0x55a9824391dd - backtrace::backtrace::libunwind::trace::h958f5f3eb75b2917\n at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/libunwind.rs:54\n - backtrace::backtrace::trace::hdf994f7eb3c12b81\n at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.2.3/src/backtrace/mod.rs:70\n 1: 0x55a98242ef60 - tikv_util::set_panic_hook::{{closure}}::hf6c0260b2e4aea39\n at /home/jenkins/.target/release/build/backtrace-e20a32a05fd0b8fe/out/capture.rs:79\n 2: 0x55a9825d8f0f - std::panicking::rust_panic_with_hook::h8d2408723e9a2bd4\n at src/libstd/panicking.rs:479\n 3: 0x55a9825d8ced - std::panicking::continue_panic_fmt::hb2aaa9386c4e5e80\n at src/libstd/panicking.rs:382\n 4: 0x55a9825d8c9b - std::panicking::begin_panic_fmt::h1c91fada5a982dcd\n at src/libstd/panicking.rs:337\n 5: 0x55a981b63253 - tikv::raftstore::store::peer_storage::PeerStorage::new::hcb75391dcb7ce140\n at src/raftstore/store/peer_storage.rs:494\n - tikv::raftstore::store::peer::Peer::new::hc3c8ed61d8e75411\n at src/raftstore/store/peer.rs:354\n 6: 0x55a981b60e25 - tikv::raftstore::store::fsm::peer::PeerFsm::create::h09cd435dfe253ce3\n at src/raftstore/store/fsm/peer.rs:151\n 7: 0x55a981a08945 - tikv::raftstore::store::fsm::store::RaftPollerBuilder<T,C>::init::{{closure}}::he07efcf2c7d48204\n at src/raftstore/store/fsm/store.rs:751\n - engine::iterable::scan_impl::h5b50724155c33a5e\n at /home/jenkins/agent/workspace/ld_tikv_multi_branch_release-3.0/tikv/components/engine/src/iterable.rs:206\n - engine::iterable::Iterable::scan_cf::hc37dba140d2afcfb\n at /home/jenkins/agent/workspace/ld_tikv_multi_branch_release-3.0/tikv/components/engine/src/iterable.rs:174\n - tikv::raftstore::store::fsm::store::RaftPollerBuilder<T,C>::init::hed22a985dea0a143\n at src/raftstore/store/fsm/store.rs:722\n 8: 0x55a9818c307e - tikv::raftstore::store::fsm::store::RaftBatchSystem::spawn::hfc514a90de4f8570\n at src/raftstore/store/fsm/store.rs:1005\n - tikv::server::node::Node<C>::start_store::h50c40c4a1b99ac18\n at src/server/node.rs:342\n - tikv::server::node::Node<C>::start::h186667dfaa6c55eb\n at src/server/node.rs:149\n 9: 0x55a981894b5b - tikv::binutil::server::run_raft_server::h8c3138effea2aca0\n at src/binutil/server.rs:255\n 10: 0x55a981878581 - tikv::binutil::server::run_tikv::hc1bb5ab5802e7af2\n at src/binutil/server.rs:76\n 11: 0x55a98186dbd5 - tikv_server::main::ha1c19d9c945299f4\n at src/bin/tikv-server.rs:159\n 12: 0x55a9817e1452 - std::rt::lang_start::{{closure}}::hfafc66c154eb4192\n at /rustc/0e4a56b4b04ea98bb16caada30cb2418dd06e250/src/libstd/rt.rs:64\n 13: 0x55a98186f4f8 - main\n 14: 0x7f60915aa504 - __libc_start_main\n 15: 0x55a9817b3408 - <unknown>\n 16: 0x0 - <unknown>"] [location=src/raftstore/store/peer_storage.rs:494] [thread_name=main] ``` I use tikv-ctl to check the db, got following result: ```bash $ ./tikv-ctl --db /data1/deploy/data/db/ bad-regions 2: "last index < applied index" 20: "last index < applied index" ``` How to resolve this problem?

The 4.0 version has this issue where the raftstore.sync-log parameter is set to false. When a TiKV node restarts due to power outages or similar reasons, it will cause TiKV to panic and fail to start. You must use the tikv-ctl tool to recover the Region. In version 5.0, this parameter was removed, and the default is set to true.

translator_bot · June 22, 2024, 3:05am

| username: Holland | Original post link

Okay. Then let’s upgrade the version.

translator_bot · June 22, 2024, 3:05am

| username: Holland | Original post link

But why would it cause the entire cluster to be unavailable? Is it because this KV is constantly restarting?

translator_bot · June 22, 2024, 3:05am

| username: tidb菜鸟一只 | Original post link

Version 4.0 is indeed quite old, and encountering such issues is inevitable. It is recommended to upgrade to at least 5.4, which will be much more stable.

translator_bot · June 22, 2024, 3:05am

| username: Holland | Original post link

Does it mean that the entire cluster becomes unavailable inevitably after a kv panic?

translator_bot · June 22, 2024, 3:05am

| username: tidb菜鸟一只 | Original post link

I haven’t used version 4.0, but with versions 5 and 6, a 3-replica 3-node TiKV setup won’t be affected by the failure of one node; the remaining 2 nodes will still provide service. If there are issues with version 4.0, it’s hard to pinpoint the cause unless you dig into the source code. Community staff will probably just recommend upgrading your version…

translator_bot · June 22, 2024, 3:05am

| username: Holland | Original post link

Okay, thank you, boss