A TiKV Node Failure Causes Cluster Unavailability

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 一个tikv节点宕机导致集群不可用

| username: Holland

[TiDB Usage Environment] Production Environment
[TiDB Version]
v4.0.14
[Reproduction Path] Operations performed that led to the issue
Deployed TiDB v4.0.14 on a cloud server with 3 TiKV nodes. After one TiKV node crashed and restarted, it failed to come back up and panicked. This caused the entire cluster to become unavailable.
[Encountered Issue: Problem Description and Impact]

Finally resolved the issue by adding a new TiKV node, shutting down the faulty node, and restarting the remaining 2 TiKV nodes.
[Resource Configuration] Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
Here is the TiKV error log:

{"log":"[2023/07/20 12:49:05.709 +08:00] [FATAL] [lib.rs:481] [\"to_commit 1238767 is out of range [last_index 1238765], raft_id: 893827, region_id: 893825\"] [backtrace=\"stack backtrace:\\n   0: tikv_util::set_panic_hook::{{closure}}\\n             at components/tikv_util/src/lib.rs:480\\n   1: std::panicking::rust_panic_with_hook\\n             at src/libstd/panicking.rs:475\\n   2: rust_begin_unwind\\n             at src/libstd/panicking.rs:375\\n   3: std::panicking::begin_panic_fmt\\n             at src/libstd/panicking.rs:326\\n   4: raft::raft_log::RaftLog\u003cT\u003e::commit_to\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/\u003c::std::macros::panic macros\u003e:9\\n   5: raft::raft::Raft\u003cT\u003e::handle_heartbeat\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft.rs:1877\\n   6: raft::raft::Raft\u003cT\u003e::step_follower\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft.rs:1718\\n      raft::raft::Raft\u003cT\u003e::step\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft.rs:1129\\n   7: raft::raw_node::RawNode\u003cT\u003e::step\\n             at rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raw_node.rs:339\\n      raftstore::store::peer::Peer::step\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer.rs:941\\n      raftstore::store::fsm::peer::PeerFsmDelegate\u003cT,C\u003e::on_raft_message\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/peer.rs:1206\\n   8: raftstore::store::fsm::peer::PeerFsmDelegate\u003cT,C\u003e::handle_msgs\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/peer.rs:455\\n   9: \u003craftstore::store::fsm::store::RaftPoller\u003cT,C\u003e as batch_system::batch::PollHandler\u003craftstore::store::fsm::peer::PeerFsm\u003cengine_rocks::engine::RocksEngine\u003e,raftstore::store::fsm::store::StoreFsm\u003e\u003e::handle_normal\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/store.rs:785\\n  10: batch_system::batch::Poller\u003cN,C,Handler\u003e::poll\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:325\\n  11: batch_system::batch::BatchSystem\u003cN,C\u003e::spawn::{{closure}}\\n             at home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:402\\n      std::sys_common::backtrace::__rust_begin_short_backtrace\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/sys_common/backtrace.rs:136\\n  12: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/thread/mod.rs:469\\n      \u003cstd::panic::AssertUnwindSafe\u003cF\u003e as core::ops::function::FnOnce\u003c()\u003e\u003e::call_once\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panic.rs:318\\n      std::panicking::try::do_call\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panicking.rs:292\\n      std::panicking::try\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8//src/libpanic_unwind/lib.rs:78\\n      std::panic::catch_unwind\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/panic.rs:394\\n      std::thread::Builder::spawn_unchecked::{{closure}}\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libstd/thread/mod.rs:468\\n      core::ops::function::FnOnce::call_once{{vtable.shim}}\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/libcore/ops/function.rs:232\\n  13: \u003calloc::boxed::Box\u003cF\u003e as core::ops::function::FnOnce\u003cA\u003e\u003e::call_once\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/liballoc/boxed.rs:1022\\n  14: \u003calloc::boxed::Box\u003cF\u003e as core::ops::function::FnOnce\u003cA\u003e\u003e::call_once\\n             at rustc/0de96d37fbcc54978458c18f5067cd9817669bc8/src/liballoc/boxed.rs:1022\\n      std::sys_common::thread::start_thread\\n             at src/libstd/sys_common/thread.rs:13\\n      std::sys::unix::thread::Thread::new::thread_start\\n             at src/libstd/sys/unix/thread.rs:80\\n  15: \u003cunknown\u003e\\n  16: clone\\n\"] [location=/rust/git/checkouts/raft-rs-841f8a6db665c5c0/b5f5830/src/raft_log.rs:237] [thread_name=raftstore-22185-0]\n","stream":"stderr","time":"2023-07-20T04:49:05.709255578Z"}
| username: tidb菜鸟一只 | Original post link

SHOW config WHERE NAME LIKE ‘%max-replicas%’;
What is the max-replica parameter? Why does the entire cluster become unavailable when one of the three TiKV nodes goes down…

| username: Holland | Original post link

| username: Holland | Original post link

I am also curious about this. Clearly, 3 replicas were set. When one KV fails, the entire cluster becomes unreachable.

| username: TiDB_C罗 | Original post link

Print using tiup display.

| username: Holland | Original post link

Deployed with Docker

| username: tidb菜鸟一只 | Original post link

Then check the logs of other TiKV and TiDB nodes to see what errors are reported?

| username: Holland | Original post link

Other KVs reported this error during the period:

[2023/07/20 12:47:26.961 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1689828446.961442851\",\"description\":\"Failed to connect to remote host: Connection timed out\",\"errno\":110,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.5.3/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":205,\"os_error\":\"Connection timed out\",\"syscall\":\"getsockopt(SO_ERROR)\",\"target_address\":\"ipv4:10.0.1.11:10000\"}"]
[2023/07/20 12:47:26.961 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f5daeeb9b80: Retry immediately"]
[2023/07/20 12:47:26.961 +08:00] [INFO] [<unknown>] ["Failed to connect to channel, retrying"]
[2023/07/20 12:47:26.961 +08:00] [WARN] [raft_client.rs:296] ["RPC batch_raft fail"] [err="Some(RpcFailure(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") }))"] [sink_err="Some(RpcFinished(Some(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") })))"] [to_addr=10.0.1.11:10000]
[2023/07/20 12:47:26.961 +08:00] [WARN] [raft_client.rs:199] ["send to 10.0.1.11:10000 failed, the gRPC connection could be broken"]
[2023/07/20 12:47:26.961 +08:00] [ERROR] [transport.rs:163] ["send raft msg err"] [err="Other(\"[src/server/raft_client.rs:208]: RaftClient send fail\")"]

Errors reported during TiDB period:

[2023/07/20 04:47:20.901 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.917 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.917 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.917 +00:00] [INFO] [client_batch.go:348] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.1.11:10000] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/20 04:47:20.952 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.953 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.955 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.957 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]
[2023/07/20 04:47:20.959 +00:00] [WARN] [client_batch.go:530] ["no available connections"] [target=10.0.1.11:10000]

The address 10.0.1.11:10000 is the downed KV.

| username: tidb菜鸟一只 | Original post link

Didn’t you deploy using Docker? Try scaling out a TiKV node first.

| username: Holland | Original post link

This has already been resolved by adding a new TiKV, stopping the faulty KV, and restarting the remaining two KVs. However, the cause needs to be investigated.

| username: Holland | Original post link

I don’t know which step worked. Anyway, after performing these operations, the broken KV is still in a stopped state. It hasn’t been taken offline from pd-ctl, keeping it for investigation.

| username: tidb菜鸟一只 | Original post link

What is the raftstore.sync-log parameter?

| username: Holland | Original post link

It is false

| username: tidb菜鸟一只 | Original post link

The 4.0 version has this issue where the raftstore.sync-log parameter is set to false. When a TiKV node restarts due to power outages or similar reasons, it will cause TiKV to panic and fail to start. You must use the tikv-ctl tool to recover the Region. In version 5.0, this parameter was removed, and the default is set to true.

| username: Holland | Original post link

Okay. Then let’s upgrade the version.

| username: Holland | Original post link

But why would it cause the entire cluster to be unavailable? Is it because this KV is constantly restarting?

| username: tidb菜鸟一只 | Original post link

Version 4.0 is indeed quite old, and encountering such issues is inevitable. It is recommended to upgrade to at least 5.4, which will be much more stable.

| username: Holland | Original post link

Does it mean that the entire cluster becomes unavailable inevitably after a kv panic?

| username: tidb菜鸟一只 | Original post link

I haven’t used version 4.0, but with versions 5 and 6, a 3-replica 3-node TiKV setup won’t be affected by the failure of one node; the remaining 2 nodes will still provide service. If there are issues with version 4.0, it’s hard to pinpoint the cause unless you dig into the source code. Community staff will probably just recommend upgrading your version…

| username: Holland | Original post link

Okay, thank you, boss :grinning: