Help identify the cause of KV startup failure

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 帮忙定位一下kv启动失败的原因

| username: tianjun

[TiDB Usage Environment] Production Environment
[TiDB Version] v7.1.0
[Reproduction Path] What operations were performed to cause the issue
Deployed TiDB cluster on a VM virtual machine. Today, when starting the cluster, it suddenly failed to start. The error message suggested checking the kv logs, but there were no obvious errors in the kv logs. Please help identify the cause of the error.

[Encountered Issue: Problem Phenomenon and Impact]
kv failed to start, no obvious error
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

Below are the startup logs
[2023/10/31 20:04:29.670 +08:00] [INFO] [lib.rs:88] [“Welcome to TiKV”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Release Version: 7.1.0”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Edition: Community”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Git Commit Hash: 0c34464e386940a60f2a2ce279a4ef18c9c6c45b”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Git Commit Branch: heads/refs/tags/v7.1.0”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“UTC Build Time: Unknown (env var does not exist when building)”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Rust Version: rustc 1.67.0-nightly (96ddd32c4 2022-11-14)”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Enable Features: pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [lib.rs:93] [“Profile: dist_release”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [mod.rs:80] [“cgroup quota: memory=Some(9223372036854771712), cpu=None, cores={8, 0, 5, 14, 9, 13, 10, 1, 6, 4, 3, 15, 11, 12, 7, 2}”]
[2023/10/31 20:04:29.674 +08:00] [INFO] [mod.rs:87] [“memory limit in bytes: 33547878400, cpu cores quota: 16”]
[2023/10/31 20:04:29.674 +08:00] [WARN] [lib.rs:544] [“environment variable TZ is missing, using /etc/localtime”]
[2023/10/31 20:04:29.674 +08:00] [WARN] [server.rs:1511] [“check: kernel”] [err=“kernel parameters net.core.somaxconn got 128, expect 32768”]
[2023/10/31 20:04:29.674 +08:00] [WARN] [server.rs:1511] [“check: kernel”] [err=“kernel parameters net.ipv4.tcp_syncookies got 1, expect 0”]
[2023/10/31 20:04:29.674 +08:00] [WARN] [server.rs:1511] [“check: kernel”] [err=“kernel parameters vm.swappiness got 30, expect 0”]
[2023/10/31 20:04:29.681 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.201:2379]
[2023/10/31 20:04:29.685 +08:00] [INFO] [] [“TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter”]
[2023/10/31 20:04:31.686 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”] [endpoints=192.168.1.201:2379]
[2023/10/31 20:04:31.686 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.202:2379]
[2023/10/31 20:04:31.687 +08:00] [INFO] [] [“subchannel 0x7f9e5864f000 {address=ipv4:192.168.1.202:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.202:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.202:2379}: connect failed: {"created":"@1698753871.687002792","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:192.168.1.202:2379"}”]
[2023/10/31 20:04:31.687 +08:00] [INFO] [] [“subchannel 0x7f9e5864f000 {address=ipv4:192.168.1.202:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.202:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.202:2379}: Retry in 999 milliseconds”]
[2023/10/31 20:04:31.687 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: }))”] [endpoints=192.168.1.202:2379]
[2023/10/31 20:04:31.687 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.203:2379]
[2023/10/31 20:04:31.688 +08:00] [INFO] [] [“subchannel 0x7f9e5864f400 {address=ipv4:192.168.1.203:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.203:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.203:2379}: connect failed: {"created":"@1698753871.688366860","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:192.168.1.203:2379"}”]
[2023/10/31 20:04:31.688 +08:00] [INFO] [] [“subchannel 0x7f9e5864f400 {address=ipv4:192.168.1.203:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.203:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.203:2379}: Retry in 999 milliseconds”]
[2023/10/31 20:04:31.688 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: }))”] [endpoints=192.168.1.203:2379]
[2023/10/31 20:04:31.688 +08:00] [WARN] [client.rs:166] [“validate PD endpoints failed”] [err=“Other("[components/pd_client/src/util.rs:599]: PD cluster failed to respond")”]
[2023/10/31 20:04:31.991 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.201:2379]
[2023/10/31 20:04:33.993 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: "Deadline Exceeded", details: }))”] [endpoints=192.168.1.201:2379]
[2023/10/31 20:04:33.993 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.202:2379]
[2023/10/31 20:04:33.994 +08:00] [INFO] [] [“subchannel 0x7f9e58791000 {address=ipv4:192.168.1.202:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.202:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.202:2379}: connect failed: {"created":"@1698753873.994289323","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:192.168.1.202:2379"}”]
[2023/10/31 20:04:33.994 +08:00] [INFO] [] [“subchannel 0x7f9e58791000 {address=ipv4:192.168.1.202:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.202:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.202:2379}: Retry in 998 milliseconds”]
[2023/10/31 20:04:33.994 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: }))”] [endpoints=192.168.1.202:2379]
[2023/10/31 20:04:33.994 +08:00] [INFO] [util.rs:604] [“connecting to PD endpoint”] [endpoints=192.168.1.203:2379]
[2023/10/31 20:04:33.995 +08:00] [INFO] [] [“subchannel 0x7f9e58791400 {address=ipv4:192.168.1.203:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.203:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.203:2379}: connect failed: {"created":"@1698753873.995013352","description":"Failed to connect to remote host: Connection refused","errno":111,"file":"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:192.168.1.203:2379"}”]
[2023/10/31 20:04:33.995 +08:00] [INFO] [] [“subchannel 0x7f9e58791400 {address=ipv4:192.168.1.203:2379, args=grpc.client_channel_factory=0x7f9e586979b0, grpc.default_authority=192.168.1.203:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f9e586389a0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f9e586b5cc0, grpc.server_uri=dns:///192.168.1.203:2379}: Retry in 1000 milliseconds”]
[2023/10/31 20:04:33.995 +08:00] [INFO] [util.rs:566] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "failed to connect to all addresses", details: }))”] [endpoints=192.168.

| username: hey-hoho | Original post link

The TiKV on 201 cannot access the PD on 203. Check if there are any network restrictions. Also, have you only deployed 2 PDs?

| username: Jellybean | Original post link

The error log indicates that TiKV is unable to establish a connection with PD.

  • Check if the PD cluster is functioning properly by reviewing the PD logs.
  • Check the network connectivity between TiKV and PD, as well as the network connectivity between all nodes in the cluster.
| username: tianjun | Original post link

The firewall is turned off, and I can ping through, but I don’t know why I can’t connect. It was working fine before, but the problem suddenly appeared today.

| username: tianjun | Original post link

The firewall is turned off, and I can ping through, but I don’t know why I can’t connect. It was working fine before.

| username: Fly-bird | Original post link

Can SSH connect to that node?

| username: tianjun | Original post link

Other SSH, FTP, ping, etc. are all normal, but the startup fails.

| username: 像风一样的男子 | Original post link

Try using tiup cluster display to check the cluster status. Have the 2 PDs failed to elect a leader?

| username: zhanggame1 | Original post link

PD probably didn’t start up, right? Check the time on all nodes.

| username: TiDBer_小阿飞 | Original post link

Check if the ports for PD and TiKV are occupied?

| username: h5n1 | Original post link

The cluster startup only has PDs 201 and 203, but the logs also mention PD 202:2379. Let’s first check what happened to PD 202. Did you perform any operations?

| username: tianjun | Original post link

We have performed a scale-down and executed the scale-down. When checking the cluster status, this cluster is not present, but the startup process is still looking for the removed node.

| username: h5n1 | Original post link

How to perform the scale-in operation, you can delete 202 from the run_tikv.sh script in the script directory.

| username: Jellybean | Original post link

What is the status displayed by tiup display?

| username: Kongdom | Original post link

Is the cluster status checked using tiup cluster display? Can you share the display results?

| username: andone | Original post link

Is the SSH connection working?