TiKV cannot start (firewall confirmed to be closed, three TiKVs have been assigned to three virtual machines)

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv启动不了 (防火墙确认都已经关闭,三个tikv已经分配给了三个虚拟机)

| username: TimeKiller

Configuration file is as follows:

# # Global variables are applied to all deployments and used as the default value of
# # the deployments if a specific deployment value is missing.
global:
 user: "tidb"
 ssh_port: 22
 deploy_dir: "/tidb-deploy"
 data_dir: "/tidb-data"

# # Monitored variables are applied to all the machines.
monitored:
 node_exporter_port: 9100
 blackbox_exporter_port: 9115

server_configs:
 tidb:
   instance.tidb_slow_log_threshold: 300
 tikv:
   readpool.storage.use-unified-pool: false
   readpool.coprocessor.use-unified-pool: true
 pd:
   replication.enable-placement-rules: true
   replication.location-labels: ["host"]
 tiflash:
   logger.level: "info"

pd_servers:
 - host: 175.27.241.31

tidb_servers:
 - host: 175.27.169.129

tikv_servers:
 - host: 175.27.241.31
   port: 20160
   status_port: 20180
   config:
     server.labels: { host: "logic-host-1" }

 - host: 175.27.169.129
port: 20161
   status_port: 20181
   config:
     server.labels: { host: "logic-host-2" }

 - host: 119.45.142.75
   port: 20162
   status_port: 20182
   config:
     server.labels: { host: "logic-host-3" }

tiflash_servers:
 - host: 119.45.142.75

monitoring_servers:
 - host: 175.27.241.31

grafana_servers:
 - host: 175.27.169.129

Deployment process is as follows:

 ~ tiup cluster deploy TiDB-cluster v7.2.0 ./topology.yaml --user root -p
tiup is checking updates for component cluster ...
Starting component `cluster`: /root/.tiup/components/cluster/v1.12.5/tiup-cluster deploy TiDB-cluster v7.2.0 ./topology.yaml --user root -p
Input SSH password:



+ Detect CPU Arch Name
  - Detecting node 175.27.241.31 Arch info ... Done
  - Detecting node 175.27.169.129 Arch info ... Done
  - Detecting node 119.45.142.75 Arch info ... Done



+ Detect CPU OS Name
  - Detecting node 175.27.241.31 OS info ... Done
  - Detecting node 175.27.169.129 OS info ... Done
  - Detecting node 119.45.142.75 OS info ... Done
Please confirm your topology:
Cluster type:    tidb
Cluster name:    TiDB-cluster
Cluster version: v7.2.0
Role        Host            Ports                            OS/Arch       Directories
----        ----            -----                            -------       -----------
pd          175.27.241.31   2379/2380                        linux/x86_64  /tidb-deploy/pd-2379,/tidb-data/pd-2379
tikv        175.27.241.31   20160/20180                      linux/x86_64  /tidb-deploy/tikv-20160,/tidb-data/tikv-20160
tikv        175.27.169.129  20161/20181                      linux/x86_64  /tidb-deploy/tikv-20161,/tidb-data/tikv-20161
tikv        119.45.142.75   20162/20182                      linux/x86_64  /tidb-deploy/tikv-20162,/tidb-data/tikv-20162
tidb        175.27.169.129  4000/10080                       linux/x86_64  /tidb-deploy/tidb-4000
tiflash     119.45.142.75   9000/8123/3930/20170/20292/8234  linux/x86_64  /tidb-deploy/tiflash-9000,/tidb-data/tiflash-9000
prometheus  175.27.241.31   9090/12020                       linux/x86_64  /tidb-deploy/prometheus-9090,/tidb-data/prometheus-9090
grafana     175.27.169.129  3000                             linux/x86_64  /tidb-deploy/grafana-3000
Attention:
    1. If the topology is not what you expected, check your yaml file.
    2. Please confirm there is no port/directory conflicts in same host.
Do you want to continue? [y/N]: (default=N) y
+ Generate SSH keys ... Done
+ Download TiDB components
  - Download pd:v7.2.0 (linux/amd64) ... Done
  - Download tikv:v7.2.0 (linux/amd64) ... Done
  - Download tidb:v7.2.0 (linux/amd64) ... Done
  - Download tiflash:v7.2.0 (linux/amd64) ... Done
  - Download prometheus:v7.2.0 (linux/amd64) ... Done
  - Download grafana:v7.2.0 (linux/amd64) ... Done
  - Download node_exporter: (linux/amd64) ... Done
  - Download blackbox_exporter: (linux/amd64) ... Done
+ Initialize target host environments
  - Prepare 175.27.241.31:22 ... Done
  - Prepare 175.27.169.129:22 ... Done
  - Prepare 119.45.142.75:22 ... Done
+ Deploy TiDB instance
  - Copy pd -> 175.27.241.31 ... Done
  - Copy tikv -> 175.27.241.31 ... Done
  - Copy tikv -> 175.27.169.129 ... Done
  - Copy tikv -> 119.45.142.75 ... Done
  - Copy tidb -> 175.27.169.129 ... Done
  - Copy tiflash -> 119.45.142.75 ... Done
  - Copy prometheus -> 175.27.241.31 ... Done
  - Copy grafana -> 175.27.169.129 ... Done
  - Deploy node_exporter -> 175.27.169.129 ... Done
  - Deploy node_exporter -> 119.45.142.75 ... Done
  - Deploy node_exporter -> 175.27.241.31 ... Done
  - Deploy blackbox_exporter -> 175.27.169.129 ... Done
  - Deploy blackbox_exporter -> 119.45.142.75 ... Done
  - Deploy blackbox_exporter -> 175.27.241.31 ... Done
+ Copy certificate to remote host
+ Init instance configs
  - Generate config pd -> 175.27.241.31:2379 ... Done
  - Generate config tikv -> 175.27.241.31:20160 ... Done
  - Generate config tikv -> 175.27.169.129:20161 ... Done
  - Generate config tikv -> 119.45.142.75:20162 ... Done
  - Generate config tidb -> 175.27.169.129:4000 ... Done
  - Generate config tiflash -> 119.45.142.75:9000 ... Done
  - Generate config prometheus -> 175.27.241.31:9090 ... Done
  - Generate config grafana -> 175.27.169.129:3000 ... Done
+ Init monitor configs
  - Generate config node_exporter -> 175.27.241.31 ... Done
  - Generate config node_exporter -> 175.27.169.129 ... Done
  - Generate config node_exporter -> 119.45.142.75 ... Done
  - Generate config blackbox_exporter -> 175.27.241.31 ... Done
  - Generate config blackbox_exporter -> 175.27.169.129 ... Done
  - Generate config blackbox_exporter -> 119.45.142.75 ... Done
Enabling component pd
        Enabling instance 175.27.241.31:2379
        Enable instance 175.27.241.31:2379 success
Enabling component tikv
        Enabling instance 119.45.142.75:20162
        Enabling instance 175.27.241.31:20160
        Enabling instance 175.27.169.129:20161
        Enable instance 175.27.169.129:20161 success
        Enable instance 175.27.241.31:20160 success
        Enable instance 119.45.142.75:20162 success
Enabling component tidb
        Enabling instance 175.27.169.129:4000
        Enable instance 175.27.169.129:4000 success
Enabling component tiflash
        Enabling instance 119.45.142.75:9000
        Enable instance 119.45.142.75:9000 success
Enabling component prometheus
        Enabling instance 175.27.241.31:9090
        Enable instance 175.27.241.31:9090 success
Enabling component grafana
        Enabling instance 175.27.169.129:3000
        Enable instance 175.27.169.129:3000 success
Enabling component node_exporter
        Enabling instance 119.45.142.75
        Enabling instance 175.27.169.129
        Enabling instance 175.27.241.31
        Enable 175.27.169.129 success
        Enable 119.45.142.75 success
        Enable 175.27.241.31 success
Enabling component blackbox_exporter
        Enabling instance 119.45.142.75
        Enabling instance 175.27.241.31
        Enabling instance 175.27.169.129
        Enable 175.27.169.129 success
        Enable 119.45.142.75 success
        Enable 175.27.241.31 success
Cluster `TiDB-cluster` deployed successfully, you can start it with command: `tiup cluster start TiDB-cluster --init`

Startup process is as follows:

 tiup cluster start TiDB-cluster
tiup is checking updates for component cluster ...
Starting component `cluster`: /root/.tiup/components/cluster/v1.12.5/tiup-cluster start TiDB-cluster
Starting cluster TiDB-cluster...
+ [ Serial ] - SSHKeySet: privateKey=/root/.tiup/storage/cluster/clusters/TiDB-cluster/ssh/id_rsa, publicKey=/root/.tiup/storage/cluster/clusters/TiDB-cluster/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=175.27.169.129
+ [Parallel] - UserSSH: user=tidb, host=119.45.142.75
+ [Parallel] - UserSSH: user=tidb, host=175.27.241.31
+ [Parallel] - UserSSH: user=tidb, host=175.27.169.129
+ [Parallel] - UserSSH: user=tidb, host=175.27.241.31
+ [Parallel] - UserSSH: user=tidb, host=175.27.241.31
+ [Parallel] - UserSSH: user=tidb, host=175.27.169.129
+ [Parallel] - UserSSH: user=tidb, host=119.45.142.75
+ [ Serial ] - StartCluster
Starting component pd
        Starting instance 175.27.241.31:2379
        Start instance 175.27.241.31:2379 success
Starting component tikv
        Starting instance 119.45.142.75:20162
        Starting instance 175.27.241.31:20160
        Starting instance 175.27.169.129:20161

Error: failed to start tikv: failed to start: 175.27.169.129 tikv-20161.service, please check the instance's log(/tidb-deploy/tikv-20161/log) for more detail.: timed out waiting for port 20161 to be started after 2m0s

Verbose debug logs has been written to /root/.tiup/logs/tiup-cluster-debug-2023-07-22-14-07-40.log.

/tidb-deploy/tikv-20161/log file is as follows:

[2023/07/22 13:52:43.856 +08:00] [INFO] [lib.rs:88] ["Welcome to TiKV"]
[2023/07/22 13:52:43.857 +08:00] [INFO] [lib.rs:93] ["Release Version:   7.2.0"]
[2023/07/22 13:52:43.857 +08:00] [INFO] [lib.rs:93] ["Edition:           Community"]
[2023/07/22 13:52:43.857 +08:00] [INFO] [lib.rs:93] ["Git Commit Hash:   12ce5540f9e8f781f14d3b3a58fb9442f03b6b29"]
[2023/07/22 13:52:43.857 +08:00] [INFO] [lib.rs:93] ["Git Commit Branch: heads/refs/tags/v7.2.0"]
[2023/07/22 13:52:43.857 +08:00] [INFO] [lib.rs:93] ["UTC Build Time:    Unknown (env var does not exist when building)"]
[2023/07/22 13:52:43.857 +08:00] [INFO] [lib.rs:93] ["Rust Version:      rustc 1.67.0-nightly (96ddd32c4 2022-11-14)"]
[2023/07/22 13:52:43.857 +08:00] [INFO] [lib.rs:93] ["Enable Features:   pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure"]
[2023/07/22 13:52:43.857 +08:00] [INFO] [lib.rs:93] ["Profile:           dist_release"]
[2023/07/22 13:52:43.857 +08:00] [INFO] [mod.rs:80] ["cgroup quota: memory=Some(9223372036854771712), cpu=None, cores={0, 1}"]
[2023/07/22 13:52:43.857 +08:00] [INFO] [mod.rs:87] ["memory limit in bytes: 16249131008, cpu cores quota: 2"]
[2023/07/22 13:52:43.857 +08:00] [WARN] [lib.rs:544] ["environment variable `TZ` is missing, using `/etc/localtime`"]
[2023/07/22 13:52:43.857 +08:00] [INFO] [config.rs:723] ["kernel parameters"] [value=32768] [param=net.core.somaxconn]
[2023/07/22 13:52:43.857 +08:00] [INFO] [config.rs:723] ["kernel parameters"] [value=0] [param=net.ipv4.tcp_syncookies]
[2023/07/22 13:52:43.857 +08:00] [INFO] [config.rs:723] ["kernel parameters"] [value=0] [param=vm.swappiness]
[2023/07/22 13:52:43.867 +08:00] [INFO] [util.rs:604] ["connecting to PD endpoint"] [endpoints=175.27.241.31:2379]
[2023/07/22 13:52:43.868 +08:00] [INFO] [<unknown>] ["TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter"]
[2023/07/22 13:52:45.868 +08:00] [INFO] [util.rs:566] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=175.27.241.31:2379]
[2023/07/22 13:52:45.868 +08:00] [WARN] [client.rs:168] ["validate PD endpoints failed"] [err="Other(\"[components/pd_client/src/util.rs:599]: PD cluster failed to respond\")"]
[2023/07/22 13:52:46.170 +08:00] [INFO] [util.rs:604] ["connecting to PD endpoint"] [endpoints=175.27.241.31:2379]
[2023/07/22 13:52:48.171 +08:00] [INFO] [util.rs:566] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=175.27.241.31:2379]
[2023/07/22 13:52:48.472 +08:00] [INFO] [util.rs:604] ["connecting to PD endpoint"] [endpoints=175.27.241.31:2379]
[2023/07/22 13:52:50.473 +08:00] [INFO] [util.rs:566] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=175.27.241.31:2379]
[2023
| username: redgame | Original post link

The log only indicates that the PD node is unresponsive or timed out. What kind of environment configuration is this?

| username: 我是咖啡哥 | Original post link

Check the PD status and see if the port is accessible.

| username: tidb狂热爱好者 | Original post link

The port is not accessible.

| username: tidb狂热爱好者 | Original post link

Ubuntu:
service ufw stop

CentOS:
service firewalld stop
service iptables stop

| username: 霸王龙的日常 | Original post link

Looking at the logs, it seems to be a port issue.

  1. Did you run tiup cluster check for a pre-check before installation? This can check for potential risks in the cluster, including disk, firewall, ports, etc.
  2. Is SELinux turned off?
    setenforce 0, temporarily disable it, effective immediately.
    If you only changed the SELinux configuration, it will take effect after a system reboot.
  3. Check if the port is occupied
    lsof -i:20161
    Check all other ports as well. I have encountered a situation where port 9090 was occupied by a system process. If so, consider changing the port.
| username: tidb菜鸟一只 | Original post link

Try telnet 175.27.241.31 2379 on each of the three TiKV hosts.

| username: ShawnYan | Original post link

Take another look at the PD logs.

| username: 裤衩儿飞上天 | Original post link

Check the result using tiup cluster check.

| username: ffeenn | Original post link

There is an issue with the PD node. First, check if PD is running normally, then check for any network problems.