TiDB 7.5.0 Failed to Start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb 7.5.0 启动失败

| username: 谢斌1204

Error: failed to start pd: failed to start: 10.96.129.222 pd-2379.service, please check the instance’s log(/data/tidb/tidb-deploy/pd-2379/log) for more detail.: timed out waiting for port 2379 to be started after 2m0s

| username: 裤衩儿飞上天 | Original post link

Take out the detailed log information and have a look~

| username: 谢斌1204 | Original post link

libnuma: Warning: node argument 1 is out of range

usage: numactl [–all | -a] [–interleave= | -i ] [–preferred= | -p ]
[–physcpubind= | -C ] [–cpunodebind= | -N ]
[–membind= | -m ] [–localalloc | -l] command args …
numactl [–show | -s]
numactl [–hardware | -H]
numactl [–length | -l ] [–offset | -o ] [–shmmode | -M ]
[–strict | -t]
[–shmid | -I ] --shm | -S
[–shmid | -I ] --file | -f
[–huge | -u] [–touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
libnuma: Warning: node argument 1 is out of range

usage: numactl [–all | -a] [–interleave= | -i ] [–preferred= | -p ]
[–physcpubind= | -C ] [–cpunodebind= | -N ]
[–membind= | -m ] [–localalloc | -l] command args …
numactl [–show | -s]
numactl [–hardware | -H]
numactl [–length | -l ] [–offset | -o ] [–shmmode | -M ]
[–strict | -t]
[–shmid | -I ] --shm | -S
[–shmid | -I ] --file | -f
[–huge | -u] [–touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes

| username: 谢斌1204 | Original post link

2024-01-15T14:35:21.530+0800 DEBUG retry error {“error”: “operation timed out after 2m0s”}
2024-01-15T14:35:21.763+0800 INFO SSHCommand {“host”: “10.96.129.221”, “port”: “22”, “cmd”: “export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin ss -ltn”, “stdout”: “State Recv-Q Send-Q Local Address:Port Peer Address:Port \nLISTEN 0 128 :111
: \nLISTEN 0 128 :22 : \nLISTEN 0 100 127.0.0.1:25 : \nLISTEN 0 128 [::]:111 [::]: \nLISTEN 0 128 [::]:2
2 [::]:
\nLISTEN 0 100 [::1]:25 [::]:* \n”, “stderr”: “”}
2024-01-15T14:35:21.763+0800 DEBUG retry error {“error”: “operation timed out after 2m0s”}
2024-01-15T14:35:22.383+0800 INFO SSHCommand {“host”: “10.96.129.223”, “port”: “22”, “cmd”: “export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin ss -ltn”, “stdout”: “State Recv-Q Send-Q Local Address:Port Peer Address:Port \nLISTEN 0 128 :111
: \nLISTEN 0 128 :22 : \nLISTEN 0 100 127.0.0.1:25 : \nLISTEN 0 128 [::]:111 [::]: \nLISTEN 0 128 [::]:2
2 [::]:
\nLISTEN 0 100 [::1]:25 [::]:* \n”, “stderr”: “”}
2024-01-15T14:35:22.383+0800 DEBUG retry error {“error”: “operation timed out after 2m0s”}
2024-01-15T14:35:22.383+0800 DEBUG TaskFinish {“task”: “StartCluster”, “error”: “failed to start pd: failed to start: 10.96.129.222 pd-2379.service, please check the instance’s log(/data/tidb/tidb-deploy/pd-2379/log) for more detail.: timed out waiting for port 2379 to be started after 2m0s”, “errorVerbos
e”: “timed out waiting for port 2379 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:92\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:129\ngithub.com/ping
cap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:167\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:405\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pi
ngcap/tiup/pkg/cluster/operation/action.go:534\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650\nfailed to start: 10.96.129.222 pd-2379.service, please check the instance’s log(/data/tidb/tidb-deploy/pd-2379/log) for more d
etail.\nfailed to start pd”}
2024-01-15T14:35:22.383+0800 INFO Execute command finished {“code”: 1, “error”: “failed to start pd: failed to start: 10.96.129.222 pd-2379.service, please check the instance’s log(/data/tidb/tidb-deploy/pd-2379/log) for more detail.: timed out waiting for port 2379 to be started after 2m0s”, “errorVer
bose”: “timed out waiting for port 2379 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:92\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:129\ngithub.com/p
ingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:167\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:405\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com
/pingcap/tiup/pkg/cluster/operation/action.go:534\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650\nfailed to start: 10.96.129.222 pd-2379.service, please check the instance’s log(/data/tidb/tidb-deploy/pd-2379/log) for mor
e detail.\nfailed to start pd”}

| username: WalterWj | Original post link

It seems like there is an issue with CPU pinning :thinking:. From the logs, it looks like the memory configuration exceeds the resources allocated by CPU pinning :thinking:. How about removing the CPU pinning?

| username: 谢斌1204 | Original post link

The numa_node has already been removed. (Is this the one?)

| username: WalterWj | Original post link

Hmm, does it still not work after removing it?

| username: 谢斌1204 | Original post link

Hmm!!!

| username: 谢斌1204 | Original post link

Why does adding numa_node at startup not work, and changing the parameters doesn’t help, requiring a rebuild of the database?

| username: tidb菜鸟一只 | Original post link

Please send the YAML configuration file.

| username: 谢斌1204 | Original post link

global:
user: “tidb”
group: “tidb”
ssh_port: 22
deploy_dir: “/data/tidb/tidb-deploy”
data_dir: “/data/tidb/tidb-data”
listen_host: 0.0.0.0
arch: “amd64”

monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: “/data/tidb/tidb-deploy/monitored-9100”
data_dir: “/data/tidb/tidb-data/monitored-9100”
log_dir: “/data/tidb/tidb-deploy/monitored-9100/log”

pd_servers:

  • host: 10.96.129.221
    ssh_port: 22
    name: “pd-129-221”
    client_port: 2379
    peer_port: 2380
    deploy_dir: “/data/tidb/tidb-deploy/pd-2379”
    data_dir: “/data/tidb/tidb-data/pd-2379”
    log_dir: “/data/tidb/tidb-deploy/pd-2379/log”

  • host: 10.96.129.222
    ssh_port: 22
    name: “pd-129.222”
    client_port: 2379
    peer_port: 2380
    deploy_dir: “/data/tidb/tidb-deploy/pd-2379”
    data_dir: “/data/tidb/tidb-data/pd-2379”
    log_dir: “/data/tidb/tidb-deploy/pd-2379/log”

  • host: 10.96.129.223
    ssh_port: 22
    name: “pd-129-223”
    client_port: 2379
    peer_port: 2380
    deploy_dir: “/data/tidb/tidb-deploy/pd-2379”
    data_dir: “/data/tidb/tidb-data/pd-2379”
    log_dir: “/data/tidb/tidb-deploy/pd-2379/log”

tidb_servers:

  • host: 10.96.129.224
    ssh_port: 22
    port: 4000
    status_port: 10080
    deploy_dir: “/data/tidb/tidb-deploy/tidb-4000”
    log_dir: “/data/tidb/tidb-deploy/tidb-4000/log”
  • host: 10.96.129.225
    ssh_port: 22
    port: 4000
    status_port: 10081
    deploy_dir: “/data/tidb/tidb-deploy/tidb-4000”
    log_dir: “/data/tidb/tidb-deploy/tidb-4000/log”
  • host: 10.96.129.226
    ssh_port: 22
    port: 4000
    status_port: 10080
    deploy_dir: “/data/tidb/tidb-deploy/tidb-4000”
    log_dir: “/data/tidb/tidb-deploy/tidb-4000/log”

tikv_servers:

  • host: 10.96.129.227
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: “/data/tidb/tidb-deploy/tikv-20160”
    data_dir: “/data/tidb/tidb-data/tikv-20160”
    log_dir: “/data/tidb/tidb-deploy/tikv-20160/log”

  • host: 10.96.129.228
    ssh_port: 22
    port: 20161
    status_port: 20181
    deploy_dir: “/data/tidb/tidb-deploy/tikv-20161”
    data_dir: “/data/tidb/tidb-data/tikv-20161”
    log_dir: “/data/tidb/tidb-deploy/tikv-20161/log”

  • host: 10.96.129.229
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: “/data/tidb/tidb-deploy/tikv-20160”
    data_dir: “/data/tidb/tidb-data/tikv-20160”
    log_dir: “/data/tidb/tidb-deploy/tikv-20160/log”

tidb_dashboard_servers:

  • host: 10.96.129.230
    ssh_port: 22
    port: 12333
    deploy_dir: “/data/tidb/tidb-deploy/tidb-dashboard-12333”
    data_dir: “/data/tidb/tidb-data/tidb-dashboard-12333”
    log_dir: “/data/tidb/tidb-deploy/tidb-dashboard-12333/log”

monitoring_servers:

  • host: 10.96.129.230
    ssh_port: 22
    port: 9090
    ng_port: 12020
    deploy_dir: “/data/tidb/tidb-deploy/prometheus-8249”
    data_dir: “/data/tidb/tidb-data/prometheus-8249”
    log_dir: “/data/tidb/tidb-deploy/prometheus-8249/log”
    #rule_dir: /data/tidb/prometheus_rule
    scrape_interval: 15s
    scrape_timeout: 10s
    grafana_servers:
  • host: 10.96.129.230
    port: 3000
    deploy_dir: “/data/tidb/tidb-deploy/grafana-3000”
    #dashboard_dir: /data/tidb/dashboards

alertmanager_servers:

  • host: 10.96.129.230
    ssh_port: 22
    listen_host: 0.0.0.0
    web_port: 9093
    cluster_port: 9094
    deploy_dir: “/data/tidb/tidb-deploy/alertmanager-9093”
    data_dir: “/data/tidb/tidb-data/alertmanager-9093”
    log_dir: “/data/tidb/tidb-deploy/alertmanager-9093/log”
    #config_file: “/data/tidb/tidb-deploy/alertmanager-9093/bin/alertmanager”
| username: dba远航 | Original post link

Check the network status.

| username: tidb菜鸟一只 | Original post link

There’s nothing wrong with the configuration file, it seems. Is it still giving an error when you start it now?

| username: TiDBer_小阿飞 | Original post link

Check if the port is being occupied.

| username: changpeng75 | Original post link

Use lsof -i:2379 to check if the port is occupied. If possible, try restarting and see if it works.

| username: 小于同学 | Original post link

Try restarting.