Cluster Created Successfully but Failed to Start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: cluster创建成功但是启动失败

| username: TiDBer_eyHUd5pk

[TiDB Usage Environment] Production Environment
[TiDB Version] v7.0.0
[Encountered Problem: Phenomenon and Impact]
Everything was normal during the configuration and detection of the cluster. After the cluster was successfully created, the cluster failed to start. SSH mutual trust was established between the machines, and the systems are all Centos 7.9.2009. Not sure if it is due to previously starting the playground.
[Attachment: Screenshot/Log/Monitoring]

| username: zhanggame1 | Original post link

Try manually starting the PD that reported an error.

| username: TiDBer_vfJBUcxl | Original post link

Manually test the SSH connection between the failed nodes to see if it succeeds.

| username: tony5413 | Original post link

Check the SSH by looking at the logs first.

| username: tidb菜鸟一只 | Original post link

Try executing systemctl start pd-2379.service directly on the machine 10.102.2.190.

| username: redgame | Original post link

Have you already stopped and cleaned up the previous Playground-related processes and resources?

| username: TiDBer_eyHUd5pk | Original post link

When manually starting, it says “not found service.” What could be the reason for this? The control machine should have installed PD on this machine, right? I didn’t manually install TiDB on this machine.

| username: TiDBer_eyHUd5pk | Original post link

Can succeed

| username: TiDBer_eyHUd5pk | Original post link

The process has been closed, but the resources have not been deleted.

| username: tidb菜鸟一只 | Original post link

Look for the run_tikv.sh script on this node, the approximate path is as shown above, execute it and check the error content.

| username: TiDBer_eyHUd5pk | Original post link

I executed the above command, and here 190 has already started the pd-server, but the log file still reports an error saying it hasn’t started.

| username: tidb菜鸟一只 | Original post link

The above is incorrect, you should execute run_pd.sh
/data/tidb-deploy/pd-2379/scripts/run_pd.sh
But has your pd-server already started?
ps -ef | grep ps-server
Take a look
Also, check if the file /etc/systemd/system/pd-2379.service exists?

| username: TiDBer_eyHUd5pk | Original post link

I changed it to run_pd.sh according to your command, and there is indeed no such file under /etc/systemd/system/pd-2379.service

| username: tidb菜鸟一只 | Original post link

Do you see any command in run_pd.sh to generate pd-2379.service? Why wasn’t it generated?

| username: TiDBer_eyHUd5pk | Original post link

There is no such command. My path is /data/cluster-deploy/pd-2379/scripts/run_pd.sh

| username: tidb菜鸟一只 | Original post link

That’s strange, your script can execute, but it didn’t generate the service on the local machine… How about you manually create one:

vi /etc/systemd/system/pd-2379.service
[Unit]
Description=pd service
After=syslog.target network.target remote-fs.target nss-lookup.target

[Service]
LimitNOFILE=1000000
LimitSTACK=10485760
User=tidb
ExecStart=/bin/bash -c '/data/cluster-deploy/pd-2379/scripts/run_pd.sh'
Restart=always

RestartSec=15s

[Install]
WantedBy=multi-user.target

Then

systemctl daemon-reload
system status pd-2379.service

Check it out.

| username: TiDBer_eyHUd5pk | Original post link

This method can solve the issue of the control machine failing to start PD. I also have issues with TiKV and other services failing to start. I will try this method for those as well.

| username: TiDBer_eyHUd5pk | Original post link

Manually creating the pd-2379.service service can succeed for a while, but it automatically shuts down after some time. I don’t know what the problem is.

log

| username: tidb菜鸟一只 | Original post link

Use tiup cluster display tidb-xxxx to check the cluster status. I feel like there might have been an issue with your initial deployment. A normal deployment should definitely generate the service.

| username: cassblanca | Original post link

Could the deployment failure be due to the gateway or firewall restricting communication between machines since TiDB is deployed across different network segments?