Error Starting Cluster with `tiup cluster start tidb-test`: PD Node Failed to Start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 运行tiup cluster start tidb-test 启动集群报错 pd 节点启动失败

| username: 突破边界

[TiDB Usage Environment] Testing
[TiDB Version] 7.5.0
[Reproduction Path]

  1. tiup cluster stop tidb-test. Successfully stopped the cluster.
  2. tiup cluster start tidb-test Failed to start the cluster, failed when starting the PD instance, console output as follows:
Error: failed to start pd: failed to start: 192.168.0.150 pd-11100.service, please check the instance's log(/mnt/filemanage/tidb/tidb-deploy/pd-2379/log) for more detail.: timed out waiting for port 11100 to be started after 2m0s

However, when I checked the log files under /mnt/filemanage/tidb/tidb-deploy/pd-2379/log, no new logs were generated
(I deleted the old log files)
Then I checked /root/.tiup/logs/tiup-cluster-debug-2015-01-13-00-11-17.log, and it showed the following:

2015-01-13T00:11:17.617+0800    INFO    Execute command finished        {"code": 1, "error": "failed to start pd: failed to start: 192.168.0.150 pd-11100.service, please check the instance's log(/mnt/filemanage/tidb/tidb-deploy/pd-2379/log) for more detail.: timed out waiting for port 11100 to be started after 2m0s", "errorVerbose": "timed out waiting for port 11100 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:92\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:129\ngithub.com/pingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:167\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:405\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:534\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650\nfailed to start: 192.168.0.150 pd-11100.service, please check the instance's log(/mnt/filemanage/tidb/tidb-deploy/pd-2379/log) for more detail.\nfailed to start pd"}

I couldn’t find more information and don’t know the reason for the startup failure.
[Encountered Problem: Problem Phenomenon and Impact] There is no clear log indicating why PD failed to start. How should I further troubleshoot?

| username: tidb菜鸟一只 | Original post link

The permissions for my directory are 755…

| username: DBAER | Original post link

Did the higher version make modifications to improve security? I see that the permissions in version 6.1 are the same as yours.

| username: 突破边界 | Original post link

Hello, I made a mistake with the logs; those were old logs, and there are actually no new error messages. I have edited the post information again. Now, because there are no clear error messages, I don’t know how to proceed with further troubleshooting.

| username: 突破边界 | Original post link

Hello, I made a mistake with the logs. I have re-edited the post. In fact, I did not see any new output logs, but I don’t know how to further analyze the reason for the PD failure.

| username: DBAER | Original post link

Check if the port 11100 is occupied.

| username: 突破边界 | Original post link

I checked and found no occupation.

[root@localhost pd-2379]# cd log
[root@localhost log]# ls
[root@localhost log]# ps -ef|grep 11100
root     2899592 2801206  0 00:23 pts/1    00:00:00 grep --color=auto 11100
[root@localhost log]# netstat -anp|grep 11100
[root@localhost log]# 
| username: TiDBer_1111 | Original post link

Is this port accessible via telnet?

| username: 突破边界 | Original post link

Telnet is not working.

| username: 突破边界 | Original post link

Telnet is not working.

| username: kelvin | Original post link

If the telnet port is not accessible, there will definitely be an error when starting. You need to resolve this network issue first.

| username: 突破边界 | Original post link

Hello, the service didn’t start successfully. This port must be unreachable, right? There’s no issue with the network.

| username: 突破边界 | Original post link

11100 This is the listening port I set for PD. PD failed to start, and this port definitely won’t work.

| username: TiDBer_1111 | Original post link

Try manually running run_pd.sh in the scripts directory of PD, or use systemctl start to check.

| username: Ming | Original post link

Go to the PD directory, find the scripts, and manually execute run_pd.sh to see if there is any output.

| username: 突破边界 | Original post link

Directly executing the script gets stuck, and nothing is output. When I use tiup, it also gets stuck for a long time.

| username: 突破边界 | Original post link

Directly executing the script, I found that logs were generated, and the logs show as follows:

[root@localhost log]# cat pd_stderr.log
2024-05-07 17:08:08.140755 W | pkg/fileutil: check file permission: directory "/mnt/filemanage/tidb/tidb-data/pd-2379" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.

My current directory permissions are as follows:

Then I changed the permissions to the following, but no logs were generated anymore, and it got stuck there.

| username: Ming | Original post link

I don’t think this error should cause the system to fail to start. Check the information in the pd.log range.
The “stuck” situation you mentioned shouldn’t be considered as stuck, right? The process is running continuously, just not joining the background? PD should have started, and then these things would be generated.

| username: WalterWj | Original post link

I understand the permissions correctly, the problem should be solved.

| username: 突破边界 | Original post link

It’s very strange, but it works again. I don’t know if it’s related to me changing the directory permissions to 700. However, I tried changing it back, stopped the cluster, and started it again, and there were no issues. In the end, I still don’t know the reason, but it works now.