Tiup cluster deploy succeeded, but start failed, PD and TiDB cannot start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiup cluster deploy 成功,但是start 报错,pd 和tidb启动不起来

| username: dong

[TiDB Usage Environment] Production Environment
[TiDB Version]
V5.4.0
[Encountered Problem]
tiup cluster deploy succeeded, but start reported an error, pd and tidb failed to start. The pd error is as follows: no space left on device
Failed to write to log, write /data1/tidb-deploy/pd-2379/log/pd.log: no space left on device
[2022/09/14 09:57:42.181 +08:00] [WARN] [retry_interceptor.go:61] [“retrying of unary invoker failed”] [target=endpoint://client-45d8c668-704a-472e-876c-b0168fef1cd8/10.71.130.114:2380] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.71.130.114:2380: connect: connection refused"”]
Failed to write to log, write /data1/tidb-deploy/pd-2379/log/pd.log: no space left on device

[Reproduction Path] Operations performed that led to the problem
tiup cluster start tidb-ipps will report this error
[Problem Phenomenon and Impact]

[Attachments]

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

| username: Ming | Original post link

Disk space is insufficient, no space left on device.

| username: wisdom | Original post link

Check the node disk

| username: dong | Original post link

Hmm, the log wasn’t pasted correctly. The error is “mkdir /data1/tidb-data/pd-2379/member/snap: permission denied”. This error occurs when running tiup cluster start tidb-ipps.

| username: dong | Original post link

Hmm, the log is incorrect, error="mkdir /data1/tidb-data/pd-2379/member/snap: permission denied. This is the error. When I used tiup cluster start tidb-ipps, I deployed it with the root user and also started it with the root user.

| username: dong | Original post link

After the deployment, it looks like this. I don’t know what’s wrong. The non-essential components are up, but the critical ones are down or NA.

10.71.130.114:3000 grafana 10.71.130.114 3000 linux/x86_64 Up - /data1/tidb-deploy/grafana-3000
10.71.130.114:2379 pd 10.71.130.114 2379/2380 linux/x86_64 Down /data1/tidb-data/pd-2379 /data1/tidb-deploy/pd-2379
10.71.130.114:9090 prometheus 10.71.130.114 9090/12020 linux/x86_64 Up /data1/tidb-data/prometheus-9090 /data1/tidb-deploy/prometheus-9090
10.71.130.114:4000 tidb 10.71.130.114 4000/10080 linux/x86_64 Down - /data1/tidb-deploy/tidb-4000
10.71.130.114:9000 tiflash 10.71.130.114 9000/8123/3930/20170/20292/8234 linux/x86_64 N/A /data1/tiflash/data,/data2/tiflash/data /data1/tidb-deploy/tiflash-9000
10.71.130.111:20160 tikv 10.71.130.111 20160/20180 linux/x86_64 N/A /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160
10.71.130.112:20160 tikv 10.71.130.112 20160/20180 linux/x86_64 N/A /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160
10.71.130.113:20160 tikv 10.71.130.113 20160/20180 linux/x86_64 N/A /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160

| username: Ming | Original post link

Try creating it manually and see what the result is.

| username: dong | Original post link

Manually creating the file is possible; it has already been tested.

| username: alfred | Original post link

It looks like a directory permission issue.

| username: wuxiangdong | Original post link

It seems to be a disk space issue.

| username: cheng | Original post link

Have you set up mutual trust between the machines?

| username: dong | Original post link

How should I do it? I couldn’t find any related information. I followed the tutorial and didn’t notice this step.

| username: cheng | Original post link

Some of the environment configurations before installation are mentioned in this document.

| username: cheng | Original post link

I see that your error is related to permissions. Check if the control machine can directly SSH into all the machines and if it can create directories (mkdir).

| username: xiaohetao | Original post link

This means there is no space left.

| username: xiaohetao | Original post link

  1. What are the directory permissions before starting?
  2. Who is the user in the YAML configuration file?
  3. Are the passwordless (root user passwordless login to the node in the configuration file) and mutual trust (control machine passwordless login to other nodes) configurations correct?
| username: dong | Original post link

Yes, the third point was indeed not done. I’m looking into how to do it. Is there any tool that can quickly establish mutual trust?

| username: dong | Original post link

Indeed, when SSHing to another machine, you need to enter a password. You might need to set up passwordless authentication. Are there any tools that can do this?

| username: Ming | Original post link

Here is the configuration for mutual trust.

| username: dong | Original post link

Well, this is what I am looking at. I don’t quite understand the third step, and I am using tiup.

It says automatic, which is strange why it doesn’t work. I am using control machine A, with TiKV configured on B/C/D. B can communicate, but C and D cannot, which is very strange.