Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tiup cluster deploy 成功,但是start 报错,pd 和tidb启动不起来
[TiDB Usage Environment] Production Environment
[TiDB Version]
V5.4.0
[Encountered Problem]
tiup cluster deploy succeeded, but start reported an error, pd and tidb failed to start. The pd error is as follows: no space left on device
Failed to write to log, write /data1/tidb-deploy/pd-2379/log/pd.log: no space left on device
[2022/09/14 09:57:42.181 +08:00] [WARN] [retry_interceptor.go:61] [“retrying of unary invoker failed”] [target=endpoint://client-45d8c668-704a-472e-876c-b0168fef1cd8/10.71.130.114:2380] [attempt=0] [error=“rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.71.130.114:2380: connect: connection refused"”]
Failed to write to log, write /data1/tidb-deploy/pd-2379/log/pd.log: no space left on device
[Reproduction Path] Operations performed that led to the problem
tiup cluster start tidb-ipps will report this error
[Problem Phenomenon and Impact]
[Attachments]
Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.
Disk space is insufficient, no space left on device.
Hmm, the log wasn’t pasted correctly. The error is “mkdir /data1/tidb-data/pd-2379/member/snap: permission denied”. This error occurs when running tiup cluster start tidb-ipps
.
Hmm, the log is incorrect, error="mkdir /data1/tidb-data/pd-2379/member/snap: permission denied. This is the error. When I used tiup cluster start tidb-ipps
, I deployed it with the root user and also started it with the root user.
After the deployment, it looks like this. I don’t know what’s wrong. The non-essential components are up, but the critical ones are down or NA.
10.71.130.114:3000 grafana 10.71.130.114 3000 linux/x86_64 Up - /data1/tidb-deploy/grafana-3000
10.71.130.114:2379 pd 10.71.130.114 2379/2380 linux/x86_64 Down /data1/tidb-data/pd-2379 /data1/tidb-deploy/pd-2379
10.71.130.114:9090 prometheus 10.71.130.114 9090/12020 linux/x86_64 Up /data1/tidb-data/prometheus-9090 /data1/tidb-deploy/prometheus-9090
10.71.130.114:4000 tidb 10.71.130.114 4000/10080 linux/x86_64 Down - /data1/tidb-deploy/tidb-4000
10.71.130.114:9000 tiflash 10.71.130.114 9000/8123/3930/20170/20292/8234 linux/x86_64 N/A /data1/tiflash/data,/data2/tiflash/data /data1/tidb-deploy/tiflash-9000
10.71.130.111:20160 tikv 10.71.130.111 20160/20180 linux/x86_64 N/A /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160
10.71.130.112:20160 tikv 10.71.130.112 20160/20180 linux/x86_64 N/A /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160
10.71.130.113:20160 tikv 10.71.130.113 20160/20180 linux/x86_64 N/A /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160
Try creating it manually and see what the result is.
Manually creating the file is possible; it has already been tested.
It looks like a directory permission issue.
It seems to be a disk space issue.
Have you set up mutual trust between the machines?
How should I do it? I couldn’t find any related information. I followed the tutorial and didn’t notice this step.
Some of the environment configurations before installation are mentioned in this document.
I see that your error is related to permissions. Check if the control machine can directly SSH into all the machines and if it can create directories (mkdir).
This means there is no space left.
Yes, the third point was indeed not done. I’m looking into how to do it. Is there any tool that can quickly establish mutual trust?
Indeed, when SSHing to another machine, you need to enter a password. You might need to set up passwordless authentication. Are there any tools that can do this?
Here is the configuration for mutual trust.
Well, this is what I am looking at. I don’t quite understand the third step, and I am using tiup.
It says automatic, which is strange why it doesn’t work. I am using control machine A, with TiKV configured on B/C/D. B can communicate, but C and D cannot, which is very strange.