Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tiup启动时报SSH错误
When using tiup to start the TiDB cluster, an error occurred (there was no error during installation):
Error: failed to start prometheus: failed to start: slave007 prometheus-9090.service, please check the instance’s log(/home/tidb/.tiup/tidb-deploy/prometheus-9090/log) for more detail.: executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@slave007:22’ {ssh_stderr: Failed to start prometheus-9090.service: Unit not found.
, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin /usr/bin/sudo -H bash -c “systemctl daemon-reload && systemctl start prometheus-9090.service”}, cause: Process exited with status 5
Prerequisite: I have set up passwordless SSH for the tidb user and root user on both the control machine and the TiDB cluster machines. As shown below, it did not ask me for a password:
[tidb@slave006 bin]$ ssh tidb@slave007
Last login: Tue Sep 6 10:40:09 2022
[tidb@slave007 ~]$
How can I solve this problem?
Should the sudo privileges be added to the tidb account if the root account does not use passwordless authentication?
Try running sudo systemctl status prometheus-9090.service
on your TiDB account slave007 to see if it works.
Should it be added on the central control machine (slave006) or on slave007?
[tidb@slave007 root]$ sudo -ll
Matching Defaults entries for tidb on slave007:
!visiblepw, always_set_home, match_group_by_gid, always_query_group_plugin, env_reset, env_keep=“COLORS DISPLAY HOSTNAME HISTSIZE KDEDIR LS_COLORS”, env_keep+=“MAIL PS1 PS2 QTDIR USERNAME LANG LC_ADDRESS LC_CTYPE”,
env_keep+=“LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES”, env_keep+=“LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE”, env_keep+=“LC_TIME LC_ALL LANGUAGE LINGUAS _XKB_CHARSET XAUTHORITY”,
secure_path=/sbin:/bin:/usr/sbin:/usr/bin
User tidb may run the following commands on slave007:
Sudoers entry:
RunAsUsers: ALL
Options: !authenticate
Commands:
ALL
Sudoers entry:
RunAsUsers: ALL
Options: !authenticate
Commands:
ALL
[tidb@slave007 .tiup]$ sudo systemctl status prometheus-9090.service
Unit prometheus-9090.service could not be found.
Additionally, I was just installing DM, and slave007 reported an SSH error:
Deploy TiDB instance
- Copy dm-master → slave003 … Done
- Copy dm-worker → slave004 … Done
- Copy dm-worker → slave005 … Done
- Copy dm-worker → slave006 … Done
- Copy prometheus → slave007 … Error
- Copy grafana → slave007 … Error
- Copy alertmanager → slave003 … Done
Error: stderr: : executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@slave007:22’ {ssh_stderr: , ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin tar --no-same-owner -zxf /tidb-deploy-dm/prometheus-8249/bin/prometheus-v6.2.0-linux-amd64.tar.gz -C /tidb-deploy-dm/prometheus-8249/bin && rm /tidb-deploy-dm/prometheus-8249/bin/prometheus-v6.2.0-linux-amd64.tar.gz}, cause: Run Command Timeout
The deployment failed, so it couldn’t start successfully afterward.
The deployment did not fail, it was successfully started before.
This deployment failure is related to DM, not TiDB.
The entire process is as follows: I first deployed and installed TiDB, and there were no issues with starting it. Later, I installed DM, which was also successful, and there were no issues with starting it either. However, during the DM installation process, the monitoring part overlapped with TiDB’s monitoring on the same node, and the port was the same. So, I uninstalled DM, modified the monitoring port for DM, and then re-deployed and installed DM. As a result, after uninstalling DM, it seems that because the monitoring part overlapped with TiDB, TiDB’s monitoring component was also affected. Therefore, TiDB cannot start now (only the monitoring component cannot start).
I used the command ./tiup dm destroy dm-test to uninstall DM.
Strange, I just reinstalled and deployed DM again, and miraculously the DM deployment was successful and it started successfully. But TiDB is still reporting an error.
The first time the DM monitoring component was deployed, it disrupted the TiDB monitoring module, causing the TiDB monitoring module to fail to start. You need to find the exact reason why the TiDB monitoring module cannot start and fix it. Alternatively, you can refer to this link tiup install | PingCAP 文档中心 and try to deploy only the monitoring component to attempt a fix.
Does the system have permission restrictions, and if so, do I need to enable sudo-related bash permissions?
Did you deploy using the tidb user? It might be a permissions issue.
Judging by his operation records, he should be a TiDB user.
[tidb@slave006 bin]$ ssh tidb@slave007
Last login: Tue Sep 6 10:40:09 2022
[tidb@slave007 ~]$