Prometheus, Grafana, and Alertmanager cannot be started through TiUP, but can be started manually

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: prometheus,grafana ,alertmanager都无法通过tiup启动,但是可以手动启动

| username: tony5413

[TiDB Usage Environment] Test
[TiDB Version] 6.5.8
[Reproduction Path] What operations were performed to encounter the issue
[Encountered Issue: Problem Phenomenon and Impact]
[tidb@tidb-server system]$ tiup cluster start tidb-test -N 192.168.116.110:3000

A new version of cluster is available: v1.15.0 → v1.15.2

To update this component:   tiup update cluster
To update all components:   tiup update --all

Starting cluster tidb-test…

  • [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-test/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-test/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=192.168.116.110
  • [Parallel] - UserSSH: user=tidb, host=192.168.116.111
  • [Parallel] - UserSSH: user=tidb, host=192.168.116.110
  • [Parallel] - UserSSH: user=tidb, host=192.168.116.110
  • [Parallel] - UserSSH: user=tidb, host=192.168.116.110
  • [Parallel] - UserSSH: user=tidb, host=192.168.116.112
  • [Parallel] - UserSSH: user=tidb, host=192.168.116.110
  • [Parallel] - UserSSH: user=tidb, host=192.168.116.113
  • [ Serial ] - StartCluster
    Starting component grafana
    Starting instance 192.168.116.110:3000
    Failed to start grafana-3000.service: Unit not found.

Error: failed to start grafana: failed to start: 192.168.116.110 grafana-3000.service, please check the instance’s log(/tidb-deploy/grafana-3000/log) for more detail.: executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@192.168.116.110:22’ {ssh_stderr: Failed to start grafana-3000.service: Unit not found.
, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin; /usr/bin/sudo -H bash -c “systemctl daemon-reload && systemctl start grafana-3000.service”}, cause: Process exited with status 5

Verbose debug logs have been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2024-06-20-11-29-05.log.
[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
2024-06-20T11:29:05.479+0800 INFO Starting instance 192.168.116.110:3000
2024-06-20T11:29:05.590+0800 ERROR SSHCommand {“host”: “192.168.116.110”, “port”: “22”, “cmd”: “export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin; /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl start grafana-3000.service"”, “error”: “Process exited with status 5”, “stdout”: “”, “stderr”: “Failed to start grafana-3000.service: Unit not found.\n”}
2024-06-20T11:29:05.590+0800 ERROR CheckPoint {“host”: “192.168.116.110”, “port”: 22, “user”: “tidb”, “sudo”: true, “cmd”: “systemctl daemon-reload && systemctl start grafana-3000.service”, “stdout”: “”, “stderr”: “Failed to start grafana-3000.service: Unit not found.\n”, “error”: “executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@192.168.116.110:22’ {ssh_stderr: Failed to start grafana-3000.service: Unit not found.\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin; /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl start grafana-3000.service"}, cause: Process exited with status 5”, “errorVerbose”: “executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@192.168.116.110:22’ {ssh_stderr: Failed to start grafana-3000.service: Unit not found.\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin; /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl start grafana-3000.service"}, cause: Process exited with status 5\n at github.com/pingcap/tiup/pkg/cluster/executor.(*EasySSHExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/ssh.go:174\n at github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/checkpoint.go:86\n at github.com/pingcap/tiup/pkg/cluster/module.(*SystemdModule).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/module/systemd.go:106\n at github.com/pingcap/tiup/pkg/cluster/operation.systemctl()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:424\n at github.com/pingcap/tiup/pkg/cluster/operation.startInstance()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:400\n at github.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:535\n at The Go Programming Language\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\n at runtime.goexit()\n\truntime/asm_amd64.s:1650”, “hash”: “48f15f405450faf7d57136e629285724a0713cde”, “func”: “github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute”, “hit”: false}
2024-06-20T11:29:05.590+0800 ERROR Failed to start grafana-3000.service: Unit not found.

2024-06-20T11:29:05.590+0800 DEBUG TaskFinish {“task”: “StartCluster”, “error”: “failed to start grafana: failed to start: 192.168.116.110 grafana-3000.service, please check the instance’s log(/tidb-deploy/grafana-3000/log) for more detail.: executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@192.168.116.110:22’ {ssh_stderr: Failed to start grafana-3000.service: Unit not found.\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin; /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl start grafana-3000.service"}, cause: Process exited with status 5”, “errorVerbose”: “executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@192.168.116.110:22’ {ssh_stderr: Failed to start grafana-3000.service: Unit not found.\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin; /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl start grafana-3000.service"}, cause: Process exited with status 5\n at github.com/pingcap/tiup/pkg/cluster/executor.(*EasySSHExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/ssh.go:174\n at github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/checkpoint.go:86\n at github.com/pingcap/tiup/pkg/cluster/module.(*SystemdModule).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/module/systemd.go:106\n at github.com/pingcap/tiup/pkg/cluster/operation.systemctl()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:424\n at github.com/pingcap/tiup/pkg/cluster/operation.startInstance()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:400\n at github.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:535\n at golang.org.x/sync/errgroup.(*Group).Go.func1()\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\n at runtime.goexit()\n\truntime/asm_amd64.s:1650\nfailed to start: 192.168.116.110 grafana-3000.service, please check the instance’s log(/tidb-deploy/grafana-3000/log) for more detail.\ngithub.com/pingcap/tiup/pkg/cluster/operation.toFailedActionError\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:645\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:401\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:535\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650\nfailed to start grafana”}
2024-06-20T11:29:05.591+0800 INFO Execute command finished {“code”: 1, “error”: “failed to start grafana: failed to start: 192.168.116.110 grafana-3000.service, please check the instance’s log(/tidb-deploy/grafana-3000/log) for more detail.: executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@192.168.116.110:22’ {ssh_stderr: Failed to start grafana-3000.service: Unit not found.\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin; /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl start grafana-3000.service"}, cause: Process exited with status 5”, “errorVerbose”: “executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@192.168.116.110:22’ {ssh_stderr: Failed to start grafana-3000.service: Unit not found.\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin; /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl start grafana-3000.service"}, cause: Process exited with status 5\n at github.com/pingcap/tiup/pkg/cluster/executor.(*EasySSHExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/ssh.go:174\n at github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/checkpoint.go:86\n at github.com/pingcap/tiup/pkg/cluster/module.(*SystemdModule).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/module/systemd.go:106\n at github.com/pingcap/tiup/pkg/cluster/operation.systemctl()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:424\n at github.com/pingcap/tiup/pkg/cluster/operation.startInstance()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:400\n at github.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:535\n at The Go Programming Language\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\n at runtime.goexit()\n\truntime/asm_amd64.s:1650\nfailed to start: 192.168.116.110 grafana-3000.service, please check the instance’s log(/tidb-deploy/grafana-3000/log) for more detail.\ngithub.com/pingcap/tiup/pkg/cluster/operation.toFailedActionError\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:645\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:401\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:535\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650\nfailed to start grafana”}

| username: zhaokede | Original post link

Error message “Failed to start grafana-3000.service: Unit not found.” indicates that systemd cannot find a service unit file named grafana-3000.service.

| username: forever | Original post link

Can it be started on the corresponding server?

| username: tony5413 | Original post link

No way
[root@tidb-server ~]# systemctl daemon-reload && systemctl start grafana-3000.service
Failed to start grafana-3000.service: Unit not found.

| username: forever | Original post link

Was the cluster just deployed?

| username: 这里介绍不了我 | Original post link

Judging by the error, it seems there’s an issue with your systemd file. If it doesn’t work, you can back up the data files, scale down the current node, then scale it up again, and finally, overwrite the scaled-up directory with the previously backed-up data files.

| username: tony5413 | Original post link

Solved it. Somehow the systemd files for Prometheus, Grafana, and Alertmanager were missing. Adding the files back to /etc/systemd/system and setting the permissions allowed them to start.

grafana-3000.service

[Unit]
Description=grafana service
After=syslog.target network.target remote-fs.target nss-lookup.target

[Service]
LimitNOFILE=1000000
LimitSTACK=10485760
User=tidb
ExecStart=/bin/bash -c '/tidb-deploy/grafana-3000/scripts/run_grafana.sh'
Restart=always
RestartSec=15s

[Install]
WantedBy=multi-user.target

prometheus-9090.service

[Unit]
Description=prometheus service
After=syslog.target network.target remote-fs.target nss-lookup.target

[Service]
LimitNOFILE=1000000
LimitSTACK=10485760
User=tidb
ExecStart=/bin/bash -c '/tidb-deploy/prometheus-9090/scripts/run_prometheus.sh'
ExecReload=/bin/bash -c 'kill -HUP $MAINPID $(pidof /tidb-deploy/prometheus-9090/bin/ng-monitoring-server)'
Restart=always
RestartSec=15s

[Install]
WantedBy=multi-user.target

alertmanager-9093.service

[Unit]
Description=alertmanager service
After=syslog.target network.target remote-fs.target nss-lookup.target

[Service]
LimitNOFILE=1000000
LimitSTACK=10485760
User=tidb
ExecStart=/bin/bash -c '/tidb-deploy/alertmanager-9093/scripts/run_alertmanager.sh'
Restart=always
RestartSec=15s

[Install]
WantedBy=multi-user.target
| username: forever | Original post link

It’s good that it’s resolved. Try to recall what operations were performed and why this was lost. I see that there are issues with system incompatibility during deployment.