Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tidb-监控重启
Previously, I used
tiup cluster stop tidb-test -R alertmanager prometheus grafana
to stop these three roles.
Now I want to start these three roles, but I found that I couldn’t start them and got an error.
Error: failed to start alertmanager: failed to start: 10.237.103.68 alertmanager-9093.service, please check the instance's log(/opt/tidb-deploy/alertmanager-9093/log) for more detail.: timed out waiting for port 9093 to be started after 2m0s
Can any experts tell me what the situation is?
It literally looks like a connection timeout, you still need to check the logs.
As mentioned above, please package and upload the logs under /opt/tidb-deploy/alertmanager-9093/log.
Check if the alertmanager-9093.service is still running.
I checked the path below. There is no such directory.
Not here, this service is also down, and there is no such directory in the path.
There is no such directory.
Check tiup cluster display tidb-test
and tiup cluster edit-config tidb-test
. You don’t have these directories? Are you deploying on this host?
The operation to start the cluster will start all components of the entire TiDB cluster in the order of PD → TiKV → Pump → TiDB → TiFlash → Drainer → TiCDC → Prometheus → Grafana → Alertmanager. Is it started in this order?
Yes, it was started in this order, but I don’t know why there is no directory after starting, and then it times out. Checking the directory logs, there is no such directory.
Did you use the -R parameter for alertmanager, prometheus, and grafana during startup? The -R parameter is effective at startup.
Now it’s done. What was the issue that caused it in the end?
Still no luck, it won’t start up. Restarting the cluster didn’t work either, so I finally gave up.
It looks like the alertmanager component is missing. You can try to scale-in -N 10.237.103.68:9093 --force to forcefully scale it down, and then scale it out again.
Check if there is still a listener on port 9093 on the machine. This is usually occupied by system-started services like node_exporter. Shutting down the corresponding service should resolve the issue.
Port 9093 is not open, nor is it occupied.
Does this directory exist? If not, it feels like the wrong configuration file is being used. Are there multiple versions of tiup on different users?
This directory exists, but there is no Prometheus directory. It was there during the initial deployment, but it disappeared after shutting down.
Stopping will completely remove the directory, this is the first time I’ve seen this. Can you still see the Prometheus items in the tip cluster display? If not, it might be that someone mistakenly scaled down. Try scaling it back up. If you can see them, check if there are any backups in the Prometheus directory. If there are, move a copy over and then start it up again.