TiDB Monitoring Restart

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb-监控重启

| username: 烂番薯0

Previously, I used
tiup cluster stop tidb-test -R alertmanager prometheus grafana to stop these three roles.
Now I want to start these three roles, but I found that I couldn’t start them and got an error.

Error: failed to start alertmanager: failed to start: 10.237.103.68 alertmanager-9093.service, please check the instance's log(/opt/tidb-deploy/alertmanager-9093/log) for more detail.: timed out waiting for port 9093 to be started after 2m0s

Can any experts tell me what the situation is?

| username: ti-tiger | Original post link

It literally looks like a connection timeout, you still need to check the logs.

| username: ffeenn | Original post link

As mentioned above, please package and upload the logs under /opt/tidb-deploy/alertmanager-9093/log.

| username: tidb菜鸟一只 | Original post link

Check if the alertmanager-9093.service is still running.

| username: 烂番薯0 | Original post link

I checked the path below. There is no such directory.

| username: 烂番薯0 | Original post link

Not here, this service is also down, and there is no such directory in the path.

| username: 烂番薯0 | Original post link

There is no such directory.

| username: tidb菜鸟一只 | Original post link

Check tiup cluster display tidb-test and tiup cluster edit-config tidb-test. You don’t have these directories? Are you deploying on this host?

| username: xingzhenxiang | Original post link

The operation to start the cluster will start all components of the entire TiDB cluster in the order of PD → TiKV → Pump → TiDB → TiFlash → Drainer → TiCDC → Prometheus → Grafana → Alertmanager. Is it started in this order?

| username: 烂番薯0 | Original post link

Yes, that’s right.

| username: 烂番薯0 | Original post link

Yes, it was started in this order, but I don’t know why there is no directory after starting, and then it times out. Checking the directory logs, there is no such directory.

| username: tomsence | Original post link

Did you use the -R parameter for alertmanager, prometheus, and grafana during startup? The -R parameter is effective at startup.

| username: Tank001 | Original post link

Now it’s done. What was the issue that caused it in the end?

| username: 烂番薯0 | Original post link

Still no luck, it won’t start up. Restarting the cluster didn’t work either, so I finally gave up.

| username: srstack | Original post link

It looks like the alertmanager component is missing. You can try to scale-in -N 10.237.103.68:9093 --force to forcefully scale it down, and then scale it out again.

| username: db_user | Original post link

Check if there is still a listener on port 9093 on the machine. This is usually occupied by system-started services like node_exporter. Shutting down the corresponding service should resolve the issue.

| username: 烂番薯0 | Original post link

Port 9093 is not open, nor is it occupied.

| username: db_user | Original post link

Does this directory exist? If not, it feels like the wrong configuration file is being used. Are there multiple versions of tiup on different users?

| username: 烂番薯0 | Original post link

This directory exists, but there is no Prometheus directory. It was there during the initial deployment, but it disappeared after shutting down.

| username: db_user | Original post link

Stopping will completely remove the directory, this is the first time I’ve seen this. Can you still see the Prometheus items in the tip cluster display? If not, it might be that someone mistakenly scaled down. Try scaling it back up. If you can see them, check if there are any backups in the Prometheus directory. If there are, move a copy over and then start it up again.