Grafana and Related Components Down

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: grafana及相关组件Down

| username: TiDBer_小阿飞

[TiDB Usage Environment] Testing Environment
[TiDB Version] V6.5.2
[Reproduction Path] The alertmanager, grafana, and prometheus components of TiDB were installed on the master node, running normally, and the pages were functioning correctly. Today, when installing DM on the master node, I mistakenly configured the IP:PORT of DM’s alertmanager, grafana, and prometheus components to be the same as TiDB’s. After completing the DM installation, I found this mistake. Upon checking, the status of the aforementioned TiDB components was normal. I then used tiup dm destroy dm to delete DM. Upon rechecking the TiDB components, they were already down.
[Encountered Problem: Symptoms and Impact] After the components went down, I tried tiup start tidb -R grafana and tiup start tidb -N <ip:port>, but neither could start, with the following error:
Error: Failed to start alertmanager: failed to start: 21.72.124.43 alertmanager-9093, service, please check the instance’s log (/tidb-deploy/alertmanager-9093/log) for more details: executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@21.72.124.43:22’ {ssh_stderr: Failed to start alertmanager-9093.server: Unit not found., ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin /usr/bin/sudo -H bash -c “systemctl daemon-reload && systemctl start alertmanager-9093.service”}, cause: Process exited with status 5
There are business operations and data running, so I cannot delete and rebuild the TiDB cluster, nor can I stop other TiDB modules.

| username: zhanggame1 | Original post link

Scale down the alertmanager\grafana\prometheus components, and then scale them up again without needing to rebuild the cluster.

| username: TiDBer_小阿飞 | Original post link

After scaling down and then scaling up again, it was quite annoying that an error occurred during the scaling up process indicating an issue with mutual trust. I checked via SSH, and sure enough, the previously configured mutual trust was gone. I had to reconfigure mutual trust for each node one by one! It might have been destroyed when configuring DM earlier.

| username: zhanggame1 | Original post link

Configuring DM should not automatically remove mutual trust, right?

| username: redgame | Original post link

It will not delete mutual trust.

| username: TiDBer_小阿飞 | Original post link

Well, it won’t delete mutual trust, but when configuring DM, there was an issue with my statement. The installation statement included -uroot -ptidb with the password. This might be the reason, who knows.

| username: zhanggame1 | Original post link

There shouldn’t be any problem with the installation statement containing a password, so it probably isn’t an issue here.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.