Prometheus Monitoring Error After Cluster Upgrade

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群升级之后Prometheus监控报错

| username: realcp1018

[TiDB Usage Environment] Public Test Environment
[TiDB Version] Upgraded from v5.2.3 to v7.1.0
[Encountered Problem: Phenomenon and Impact]
After upgrading two test clusters to version 7.1.0, one of the Prometheus monitors cannot be viewed (the other is normal), and the error is as shown in the picture:


I didn’t check the monitor before the upgrade, so I’m not sure if it has always been like this. I’ve never encountered such an issue before. Do you have any suggestions? Thanks!
Additional: After scaling down Prometheus and Grafana and then scaling them up again, the issue persists.

| username: DBRE | Original post link

  1. Confirm whether the name of the datasource in grafana->configuration->datasource is publicpublictest-cluster.
  2. In grafana->configuration->datasource, click Save & Test to see if it works properly.
  3. Use a browser to visit ${prometheus_ip}:${prometheus_port} to see if it can be accessed normally.
| username: zhanggame1 | Original post link

There is an issue with the data source.

| username: realcp1018 | Original post link

I’ve checked the data source, and it indeed differs from the error message, as the error includes an additional “public” prefix. The Save&Test function also works fine.

| username: realcp1018 | Original post link

The IP and port of the data source were checked before and there were no issues. I just accessed it directly using a browser and it worked fine. I noticed that the data source displayed in the Dashboard’s Annotations is incorrect. When I clicked in, there was no data source inside, and I couldn’t modify it because:

This dashboard cannot be saved from Grafana's UI since it has been provisioned from another source. Copy the JSON or save it to a file below. Then you can update your dashboard in corresponding provisioning source.
See documentation for more information about provisioning.

I checked the actual configuration file but haven’t found the one with the error prefix. I plan to follow the suggestion here: Grafana保存报错 - TiDB 的问答社区 to make some modifications.

| username: DBRE | Original post link

You can check whether the datasource in the json files under the dashboards subdirectory in the grafana deployment directory is correct.

| username: realcp1018 | Original post link

I really couldn’t find the relevant trigger point for the issue, so I temporarily attribute it to a historical legacy problem.
I added allowUiUpdates: true in /data/grafana-3000/provisioning/dashboards/dashboard.yml.
Then I found any Dashboard on the page, clicked the Dashboard Settings button, went into the JSON Model, copied the content out, replaced all instances of publicpublic with public, and then copied it back and saved it.
Currently, the modified Dashboard is working fine, but the others need to be modified one by one.

| username: realcp1018 | Original post link

I checked the data sources shown in the files under provisioning, and they are all correct. However, according to your suggestion, the files under dashboards do show anomalies. For example:

    {
      "name": "publicpublictest-cluster",
      "label": "publictest-cluster",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }

I should be able to fix it by manually replacing all the files here. Replacing the dashboards one by one on the page is a bit slow.

| username: realcp1018 | Original post link

Thank you, I replaced everything uniformly and the monitoring is normal now, so there’s no need to modify it from the page:

cd /data/grafana-3000/dashboards
for f in `ls`; do sed -i 's/publicpublic/public/g' $f; done
sudo systemctl restart grafana-3000
| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.