How Grafana Monitoring Metrics Are Collected

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: grafana 的监控指标是如何统计出来的

| username: weixiaobing

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
During the TiKV expansion in the production environment, an issue occurred. When checking the monitoring, it was found that the number of TiKV nodes displayed in the Services Port Status in the Overview did not match the number of nodes displayed on the dashboard.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: tidb菜鸟一只 | Original post link

You can check with Prometheus directly, and try reloading Prometheus using tiup cluster reload -R prometheus.

| username: DBRE | Original post link

Grafana’s data comes from Prometheus. The datasource recorded in Grafana is the Prometheus address corresponding to the TiDB cluster. You can open the corresponding graph in Grafana, click edit to view the corresponding PromQL, and then use PromQL to query in Prometheus.

If the quantity is incorrect, you can use tiup cluster reload xxxx -R prometheus to check again.

| username: weixiaobing | Original post link

I want to know how the probe_success metric is calculated? Because Prometheus reloads and restarts very slowly, there is no monitoring during the restart, so we rarely reload Prometheus.

| username: DBRE | Original post link

The probe_success metric is done by the blackbox_exporter corresponding to each node. It should determine whether each node is alive by connecting to the status port of each node via TCP. You can check the job_name: “tidb_port_probe” in the prometheus.yml configuration file. Prometheus will periodically fetch the probe_success metric. For the specific principle, you need to look into the implementation of blackbox_exporter.

| username: weixiaobing | Original post link

Actually, I just want to know the specific calculation logic of some monitoring indicators like probe_success. If these logics are clear, then troubleshooting will be much easier.

| username: DBRE | Original post link

I agree, understanding the meaning and calculation logic of metrics helps in troubleshooting. However, there is indeed not much detailed explanation of the corresponding monitoring metrics from the official documentation, and there is also very little explanation of the metrics returned by the status interfaces of various components. It is very difficult to troubleshoot problems by combining and utilizing these metrics.

| username: liuis | Original post link

Prometheus is a time-series database, and Grafana is just a visualization tool.