Monitoring Interface Error

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 监控接口报错

| username: 艾琳逼哥索

How to handle this error, tech newbie

| username: Daniel-W | Original post link

Check if Prometheus is functioning properly.
Verify if the network between the PD node and Prometheus is connected.

| username: tidb菜鸟一只 | Original post link

Execute tiup cluster display tidb-test to check the status of the Prometheus process.

| username: DBRE | Original post link

Try modifying it.

| username: songxuecheng | Original post link

If the Prometheus source has been changed, the dashboard needs to modify the data source accordingly.

| username: 艾琳逼哥索 | Original post link

The disk is full, and there is too much data in the wal directory. How should this be handled? Can it be deleted?

| username: kelvin | Original post link

Check the Prometheus process status. If the status is abnormal, check if there are any network issues.

| username: 小于同学 | Original post link

Ping the address that is showing the error.

| username: tidb菜鸟一只 | Original post link

The WAL directory should be quite small, usually just storing the logs for the current day.

| username: changpeng75 | Original post link

Is the connection to port 9090 being refused due to the firewall blocking it, or is the application not allowing the connection?

| username: DBRE | Original post link

  1. It is not recommended to clean it directly, as it may result in data loss.
  2. Filter the Prometheus logs to check if Prometheus is frequently restarting. If there are frequent restarts, the logs will show “Starting TSDB …”. In this case, you can modify the Prometheus configuration file to discard some data collection, and then use tiup to restart Prometheus.
    Find the section with job_name: “tikv” and add:
    • source_labels: [name]
      separator: ;
      regex: tikv_thread_nonvoluntary_context_switches|tikv_thread_voluntary_context_switches|tikv_threads_io_bytes_total
      action: drop
    • source_labels: [name,name]
      separator: ;
      regex: tikv_thread_cpu_seconds_total;(tokio|rocksdb).+
      action: drop
  3. Of course, you can also redeploy Prometheus by scaling in and then scaling out.
| username: caiyfc | Original post link

If the monitoring data is not important, then delete some data. If the monitoring data is important, then expand a Prometheus instance and change the Prometheus data source information in Grafana. This way, you can view the monitoring information after the expansion. After running for a while, you can scale down the problematic Prometheus node. To view the data of the problematic Prometheus node, you still need to modify the Grafana configuration.

| username: dba远航 | Original post link

There is an issue with obtaining the data source. Please check the connection to Prometheus.

| username: zhang_2023 | Original post link

It doesn’t seem to be a network issue, firewall, or port.

| username: DBAER | Original post link

Check the status of Prometheus.