After upgrading TiDB, DM was also upgraded, but DM Dashboard monitoring has no data

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb 升级后 dm 也升级了 dm dashboard 监控无数据

| username: TiDBer_ZsnVPQB4

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.0
[Reproduction Path] Upgraded TiDB version from v5.2.1 to v6.5.0, and also upgraded the DM cluster to v6.5.0
[Encountered Problem: Problem Phenomenon and Impact]
After upgrading the DM cluster, the dm syncer lag alert in the monitoring has no data.
Reloaded the entire cluster, restarted the entire DM cluster, individually restarted Grafana, stopped and restarted the DM synchronization task, but still no data.

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: TiDBer_ZsnVPQB4 | Original post link

It seems that the metrics collection is not gathering data anymore, and everything has turned to null. Which component is responsible for the collection in this DM cluster? I couldn’t find the code for dm-worker or dm-master, and I also didn’t see any service like node_exporter responsible for the collection.

| username: Lucien-卢西恩 | Original post link

You can follow the logic of grafana → Prometheus → dm-worker to check downwards. Common issues may occur if the related configuration services of dm-worker and dm-master are not started.

| username: TiDBer_ZsnVPQB4 | Original post link

  1. Below is the current result displayed by DM. The main issue is that only the alert interface cannot retrieve values, while the other three monitoring alert data are normal.


| username: TiDBer_ZsnVPQB4 | Original post link

It’s a bit problematic; this metric is gone. It’s probably using the dm_syncer_replication_lag_gauge metric now.

| username: Lucien-卢西恩 | Original post link

Is dm_syncer_replication_lag_gauge recorded in Prometheus?

| username: TiDBer_ZsnVPQB4 | Original post link

There are records of dm_syncer_replication_lag_gauge. It seems that in the new version, dm_syncer_replication_lag has been divided into four parts, and the lag one is no longer available. It cannot be found in the metrics.

  • dm_syncer_replication_lag_sum
  • dm_syncer_replication_lag_gauge
  • dm_syncer_replication_lag_count
  • dm_syncer_replication_lag_bucket

| username: Lucien-卢西恩 | Original post link

Have you confirmed whether the time between services is synchronized? There was a similar situation before where the data was empty due to unsynchronized time.

| username: TiDBer_ZsnVPQB4 | Original post link

After checking, the time of the DM cluster and the TiDB nodes, as well as the monitoring node time, are all consistent.

Moreover, the dm-dashboard and one of the DM cluster nodes are on the same node.

It is definitely not a time issue.