After upgrading to version 7.5.1, the tidb_server_connections metric is abnormal

As shown in the figure, the total number of connections is 338, but the value obtained through the /metrics interface is 1, which feels very much like the value of active connections.

How about trying to curl from a different tidb-server node?

Or you can use netstat to filter and see which connections are actually established. It seems that directly checking would give you the connection count for the entire cluster, while the curl command below should only give you the connection count for a single instance. Alternatively, you can check which value is used for the connection count in the monitoring.

Check if other metrics are normal? For example, tidb_server_tokens, which is used to monitor the number of active sessions.

There might be some helpful information in the alert logs.

Did you configure this during the upgrade?

However, it’s strange that in the same cluster with multiple tidb-servers, one machine’s metrics are normal, showing only a default metric.

The other abnormal nodes also report metrics without the resource_group label.

After comparing the configurations, I found that the other tidb-server instances have the instance.tidb_force_priority parameter set, while the tidb-server with normal metrics does not have this configuration. I will remove it over the weekend and observe the results.

This is a bug, and it is being tracked in this issue: Connection count metric can be less than the real value · Issue #51889 · pingcap/tidb · GitHub. The problem was introduced with the enhancement of graceful shutdown (server: enhance graceful stop by closing connections after finish the ongoing txn by july2993 · Pull Request #32111 · pingcap/tidb · GitHub). Adding related metrics to the resource group (metrics: add connection and fail metrics by `resource group name` by bufferflies · Pull Request #49424 · pingcap/tidb · GitHub) exposed this issue after monitoring was added.

The expert is very meticulous, a bug-catching master :+1:

The same issue
Connection Count IP duplication

Temporary solution: tidb_server_connections{k8s_cluster=“$k8s_cluster”, tidb_cluster=“$tidb_cluster”, resource_group=“default”}

I directly used group by sum, but the data volume is still incorrect. However, there is only one line for each instance now.
sum(tidb_server_connections{cluster="$tidb_cluster"}) by (instance)

