Abnormal Data Collection by node_exporter Leading to Alerts

Original topic: node_exporter 采集数据有异常导致告警

TiDB version: 5.3.1
OS: Redhat 7.9
Phenomenon: Prometheus has the following alert: NODE_memory_used_more_than_80%
One OS server’s memory usage exceeds 80%

However, when logging into the server and checking with free -h, the memory usage is actually not over 80%.
Looking at the alert monitoring metric expression:
(((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes) * 100))
My understanding of this alert’s principle should be:
Total OS memory - used memory - buffer/cache memory (the value in the buff/cache column output by free -h)
But when I checked with:
curl | grep "node_memory_Cached_bytes"
I found that the value of node_memory_Cached_bytes is much smaller than the actual buffer/cache value shown by free -h, which is causing the alert.
Moreover, I checked another OS and found that the node_memory_Cached_bytes and buffer/cache values shown by free -h are basically consistent, only this machine has the problem.
I would like to ask if anyone has encountered a similar issue?
I restarted the node_exporter process but found it was still not useful, the abnormal alert persists?

It is the allocated memory that occupies 80%, not the server’s memory.

The allocated memory should be the memory that is already in use, including buffer/cache.

  1. The buffer/cache shown in free -h is 57G.
  2. node_exporter shows only 8G for node_memory_Cached_bytes.
  3. Suspect that node_exporter collected the wrong data or the buffer usage is too large and the cache usage is too small. I will log into the server next week to check the specific values of Buffer and cache in /proc/meminfo.
If you have Zabbix, set up an agent to compare and see.

Is the difference between total and Available equal to the used memory? Some agents calculate it this way.

It’s not system memory, right?

Learn about it. I checked on node_exporter,
meminfo_numa Exposes memory statistics from /sys/devices/system/node/node[0-9]*/meminfo, /sys/devices/system/node/node[0-9]*/numastat.
prometheus/node_exporter: Exporter for machine metrics (github.com)

Pay attention to whether other programs are occupying memory.

That is the information of NUMA nodes.
The overall memory usage can be obtained from /proc/meminfo.

At present, the preliminary cause seems to have been identified. The data collected by node_exporter is obtained from /proc/meminfo. It was found that SReclaimable in /proc/meminfo occupies a large amount (approximately 50G), while Cached occupies very little. Therefore, it is suspected that there might be an operating system memory leak.

