Abnormal Data Collection by node_exporter Leading to Alerts

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: node_exporter 采集数据有异常导致告警

| username: Raymond

TiDB version: 5.3.1
OS: Redhat 7.9
Phenomenon: Prometheus has the following alert: NODE_memory_used_more_than_80%
One OS server’s memory usage exceeds 80%

However, when logging into the server and checking with free -h, the memory usage is actually not over 80%.
Looking at the alert monitoring metric expression:
(((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes) * 100))
My understanding of this alert’s principle should be:
Total OS memory - used memory - buffer/cache memory (the value in the buff/cache column output by free -h)
But when I checked with:
curl http://127.0.0.1:9100/metrics | grep "node_memory_Cached_bytes"
I found that the value of node_memory_Cached_bytes is much smaller than the actual buffer/cache value shown by free -h, which is causing the alert.
Moreover, I checked another OS and found that the node_memory_Cached_bytes and buffer/cache values shown by free -h are basically consistent, only this machine has the problem.
I would like to ask if anyone has encountered a similar issue?
I restarted the node_exporter process but found it was still not useful, the abnormal alert persists?

| username: zhaokede | Original post link

It is the allocated memory that occupies 80%, not the server’s memory.

| username: Raymond | Original post link

The allocated memory should be the memory that is already in use, including buffer/cache.

| username: Raymond | Original post link

Supplement:

  1. The buffer/cache shown in free -h is 57G.
  2. node_exporter shows only 8G for node_memory_Cached_bytes.
  3. Suspect that node_exporter collected the wrong data or the buffer usage is too large and the cache usage is too small. I will log into the server next week to check the specific values of Buffer and cache in /proc/meminfo.
| username: xiaoqiao | Original post link

If you have Zabbix, set up an agent to compare and see.

| username: TiDBer_21wZg5fm | Original post link

Haven’t found anything yet.

| username: DBAER | Original post link

Is the difference between total and Available equal to the used memory? Some agents calculate it this way.

| username: shigp_TIDBER | Original post link

Here to observe and learn.

| username: zhang_2023 | Original post link

It’s not system memory, right?

| username: stephanie | Original post link

Learn about it. I checked on node_exporter,
meminfo_numa Exposes memory statistics from /sys/devices/system/node/node[0-9]*/meminfo, /sys/devices/system/node/node[0-9]*/numastat.
prometheus/node_exporter: Exporter for machine metrics (github.com)

| username: QH琉璃 | Original post link

I haven’t encountered this situation.

| username: dba远航 | Original post link

Pay attention to whether other programs are occupying memory.

| username: TiDBer_JUi6UvZm | Original post link

Haven’t encountered it, keeping an eye on it.

| username: Raymond | Original post link

That is the information of NUMA nodes.
The overall memory usage can be obtained from /proc/meminfo.

| username: Raymond | Original post link

At present, the preliminary cause seems to have been identified. The data collected by node_exporter is obtained from /proc/meminfo. It was found that SReclaimable in /proc/meminfo occupies a large amount (approximately 50G), while Cached occupies very little. Therefore, it is suspected that there might be an operating system memory leak.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.