Possible Causes for Intermittent Drop to 0 in TiDB-server Memory Monitoring

translator_bot · June 21, 2024, 10:39am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB-server 内存监控间歇性降为0，是哪些可能得原因导致的

| username: TiDBer_vFs1A6CZ

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.1
[Reproduction Path]
Single SQL query limit: global.tidb_mem_quota_query = 40G;
TiDB server memory limit: global.tidb_server_memory_limit = ‘80%’;
tidb_server_memory_limit_gc_trigger 0.7
[Encountered Problem: Phenomenon and Impact]
TiDB total memory is 160G, TiDB-server memory limit is tidb_server_memory_limit=138G.
When checking TiDB-Server memory monitoring, it intermittently drops to 0.
Additionally, the single SQL query limit is 40G. During the monitoring period, only one task is running on TiDB-Server. Why does TiDB-Server memory monitoring show usage around 90G?

When checking TiDB-Server uptime, it also intermittently drops to 0, but the total uptime continues.

When checking the network communication of the TiDB-server node, the communication is normal, and no traffic is sent during the corresponding period.

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

[Attachments: Screenshots/Logs/Monitoring] During the corresponding time and on the corresponding node, no OOM situation was found.

translator_bot · June 21, 2024, 10:39am

| username: 小龙虾爱大龙虾 | Original post link

Reducing to 0 doesn’t mean the memory suddenly becomes 0; it means the data collection has stopped. The inability to collect data might be due to resource exhaustion, such as CPU or memory being fully utilized. Additionally, your TiDB is consuming a lot of memory. It is recommended to optimize your SQL queries. Also, setting tidb_mem_quota_query to 40G is excessive and may not be beneficial.

translator_bot · June 21, 2024, 10:39am

| username: TiDBer_vFs1A6CZ | Original post link

Monitoring CPU resource usage

Memory usage

Memory & CPU resources are not exhausted, so why does this situation still occur?

translator_bot · June 21, 2024, 10:39am

| username: 小龙虾爱大龙虾 | Original post link

The load is still too high.

translator_bot · June 21, 2024, 10:39am

| username: Jellybean | Original post link

These graphs look a bit strange. For example, the TiDB Uptime panel: if the tidb-server node really crashed and restarted, the Uptime curve should gradually rise after dropping to 0. However, the actual curve is consistent with the value before the drop, indicating that the tidb-server did not experience a crash and restart issue.

Considering that the large memory usage curve has similar issues, it can be concluded that the problem lies in the monitoring data collection component. Either the collection is intermittent, or there is a discontinuity in the display of the charts.

Therefore, you can investigate:

Whether the monitoring collection components are functioning properly, such as checking if node_exporter, black_exporter, prometheus, etc., have restarted or are stuck.
Investigate the Grafana display module to confirm whether the monitoring metric data collection is normal but the display has issues.

translator_bot · June 21, 2024, 10:39am

| username: Jellybean | Original post link

Confirm whether the cluster has been affected, such as whether there has been a sudden drop or severe fluctuation in the overall latency and QPS of business access to the TiDB cluster.

translator_bot · June 21, 2024, 10:39am

| username: 像风一样的男子 | Original post link

It looks like TiDB OOM caused a restart.

translator_bot · June 21, 2024, 10:39am

| username: caiyfc | Original post link

Bro, how did you determine that?
I think the possibility mentioned by the moderator above is more likely, and the possibility of OOM is relatively small.

translator_bot · June 21, 2024, 10:39am

| username: tidb狂热爱好者 | Original post link

Your machine is not functioning properly, one of the TiKV CPUs is maxed out.

translator_bot · June 21, 2024, 10:39am

| username: 小龙虾爱大龙虾 | Original post link

If the process is OOM and killed by the operating system, then the uptime monitoring should start from 0, so it should not be a process OOM.

translator_bot · June 21, 2024, 10:39am

| username: 胡杨树旁 | Original post link

Check the system logs and search for “oom kill” and similar terms. Generally, if TiDB has an OOM (Out of Memory) event, there will be a “welcome ****” message indicating that the TiDB node has restarted.

translator_bot · June 21, 2024, 10:39am

| username: 春风十里 | Original post link

Although this is the default configuration for TiDB, I personally think this setting is too high. For example, Oracle’s default configuration is 40% of physical memory, and it is generally recommended not to exceed 70%. This is because the operating system itself also requires a considerable amount of memory, and these limits often cannot be strictly enforced. In some scenarios where resources are tight, it may exceed 80%.

For instance, the official documentation mentions that the current tidb_server_memory_limit does not terminate the following SQL operations:

DDL operations
SQL operations containing window functions and common table expressions

Warning

TiDB does not guarantee that the tidb_server_memory_limit limit will take effect during startup. If the operating system has insufficient free memory, TiDB may still encounter OOM. You need to ensure that the TiDB instance has enough available memory.
During memory control, the overall memory usage of TiDB may slightly exceed the tidb_server_memory_limit.

Reference:
TiDB OOM Troubleshooting | PingCAP Documentation Center

Column - Exploration of TiDB Server OOM Issue Optimization | TiDB Community

translator_bot · June 21, 2024, 10:39am

| username: tidb菜鸟一只 | Original post link

It looks like the memory is exhausted or the network traffic is full, and even the monitoring cannot collect data. However, looking at your memory monitoring graph, the memory should not be full. Please confirm the network issue again…

translator_bot · June 21, 2024, 10:39am

| username: tidb狂热爱好者 | Original post link

His TiKV CPU is at 100%, so any monitoring is likely to be stuck.

translator_bot · June 21, 2024, 10:39am

| username: dba远航 | Original post link

It feels like the memory is growing too fast and is being limited by a certain parameter.