[BUG] tikv-server Memory Overflow Trigger (OOM)

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: [BUG] tikv-server 内存溢出触发(OOM)

| username: Atlan

[TiDB Usage Environment] Production Environment / Testing / Poc
System Version: Ubuntu 20.04
TiDB Version: v.6.5.1 v6.5.0
[Reproduction Path]
During e2e testing, memory usage continues to grow even after the pressure stabilizes. If long-term e2e testing is conducted, it will trigger OOM.
[Encountered Problem: Phenomenon and Impact]


Trigger OOM restart
[Resource Configuration]
Host Resource Configuration:
CPU: 16 cores
Memory: 128G
Mechanical Disk: 20T

Monitoring Screenshots


Heap pprof
000001.heap (55.9 KB)

| username: Atlan | Original post link

The issue first appeared in TiDB v6.5.0. Later, we found that v6.5.1 was released and upgraded to it. However, we still noticed some memory usage issues after the upgrade. The e2e tool’s pressure is consistently stable without any sudden spikes, but TiKV’s memory usage gradually increases, triggering OOM. This was triggered in v6.5.0, and although it hasn’t been triggered after the upgrade, the memory usage in our scenario continues to increase. Therefore, we suspect it might be an internal issue with TiKV. The person above has already provided the Head pprof. Could someone please analyze it? Thank you.

| username: Lucien-卢西恩 | Original post link

Can you use the clinic or metrictools tool to capture TiKV-details monitoring?

| username: magic | Original post link

I saw a post like this before, not sure if it’s related.

| username: Atlan | Original post link

Note: I am not sure about the time period of this export. I have adjusted some operations in the last 30 minutes, so the resource usage of TiDB may have suddenly decreased significantly.

| username: Atlan | Original post link

In my environment, we have a mixed deployment. TiDB, business applications, and other basic components are all together. I’m not sure if the OOM is necessarily triggered by TiDB. It could also be triggered by other services, causing TiKV to be mistakenly killed. Currently, I just find the consistent increase in memory usage quite strange.

| username: Atlan | Original post link

Flow control is enabled by default in the new version. I haven’t made any adjustments.

| username: Atlan | Original post link

@h5n1