A certain TiKV node frequently experiences OOM, but the server memory is sufficient

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 某tikv节点频繁oom,但服务器内存是充足的

| username: jaybing926

【TiDB Usage Environment】Production Environment
【TiDB Version】v4.0.9
【Encountered Problem: Phenomenon and Impact】
A certain TiKV node in the cluster frequently encounters OOM, but the server’s memory is sufficient. What could be the issue?
【Resource Configuration】

| username: tidb狂热爱好者 | Original post link

Upgrade to 6

| username: jaybing926 | Original post link

Because I didn’t deploy the cluster, I’m afraid there might be issues with the upgrade~~ So I’ve been delaying and haven’t dared to upgrade.

| username: tidb狂热爱好者 | Original post link

SET tidb_mem_quota_query = 8 << 30;

| username: jaybing926 | Original post link

I still don’t quite understand this parameter. Could you please explain it in detail?

| username: WalterWj | Original post link

Didn’t bind the core, right?

| username: jaybing926 | Original post link

So, I have 8G now, right? Is this how I should look at it?

The key point is that my physical machine still has 60G of available memory. Even if this reaches 8G, it shouldn’t OOM, right? I don’t understand.

| username: jaybing926 | Original post link

When everything is normal, the available memory is 60G. When it crashes, the memory is released, and the available memory reaches 120G. Is there any problem with this?

| username: Ming | Original post link

Sorry, I mistook it for TiKV-Details’ memory at that time. :smiling_face_with_tear:

| username: jaybing926 | Original post link

Haha, thanks for participating~~ :rofl:

| username: jaybing926 | Original post link

This is from my message log, can anyone understand this? How much memory did I use? It got killed. 87G?

| username: dba-kit | Original post link

Could you send the configuration file? Did you use Cgroup to limit the memory?

| username: dba-kit | Original post link

As shown here

| username: caiyfc | Original post link

Check the topology file to confirm if NUMA binding is enabled. If it is, only a portion of the memory can be used.

| username: jaybing926 | Original post link

I didn’t see any cgroup-related configuration.

| username: jaybing926 | Original post link

What is a topology file? How do you view it?

| username: Ming | Original post link

Now you can see numa_node = ‘1’ in the configuration file. You can check the size of node 1 by running numactl --hardware. It should not exceed 60G. Then, in the second message above, combining with anon-rss, it should be confirmed that it is a CPU binding issue.

| username: tidb菜鸟一只 | Original post link

You can take a look at the reply above. I estimate that the usable memory for your numa_node1 is only 60G. If you want to use more, you can avoid binding the cores to numa, comment out the related configuration, or bind multiple numa nodes, something like this numa_node: “0,1”.

| username: caiyfc | Original post link

tiup cluster edit-config <cluster-name>
<cluster-name> represents the name of the cluster to be operated on.
In the opened file, check if there are any NUMA-related settings for TiKV.

| username: TiDBer_jYQINSnf | Original post link

If you don’t want to OOM, adjust the sizes of various blockcaches and memtables.
https://docs.pingcap.com/zh/tidb/stable/tune-tikv-memory-performance#tikv-内存参数性能调优

Check out this article.
The underlying layer of TiDB is RocksDB, which has 4 CFs. Each CF has a memtable (equal to write-buffer-size * max-write-buffer-number) and a blockcache (corresponding to [storage.block-cache]). Adjusting these to be smaller will reduce memory usage.