A certain TiKV node frequently experiences OOM, but the server memory is sufficient

translator_bot · June 22, 2024, 7:38pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 某tikv节点频繁oom，但服务器内存是充足的

| username: jaybing926

【TiDB Usage Environment】Production Environment
【TiDB Version】v4.0.9
【Encountered Problem: Phenomenon and Impact】
A certain TiKV node in the cluster frequently encounters OOM, but the server’s memory is sufficient. What could be the issue?
【Resource Configuration】

translator_bot · June 22, 2024, 7:38pm

| username: tidb狂热爱好者 | Original post link

Upgrade to 6

translator_bot · June 22, 2024, 7:38pm

| username: jaybing926 | Original post link

Because I didn’t deploy the cluster, I’m afraid there might be issues with the upgrade~~ So I’ve been delaying and haven’t dared to upgrade.

translator_bot · June 22, 2024, 7:38pm

| username: tidb狂热爱好者 | Original post link

SET tidb_mem_quota_query = 8 << 30;

translator_bot · June 22, 2024, 7:38pm

| username: jaybing926 | Original post link

I still don’t quite understand this parameter. Could you please explain it in detail?

translator_bot · June 22, 2024, 7:38pm

| username: WalterWj | Original post link

Didn’t bind the core, right?

translator_bot · June 22, 2024, 7:38pm

| username: jaybing926 | Original post link

So, I have 8G now, right? Is this how I should look at it?

The key point is that my physical machine still has 60G of available memory. Even if this reaches 8G, it shouldn’t OOM, right? I don’t understand.

translator_bot · June 22, 2024, 7:38pm

| username: jaybing926 | Original post link

When everything is normal, the available memory is 60G. When it crashes, the memory is released, and the available memory reaches 120G. Is there any problem with this?

translator_bot · June 22, 2024, 7:38pm

| username: Ming | Original post link

Sorry, I mistook it for TiKV-Details’ memory at that time.

translator_bot · June 22, 2024, 7:38pm

| username: jaybing926 | Original post link

Haha, thanks for participating~~

translator_bot · June 22, 2024, 7:38pm

| username: jaybing926 | Original post link

This is from my message log, can anyone understand this? How much memory did I use? It got killed. 87G?

translator_bot · June 22, 2024, 7:38pm

| username: dba-kit | Original post link

Could you send the configuration file? Did you use Cgroup to limit the memory?

translator_bot · June 22, 2024, 7:38pm

| username: dba-kit | Original post link

As shown here

translator_bot · June 22, 2024, 7:38pm

| username: caiyfc | Original post link

Check the topology file to confirm if NUMA binding is enabled. If it is, only a portion of the memory can be used.

translator_bot · June 22, 2024, 7:38pm

| username: jaybing926 | Original post link

I didn’t see any cgroup-related configuration.

translator_bot · June 22, 2024, 7:38pm

| username: jaybing926 | Original post link

What is a topology file? How do you view it?

translator_bot · June 22, 2024, 7:38pm

| username: Ming | Original post link

Now you can see numa_node = ‘1’ in the configuration file. You can check the size of node 1 by running numactl --hardware. It should not exceed 60G. Then, in the second message above, combining with anon-rss, it should be confirmed that it is a CPU binding issue.

translator_bot · June 22, 2024, 7:38pm

| username: tidb菜鸟一只 | Original post link

You can take a look at the reply above. I estimate that the usable memory for your numa_node1 is only 60G. If you want to use more, you can avoid binding the cores to numa, comment out the related configuration, or bind multiple numa nodes, something like this numa_node: “0,1”.

translator_bot · June 22, 2024, 7:38pm

| username: caiyfc | Original post link

tiup cluster edit-config <cluster-name>
<cluster-name> represents the name of the cluster to be operated on.
In the opened file, check if there are any NUMA-related settings for TiKV.

translator_bot · June 22, 2024, 7:38pm

| username: TiDBer_jYQINSnf | Original post link

If you don’t want to OOM, adjust the sizes of various blockcaches and memtables.
https://docs.pingcap.com/zh/tidb/stable/tune-tikv-memory-performance#tikv-内存参数性能调优

Check out this article.
The underlying layer of TiDB is RocksDB, which has 4 CFs. Each CF has a memtable (equal to write-buffer-size * max-write-buffer-number) and a blockcache (corresponding to [storage.block-cache]). Adjusting these to be smaller will reduce memory usage.