TIKV Out of Memory (OOM)

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV OOM

| username: Atlan

[TiDB Usage Environment] Production Environment / Testing / Poc
System Version: Ubuntu 20.04
TiDB Version: v.6.5.1 v6.5.0
[Reproduction Path]
During e2e testing, memory usage continues to grow even after the pressure stabilizes. If long-term e2e testing is conducted, it will trigger OOM.
[Encountered Problem: Problem Phenomenon and Impact]


Trigger OOM restart
[Resource Configuration]
Host Resource Configuration:
CPU: 16 cores
Memory: 128G
Mechanical Disk: 20T

Monitoring Screenshot


Heap pprof
000001.heap (55.9 KB)

| username: Atlan | Original post link

metricstools tidb-test-TiKV-Details_2023-03-28T08_59_55.204Z.json (14.0 MB)

| username: Atlan | Original post link

@h5n1

| username: Atlan | Original post link

Situation Description:
Three hosts: 16C 128G 20T*3
Three TiDB, three PD, six TiKV (with multiple disks), and sufficient host resources.
Note: Mixed with business, but confirmed that it is not the business or other basic components occupying resources that triggered OOM, causing TiKV to be mistakenly killed.

| username: magic | Original post link

Did you send it again? :thinking:

| username: xfworld | Original post link

If mixed deployment is used, pay attention to the memory configuration of TiKV. Otherwise, if the system memory is insufficient, Linux will kill TiKV as a zombie process.

| username: Atlan | Original post link

Yes, the previous one sank. Now, the business line can’t handle the scheduled restarts anymore.

| username: Atlan | Original post link

Is there a recommended value for this memory configuration? My host has 128G of memory, running two TiKV instances should be sufficient, right? Other programs don’t really use much memory.

| username: xfworld | Original post link

Hybrid deployment should be stress-tested in the testing environment.

Refer to these articles:

| username: tidb狂热爱好者 | Original post link

You should study the basic theory before making another post.

| username: tidb狂热爱好者 | Original post link

If an OOM occurs, just kill the process with the highest memory usage. Which process is using the most memory? TiKV limits its usage to 80%. How much memory do your other services use?

| username: tidb菜鸟一只 | Original post link

SHOW config WHERE TYPE=‘tikv’ AND NAME LIKE ‘%storage.block-cache.capacity%’;
Check this parameter. If a single TiKV is deployed on one server, set it to 45% of the server’s memory. If two TiKVs are deployed on one server, set it to 22.5% of the server’s memory. Additionally, if two TiKVs are deployed on one server, you can bind each TiKV to a single NUMA node by specifying numa_node to prevent mutual interference.

| username: Atlan | Original post link

I’ll take a look, thank you.

| username: Atlan | Original post link

Okay, thank you. I’ll take a look.

| username: Atlan | Original post link

I don’t quite understand what you mean. The total memory of my business is less than 20G.

| username: xfworld | Original post link

Memory usage will accumulate, and if it’s a mixed deployment, it’s best to use cgroup for isolation.

| username: Atlan | Original post link

If using cgroup isolation, it might be a bit more complex for operations.

| username: Atlan | Original post link

We originally had slow disk writes but couldn’t add more machines, so we added disks and mixed two TiKV instances. The disk write latency is very high with SAS disks.

| username: Atlan | Original post link

Default cache

| username: h5n1 | Original post link

The default cache size of TiKV is about 45% of the memory. Manually set the size, try running it at 32G first.